ScenarioDiff: Text-to-video Generation with Dynamic Transformations of Scene Conditions

Zhang, Yipeng; Wang, Xin; Chen, Hong; Qin, Chenyang; Hao, Yibo; Mei, Hong; Zhu, Wenwu

doi:10.1007/s11263-025-02413-7

ScenarioDiff: Text-to-video Generation with Dynamic Transformations of Scene Conditions

Published: 25 March 2025

Volume 133, pages 4909–4922, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

429 Accesses
2 Citations
Explore all metrics

Abstract

With the development of diffusion models, text-to-video generation has recently received significant attention and achieved remarkable success. However, existing text-to-video approaches suffer from the following weaknesses: i) they fail to control the trajectory of the subject as well as the process of scene transformations; ii) they can only generate videos with limited frames, failing to capture the whole transformation process. To address these issues, we propose the model named ScenarioDiff, which is able to generate longer videos with scene transformations. Specifically, we employ a spatial layout fuser to control the positions of subjects and the scenes of each frame. To effectively present the process of scene transformation, we introduce mixed frequency controlnet, which utilizes several frames of the generated videos to extend them to long videos chunk by chunk in an auto-regressive manner. Additionally, to ensure consistency between different video chunks, we propose a cross-chunk scheduling mechanism during inference. Experimental results demonstrate the effectiveness of our approach in generating videos with dynamic scene transformations. Our project page is available at https://scenariodiff2024.github.io/.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MEVG: Multi-event Video Generation with Text-to-Video Models

xGen-VideoSyn-1: High-Fidelity Text-to-Video Synthesis with Compressed Representations

ControlVideo: conditional control for one-shot text-driven video editing and beyond

Article 08 February 2025

Data Availability

This paper uses public datasets to conduct experiments available in the following URLs. Miao et al. (2021): https://www.vspwdataset.com/. Bain et al. (2021b): https://github.com/m-bain/webvid.

References

Bain, M., Nagrani, A., Varol, G., & Zisserman, A. (2021a). Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proceedings of the ieee/cvf international conference on computer vision (pp. 1728–1738).
Bain, M., Nagrani, A., Varol, G., & Zisserman, A. (2021b). Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proceedings of the ieee/cvf international conference on computer vision (pp. 1728–1738).
Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., & Li, L. others (2023). Improving image generation with better captions. Computer Science. https://cdn.openai.com/papers/dall-e-3.pdf, 2(3), 8,
Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., & Ramesh, A. (2024). Video generation models as world simulators. https://openai.com/research/video-generation-models-as-world-simulators
Chen, H., Xia, M., He, Y., Zhang, Y., Cun, X., & Yang, S. others (2023). Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512
Chen, H., Zhang, Y., Wu, S., Wang, X., Duan, X., Zhou, Y., & Zhu, W. (2023). Disenbooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation. The twelfth international conference on learning representations.
Elfwing, S., Uchibe, E., & Doya, K. (2018). Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks, 107, 3–11.
Article Google Scholar
Esser, P., Rombach, R., & Ommer, B. (2021). Taming transformers for high-resolution image synthesis. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 12873–12883).
Frans, K., Soros, L., & Witkowski, O. (2022). Clipdraw: Exploring text-to-drawing synthesis through language-image encoders. Advances in Neural Information Processing Systems, 35, 5207–5218.
Google Scholar
Guo, X., Zheng, M., Hou, L., Gao, Y., Deng, Y., & Ma, C. others (2023). I2v-adapter: A general image-to-video adapter for video diffusion models. arXiv preprint arXiv:2312.16693
Guo, Y., Yang, C., Rao, A., Agrawala, M., Lin, D., & Dai, B. (2023). Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint arXiv:2311.16933
Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., & Dai, B. (2023). Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. The twelfth international conference on learning representations.
He, Y., Yang, T., Zhang, Y., Shan, Y., & Chen, Q. (2022). Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221
Henschel, R., Khachatryan, L., Hayrapetyan, D., Poghosyan, H., Tadevosyan, V., Wang, Z. & Shi, H. (2024). Streamingt2v: Consistent, dynamic, and extendable long video generation from text. arXiv preprint arXiv:2403.14773
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840–6851.
Google Scholar
Hong, W., Ding, M., Zheng, W., Liu, X., & Tang, J. (2022). Cogvideo: Large-scale pretraining for text-to-video generation via transformers. The eleventh international conference on learning representations.
Hu, Z., & Xu, D. (2023). Videocontrolnet: A motion-guided video-to-video translation framework by using diffusion model with controlnet. arXiv preprint arXiv:2307.14073
Huang, H., Feng, Y., Shi, C., Xu, L., Yu, J., & Yang, S. (2024). Free-bloom: Zero-shot text-to-video generator with llm director and ldm animator. Advances in Neural Information Processing Systems, 36
Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y. & Liu, Z. (2024). VBench: Comprehensive benchmark suite for video generative models. Proceedings of the ieee/cvf conference on computer vision and pattern recognition.
Khachatryan, L., Movsisyan, A., Tadevosyan, V., Henschel, R., Wang, Z., Navasardyan, S., & Shi, H. (2023). Text2video-zero: Text-to-image diffusion models are zero-shot video generators. Proceedings of the ieee/cvf international conference on computer vision (pp. 15954–15964).
Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., & Lee, Y.J. (2023). Gligen: Open-set grounded text-to-image generation. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 22511–22521).
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., & Zhu, J. (2022). Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35, 5775–5787.
Google Scholar
Luo, Z., Chen, D., Zhang, Y., Huang, Y., Wang, L., Shen, Y., & Tan, T. (2023). Videofusion: Decomposed diffusion models for high-quality video generation. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 10209–10218).
Miao, J., Wei, Y., Wu, Y., Liang, C., Li, G., & Yang, Y. (2021). Vspw: A large-scale dataset for video scene parsing in the wild. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 4133–4143).
Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R. (2021). Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1), 99–106.
Article Google Scholar
Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., & Shan, Y. (2024). T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Proceedings of the aaai conference on artificial intelligence (Vol. 38, pp. 4296–4304).
Nichol, A.Q., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., Mcgrew, B., & Chen, M. (2022). Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In: International conference on machine learning (pp. 16784–16804).
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., & Rombach, R. (2023). Sdxl: Improving latent diffusion models for high-resolution image synthesis. The twelfth international conference on learning representations.
Qiu, H., Xia, M., Zhang, Y., He, Y., Wang, X., Shan, Y., & Liu, Z. (2023). Freenoise: Tuning-free longer video diffusion via noise rescheduling. arXiv preprint arXiv:2310.15169, , ,
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., & Agarwal, S. others (2021). Learning transferable visual models from natural language supervision. International conference on machine learning (pp. 8748–8763).
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2), 3
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., & Sutskever, I. (2021). Zero-shot text-to-image generation. International conference on machine learning (pp. 8821–8831).
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 10684–10695).
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. Medical image computing and computer-assisted intervention–miccai 2015: 18th international conference, munich, germany, october 5-9, 2015, proceedings, part iii 18 (pp. 234–241).
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., & Aberman, K. (2023). Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 22500–22510).
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., et al. (2022). Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35, 36479–36494.
Google Scholar
Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., & Zhang, S. others (2022). Make-a-video: Text-to-video generation without text-video data. The eleventh international conference on learning representations.
Song, J., Meng, C., & Ermon, S. (2020). Denoising diffusion implicit models. International conference on learning representations.
Villegas, R., Babaeizadeh, M., Kindermans, P., J., Moraldo, H., Zhang, H., Saffar, M.T., & Erhan, D. (2022). Phenaki: Variable length video generation from open domain textual descriptions. International conference on learning representations.
Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., & Zhang, S. (2023). Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571
Wang, Z., Li, A., Xie, E., Zhu, L., Guo, Y., Dou, Q., & Li, Z. (2024). Customvideo: Customizing text-to-video generation with multiple subjects. arXiv preprint arXiv:2401.09962
Wu, C., Huang, L., Zhang, Q., Li, B., Ji, L., Yang, F., & Duan, N. (2021). Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806
Wu, T., Si, C., Jiang, Y., Huang, Z., & Liu, Z. (2023). Freeinit: Bridging initialization gap in video diffusion models. arXiv preprint arXiv:2312.07537
Xie, J., Li, Y., Huang, Y., Liu, H., Zhang, W., Zheng, Y., & Shou, M.Z. (2023). Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. Proceedings of the ieee/cvf international conference on computer vision (pp. 7452–7461).
Xing, J., Xia, M., Liu, Y., Zhang, Y., He, Y., & Liu, H. others (2024). Make-your-video: Customized video generation using textual and structural guidance. IEEE Transactions on Visualization and Computer Graphics
Xue, H., Hang, T., Zeng, Y., Sun, Y., Liu, B., Yang, H., & Guo, B. (2022). Advancing high-resolution video-language representation with large-scale video transcriptions. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 5036–5045).
Ye, H., Zhang, J., Liu, S., Han, X., & Yang, W. (2023). Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721
Zhang, L., Rao, A., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models. In: Proceedings of the ieee/cvf international conference on computer vision (pp. 3836–3847).
Zhou, D., Wang, W., Yan, H., Lv, W., Zhu, Y., & Feng, J. (2022). Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018

Download references

Funding

This work is supported by the National Key Research and Development Program of China No.2023YFF1205001, National Natural Science Foundation of China (No. 62222209, 62250008, 62102222), Beijing National Research Center for Information Science and Technology under Grant No. BNR2023RC01003, BNR2023TD03006, and Beijing Key Lab of Networked Multimedia.

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, China
Yipeng Zhang, Xin Wang, Hong Chen, Chenyang Qin, Yibo Hao & Wenwu Zhu
Beijing National Research Center for Information Science and Technology, Beijing, China
Xin Wang & Wenwu Zhu
MoE Key Lab of High Confidence Software Technologies, Peking University, Beijing, China
Hong Mei

Authors

Yipeng Zhang
View author publications
Search author on:PubMed Google Scholar
Xin Wang
View author publications
Search author on:PubMed Google Scholar
Hong Chen
View author publications
Search author on:PubMed Google Scholar
Chenyang Qin
View author publications
Search author on:PubMed Google Scholar
Yibo Hao
View author publications
Search author on:PubMed Google Scholar
Hong Mei
View author publications
Search author on:PubMed Google Scholar
Wenwu Zhu
View author publications
Search author on:PubMed Google Scholar

Corresponding authors

Correspondence to Xin Wang or Wenwu Zhu.

Additional information

Communicated by Long Yang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, Y., Wang, X., Chen, H. et al. ScenarioDiff: Text-to-video Generation with Dynamic Transformations of Scene Conditions. Int J Comput Vis 133, 4909–4922 (2025). https://doi.org/10.1007/s11263-025-02413-7

Download citation

Received: 29 July 2024
Accepted: 10 October 2024
Published: 25 March 2025
Version of record: 25 March 2025
Issue date: July 2025
DOI: https://doi.org/10.1007/s11263-025-02413-7

Keywords

Profiles

Xin Wang View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ScenarioDiff: Text-to-video Generation with Dynamic Transformations of Scene Conditions

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MEVG: Multi-event Video Generation with Text-to-Video Models

xGen-VideoSyn-1: High-Fidelity Text-to-Video Synthesis with Compressed Representations

ControlVideo: conditional control for one-shot text-driven video editing and beyond

Explore related subjects

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now