Abstract
With the development of diffusion models, text-to-video generation has recently received significant attention and achieved remarkable success. However, existing text-to-video approaches suffer from the following weaknesses: i) they fail to control the trajectory of the subject as well as the process of scene transformations; ii) they can only generate videos with limited frames, failing to capture the whole transformation process. To address these issues, we propose the model named ScenarioDiff, which is able to generate longer videos with scene transformations. Specifically, we employ a spatial layout fuser to control the positions of subjects and the scenes of each frame. To effectively present the process of scene transformation, we introduce mixed frequency controlnet, which utilizes several frames of the generated videos to extend them to long videos chunk by chunk in an auto-regressive manner. Additionally, to ensure consistency between different video chunks, we propose a cross-chunk scheduling mechanism during inference. Experimental results demonstrate the effectiveness of our approach in generating videos with dynamic scene transformations. Our project page is available at https://scenariodiff2024.github.io/.
Similar content being viewed by others
Data Availability
This paper uses public datasets to conduct experiments available in the following URLs. Miao et al. (2021): https://www.vspwdataset.com/. Bain et al. (2021b): https://github.com/m-bain/webvid.
References
Bain, M., Nagrani, A., Varol, G., & Zisserman, A. (2021a). Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proceedings of the ieee/cvf international conference on computer vision (pp. 1728–1738).
Bain, M., Nagrani, A., Varol, G., & Zisserman, A. (2021b). Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proceedings of the ieee/cvf international conference on computer vision (pp. 1728–1738).
Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., & Li, L. others (2023). Improving image generation with better captions. Computer Science. https://cdn.openai.com/papers/dall-e-3.pdf, 2(3), 8,
Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., & Ramesh, A. (2024). Video generation models as world simulators. https://openai.com/research/video-generation-models-as-world-simulators
Chen, H., Xia, M., He, Y., Zhang, Y., Cun, X., & Yang, S. others (2023). Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512
Chen, H., Zhang, Y., Wu, S., Wang, X., Duan, X., Zhou, Y., & Zhu, W. (2023). Disenbooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation. The twelfth international conference on learning representations.
Elfwing, S., Uchibe, E., & Doya, K. (2018). Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks, 107, 3–11.
Esser, P., Rombach, R., & Ommer, B. (2021). Taming transformers for high-resolution image synthesis. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 12873–12883).
Frans, K., Soros, L., & Witkowski, O. (2022). Clipdraw: Exploring text-to-drawing synthesis through language-image encoders. Advances in Neural Information Processing Systems, 35, 5207–5218.
Guo, X., Zheng, M., Hou, L., Gao, Y., Deng, Y., & Ma, C. others (2023). I2v-adapter: A general image-to-video adapter for video diffusion models. arXiv preprint arXiv:2312.16693
Guo, Y., Yang, C., Rao, A., Agrawala, M., Lin, D., & Dai, B. (2023). Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint arXiv:2311.16933
Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., & Dai, B. (2023). Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. The twelfth international conference on learning representations.
He, Y., Yang, T., Zhang, Y., Shan, Y., & Chen, Q. (2022). Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221
Henschel, R., Khachatryan, L., Hayrapetyan, D., Poghosyan, H., Tadevosyan, V., Wang, Z. & Shi, H. (2024). Streamingt2v: Consistent, dynamic, and extendable long video generation from text. arXiv preprint arXiv:2403.14773
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840–6851.
Hong, W., Ding, M., Zheng, W., Liu, X., & Tang, J. (2022). Cogvideo: Large-scale pretraining for text-to-video generation via transformers. The eleventh international conference on learning representations.
Hu, Z., & Xu, D. (2023). Videocontrolnet: A motion-guided video-to-video translation framework by using diffusion model with controlnet. arXiv preprint arXiv:2307.14073
Huang, H., Feng, Y., Shi, C., Xu, L., Yu, J., & Yang, S. (2024). Free-bloom: Zero-shot text-to-video generator with llm director and ldm animator. Advances in Neural Information Processing Systems, 36
Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y. & Liu, Z. (2024). VBench: Comprehensive benchmark suite for video generative models. Proceedings of the ieee/cvf conference on computer vision and pattern recognition.
Khachatryan, L., Movsisyan, A., Tadevosyan, V., Henschel, R., Wang, Z., Navasardyan, S., & Shi, H. (2023). Text2video-zero: Text-to-image diffusion models are zero-shot video generators. Proceedings of the ieee/cvf international conference on computer vision (pp. 15954–15964).
Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., & Lee, Y.J. (2023). Gligen: Open-set grounded text-to-image generation. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 22511–22521).
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., & Zhu, J. (2022). Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35, 5775–5787.
Luo, Z., Chen, D., Zhang, Y., Huang, Y., Wang, L., Shen, Y., & Tan, T. (2023). Videofusion: Decomposed diffusion models for high-quality video generation. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 10209–10218).
Miao, J., Wei, Y., Wu, Y., Liang, C., Li, G., & Yang, Y. (2021). Vspw: A large-scale dataset for video scene parsing in the wild. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 4133–4143).
Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R. (2021). Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1), 99–106.
Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., & Shan, Y. (2024). T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Proceedings of the aaai conference on artificial intelligence (Vol. 38, pp. 4296–4304).
Nichol, A.Q., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., Mcgrew, B., & Chen, M. (2022). Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In: International conference on machine learning (pp. 16784–16804).
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., & Rombach, R. (2023). Sdxl: Improving latent diffusion models for high-resolution image synthesis. The twelfth international conference on learning representations.
Qiu, H., Xia, M., Zhang, Y., He, Y., Wang, X., Shan, Y., & Liu, Z. (2023). Freenoise: Tuning-free longer video diffusion via noise rescheduling. arXiv preprint arXiv:2310.15169, , ,
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., & Agarwal, S. others (2021). Learning transferable visual models from natural language supervision. International conference on machine learning (pp. 8748–8763).
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2), 3
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., & Sutskever, I. (2021). Zero-shot text-to-image generation. International conference on machine learning (pp. 8821–8831).
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 10684–10695).
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. Medical image computing and computer-assisted intervention–miccai 2015: 18th international conference, munich, germany, october 5-9, 2015, proceedings, part iii 18 (pp. 234–241).
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., & Aberman, K. (2023). Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 22500–22510).
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., et al. (2022). Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35, 36479–36494.
Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., & Zhang, S. others (2022). Make-a-video: Text-to-video generation without text-video data. The eleventh international conference on learning representations.
Song, J., Meng, C., & Ermon, S. (2020). Denoising diffusion implicit models. International conference on learning representations.
Villegas, R., Babaeizadeh, M., Kindermans, P., J., Moraldo, H., Zhang, H., Saffar, M.T., & Erhan, D. (2022). Phenaki: Variable length video generation from open domain textual descriptions. International conference on learning representations.
Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., & Zhang, S. (2023). Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571
Wang, Z., Li, A., Xie, E., Zhu, L., Guo, Y., Dou, Q., & Li, Z. (2024). Customvideo: Customizing text-to-video generation with multiple subjects. arXiv preprint arXiv:2401.09962
Wu, C., Huang, L., Zhang, Q., Li, B., Ji, L., Yang, F., & Duan, N. (2021). Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806
Wu, T., Si, C., Jiang, Y., Huang, Z., & Liu, Z. (2023). Freeinit: Bridging initialization gap in video diffusion models. arXiv preprint arXiv:2312.07537
Xie, J., Li, Y., Huang, Y., Liu, H., Zhang, W., Zheng, Y., & Shou, M.Z. (2023). Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. Proceedings of the ieee/cvf international conference on computer vision (pp. 7452–7461).
Xing, J., Xia, M., Liu, Y., Zhang, Y., He, Y., & Liu, H. others (2024). Make-your-video: Customized video generation using textual and structural guidance. IEEE Transactions on Visualization and Computer Graphics
Xue, H., Hang, T., Zeng, Y., Sun, Y., Liu, B., Yang, H., & Guo, B. (2022). Advancing high-resolution video-language representation with large-scale video transcriptions. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 5036–5045).
Ye, H., Zhang, J., Liu, S., Han, X., & Yang, W. (2023). Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721
Zhang, L., Rao, A., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models. In: Proceedings of the ieee/cvf international conference on computer vision (pp. 3836–3847).
Zhou, D., Wang, W., Yan, H., Lv, W., Zhu, Y., & Feng, J. (2022). Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018
Funding
This work is supported by the National Key Research and Development Program of China No.2023YFF1205001, National Natural Science Foundation of China (No. 62222209, 62250008, 62102222), Beijing National Research Center for Information Science and Technology under Grant No. BNR2023RC01003, BNR2023TD03006, and Beijing Key Lab of Networked Multimedia.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Communicated by Long Yang.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, Y., Wang, X., Chen, H. et al. ScenarioDiff: Text-to-video Generation with Dynamic Transformations of Scene Conditions. Int J Comput Vis 133, 4909–4922 (2025). https://doi.org/10.1007/s11263-025-02413-7
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02413-7