这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

ScenarioDiff: Text-to-video Generation with Dynamic Transformations of Scene Conditions

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

With the development of diffusion models, text-to-video generation has recently received significant attention and achieved remarkable success. However, existing text-to-video approaches suffer from the following weaknesses: i) they fail to control the trajectory of the subject as well as the process of scene transformations; ii) they can only generate videos with limited frames, failing to capture the whole transformation process. To address these issues, we propose the model named ScenarioDiff, which is able to generate longer videos with scene transformations. Specifically, we employ a spatial layout fuser to control the positions of subjects and the scenes of each frame. To effectively present the process of scene transformation, we introduce mixed frequency controlnet, which utilizes several frames of the generated videos to extend them to long videos chunk by chunk in an auto-regressive manner. Additionally, to ensure consistency between different video chunks, we propose a cross-chunk scheduling mechanism during inference. Experimental results demonstrate the effectiveness of our approach in generating videos with dynamic scene transformations. Our project page is available at https://scenariodiff2024.github.io/.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data Availability

This paper uses public datasets to conduct experiments available in the following URLs. Miao et al. (2021): https://www.vspwdataset.com/. Bain et al. (2021b): https://github.com/m-bain/webvid.

References

  • Bain, M., Nagrani, A., Varol, G., & Zisserman, A. (2021a). Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proceedings of the ieee/cvf international conference on computer vision (pp. 1728–1738).

  • Bain, M., Nagrani, A., Varol, G., & Zisserman, A. (2021b). Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proceedings of the ieee/cvf international conference on computer vision (pp. 1728–1738).

  • Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., & Li, L. others (2023). Improving image generation with better captions. Computer Science. https://cdn.openai.com/papers/dall-e-3.pdf, 2(3), 8,

  • Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., & Ramesh, A. (2024). Video generation models as world simulators. https://openai.com/research/video-generation-models-as-world-simulators

  • Chen, H., Xia, M., He, Y., Zhang, Y., Cun, X., & Yang, S. others (2023). Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512

  • Chen, H., Zhang, Y., Wu, S., Wang, X., Duan, X., Zhou, Y., & Zhu, W. (2023). Disenbooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation. The twelfth international conference on learning representations.

  • Elfwing, S., Uchibe, E., & Doya, K. (2018). Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks, 107, 3–11.

    Article  Google Scholar 

  • Esser, P., Rombach, R., & Ommer, B. (2021). Taming transformers for high-resolution image synthesis. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 12873–12883).

  • Frans, K., Soros, L., & Witkowski, O. (2022). Clipdraw: Exploring text-to-drawing synthesis through language-image encoders. Advances in Neural Information Processing Systems, 35, 5207–5218.

    Google Scholar 

  • Guo, X., Zheng, M., Hou, L., Gao, Y., Deng, Y., & Ma, C. others (2023). I2v-adapter: A general image-to-video adapter for video diffusion models. arXiv preprint arXiv:2312.16693

  • Guo, Y., Yang, C., Rao, A., Agrawala, M., Lin, D., & Dai, B. (2023). Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint arXiv:2311.16933

  • Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., & Dai, B. (2023). Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. The twelfth international conference on learning representations.

  • He, Y., Yang, T., Zhang, Y., Shan, Y., & Chen, Q. (2022). Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221

  • Henschel, R., Khachatryan, L., Hayrapetyan, D., Poghosyan, H., Tadevosyan, V., Wang, Z. & Shi, H. (2024). Streamingt2v: Consistent, dynamic, and extendable long video generation from text. arXiv preprint arXiv:2403.14773

  • Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840–6851.

    Google Scholar 

  • Hong, W., Ding, M., Zheng, W., Liu, X., & Tang, J. (2022). Cogvideo: Large-scale pretraining for text-to-video generation via transformers. The eleventh international conference on learning representations.

  • Hu, Z., & Xu, D. (2023). Videocontrolnet: A motion-guided video-to-video translation framework by using diffusion model with controlnet. arXiv preprint arXiv:2307.14073

  • Huang, H., Feng, Y., Shi, C., Xu, L., Yu, J., & Yang, S. (2024). Free-bloom: Zero-shot text-to-video generator with llm director and ldm animator. Advances in Neural Information Processing Systems, 36

  • Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y. & Liu, Z. (2024). VBench: Comprehensive benchmark suite for video generative models. Proceedings of the ieee/cvf conference on computer vision and pattern recognition.

  • Khachatryan, L., Movsisyan, A., Tadevosyan, V., Henschel, R., Wang, Z., Navasardyan, S., & Shi, H. (2023). Text2video-zero: Text-to-image diffusion models are zero-shot video generators. Proceedings of the ieee/cvf international conference on computer vision (pp. 15954–15964).

  • Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., & Lee, Y.J. (2023). Gligen: Open-set grounded text-to-image generation. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 22511–22521).

  • Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., & Zhu, J. (2022). Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35, 5775–5787.

    Google Scholar 

  • Luo, Z., Chen, D., Zhang, Y., Huang, Y., Wang, L., Shen, Y., & Tan, T. (2023). Videofusion: Decomposed diffusion models for high-quality video generation. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 10209–10218).

  • Miao, J., Wei, Y., Wu, Y., Liang, C., Li, G., & Yang, Y. (2021). Vspw: A large-scale dataset for video scene parsing in the wild. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 4133–4143).

  • Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R. (2021). Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1), 99–106.

    Article  Google Scholar 

  • Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., & Shan, Y. (2024). T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Proceedings of the aaai conference on artificial intelligence (Vol. 38, pp. 4296–4304).

  • Nichol, A.Q., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., Mcgrew, B., & Chen, M. (2022). Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In: International conference on machine learning (pp. 16784–16804).

  • Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., & Rombach, R. (2023). Sdxl: Improving latent diffusion models for high-resolution image synthesis. The twelfth international conference on learning representations.

  • Qiu, H., Xia, M., Zhang, Y., He, Y., Wang, X., Shan, Y., & Liu, Z. (2023). Freenoise: Tuning-free longer video diffusion via noise rescheduling. arXiv preprint arXiv:2310.15169, , ,

  • Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., & Agarwal, S. others (2021). Learning transferable visual models from natural language supervision. International conference on machine learning (pp. 8748–8763).

  • Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2), 3

  • Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., & Sutskever, I. (2021). Zero-shot text-to-image generation. International conference on machine learning (pp. 8821–8831).

  • Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 10684–10695).

  • Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. Medical image computing and computer-assisted intervention–miccai 2015: 18th international conference, munich, germany, october 5-9, 2015, proceedings, part iii 18 (pp. 234–241).

  • Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., & Aberman, K. (2023). Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 22500–22510).

  • Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., et al. (2022). Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35, 36479–36494.

    Google Scholar 

  • Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., & Zhang, S. others (2022). Make-a-video: Text-to-video generation without text-video data. The eleventh international conference on learning representations.

  • Song, J., Meng, C., & Ermon, S. (2020). Denoising diffusion implicit models. International conference on learning representations.

  • Villegas, R., Babaeizadeh, M., Kindermans, P., J., Moraldo, H., Zhang, H., Saffar, M.T., & Erhan, D. (2022). Phenaki: Variable length video generation from open domain textual descriptions. International conference on learning representations.

  • Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., & Zhang, S. (2023). Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571

  • Wang, Z., Li, A., Xie, E., Zhu, L., Guo, Y., Dou, Q., & Li, Z. (2024). Customvideo: Customizing text-to-video generation with multiple subjects. arXiv preprint arXiv:2401.09962

  • Wu, C., Huang, L., Zhang, Q., Li, B., Ji, L., Yang, F., & Duan, N. (2021). Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806

  • Wu, T., Si, C., Jiang, Y., Huang, Z., & Liu, Z. (2023). Freeinit: Bridging initialization gap in video diffusion models. arXiv preprint arXiv:2312.07537

  • Xie, J., Li, Y., Huang, Y., Liu, H., Zhang, W., Zheng, Y., & Shou, M.Z. (2023). Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. Proceedings of the ieee/cvf international conference on computer vision (pp. 7452–7461).

  • Xing, J., Xia, M., Liu, Y., Zhang, Y., He, Y., & Liu, H. others (2024). Make-your-video: Customized video generation using textual and structural guidance. IEEE Transactions on Visualization and Computer Graphics

  • Xue, H., Hang, T., Zeng, Y., Sun, Y., Liu, B., Yang, H., & Guo, B. (2022). Advancing high-resolution video-language representation with large-scale video transcriptions. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 5036–5045).

  • Ye, H., Zhang, J., Liu, S., Han, X., & Yang, W. (2023). Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721

  • Zhang, L., Rao, A., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models. In: Proceedings of the ieee/cvf international conference on computer vision (pp. 3836–3847).

  • Zhou, D., Wang, W., Yan, H., Lv, W., Zhu, Y., & Feng, J. (2022). Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018

Download references

Funding

This work is supported by the National Key Research and Development Program of China No.2023YFF1205001, National Natural Science Foundation of China (No. 62222209, 62250008, 62102222), Beijing National Research Center for Information Science and Technology under Grant No. BNR2023RC01003, BNR2023TD03006, and Beijing Key Lab of Networked Multimedia.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Xin Wang or Wenwu Zhu.

Additional information

Communicated by Long Yang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Wang, X., Chen, H. et al. ScenarioDiff: Text-to-video Generation with Dynamic Transformations of Scene Conditions. Int J Comput Vis 133, 4909–4922 (2025). https://doi.org/10.1007/s11263-025-02413-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-025-02413-7

Keywords

Profiles

  1. Xin Wang