这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

MoonShot: Towards Controllable Video Generation and Editing with Motion-Aware Multimodal Conditions

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Current video diffusion models (VDMs) mostly rely on text conditions, limiting control over video appearance and geometry. This study introduces a new model, MoonShot, conditioning on both image and text for enhanced control. It features the Multimodal Video Block (MVB), integrating the motion-aware dual cross-attention layer for precise appearance and motion alignment with provided prompts, and the spatiotemporal attention layer for large motion dynamics. It can also incorporate pre-trained Image ControlNet modules for geometry conditioning without extra video training. Experiments show our model significantly improves visual quality and motion fidelity, and its versatility allows for applications in personalized video generation, animation, and editing, making it a foundational tool for controllable video creation. More video results can be found here.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data Availability

The datasets analyzed during the current study are available as follows: The Webvid10M(Bain et al., 2021) is available at: https://github.com/m-bain/webvid. InternVideo (Wang et al., 2022) is available at: https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid.

References

  • An, J., Zhang, S., Yang, H., Gupta, S., Huang, J.B., Luo, J., Yin, X. (2023). Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477

  • Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R.H., Levine, S. (2017). Stochastic variational video prediction. arXiv preprint arXiv:1710.11252

  • Bain, M., Nagrani, A., Varol, G., Zisserman, A. (2021). Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1728–1738

  • Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K. (2023a). Align your latents: High-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 22563–22575

  • Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K. (2023b). Align your latents: High-resolution video synthesis with latent diffusion models. In: CVPR

  • Ceylan, D., Huang, C.H., Mitra, N.J. (2023). Pix2video: Video editing using image diffusion. arXiv:2303.12688

  • Chai, W., Guo, X., Wang, G., Lu, Y. (2023). Stablevideo: Text-driven consistency-aware diffusion video editing. arXiv preprint arXiv:2308.09592

  • Chen, H., Zhang, Y., Cun, X., Xia, M., Wang, X., Weng, C., Shan, Y. (2024). Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv:2401.09047

  • Chen, L., Zhao, M., Liu, Y., Ding, M., Song, Y., Wang, S., Wang, X., Yang, H., Liu, J., Du, K., et al. (2023a). Photoverse: Tuning-free image customization with text-to-image diffusion models. arXiv preprint arXiv:2309.05793

  • Chen, W., Wu, J., Xie, P., Wu, H., Li, J., Xia, X., Xiao, X., Lin, L. (2023b). Control-a-video: Controllable text-to-video generation with diffusion models. arXiv:2305.13840

  • Chen, X., Huang, L., Liu, Y., Shen, Y., Zhao, D., Zhao, H. (2023c). Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481

  • Esser, P., Chiu, J., Atighehchian, P., Granskog, J., Germanidis, A. (2023). Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011

  • Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., & Taigman, Y. (2022). Make-a-scene: Scene-based text-to-image generation with human priors. In X. V. Part (Ed.), Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings (pp. 89–106). Springer.

    Chapter  Google Scholar 

  • Ge, S., Hayes, T., Yang, H., Yin, X., Pang, G., Jacobs, D., Huang, J.B., Parikh, D. (2022). Long video generation with time-agnostic vqgan and time-sensitive transformer. arXiv preprint arXiv:2204.03638

  • Ge, S., Nah, S., Liu, G., Poon, T., Tao, A., Catanzaro, B., Jacobs, D., Huang, J.B., Liu, M.Y., Balaji, Y. (2023). Preserve your own correlation: A noise prior for video diffusion models. arXiv preprint arXiv:2305.10474

  • Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T. (2023). Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arxiv:2307.10373

  • Girdhar, R., Singh, M., Brown, A., Duval, Q., Azadi, S., Rambhatla, S.S., Shah, A., Yin, X., Parikh D., Misra I. (2023). Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709

  • Gu, Y., Wang, X., Wu, J.Z., Shi, Y., Chen, Y., Fan, Z., Xiao, W., Zhao, R., Chang, S., Wu, W., et al. (2023). Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. arXiv preprint arXiv:2305.18292

  • Guo, Y., Yang, C., Rao, A., Wang, Y., Qiao, Y., Lin, D., Dai B. (2023). Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725

  • Harvey, W., Naderiparizi, S., Masrani, V., Weilbach, C., Wood, F. (2022). Flexible diffusion modeling of long videos. arXiv preprint arXiv:2205.11495

  • He, Y., Yang, T., Zhang, Y., Shan, Y., Chen, Q. (2022). Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221

  • Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30

  • Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840–6851.

    Google Scholar 

  • Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al. (2022). Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303

  • Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J. (2022). Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868

  • Höppe, T., Mehrjou, A., Bauer, S., Nielsen, D., Dittadi, A. (2022). Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696

  • Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685

  • Hu, J., Shen, L., Sun, G. (2018). Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141

  • Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z. (2024). VBench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  • Jiang, Y., Wu, T., Yang, S., Si, C., Lin, D., Qiao, Y., Loy, C.C., Liu, Z. (2024). Videobooth: Diffusion-based video generation with image prompts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6689–6700

  • Karras, T., Aittala, M., Aila, T., Laine, S. (2022). Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364

  • Khachatryan, L., Movsisyan, A., Tadevosyan, V., Henschel, R., Wang, Z., Navasardyan, S., Shi, H. (2023a). Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439

  • Khachatryan, L., Movsisyan, A., Tadevosyan, V., Henschel, R., Wang, Z., Navasardyan, S., Shi, H. (2023b). Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439

  • Kim, Y., Nam, S., Cho, I., Kim, S.J. (2019). Unsupervised keypoint learning for guiding class-conditional video prediction. Advances in neural information processing systems 32

  • Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y. (2023a). Multi-concept customization of text-to-image diffusion. In: CVPR

  • Kumari, N,, Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y. (2023b). Multi-concept customization of text-to-image diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1931–1941

  • Le Moing, G., Ponce, J., Schmid, C. (2021). Ccvs: Context-aware controllable video synthesis. NeurIPS

  • Li, D., Li, J., Hoi, S.C. (2023a). Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. arXiv preprint arXiv:2305.14720

  • Li, X., Chu, W., Wu, Y., Yuan, W., Liu, F., Zhang, Q., Li, F., Feng, H., Ding, E., Wang, J. (2023b). Videogen: A reference-guided latent diffusion approach for high definition text-to-video generation. arXiv preprint arXiv:2309.00398

  • Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H. (2018). Flow-grounded spatial-temporal video prediction from still images. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 600–615

  • Liu, S., Zhang, Y., Li, W., Lin, Z., Jia, J. (2023). Video-p2p: Video editing with cross-attention control. arXiv:2303.04761

  • Luo, Z., Chen, D., Zhang, Y., Huang, Y., Wang, L., Shen, Y., Zhao, D., Zhou, J., Tan, T. (2023). Videofusion: Decomposed diffusion models for high-quality video generation. In: CVPR

  • Ma, Z., Zhou, D., Yeh, C.H., Wang, X.S., Li, X., Yang, H., Dong, Z., Keutzer, K., Feng, J. (2024). Magic-me: Identity-specific video customized diffusion. arXiv preprint arXiv:2402.09368

  • Molad, E., Horwitz, E., Valevski, D., Acha, A.R., Matias, Y., Pritch, Y., Leviathan, Y., Hoshen, Y. (2023). Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329

  • Mou, C., Wang, X., Xie, L., Zhang, J., Qi, Z., Shan, Y., Qie, X. (2023). T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453

  • Ni, H., Shi, C., Li, K., Huang, S.X., Min, M.R. (2023). Conditional image-to-video generation with latent flow diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 18444–18455

  • Nikankin, Y., Haim, N., Irani, M. (2022). Sinfusion: Training diffusion models on a single image or video. arXiv preprint arXiv:2211.11743

  • Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al. (2023). Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193

  • Pan, J., Wang, C., Jia, X., Shao, J., Sheng, L., Yan, J., Wang, X. (2019). Video generation from single semantic label map. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3733–3742

  • Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R. (2023). Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952

  • Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L. (2017). The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675

  • Qi, C., Cun, X., Zhang, Y., Lei, C., Wang, X., Shan, Y., Chen, Q. (2023). Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535

  • Ren, W., Yang, H., Zhang, G., Wei, C., Du, X., Huang, S., Chen, W. (2024). Consisti2v: Enhancing visual consistency for image-to-video generation. arXiv preprint arXiv:2402.04324

  • Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In: CVPR, pp 10684–10695

  • Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K. (2022). Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242

  • Saito, M., Matsumoto, E., Saito, S. (2017). Temporal generative adversarial nets with singular value clipping. In: ICCV

  • Schuhmann, C., Vencu, R., Beaumont, R., Coombes, T., Gordon, C., Katta, A., Kaczmarczyk, R., Jitsev, J. (2022). LAION-5B: laion-5b: A new era of open large-scale multi-modal datasets. https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/

  • Shen, X., Li, X., Elhoseiny, M. (2023). Mostgan-v: Video generation with temporal motion styles. In: CVPR

  • Singer, U., Polyak, A., Hayes, T,, Yin X,, An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al. (2022). Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792

  • Skorokhodov, I., Tulyakov, S., Elhoseiny, M. (2021). Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. arXiv preprint arXiv:2112.14683

  • Smith, J.S., Hsu, Y.C., Zhang, L., Hua, T., Kira, Z., Shen, Y., Jin, H. (2023). Continual diffusion: Continual customization of text-to-image diffusion with c-lora. arXiv preprint arXiv:2304.06027

  • Srivastava, N., Mansimov, E., Salakhudinov, R. (2015). Unsupervised learning of video representations using lstms. In: ICML

  • Tian, Y., Ren, J., Chai, M., Olszewski, K., Peng, X., Metaxas, D.N., Tulyakov, S. (2021). A good image generator is what you need for high-resolution video synthesis. In: ICLR

  • Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J. (2018). Mocogan: Decomposing motion and content for video generation. In: CVPR

  • Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S. (2018). Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717

  • Voleti, V., Jolicoeur-Martineau, A., Pal, C. (2022). Masked conditional video diffusion for prediction, generation, and interpolation. arXiv preprint arXiv:2205.09853

  • Vondrick, C., Pirsiavash, H., Torralba, A. (2016). Generating videos with scene dynamics. NIPS

  • Voynov, A., Chu, Q., Cohen-Or, D., Aberman, K. (2023). \( p+ \): Extended textual conditioning in text-to-image generation. arXiv preprint arXiv:2303.09522

  • Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., Zhang, S. (2023a). Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571

  • Wang, W., Xie, k., Liu, Z., Chen, H., Cao, Y., Wang, X., Shen, C. (2023b). Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599

  • Wang, W., Yang, H., Tuo, Z., He, H., Zhu, J., Fu, J., Liu, J. (2023c). Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874

  • Wang, X., Yuan, H., Zhang, S., Chen, D., Wang, J., Zhang, Y., Shen, Y., Zhao, D., Zhou, J. (2023d). Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018

  • Wang, X., Zhang, S., Yuan, H., Qing, Z., Gong, B., Zhang, Y., Shen, Y., Gao, C., Sang, N. (2024). A recipe for scaling up text-to-video generation with text-free videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6572–6582

  • Wang, Y., Li, K., Li, Y., He, Y., Huang. B., Zhao. Z., Zhang, H., Xu, J., Liu, Y., Wang, Z., Xing, S., Chen, G., Pan, J., Yu, J., Wang, Y., Wang, L., Qiao, Y. (2022). Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191

  • Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P., et al. (2023e). Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103

  • Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., Zuo, W. (2023). Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848

  • Wu, C., Huang, L., Zhang, Q., Li, B., Ji, L., Yang, F., Sapiro, G., Duan, N. (2021). Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806

  • Wu, C., Liang J., Ji, L., Yang, F., Fang, Y., Jiang, D., Duan, N. (2022a). Nüwa: Visual synthesis pre-training for neural visual world creation. In: ECCV, Springer, pp 720–736

  • Wu, J.Z., Ge, Y., Wang, X., Lei, W., Gu, Y., Hsu, W., Shan, Y., Qie, X., Shou, M.Z. (2022b). Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565

  • Xing, J., Xia, M., Zhang, Y., Chen, H., Wang, X., Wong, T.T., Shan, Y. (2023). Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190

  • Xiong, W., Luo, W., Ma, L., Liu, W., & Luo, J. (2018). Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2364–2373

  • Xu, J., Mei, T., Yao, T., Rui, Y. (2016). Msr-vtt: A large video description dataset for bridging video and language. In: CVPR

  • Yan, W., Zhang, Y., Abbeel, P., Srinivas, A. (2021). Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157

  • Yang, R., Srivastava, P., Mandt, S. (2022). Diffusion probabilistic modeling for video generation. arXiv preprint arXiv:2203.09481

  • Yang, S., Zhou, Y., Liu, Z., Loy, C.C. (2023). Rerender a video: Zero-shot text-guided video-to-video translation. In: ACM SIGGRAPH Asia Conference Proceedings

  • Ye, H., Zhang, J., Liu, S., Han, X., Yang, W. (2023). Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arxiv:2308.06721

  • Yu, S., Tack, J., Mo, S., Kim, H., Kim, J., Ha, J.W., & Shin, J. (2021). Generating videos with dynamics-aware implicit generative adversarial networks. In: ICLR

  • Zhang, D.J., Wu, J.Z., Liu, J.W., Zhao, R., Ran, L., Gu, Y., Gao, D., & Shou, M.Z. (2023a). Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818

  • Zhang, L., Rao, A., & Agrawala, M. (2023b). Adding conditional control to text-to-image diffusion models

  • Zhang, S., Wang, J., Zhang, Y., Zhao, K., Yuan, H., Qing, Z., Wang, X., Zhao, D., & Zhou, J. (2023c). I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145

  • Zhang, Y., Wei, Y., Jiang, D., Zhang, X., Zuo, W., Tian, Q. (2023d). Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077

  • Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., & Li, T., You Y. (2024). Open-sora: Democratizing efficient video production for all. https://github.com/hpcaitech/Open-Sora

  • Zhou, D., Wang, W., Yan, H., Lv. W., Zhu, Y., & Feng, J. (2022). Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mike Zheng Shou.

Additional information

Communicated by Shengfeng He.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, D.J., Li, D., Le, H. et al. MoonShot: Towards Controllable Video Generation and Editing with Motion-Aware Multimodal Conditions. Int J Comput Vis 133, 3629–3644 (2025). https://doi.org/10.1007/s11263-025-02346-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-025-02346-1

Keywords