Abstract
Current video diffusion models (VDMs) mostly rely on text conditions, limiting control over video appearance and geometry. This study introduces a new model, MoonShot, conditioning on both image and text for enhanced control. It features the Multimodal Video Block (MVB), integrating the motion-aware dual cross-attention layer for precise appearance and motion alignment with provided prompts, and the spatiotemporal attention layer for large motion dynamics. It can also incorporate pre-trained Image ControlNet modules for geometry conditioning without extra video training. Experiments show our model significantly improves visual quality and motion fidelity, and its versatility allows for applications in personalized video generation, animation, and editing, making it a foundational tool for controllable video creation. More video results can be found here.
Similar content being viewed by others
Data Availability
The datasets analyzed during the current study are available as follows: The Webvid10M(Bain et al., 2021) is available at: https://github.com/m-bain/webvid. InternVideo (Wang et al., 2022) is available at: https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid.
References
An, J., Zhang, S., Yang, H., Gupta, S., Huang, J.B., Luo, J., Yin, X. (2023). Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477
Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R.H., Levine, S. (2017). Stochastic variational video prediction. arXiv preprint arXiv:1710.11252
Bain, M., Nagrani, A., Varol, G., Zisserman, A. (2021). Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1728–1738
Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K. (2023a). Align your latents: High-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 22563–22575
Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K. (2023b). Align your latents: High-resolution video synthesis with latent diffusion models. In: CVPR
Ceylan, D., Huang, C.H., Mitra, N.J. (2023). Pix2video: Video editing using image diffusion. arXiv:2303.12688
Chai, W., Guo, X., Wang, G., Lu, Y. (2023). Stablevideo: Text-driven consistency-aware diffusion video editing. arXiv preprint arXiv:2308.09592
Chen, H., Zhang, Y., Cun, X., Xia, M., Wang, X., Weng, C., Shan, Y. (2024). Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv:2401.09047
Chen, L., Zhao, M., Liu, Y., Ding, M., Song, Y., Wang, S., Wang, X., Yang, H., Liu, J., Du, K., et al. (2023a). Photoverse: Tuning-free image customization with text-to-image diffusion models. arXiv preprint arXiv:2309.05793
Chen, W., Wu, J., Xie, P., Wu, H., Li, J., Xia, X., Xiao, X., Lin, L. (2023b). Control-a-video: Controllable text-to-video generation with diffusion models. arXiv:2305.13840
Chen, X., Huang, L., Liu, Y., Shen, Y., Zhao, D., Zhao, H. (2023c). Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481
Esser, P., Chiu, J., Atighehchian, P., Granskog, J., Germanidis, A. (2023). Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011
Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., & Taigman, Y. (2022). Make-a-scene: Scene-based text-to-image generation with human priors. In X. V. Part (Ed.), Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings (pp. 89–106). Springer.
Ge, S., Hayes, T., Yang, H., Yin, X., Pang, G., Jacobs, D., Huang, J.B., Parikh, D. (2022). Long video generation with time-agnostic vqgan and time-sensitive transformer. arXiv preprint arXiv:2204.03638
Ge, S., Nah, S., Liu, G., Poon, T., Tao, A., Catanzaro, B., Jacobs, D., Huang, J.B., Liu, M.Y., Balaji, Y. (2023). Preserve your own correlation: A noise prior for video diffusion models. arXiv preprint arXiv:2305.10474
Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T. (2023). Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arxiv:2307.10373
Girdhar, R., Singh, M., Brown, A., Duval, Q., Azadi, S., Rambhatla, S.S., Shah, A., Yin, X., Parikh D., Misra I. (2023). Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709
Gu, Y., Wang, X., Wu, J.Z., Shi, Y., Chen, Y., Fan, Z., Xiao, W., Zhao, R., Chang, S., Wu, W., et al. (2023). Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. arXiv preprint arXiv:2305.18292
Guo, Y., Yang, C., Rao, A., Wang, Y., Qiao, Y., Lin, D., Dai B. (2023). Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725
Harvey, W., Naderiparizi, S., Masrani, V., Weilbach, C., Wood, F. (2022). Flexible diffusion modeling of long videos. arXiv preprint arXiv:2205.11495
He, Y., Yang, T., Zhang, Y., Shan, Y., Chen, Q. (2022). Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840–6851.
Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al. (2022). Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303
Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J. (2022). Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868
Höppe, T., Mehrjou, A., Bauer, S., Nielsen, D., Dittadi, A. (2022). Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685
Hu, J., Shen, L., Sun, G. (2018). Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z. (2024). VBench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Jiang, Y., Wu, T., Yang, S., Si, C., Lin, D., Qiao, Y., Loy, C.C., Liu, Z. (2024). Videobooth: Diffusion-based video generation with image prompts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6689–6700
Karras, T., Aittala, M., Aila, T., Laine, S. (2022). Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364
Khachatryan, L., Movsisyan, A., Tadevosyan, V., Henschel, R., Wang, Z., Navasardyan, S., Shi, H. (2023a). Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439
Khachatryan, L., Movsisyan, A., Tadevosyan, V., Henschel, R., Wang, Z., Navasardyan, S., Shi, H. (2023b). Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439
Kim, Y., Nam, S., Cho, I., Kim, S.J. (2019). Unsupervised keypoint learning for guiding class-conditional video prediction. Advances in neural information processing systems 32
Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y. (2023a). Multi-concept customization of text-to-image diffusion. In: CVPR
Kumari, N,, Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y. (2023b). Multi-concept customization of text-to-image diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1931–1941
Le Moing, G., Ponce, J., Schmid, C. (2021). Ccvs: Context-aware controllable video synthesis. NeurIPS
Li, D., Li, J., Hoi, S.C. (2023a). Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. arXiv preprint arXiv:2305.14720
Li, X., Chu, W., Wu, Y., Yuan, W., Liu, F., Zhang, Q., Li, F., Feng, H., Ding, E., Wang, J. (2023b). Videogen: A reference-guided latent diffusion approach for high definition text-to-video generation. arXiv preprint arXiv:2309.00398
Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H. (2018). Flow-grounded spatial-temporal video prediction from still images. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 600–615
Liu, S., Zhang, Y., Li, W., Lin, Z., Jia, J. (2023). Video-p2p: Video editing with cross-attention control. arXiv:2303.04761
Luo, Z., Chen, D., Zhang, Y., Huang, Y., Wang, L., Shen, Y., Zhao, D., Zhou, J., Tan, T. (2023). Videofusion: Decomposed diffusion models for high-quality video generation. In: CVPR
Ma, Z., Zhou, D., Yeh, C.H., Wang, X.S., Li, X., Yang, H., Dong, Z., Keutzer, K., Feng, J. (2024). Magic-me: Identity-specific video customized diffusion. arXiv preprint arXiv:2402.09368
Molad, E., Horwitz, E., Valevski, D., Acha, A.R., Matias, Y., Pritch, Y., Leviathan, Y., Hoshen, Y. (2023). Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329
Mou, C., Wang, X., Xie, L., Zhang, J., Qi, Z., Shan, Y., Qie, X. (2023). T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453
Ni, H., Shi, C., Li, K., Huang, S.X., Min, M.R. (2023). Conditional image-to-video generation with latent flow diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 18444–18455
Nikankin, Y., Haim, N., Irani, M. (2022). Sinfusion: Training diffusion models on a single image or video. arXiv preprint arXiv:2211.11743
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al. (2023). Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193
Pan, J., Wang, C., Jia, X., Shao, J., Sheng, L., Yan, J., Wang, X. (2019). Video generation from single semantic label map. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3733–3742
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R. (2023). Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L. (2017). The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675
Qi, C., Cun, X., Zhang, Y., Lei, C., Wang, X., Shan, Y., Chen, Q. (2023). Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535
Ren, W., Yang, H., Zhang, G., Wei, C., Du, X., Huang, S., Chen, W. (2024). Consisti2v: Enhancing visual consistency for image-to-video generation. arXiv preprint arXiv:2402.04324
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In: CVPR, pp 10684–10695
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K. (2022). Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242
Saito, M., Matsumoto, E., Saito, S. (2017). Temporal generative adversarial nets with singular value clipping. In: ICCV
Schuhmann, C., Vencu, R., Beaumont, R., Coombes, T., Gordon, C., Katta, A., Kaczmarczyk, R., Jitsev, J. (2022). LAION-5B: laion-5b: A new era of open large-scale multi-modal datasets. https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/
Shen, X., Li, X., Elhoseiny, M. (2023). Mostgan-v: Video generation with temporal motion styles. In: CVPR
Singer, U., Polyak, A., Hayes, T,, Yin X,, An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al. (2022). Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792
Skorokhodov, I., Tulyakov, S., Elhoseiny, M. (2021). Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. arXiv preprint arXiv:2112.14683
Smith, J.S., Hsu, Y.C., Zhang, L., Hua, T., Kira, Z., Shen, Y., Jin, H. (2023). Continual diffusion: Continual customization of text-to-image diffusion with c-lora. arXiv preprint arXiv:2304.06027
Srivastava, N., Mansimov, E., Salakhudinov, R. (2015). Unsupervised learning of video representations using lstms. In: ICML
Tian, Y., Ren, J., Chai, M., Olszewski, K., Peng, X., Metaxas, D.N., Tulyakov, S. (2021). A good image generator is what you need for high-resolution video synthesis. In: ICLR
Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J. (2018). Mocogan: Decomposing motion and content for video generation. In: CVPR
Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S. (2018). Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717
Voleti, V., Jolicoeur-Martineau, A., Pal, C. (2022). Masked conditional video diffusion for prediction, generation, and interpolation. arXiv preprint arXiv:2205.09853
Vondrick, C., Pirsiavash, H., Torralba, A. (2016). Generating videos with scene dynamics. NIPS
Voynov, A., Chu, Q., Cohen-Or, D., Aberman, K. (2023). \( p+ \): Extended textual conditioning in text-to-image generation. arXiv preprint arXiv:2303.09522
Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., Zhang, S. (2023a). Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571
Wang, W., Xie, k., Liu, Z., Chen, H., Cao, Y., Wang, X., Shen, C. (2023b). Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599
Wang, W., Yang, H., Tuo, Z., He, H., Zhu, J., Fu, J., Liu, J. (2023c). Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874
Wang, X., Yuan, H., Zhang, S., Chen, D., Wang, J., Zhang, Y., Shen, Y., Zhao, D., Zhou, J. (2023d). Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018
Wang, X., Zhang, S., Yuan, H., Qing, Z., Gong, B., Zhang, Y., Shen, Y., Gao, C., Sang, N. (2024). A recipe for scaling up text-to-video generation with text-free videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6572–6582
Wang, Y., Li, K., Li, Y., He, Y., Huang. B., Zhao. Z., Zhang, H., Xu, J., Liu, Y., Wang, Z., Xing, S., Chen, G., Pan, J., Yu, J., Wang, Y., Wang, L., Qiao, Y. (2022). Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191
Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P., et al. (2023e). Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103
Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., Zuo, W. (2023). Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848
Wu, C., Huang, L., Zhang, Q., Li, B., Ji, L., Yang, F., Sapiro, G., Duan, N. (2021). Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806
Wu, C., Liang J., Ji, L., Yang, F., Fang, Y., Jiang, D., Duan, N. (2022a). Nüwa: Visual synthesis pre-training for neural visual world creation. In: ECCV, Springer, pp 720–736
Wu, J.Z., Ge, Y., Wang, X., Lei, W., Gu, Y., Hsu, W., Shan, Y., Qie, X., Shou, M.Z. (2022b). Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565
Xing, J., Xia, M., Zhang, Y., Chen, H., Wang, X., Wong, T.T., Shan, Y. (2023). Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190
Xiong, W., Luo, W., Ma, L., Liu, W., & Luo, J. (2018). Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2364–2373
Xu, J., Mei, T., Yao, T., Rui, Y. (2016). Msr-vtt: A large video description dataset for bridging video and language. In: CVPR
Yan, W., Zhang, Y., Abbeel, P., Srinivas, A. (2021). Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157
Yang, R., Srivastava, P., Mandt, S. (2022). Diffusion probabilistic modeling for video generation. arXiv preprint arXiv:2203.09481
Yang, S., Zhou, Y., Liu, Z., Loy, C.C. (2023). Rerender a video: Zero-shot text-guided video-to-video translation. In: ACM SIGGRAPH Asia Conference Proceedings
Ye, H., Zhang, J., Liu, S., Han, X., Yang, W. (2023). Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arxiv:2308.06721
Yu, S., Tack, J., Mo, S., Kim, H., Kim, J., Ha, J.W., & Shin, J. (2021). Generating videos with dynamics-aware implicit generative adversarial networks. In: ICLR
Zhang, D.J., Wu, J.Z., Liu, J.W., Zhao, R., Ran, L., Gu, Y., Gao, D., & Shou, M.Z. (2023a). Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818
Zhang, L., Rao, A., & Agrawala, M. (2023b). Adding conditional control to text-to-image diffusion models
Zhang, S., Wang, J., Zhang, Y., Zhao, K., Yuan, H., Qing, Z., Wang, X., Zhao, D., & Zhou, J. (2023c). I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145
Zhang, Y., Wei, Y., Jiang, D., Zhang, X., Zuo, W., Tian, Q. (2023d). Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077
Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., & Li, T., You Y. (2024). Open-sora: Democratizing efficient video production for all. https://github.com/hpcaitech/Open-Sora
Zhou, D., Wang, W., Yan, H., Lv. W., Zhu, Y., & Feng, J. (2022). Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Shengfeng He.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, D.J., Li, D., Le, H. et al. MoonShot: Towards Controllable Video Generation and Editing with Motion-Aware Multimodal Conditions. Int J Comput Vis 133, 3629–3644 (2025). https://doi.org/10.1007/s11263-025-02346-1
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02346-1