Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

Zhang, David Junhao; Wu, Jay Zhangjie; Liu, Jia-Wei; Zhao, Rui; Ran, Lingmin; Gu, Yuchao; Gao, Difei; Shou, Mike Zheng

doi:10.1007/s11263-024-02271-9

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

Published: 24 October 2024

Volume 133, pages 1879–1893, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

1068 Accesses
72 Citations
Explore all metrics

Abstract

Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution, which can also remove potential artifacts and corruptions from low-resolution videos. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15 G vs. 72 G). Furthermore, our Show-1 model can be readily adapted for motion customization and video stylization applications through simple temporal attention layer finetuning. Our model achieves state-of-the-art performance on standard video generation benchmarks. Code of Show-1 is publicly available and more videos can be found here.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

xGen-VideoSyn-1: High-Fidelity Text-to-Video Synthesis with Compressed Representations

Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation

Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation

Article 24 February 2025

Notes

References

An, J., Zhang, S., Yang, H., Gupta, S., Huang, J. B., Luo, J., & Yin, X. (2023). Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint[SPACE]arXiv:2304.08477.
Bain, M., Nagrani, A., Varol, G., & Zisserman, A. (2021). Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1728–1738).
Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Zhang, Q., Kreis, K., Aittala, M., Aila, T., Laine, S., Catanzaro, B., et al. (2022). eDiff-I: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint[SPACE]arXiv:2211.01324.
Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Li, Y., Michaeli, T., et al. (2024). Lumiere: A space–time diffusion model for video generation. arXiv preprint[SPACE]arXiv:2401.12945.
Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S., & Kreis, K. (2023a). Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 22563–22575).
Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler. S., & Kreis. K. (2023b). Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR.
Chen, H., Xia, M., He, Y., Zhang, Y., Cun, X., Yang, S., Xing, J., Liu, Y., Chen, Q., Wang, X., et al. (2023). Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint[SPACE]arXiv:2310.19512.
Esser, P., Chiu, J., Atighehchian, P., Granskog, J., & Germanidis, A. (2023). Structure and content-guided video synthesis with diffusion models. arXiv preprint[SPACE]arXiv:2302.03011.
Ge, S., Hayes, T., Yang, H., Yin, X., Pang, G., Jacobs, D., Huang, J. B., & Parikh, D. (2022). Long video generation with time-agnostic vqgan and time-sensitive transformer. arXiv preprint[SPACE]arXiv:2204.03638.
Ge, S., Nah, S., Liu, G., Poon, T., Tao, A., Catanzaro, B., Jacobs, D., Huang, J. B., Liu, M. Y., & Balaji, Y. (2023). Preserve your own correlation: A noise prior for video diffusion models. arXiv preprint[SPACE]arXiv:2305.10474.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial networks. NIPS.
Gu, S., Chen, D., Bao, J., Wen, F., Zhang, B., Chen, D., Yuan, L., & Guo, B. (2022). Vector quantized diffusion model for text-to-image synthesis. In CVPR (pp. 10696–10706).
Harvey, W., Naderiparizi, S., Masrani, V., Weilbach, C., & Wood, F. (2022). Flexible diffusion modeling of long videos. arXiv preprint[SPACE]arXiv:2205.11495.
He, Y., Yang, T., Zhang, Y., Shan, Y., & Chen, Q. (2022). Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint[SPACE]arXiv:2211.13221.
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in neural information processing systems 30.
Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., et al. (2022a). Imagen video: High definition video generation with diffusion models. arXiv preprint[SPACE]arXiv:2210.02303.
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. NeurIPS, 33, 6840–6851.
Google Scholar
Ho, J., Saharia, C., Chan, W., Fleet, D. J., Norouzi, M., & Salimans, T. (2022). Cascaded diffusion models for high fidelity image generation. JMLR, 23, 47–1.
MathSciNet MATH Google Scholar
Hong, S., Yang, D., Choi, J., & Lee, H. (2018). Inferring semantic layout for hierarchical text-to-image synthesis. In CVPR (pp. 7986–7994).
Hong, W., Ding, M., Zheng, W., Liu, X., & Tang, J. (2022). Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint[SPACE]arXiv:2205.15868.
Höppe, T., Mehrjou, A., Bauer, S., Nielsen, D., & Dittadi, A. (2022). Diffusion models for video prediction and infilling. arXiv preprint[SPACE]arXiv:2206.07696.
Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al. (2023). Vbench: Comprehensive benchmark suite for video generative models. arXiv preprint[SPACE]arXiv:2311.17982.
Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., & Liu, Z. (2024). VBench: Comprehensive Benchmark Suite for Video Generative Models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Jeong, H., Park, G. Y., & Ye, J. C. (2023). Vmc: Video motion customization using temporal attention adaption for text-to-video diffusion models. arXiv preprint[SPACE]arXiv:2312.00845.
Khachatryan, L., Movsisyan, A., Tadevosyan, V., Henschel, R., Wang, Z., Navasardyan, S., & Shi, H. (2023). Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint[SPACE]arXiv:2303.13439.
Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Hornung, R., Adam, H., Akbari, H., Alon, Y., Birodkar, V., et al. (2023). Videopoet: A large language model for zero-shot video generation. arXiv preprint[SPACE]arXiv:2312.14125.
Le Moing, G., Ponce, J., & Schmid, C. (2021). Ccvs: Context-aware controllable video synthesis. NeurIPS.
Luo, Z., Chen, D., Zhang, Y., Huang, Y., Wang, L., Shen, Y., Zhao, D., Zhou, J., & Tan, T. (2023). VideoFusion: Decomposed diffusion models for high-quality video generation. In CVPR.
Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J. Y., & Ermon, S. (2021). Sdedit: Guided image synthesis and editing with stochastic differential equations. In International conference on learning representations.
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., & Chen, M. (2021). Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint[SPACE]arXiv:2112.10741.
Nikankin, Y., Haim, N., & Irani, M. (2022). Sinfusion: Training diffusion models on a single image or video. arXiv preprint[SPACE]arXiv:2211.11743.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1–67.
MathSciNet Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint[SPACE]arXiv:2204.06125.
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., & Lee, H. (2016). Generative adversarial text to image synthesis. In ICML, PMLR (pp. 1060–1069).
Rogozhnikov, A. (2022). Einops: Clear and reliable tensor manipulations with einstein-like notation. In International conference on learning representations. https://openreview.net/forum?id=oapKSVM2bcj.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In CVPR (pp. 10684–10695).
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention-MICCAI 2015: 18th international conference, Munich, Germany, October 5–9, 2015. Proceedings, Part III (Vol. 18, pp. 234–241).
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi, S. S., Lopes, R. G., et al. (2022). Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint[SPACE]arXiv:2205.11487.
Saito, M., Matsumoto, E., & Saito, S. (2017). Temporal generative adversarial nets with singular value clipping. In ICCV.
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint[SPACE]arXiv:2210.08402.
Shen, X., Li, X., & Elhoseiny, M. (2023). MoStGAN-V: Video generation with temporal motion styles. In CVPR.
Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al. (2022). Make-a-video: Text-to-video generation without text-video data. arXiv preprint[SPACE]arXiv:2209.14792.
Skorokhodov, I., Tulyakov, S., & Elhoseiny, M. (2021). Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. arXiv preprint arXiv:2112.14683.
Soomro, K., Zamir, A.R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint[SPACE]arXiv:1212.0402.
Srivastava, N., Mansimov, E., & Salakhudinov, R. (2015). Unsupervised learning of video representations using lstms. In ICML.
Tian, Y., Ren, J., Chai, M., Olszewski, K., Peng, X., Metaxas, D. N., & Tulyakov, S. (2021). A good image generator is what you need for high-resolution video synthesis. In ICLR.
Tulyakov, S., Liu, M.Y., Yang, X., & Kautz, J. (2018). MoCoGAN: Decomposing motion and content for video generation. CVPR.
Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., & Gelly, S. (2018). Towards accurate generative models of video: A new metric & challenges. arXiv preprint[SPACE]arXiv:1812.01717.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems 30.
Voleti, V., Jolicoeur-Martineau, A., & Pal, C. (2022). Masked conditional video diffusion for prediction, generation, and interpolation. arXiv preprint[SPACE]arXiv:2205.09853.
Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Generating videos with scene dynamics. NIPS.
Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., & Zhang, S. (2023a). Modelscope text-to-video technical report. arXiv preprint[SPACE]arXiv:2308.06571.
Wang, W., Yang, H., Tuo, Z., He, H., Zhu, J., Fu, J., & Liu, J. (2023b). Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint[SPACE]arXiv:2305.10874.
Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P., et al. (2023c). Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint[SPACE]arXiv:2309.15103
Wu, C., Huang, L., Zhang, Q., Li, B., Ji, L., Yang, F., Sapiro, G., & Duan, N. (2021). Godiva: Generating open-domain videos from natural descriptions. arXiv preprint[SPACE]arXiv:2104.14806.
Wu, C., Liang, J., Ji, L., Yang, F., Fang, Y., Jiang, D., & Duan, N. (2022a). Nüwa: Visual synthesis pre-training for neural visual world creation. In ECCV (pp. 720–736). Springer.
Wu, J. Z., Ge, Y., Wang, X., Lei, W., Gu, Y., Hsu, W., Shan, Y., Qie, X., & Shou, M. Z. (2022b). Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint[SPACE]arXiv:2212.11565.
Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). Msr-vtt: A large video description dataset for bridging video and language. In CVPR.
Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., & He, X. (2018). Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR (pp. 1316–1324).
Yan, W., Zhang, Y., Abbeel, P., & Srinivas, A. (2021). Videogpt: Video generation using vq-vae and transformers. arXiv preprint[SPACE]arXiv:2104.10157.
Yang, R., Srivastava, P., & Mandt, S. (2022). Diffusion probabilistic modeling for video generation. arXiv preprint[SPACE]arXiv:2203.09481.
Yin, S., Wu, C., Yang, H., Wang, J., Wang, X., Ni, M., Yang, Z., Li, L., Liu, S., Yang, F., et al. (2023). Nuwa-xl: Diffusion over diffusion for extremely long video generation. arXiv preprint[SPACE]arXiv:2303.12346.
Yu, S., Tack, J., Mo, S., Kim, H., Kim, J., Ha, J. W., & Shin, J. (2021). Generating videos with dynamics-aware implicit generative adversarial networks. In ICLR.
Zhang, H., Koh, J. Y., Baldridge, J., Lee, H., & Yang, Y. (2021). Cross-modal contrastive learning for text-to-image generation. In CVPR (pp. 833–842).
Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., & Metaxas, D. N. (2017). Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV (pp. 5907–5915).
Zhao, R., Gu, Y., Wu, J. Z., Zhang, D. J., Liu, J., Wu, W., Keppo, J., & Shou, M. Z. (2023). Motiondirector: Motion customization of text-to-video diffusion models. arXiv preprint[SPACE]arXiv:2310.08465.
Zhou, D., Wang, W., Yan, H., Lv, W., Zhu, Y., & Feng, J. (2022). Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint[SPACE]arXiv:2211.11018.

Download references

Author information

David Junhao Zhang, Jay Zhangjie Wu, and Jia-Wei Liu contributed equally to this work.

Authors and Affiliations

Show Lab, National University of Singapore, Singapore, Singapore
David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao & Mike Zheng Shou

Authors

David Junhao Zhang
View author publications
Search author on:PubMed Google Scholar
Jay Zhangjie Wu
View author publications
Search author on:PubMed Google Scholar
Jia-Wei Liu
View author publications
Search author on:PubMed Google Scholar
Rui Zhao
View author publications
Search author on:PubMed Google Scholar
Lingmin Ran
View author publications
Search author on:PubMed Google Scholar
Yuchao Gu
View author publications
Search author on:PubMed Google Scholar
Difei Gao
View author publications
Search author on:PubMed Google Scholar
Mike Zheng Shou
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Mike Zheng Shou.

Additional information

Communicated by Yubo Li.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, D.J., Wu, J.Z., Liu, JW. et al. Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation. Int J Comput Vis 133, 1879–1893 (2025). https://doi.org/10.1007/s11263-024-02271-9

Download citation

Received: 29 March 2024
Accepted: 03 October 2024
Published: 24 October 2024
Version of record: 24 October 2024
Issue date: April 2025
DOI: https://doi.org/10.1007/s11263-024-02271-9

Keywords

Part of a collection:

Special Issue on Large-Scale Generative Models for Content Creation and Manipulation

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

xGen-VideoSyn-1: High-Fidelity Text-to-Video Synthesis with Compressed Representations

Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation

Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now