Avoid common mistakes on your manuscript.
1 Introduction
The field of computer vision has witnessed remarkable advancements in recent years, largely driven by the rapid progress in generative models. The advent of large-scale generative models has significantly transformed content creation and manipulation, enabling the synthesis of high-quality images, videos, and 3D content. These models provide unprecedented control over visual elements, facilitating applications in digital art, virtual reality, and interactive media.
Early generative models, such as Gaussian Mixture Models and Variational Autoencoders (VAEs), laid the groundwork for probabilistic modeling but were limited in scalability and realism. The introduction of Generative Adversarial Networks (GANs) and autoregressive models brought substantial improvements in image quality and diversity, while diffusion models further enhanced visual coherence, particularly in high-resolution synthesis. At the same time, transformer-based architectures have enabled more structured and controllable content generation. These advancements have made large-scale generative models the backbone of modern AI-driven content creation, with applications ranging from style transfer, image inpainting, and object manipulation to high-fidelity video synthesis and 3D scene generation.
Despite their success, several critical challenges remain. Large-scale models require massive computational resources, making efficient training and deployment a pressing issue. In video generation, maintaining temporal consistency remains an ongoing challenge, especially when synthesizing long sequences. In 3D content creation, ensuring multi-view consistency and structural fidelity is crucial for high-quality object generation. Additionally, domain adaptation, robustness, and fairness are major concerns as generative models become widely adopted in real-world applications. Finally, the trustworthiness of generative models, including bias mitigation, ethical considerations, and security, has become an increasingly important topic in AI research.
To address these challenges, this special issue aims to highlight state-of-the-art research in generative modeling for content creation and manipulation. We received 53 submissions, of which 21 high-quality papers were accepted after rigorous peer review. These works cover a wide range of topics, including image and video synthesis, controllable generative models, 3D content generation, interactive editing, domain adaptation, and trustworthiness.
2 Text-to-Image and Text-to-Video Generation
Recent advancements in text-conditioned generative models have significantly improved the quality and realism of synthesized content, enabling controllable image and video generation.
-
Ge et al. introduce an approach that enhances text-to-image synthesis by incorporating enriched textual embeddings, capturing fine-grained semantics and expressive details. Their method improves alignment between text prompts and generated images, leading to more diverse and visually appealing results.
-
Wang et al. improve video consistency by introducing a swap attention mechanism that balances spatial and temporal coherence in diffusion-based text-to-video generation. This technique enhances the smoothness of generated motion, reducing flickering and abrupt scene changes.
-
LaVie presents a cascaded latent diffusion framework for high-resolution and temporally stable video generation. By progressively refining video frames, the method achieves both sharp details and long-range temporal consistency.
-
Show-1 explores a hybrid approach that integrates pixel-based and latent-space diffusion models, improving motion stability in video synthesis. This fusion enables higher fidelity in dynamic scenes while maintaining efficient computation.
While text-to-image and video synthesis have seen substantial improvements, achieving fine-grained control over generated content remains a crucial research direction.
3 Controllable and Interactive Generative Models
Controllability is a key requirement for practical deployment of generative models, enabling users to guide synthesis processes according to specific constraints.
-
Moonshot introduces a motion-aware conditioning framework, enabling precise user control over video generation and editing. By incorporating multimodal constraints, the method allows users to manipulate motion trajectories, pacing, and scene composition.
-
Zhang et al. propose a compact yet efficient layout generation model that balances computational efficiency with strong performance. Their approach optimizes text-conditioned layout generation while significantly reducing model size, making it suitable for resource-constrained settings.
-
FastComposer presents a novel localized attention mechanism, allowing different subjects to be independently controlled within a single generated image. This method ensures seamless multi-subject composition while avoiding blending artifacts.
With these improvements in controllability, generative models are also driving innovations in 3D content creation, as explored in the next section.
4 3D Generative Modeling and Neural Rendering
Generative models are rapidly expanding into 3D content creation, enabling applications in virtual environments, game development, and digital design.
-
SLIDE proposes a multi-view consistent 3D synthesis pipeline that improves texture generation and geometric fidelity. The framework integrates mesh and texture generation to produce structurally coherent and visually appealing 3D assets.
-
Hyper-3DG introduces a hypergraph-based text-to-3D framework, enhancing structural accuracy and fine-grained detail. By leveraging hypergraph representations, the model captures spatial dependencies more effectively than traditional architectures.
-
Sun et al. present a motion synthesis model that generates realistic dyadic human interactions with natural communicative gestures. The model learns from real-world conversational motion data, enabling lifelike nonverbal expressions and gestures.
-
Instant3D proposes an efficient text-to-3D pipeline, significantly reducing computational overhead while maintaining high-quality 3D synthesis. The method accelerates 3D generation without sacrificing geometric precision, making it suitable for real-time applications.
-
Nath et al. explore an implicit neural representation approach that enhances shape awareness in generative 3D models. Their method improves shape consistency by incorporating polynomial implicit functions, allowing for smoother and more structured 3D surfaces.
As generative models advance 3D content creation, they are also reshaping interactive content editing and animation, making generative techniques more accessible to creative professionals.
5 Generative Models for Content Editing and Animation
Generative models have become essential tools for image/video editing and animation, enabling seamless modifications and automation.
-
Yildirim et al. improve image editing precision using a residual-based warping technique, enhancing flexibility and realism. The approach enables high-fidelity modifications while preserving structural details, making it particularly effective for face and portrait editing.
-
AniClipart applies text-to-video priors to animate clipart-style images, extending generative capabilities to 2D animation. By leveraging text-conditioned motion priors, the method generates smooth, expressive animations for static illustrations.
-
Xing et al. develop a framework for fine-grained garment transfer, improving the realism of virtual try-on applications. Their model adapts clothing identity while preserving fabric textures and structural consistency, enhancing online shopping experiences.
-
CogCartoon explores story visualization techniques, bridging text-based narratives with animated sequences. By synthesizing sequential frames with temporal consistency, the method enables automatic storytelling from written descriptions.
Beyond content creation, ensuring robustness and adaptability in generative models is a major concern, particularly in real-world deployments.
6 Generative Domain Adaptation and Robustness
Robustness and adaptability are essential for deploying generative models in dynamic environments and cross-domain applications.
-
Li et al. introduce a one-shot domain adaptation framework, improving 3D GAN generalization to unseen data. Their method leverages limited target samples to adjust model parameters without requiring extensive retraining.
-
Oh et al. tackle stability-plasticity tradeoffs, proposing a geodesic distillation loss for CLIP-based models. This loss function helps maintain generative model adaptability while preventing catastrophic forgetting.
-
Zhou et al. enhance text-to-image synthesis with conditional masking, ensuring stronger semantic alignment. By using masked conditioning, the model better understands context and produces more coherent outputs.
As these models become more widely used, ethical concerns and trustworthiness are becoming increasingly important.
7 Surveys and Theoretical Insights
Two survey papers provide comprehensive insights into generative model trustworthiness and 3D neural stylization.
-
Fan et al. examine bias, fairness, and security risks in generative models, outlining challenges in making these models more reliable. They discuss evaluation frameworks and mitigation strategies for ethical generative AI.
-
Chen et al. present an in-depth review of 3D stylization techniques, covering applications in art, design, and entertainment. Their survey categorizes recent works, highlighting advancements in geometric and texture-based stylization.
8 Conclusion
The works presented in this special issue illustrate the rapid advancements in large-scale generative models, spanning high-resolution content synthesis, controllability, 3D generation, domain adaptation, and trustworthiness. The selected papers not only push the boundaries of generative AI but also address key challenges in efficiency, consistency, and ethical considerations.
The progress in text-to-image and text-to-video models demonstrates increasing fidelity and control, yet challenges remain in enhancing interpretability and computational efficiency. Controllable generation has made strides in structured content creation, but further work is needed to develop more intuitive interaction mechanisms. In 3D content creation, generative models now achieve higher geometric accuracy and multi-view consistency, though optimizing computational demands remains crucial for practical deployment.
Applications in content editing and animation showcase how generative models can enhance creative workflows, yet ensuring robustness and seamless integration into production pipelines remains a priority. Meanwhile, domain adaptation and robustness are vital for generative models to generalize across datasets and real-world scenarios, requiring continued research in adaptation techniques and stability.
As generative models become increasingly embedded in digital media, their trustworthiness and ethical implications must be carefully addressed. The survey papers in this issue provide a comprehensive outlook on the challenges of reliability, bias, and security, underscoring the need for transparent evaluation metrics and safeguards against misuse.
We hope this collection of works serves as a valuable reference for researchers and practitioners in generative AI. We extend our sincere gratitude to all authors, reviewers, and the editorial team for their efforts in making this special issue possible. We look forward to continued progress in this evolving and impactful field.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
He, S., Gao, L., Fu, H. et al. Guest Editorial: Special Issue on Large-Scale Generative Models for Content Creation and Manipulation. Int J Comput Vis 133, 4962–4965 (2025). https://doi.org/10.1007/s11263-025-02414-6
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02414-6