AvatarStudio: High-Fidelity and Animatable 3D Avatar Creation from Text

Zhang, Xuanmeng; Zhang, Jianfeng; Zhang, Chenxu; Liew, Jun Hao; Zhang, Huichao; Yang, Yi; Feng, Jiashi

doi:10.1007/s11263-025-02423-5

AvatarStudio: High-Fidelity and Animatable 3D Avatar Creation from Text

Open access
Published: 07 April 2025

Volume 133, pages 5178–5196, (2025)
Cite this article

You have full access to this open access article

Download PDF

International Journal of Computer Vision Aims and scope Submit manuscript

AvatarStudio: High-Fidelity and Animatable 3D Avatar Creation from Text

Download PDF

Xuanmeng Zhang ORCID: orcid.org/0000-0002-6939-4074¹^na1,
Jianfeng Zhang²^na1,
Chenxu Zhang²,
Jun Hao Liew²,
Huichao Zhang²,
Yi Yang³ &
…
Jiashi Feng²

1851 Accesses
2 Citations
Explore all metrics

Abstract

We study the problem of creating high-fidelity and animatable 3D avatars from only textual descriptions. Existing text-to-avatar methods are either limited to static avatars which cannot be animated or struggle to generate animatable avatars with promising quality and precise pose control. To address these limitations, we propose AvatarStudio, a generative model that yields explicit textured 3D meshes for animatable human avatars. Specifically, AvatarStudio proposes to incorporate articulation modeling into the explicit mesh representation to support high-resolution rendering and avatar animation. To ensure view consistency and pose controllability of the resulting avatars, we introduce a simple-yet-effective 2D diffusion model conditioned on DensePose for Score Distillation Sampling supervision. By effectively leveraging the synergy between the articulated mesh representation and DensePose-conditional diffusion model, AvatarStudio can create high-quality avatars from text ready for animation. Furthermore, it is competent for many applications, e.g., multimodal avatar animations and style-guided avatar creation. Please refer to our project page for more results.

AvatarCap: Animatable Avatar Conditioned Monocular Human Volumetric Capture

HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting

RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The creation of high-fidelity and animatable 3D human avatars is essential in various fields, including the media industry, VR/AR, game design, etc. However, it is a labor-intensive task that typically requires pre-captured templates and extensive work from experienced artists. Therefore, a user-friendly system that can generate and animate 3D avatars is of great value. This objective serves as the primary goal of this work. Existing 3D avatars creation methods can be classified into three categories: (1) template-based generation pipeline (Met, 2023), (2) 3D generative models (Hong et al., 2022a; Zhang et al., 2022) and (3) 2D-lifting methods (Poole et al., 2022; Lin et al., 2023). Avatars generated using template-based methods typically exhibit relatively simple topology and texture. On the other hand, 3D generative models often struggle to generalize to arbitrary avatars with diverse appearances due to the scarcity and limited diversity of accessible 3D models. Yet, in real-world applications, users often desire high-quality 3D avatars with intricate structures and artistic styles.

Recently, 2D-lifting methods have shown that 2D generation models trained on large-scale image datasets possess strong generalizability, making them suitable for 3D content creation. Representative works such as DreamFusion (Poole et al., 2022) and Magic3D (Lin et al., 2023), employ 2D diffusion models as supervision to optimize 3D representations using Score Distillation Sampling. More recent studies (Cao et al., 2023; Huang et al., 2023; Kolotouros et al., 2023) incorporate parametric human priors (Loper et al., 2015; Alldieck et al., 2021) into the 2D-lifting optimization process to facilitate 3D human avatar creation. However, these methods often focus on generating static avatars, which are challenging to animate, or they produce low-quality, animatable 3D avatars (e.g., blurriness, lack of details, and poor pose controllability) (Cao et al., 2023; Jiang et al., 2023; Hong et al., 2022b; Kolotouros et al., 2023; Huang et al., 2023), thereby failing to meet the requirements for practical applications. Consequently, there is a growing need for more advanced solutions capable of generating high-fidelity, animatable 3D avatars.

In this work, we propose AvatarStudio, a novel framework designed for creating high-quality 3D avatars from textual descriptions while offering flexible animation ability. Our method proposes a new 3D human representation that incorporates articulated human modeling into an explicit mesh representation. The former enables to animate the generated avatars to desired poses, while the latter allows to fully harness the power of 2D diffusion priors at high resolution. Though conceptually straightforward, effectively training an articulated mesh representation to generate high-quality and animatable avatars presents a significant challenge, stemming from the absence of proper mesh initialization and an effective pose-controllable 2D guidance mechanism.

To address these challenges, we employ a two-pronged approach. Specifically, we train a human NeRF from scratch with a fixed pre-defined canonical pose, which largely eases the training difficulties. Initialized from it, we optimize a SMPL-guided (Loper et al., 2015) articulated mesh represented by DMTet (Shen et al., 2021). This mesh representation, rendered through an efficient rasterizer (Laine et al., 2020), enables the production of high-resolution images up to $512^2$, thereby facilitating the creation of high-fidelity avatars. To effectively optimize this proposed representation from text, we utilize pre-trained 2D diffusion models as priors. Unlike previous methods that use StableDiffusion (Rombach et al., 2021) or skeleton-conditional ControlNet (Zhang et al., 2023a) for SDS supervision, which are prone to inaccurate pose control and the Janus problem (Cao et al., 2023; Huang et al., 2023), we propose a novel ControlNet conditioned on DensePose (Güler et al., 2018) as guidance, which offers dual benefits: (1) 3D-aware DensePose ensures a more stable and view-consistent avatar creation process; (2) it provides more precise pose control over the generated avatars.

As shown in Fig. 1, AvatarStudio can create high-fidelity, animatable avatars from the text. We evaluate it quantitatively and qualitatively, verifying its superiority over previous state-of-the-arts. Thanks to the easy-to-use animation capability, it allows users to animate the generated avatars using multimodal signals (e.g., video and text). Moreover, by simply plugging an additional adapter (Ye et al., 2023), AvatarStudio can create avatars with unique artistic styles given a reference style image, further expanding the range of applications and customization options for 3D avatar creation.

2 Related Works

Text-Guided 3D Content Generation. The successful advancement in text-guided 2D image generation has paved the way for text-guided 3D content creation. Notable examples include CLIP-forge (Sanghi et al., 2021), DreamFields (Jain et al., 2021), and CLIP-Mesh (Khalid et al., 2022), which utilize the widely acclaimed CLIP (Radford et al., 2021) to optimize underlying 3D representations, such as NeRF and textured meshes. DreamFusion (Poole et al., 2022) proposes to use the Score Distillation Sampling (SDS) loss, derived from a pre-trained diffusion model (Saharia et al., 2022) as supervision during optimization. Subsequent improvements over DreamFusion include optimizing 3D representations in a latent space (Metzer et al., 2022) and coarse-to-fine manner (Lin et al., 2023). Following this line of research, TEXTure (Richardson et al., 2023) opts to generate texture maps for a given 3D mesh, while ProlificDreamer (Wang et al., 2023) introduces variational score distillation to produce promising results. Fantasia3D (Chen et al., 2023) disentangle the modeling and learning of geometry and appearance for generating 3D assets. However, despite these advancements, when it comes to avatar creation, these techniques often exhibit limitations including low-quality generation, presence of the Janus problem, and incorrect rendering of body parts. In contrast, AvatarStudio enables high-fidelity and animatable generation of 3D avatars from text prompts.

Text-Guided 3D Avatar Generation. To enable 3D avatar generation from text, several approaches have been proposed. Avatar-CLIP (Hong et al., 2022b) sets the foundation by initializing human geometry with a shape VAE and utilizing CLIP (Radford et al., 2021) to assist in geometry and texture generation. DreamAvatar (Cao et al., 2023) and AvatarCraft (Jiang et al., 2023) integrate the human parametric model with pre-trained 2D diffusion models for 3D avatar creation. DreamHuman (Kolotouros et al., 2023) further introduces a camera zoom-in technique to refine the local details of resulting avatars; TADA (Liao et al., 2024) proposes a hybrid representation for avatar animation; while DreamWaltz (Huang et al., 2023) incorporates a skeleton-conditioned ControlNet and develops an occlusion-aware SDS guidance for pose-aligned supervision. Although these methods achieve animatable results, they suffer from low-quality generation issues, like blurriness, coarseness and insufficient details. Additionally, the weak SDS guidance and the inherent sparsity of skeleton conditioning make it difficult to generate multi-view consistent avatars with accurate pose controllability. HumanNorm (Huang et al., 2024) and SEEAvatar (Xu et al., 2023a) introduce a paradigm that produces the geometry of the human body first and then performs texture generation. In contrast, we propose an articulated textured mesh representation for 3D human modeling, enabling effective avatar animation and high-resolution rendering. It allows the model to fully utilize 2D diffusion priors at high resolution, leading to higher-quality generation. Moreover, we use a DensePose-conditioned ControlNet for SDS guidance to ensure more stable, view-consistent avatar creation and improved pose control. A concurrent work, AvatarVerse (Zhang et al., 2024) also employs DensePose as conditioning for SDS guidance. However, AvatarVerse is limited to static avatar generation, making it hard to animate the resulting avatars in a user-friendly way.

3 Preliminaries

Score Distillation Sampling. The key technique of lifting a pre-trained 2D diffusion model $\varvec{\epsilon }_\phi $ into a 3D representation $\theta $ is the Score Distillation Sampling (SDS), which can be used to guide the generation of 3D content given an input text prompt y. Specifically, given an image $\mathcal {I} = g(\theta )$ rendered from a differentiable 3D model g, we add random noise $\varvec{\epsilon }$ to obtain a noisy image. The SDS loss then computes the gradient of $\theta $ by minimizing the difference between the predicted noise $\varvec{\epsilon }_\phi \left( x_t; y, t\right) $ and the added noise $\varvec{\epsilon }$, which can be formulated as:

$$\begin{aligned} \nabla _\theta \mathcal {L}_{\text {SDS}}\left( \phi , x_\theta \right) =\mathbb {E}_{t, \varvec{\epsilon }}\left[ w(t)\left( \varvec{\epsilon }_\phi \left( z_t ; y, t\right) -\varvec{\epsilon }\right) \frac{\partial x}{\partial \theta }\right] , \end{aligned}$$

(1)

where $z_t$ is the noisy image at noise level t, w(t) denotes a weighting function depends on t and y.

SMPL. Skinned Multi-Person Linear model (Loper et al., 2015) is a parametric human model that represents a wide range of human body poses and shapes. It defines a deformable mesh $\mathcal {M}(\xi , \beta ) = (\mathcal {V}, \mathcal {S})$, where $\xi $ and $\beta $ denote the pose and shape parameters, $\mathcal {V}$ is the set of $N_{v} = 6890$ vertices, and $\mathcal {S}$ is the set of linear blend skinning (LBS) weights assigned for each vertex. It provides an articulated geometric proxy to the underlying dynamic human body. In this paper, we develop an articulated 3D human representation for animatable avatar creation by generalizing the LBS of SMPL to clothed human modeling (Fig. 2).

4 Methodology

In this work, we aim to generate high-fidelity and animatable 3D human avatars from text inputs. In Sec. 4.1, we first present how to design an articulated explicit mesh representation for animatable avatar modeling and high-resolution rendering. In Sec. 4.2, we elaborate on how to optimize the proposed representation from text inputs via a DensePose-conditioned ControlNet and introduce several simple yet effective strategies to facilitate the generation process.

4.1 Articulated 3D Human Modeling

SMPL-Guided Avatar Articulation. To create animatable human avatars, we incorporate a simple yet effective SMPL-guided articulation into the 3D human modeling process to drive the generated avatar to the desired poses. Specifically, given SMPL parameter $p = (\xi , \beta )$, our model first generates a template avatar with a pre-defined pose in the canonical space, followed by deforming it to the target pose defined by the corresponding parameter p in the deformed space. We leverage the inverse transformation of SMPL LBS to guide the deformation of our human representation. Specifically, given a point $\mathbf {x_d}$ in the deformed space, we first find its nearest vertex $v^*$ in the corresponding SMPL mesh, and then use the skinning weights of $v^*$ to deform $\mathbf {x_d}$ to the corresponding point $\mathbf {x_c}$ in the canonical space:

$$\begin{aligned} \begin{aligned} \mathbf {x_c} = \mathcal {G}^{-1} \cdot \mathbf {x_d}, \quad \mathcal {G} = \sum \limits _{i=1}^{N_j} s_i^* \cdot B_i (\xi , \beta ) \end{aligned} \end{aligned}$$

(2)

where $s_i^*$ is the skinning weight of vertex $v^*$ w.r.t. the i-th joint, $B_i (\xi , \beta )$ is the bone transformation matrix of joint i, $N_j = 24 $ is the number of joints. In addition, we employ a non-rigid deformation scheme (Peng et al., 2021; Chen et al., 2021) to learn residual deformation to compensate for areas where SMPL deformation is inaccurate. Specifically, we add an MLP-based deformation network to learn a residual deformation that models surface changes with the articulation of the avatar: $\triangle n_i = MLPs(Concat[Embed(x_c), p])$, where we feed the position-embedded $x_c$ and SMPL parameter p to MLPs. We found that adding such deformation makes the animation results more robust, especially for avatars with complex clothing, as shown on our project page (video grid with the caption “More animation results-AvatarStudio with non-rigid deformation”).

Articulated Textured Mesh Representation. Most existing text-to-avatar methods (Jiang et al., 2023; Cao et al., 2023) represent avatars as NeRFs and use volumetric rendering for rendering. However, they typically require a considerable computation budget, limiting the resolution of generated images. Thus, they often produce low-quality avatars with coarseness and lack details. In contrast, we opt for an explicit mesh representation for animatable human modeling, which supports high-resolution rendering via an efficient rasterizer (Laine et al., 2020). Nevertheless, we empirically observe that directly optimizing the mesh representation for avatar generation produces degenerated results (see Fig. 6 for more details), due to the high dimensionality of the mesh space and the complexity of human bodies.

To address this, we propose a coarse-to-fine pipeline to optimize the proposed representation. In the first stage, we adopt NeRF to learn a static human in a fixed pre-defined canonical space by leveraging the low-resolution diffusion prior as guidance, which largely alleviates the optimization difficulties. We use hash grid decoding from InstantNGP (Müller et al., 2022) with a two-layer MLP to predict the density and color. To further accelerate the learning process, we adopt a residual prediction scheme on top of the SMPL-derived density field, which serves as a strong geometric prior. In the second stage, we use a differentiable surface representation, i.e., Deep Marching Tetrahedra (DMTet) (Shen et al., 2021), to model avatars as textured mesh, which is initialized from the coarse NeRF using the marching cube algorithm (Lorensen & Cline, 1998). The explicit mesh representation allows us to improve the generation quality by optimizing with high-resolution diffusion prior (e.g., $512\times 512$). Please refer to the Appendix for more details.

For articulating avatar modeling, we establish the correspondence between the canonical and deformed spaces via the SMPL-guided deformation. Specifically, for a point $x_d$ in the deformed space, we first find the corresponding point $x_c$ in the canonical space (see Eq. 2). We then predict a signed distance offset from the surface of the mesh extracted from the coarse model for geometry refinement. The final signed distance of the fine stage $d_{fine}(\mathbf {x_c})$ at point $x_d$ can be computed as:

$$\begin{aligned} \begin{aligned} d_{fine}(\mathbf {x_c}) = d_{coarse}(\mathbf {x_c}) + \Delta d (x_c), \\ \end{aligned} \end{aligned}$$

(3)

where $d_{coarse}(\mathbf {x_c})$ is the signed distance value from the coarse stage, $\Delta d (x_c)$ is the residual SDF value predicted by a two-layer MLP. This allows us to animate the generated avatars to arbitrary poses by simply deforming the canonical one. We employ the neural color field initialized from the coarse stage for mesh textures modeling under higher-resolution space.

4.2 Text-to-Avatar Generation

DensePose SDS Optimization. Optimizing the proposed articulated mesh representation alone is insufficient to achieve significant performance in creating animatable 3D avatars due to the lack of effective 2D pre-trained guidance. The core idea is to optimize the 3D model by distilling prior knowledge from a pretrained diffusion model using Score Distillation Sampling (SDS) loss. Although the image diffusion model can guide content generation, it struggles to synthesize a human avatar with the correct pose due to the absence of conditioning signals. To address this issue, we adopt a DensePose-conditional ControlNet that leverages the more expressive DensePose signal as conditions for avatar generation. This approach helps to alleviate the inaccurate pose control and the Janus problem arises when applying pure Stable Diffusion (Cao et al., 2023) or sparse skeleton-conditioned ControlNet (Huang et al., 2023) for guidance. Specifically, given the SMPL parameter p, we render the human image $\mathcal {I}= g(\theta , p)$ from the 3D human model g parametrized by $\theta $. We also render the SMPL mesh defined by p as DensePose conditions $\mathcal {I}_{cond}(\theta , p)$ from the same camera viewpoint as $\mathcal {I}$. DensePose Güler et al. (2018) partitions the human body mesh into 24 distinct parts, each corresponding to specific body regions (e.g., arms, legs, head). Consequently, each triangular face in the body mesh is assigned to one of the 24 parts. This association is managed by a face-to-indice tensor that labels which part each face belongs to. To achieve this, we adopt rendering scripts from Pytorch3D turtorial^{Footnote 1} and Densepose repository.^{Footnote 2} The use of SMPL-derived DensePose maps allows our approach to bypass potential inaccuracies associated with estimation-based DensePose methods, ensuring reliable input data for our framework. The DensePose-conditioned SDS loss can be defined as follows:

$$\begin{aligned} \begin{aligned}&\nabla _{\theta } \mathcal {L}_{SDS} (\phi , \mathcal {I}= g(\theta , p)) = \\&\mathbb {E}_{t, \epsilon } \left[ \omega (t) (\hat{\epsilon }_{\phi }(\mathcal {I}_t; y, \mathcal {I}_{cond}, t) - \epsilon ) \frac{\partial \mathcal {I} }{\partial \eta }\right] , \end{aligned} \end{aligned}$$

(4)

where $p = (\xi , \beta )$ is the SMPL parameter, $\mathcal {I}_t$ denotes the noisy image at noise level t, $\omega (t)$ is a weighting function that depends on the noise level t, $\varvec{\epsilon }$ is the added noise, and y is the input text prompt. Compared to skeleton-conditioned ControlNet, DensePose-conditioned ControlNet offers two benefits: (1) 3D-aware DensePose ensures a more stable and view-consistent avatar creation process; (2) it enables more accurate pose control of the generated avatars.

CFG Rescale. To ensure better alignment with input text, existing works often use a large classifier-free guidance (CFG) scale when optimizing avatar representation with SDS. However, a large CFG scale can produce severe color saturation, making the generated avatars look unreal. To alleviate the color saturation issue, we apply the CFG rescale trick from (Lin et al., 2024) for adjusting the denoised $\hat{x}_0$. Please refer to (Lin et al., 2024) for more details.

Additional Training Strategies. Directly generating the full-body avatars often results in blurry outcomes that lack fine details. To improve the fidelity of the generated avatars, we employ a part-level super-resolution strategy. By leveraging the body prior from SMPL, we can easily identify the positions of different body parts (i.e., head, hand, upper body, lower body, and arm). We zoom in on each part and apply SDS as before to refine their texture and geometric details. To guide this fine-grained optimization, we use corresponding text prompts for each body part (e.g., “$\texttt {The headshot of <name>}$”, “$\texttt {The right hand of <name>}$”, etc), where $\mathtt{<name>}$ is the textual description of an avatar (Kolotouros et al., 2023).

To improve the quality of animation while preserving high-quality textures and geometries, we adopt a dual-space training strategy (Cao et al., 2023; Kolotouros et al., 2023) that jointly optimizes the human avatar in both the canonical space and deformed space. We utilize the “A-pose” in the canonical space as it is a standard pose for natural humans. Within the deformed space, we sample various poses for training to enhance pose control generalization and accuracy. Specifically, we randomly sample human poses from VPoser (Pavlakos et al., 2019), a variational autoencoder that learns a latent representation of the human pose prior in the training process.

5 Experiments

In this section, we first verify AvatarStudio’s ability for 3D avatar creation from text inputs. Then, we conduct ablation studies to analyze the effectiveness of each component. Finally, we showcase the applications of AvatarStudio, including multimodal avatar animation and style-guided creation.

Implementation Details. To train DensePose-conditioned ControlNet, we sample human images from the LAION dataset (Schuhmann et al., 2022) and annotate them using a pre-trained DensePose model (Güler et al., 2018), resulting in around 1.2M image pairs. The ControlNet training is based on the Stable Diffusion v2.1 base model ($512^2$) and takes about 2 days using 16 NVIDIA V100 GPUs. Our AvatarStudio is implemented in the threestudio (Guo et al., 2023) codebase. For each text prompt, AvatarStudio trains the 3D model with a batch size of 1 for 8k and 2k iterations in the coarse and fine stages, respectively, using the AdamW optimizer (Kingma and Ba, 2015) at a learning rate of 0.01. The entire training process takes around 1.5 h on a single NVIDIA V100 GPU. For the SDS guidance, the maximum and minimum timestep decrease from 0.98 to 0.5 and 0.02, over the first 6,000 steps in the coarse stage. In the fine stage, these are fixed to 0.5 and 0.02, respectively. We set the rescale factor to 0.5 for the CFG rescale trick. The rendering resolution begins at $64^2$ and increases to $256^2$ after the first 4,000 steps in the coarse stage and is set to $512^2$ in the fine stage. For more implementation details, please refer to the Appendix.

5.1 Qualitative Comparison

We present a qualitative comparison against DreamAvatar (Cao et al., 2023), DreamWaltz (Huang et al., 2023), DreamHuman (Kolotouros et al., 2023) and AvatarVerse (Zhang et al., 2024) in Fig. 3. Benefiting from the explicit mesh representation, AvatarStudio outperforms DreamAvatar and DreamWaltz significantly in terms of both geometry and texture, resulting in richer details across all cases. In comparison with AvatarVerse, our AvatarStudio generates avatars with clearer appearances (1st and 3rd rows) and aligns more closely with the input texts (2nd row). Moreover, thanks to its articulation modeling, a standout feature of AvatarStudio is its ability to support avatar animation (see later in Fig. 10), which is not available in AvatarVerse. These clearly demonstrate the superiority of AvatarStudio for text-guided 3D avatar creation. We also visualize the normal maps of the generated avatars in Fig. 4 and 14, showing that our method is robust to different input prompts and can produce high-quality results.

User Study. To quantitatively evaluate AvatarStudio, we conduct user studies comparing the performance of our results with four SOTA methods under the same text prompts. We randomly pick 30 prompts for evaluation. Each prompt is evaluated by 20 volunteers. Each user is required to select a preferred 3D model among the given rendered videos with corresponding prompts. In Fig. 5, we first compare AvatarStudio with DreamAvatar (Cao et al., 2023) and DreamWaltz (Huang et al., 2023) for specific characters generation, and then compare with AvatarVerse and DreamHuman (Kolotouros et al., 2023) in terms of realistic human generation. As shown in Fig. 5, users prefer our model over other methods. The results demonstrate that our method achieves significantly superior preference over all other methods.

CLIP Score. We use CLIP score (Detlefsen et al., 2022) as an evaluation metric to measure the consistency between the generated avatars and input texts for the above methods. For each method, we render the generated avatars from four evenly distributed horizontal views and calculate the averaged CLIP score for these rendered images and the input text. Similar to the user study, we compare the proposed method with DreamAvatar and DreamWaltz in terms of specific character generation and compare with AvatarVerse and DreamHuman for realistic human generation. The CLIP scores for DreamAvatar, DreamWaltz, and ours are 30.45, 31.52, and 32.80, respectively, while the CLIP scores for DreamHuman, AvatarVerse, and ours are 29.54, 28.88, and 32.17, respectively. Our AvatarStudio consistently outperforms all these methods, verifying its effectiveness in creating more accurate avatars in alignment with the input texts.

5.2 Ablation Studies

Avatar Representation. Our approach, AvatarStudio, utilizes an articulated mesh representation in a coarse-to-fine manner, with the coarse stage being represented by NeRF. To explore the impact of different 3D representations, we optimize 3D avatars from text using either mesh-only (DMTet) or NeRF-only representations. As shown in Fig. 6, directly optimizing meshes for avatar creation results in collapsed results, while using NeRF-only representation often yields avatars of lower quality. In contrast, our proposed articulated representation, which combines NeRF and mesh, successfully generates high-resolution images with fine details, demonstrating its effectiveness.

Part-Aware Super-Resolution and CFG Rescale Strategy. Furthermore, we explore the individual impacts of part-aware super-resolution (SR) and CFG rescale strategy (Figs. 7, 8). As shown in Figs. 7a and 8a, we observe CFG rescale method can mitigate the color saturation issue, generating more natural appearance for the generated avatar. Upon the addition of part-aware super-resolution, the model can produce sharper appearances and more local fine details, such as on faces and belts (see Fig. 7b). These studies validate the effectiveness of each proposed component in our approach, demonstrating their substantial contribution to the final result.

DensePose-Conditioned ControlNet. AvatarStudio uses ControlNet conditioned on DensePose for SDS guidance. To assess its efficacy, we compare the performance of our method when trained with StableDiffusion (SD) or Skeleton-conditioned ControlNet (see Fig. 9). We observe the model guided by StableDiffusion generates avatars that exhibit incorrect poses and lower quality due to the lack of pose-aware guidance, which results in inaccurate animations. While the Skeleton-conditioned ControlNet model improves pose control, it still suffers from inaccuracies in foot positioning and head orientation. In contrast, our proposed DensePose-conditioned diffusion guidance achieves precise and stable pose control, accompanied by high-quality textures, which validates the importance of leveraging DensePose-conditioned guidance in the avatar creation process. Moreover, we observe that Skeleton-conditioned ControlNet also suffers from the Janus problem (Fig. 9). This is because the keypoints reside in the SMPL mesh, and it is difficult to determine whether they are occluded, and thus guide the text-to-3D model to yield an incorrect back-view image. In contrast, the DensePose control signals provide a more detailed and accurate description of a person’s pose and view and thus guide the model to generate reasonable 3D avatar results, effectively mitigating the Janus problem. Leveraging DensePose as a form of SDS guidance for 3D generation offers a significant advantage over keypoint or skeleton-based guidance. The reasons are as follows: 1) Skeleton-based guidance, while effective in many scenarios, can be relatively sparse. This sparsity can lead to ambiguity in distinguishing between frontal and back views, a phenomenon often referred to as the “Janus problem”. Besides, due to its sparsity, the same skeleton-represented pose can potentially map to multiple real human poses, leading to inaccuracies in the generated 3D avatars. 2) On the other hand, DensePose provides a more detailed and accurate description of a person’s pose. Its dense nature allows for a more precise mapping between the guidance and the actual human pose, thereby alleviating the aforementioned Janus problem and enhancing view consistency.

To quantitatively assess the pose controllability of avatars generated with different diffusion guidances, we predict the SMPL parameters for posed avatar images using a pre-trained 3D human reconstruction model HybrIK (Li et al., 2021). The images are generated via the given SMPL parameters. We calculate the Mean Squared Error (MSE) in $10^{-2}$ between the input and the estimated SMPL parameters. Specifically, for each avatar, we generate 120 posed images using 120 fixed SMPL parameters in a frontal view. We compute the average MSE across those images as the final MSE score. The MSE for StableDiffussion, Skeleton-conditioned ControlNet, and our method are 9.0, 7.7, and 5.9, respectively. We observe AvatarStudio achieves the best pose control with the lowest MSE, further verifying the effectiveness of DensePose guidance.

5.3 Applications

Multimodal Animation. A crucial feature of our method lies in its capability to provide high-quality, natural, and easy-to-use animation, which allows users to drive avatars using multimodal signals (e.g., video, text, audio, etc). Figure 10 illustrates the animation of avatars created by AvatarStudio using either video (Fig. 10a) or text (Fig. 10b). For video-driven animation, we first employ VIBE (Kocabas et al., 2021) to estimate SMPL sequences from the driving video, which are subsequently leveraged to animate the generated avatar. For text-driven animation, we adopt MDM (Tevet et al., 2023) to convert text into SMPL sequences. Despite adopting a simple SMPL-guided animation, our method produces plausible animations, exhibiting natural movements. The consistency of these results w.r.t. SMPL motions is attributed to two factors: (1) the use of a pre-trained 2D diffusion model to provide SDS guidance, which can correct areas where the SMPL deformation is inaccurate, leading to better animation results; (2) the employment of a strong DensePose as prior, which provides strong pose information, helping to learn to model the avatar under different poses. As such, AvatarStudio can leverage any multimodal-to-motion methods that generate SMPL sequences for animation, showing the versatility and potential of our method in creating realistically animated avatars from diverse text prompts. For more results, please refer to our project page.

Style-Guided Avatar Creation. Moreover, we show that AvatarStudio supports stylized avatar creation by simply providing an additional style image. To achieve this, we employ IP-Adapter (Ye et al., 2023), an adapter that enables image prompt capability for pre-trained text-to-image diffusion model via a decoupled cross-attention design. We plug the IP-Adapter into our DensePose-conditioned ControlNet and optimize with SDS as before. Without bells and whistles, AvatarStudio can generate high-quality avatars of various styles of interest as shown in Fig. 11. Note that the provided style image can be combined with text prompts to enable flexible avatar creation (e.g., a policewoman in Pixar Disney style in Fig. 11). This capability expands its application, allowing users to create stylized avatars catering to specific aesthetic desires.

6 Conclusion

In this paper, we introduce AvatarStudio for creating high-fidelity and animatable 3D avatars from only textual inputs. Our AvatarStudio introduces articulated modeling into explicit 3D mesh representation to support avatars animation while offering high rendering quality. To further improve pose contrallability and view consistency, we leverage DensePose-conditioned ControlNet for Score Distillation Sampling supervision. We also discover several simple yet effective strategies, such as part-aware super-resolution for improving the fidelity of each body part, dual-space training for improving the robustness against different poses and CFG rescale to alleviate the color saturation issue. As a result, AvatarStudio supports various downstream applications, including multimodal avatar animations (e.g., video or text driven) and style-guided avatar creation.

Notes

References

(2023) Metahuman. https://www.unrealengine.com/en-US/metahum-an
(2023) Torchmetrics. https://torchmetrics.readthedocs.io/en/stable/multimodal/clip_score.html.
Alldieck, T., Xu, H., & Sminchisescu, C. (2021) imghum: Implicit generative models of 3d human shape and articulated pose.
Cao, Y., Cao, Y. P., Han, K., Shan, Y., & Wong, K. Y. K. (2023). Dreamavatar: Text-and-shape guided 3d human avatar generation via diffusion models. arXiv.
Chen, J., Zhang, Y., Kang, D., Zhe, X., Bao, L., Jia, X., & Lu, H. (2021). Animatable neural radiance fields from monocular RGB videos. arXiv.
Chen, R., Chen, Y., Jiao, N., & Jia, K. (2023). Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In ICCV.
Desbrun, M., Meyer, M., Schroder, P., & Barr, A. H. (2023). Implicit fairing of irregular meshes using diffusion and curvature flow. In Seminal graphics papers: Pushing the boundaries.
Detlefsen, N. S., Borovec, J., Schock, J., Jha, A. H., Koker, T., Di Liello, L., Stancl, D., Quan, C., Grechkin, M., & Falcon, W. (2022). Torchmetrics-measuring reproducibility in pytorch. Journal of Open Source Software, 7, 4101.
Article Google Scholar
Güler, R. A., Neverova, N., & Kokkinos, I. (2018). Densepose: Dense human pose estimation in the wild.
Guo, Y. C., Liu, Y. T., Shao, R., Laforte, C., Voleti, V., Luo, G., Chen, C. H., Zou, Z. X., Wang, C., Cao, Y. P., & Zhang, S. H. (2023). threestudio: A unified framework for 3d content generation. https://github.com/threestudio-project/threestudio.
Hong, F., Chen, Z., Lan, Y., Pan, L., & Liu, Z. (2022a). Eva3d: Compositional 3d human generation from 2d image collections. arXiv.
Hong, F., Zhang, M., Pan, L., Cai, Z., Yang, L., & Liu, Z. (2022b). AvatarCLIP: Zero-shot text-driven generation and animation of 3d avatars. ACM Transactions on Graphics, 161.
Huang, X., Shao, R., Zhang, Q., Zhang, H., Feng, Y., Liu, Y., & Wang, Q. (2024). Humannorm: Learning normal diffusion model for high-quality and realistic 3d human generation. In CVPR.
Huang, Y., Wang, J., Zeng, A., Cao, H., Qi, X., Shi, Y., Zha, Z., & Zhang, L. (2023). Dreamwaltz: Make a scene with complex 3d animatable avatars. In NeurIPS.
Jain, A., Mildenhall, B., Barron, J. T., Abbeel, P., & Poole, B. (2021). Zero-shot text-guided object generation with dream fields.
Jiang, R., Wang, C., Zhang, J., Chai, M., He, M., Chen, D., & Liao, J. (2023). Avatarcraft: Transforming text into neural human avatars with parameterized shape and pose control. arXiv.
Kerbl, B., Kopanas, G., Leimkühler, T., & Drettakis, G. (2023). 3D Gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 139. https://doi.org/10.1145/3592433.
Khalid, N. M., Xie, T., Belilovsky, E., & Popa, T. (2022). Clip-mesh: Generating textured meshes from text using pretrained image-text models.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.
Kocabas, M., Huang, C. H. P., Hilliges, O., & Black, M. J. (2021). Pare: Part attention regressor for 3d human body estimation.
Kolotouros, N., Alldieck, T., Zanfir, A., Bazavan, E. G., Fieraru, M. & Sminchisescu, C. (2023). Dreamhuman: Animatable 3d avatars from text. In NeurIPS.
Laine, S., Hellsten, J., Karras, T., Seol, Y., Lehtinen, J., & Aila, T. (2020). Modular primitives for high-performance differentiable rendering. ACM Transactions on Graphics, 39(6), 194. https://doi.org/10.1145/3414685.3417861.
Li, J., Xu, C., Chen, Z., Bian, S., Yang, L., & Lu, C. (2021). Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation.
Liao, T., Yi, H., Xiu, Y., Tang, J., Huang, Y., Thies, J., & Black, M. J. (2024). TADA! Text to Animatable Digital Avatars. In 3DV.
Lin, C. H., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler, S., Liu, M. Y., & Lin, T. Y. (2023). Magic3d: High-resolution text-to-3d content creation.
Lin, S., Liu, B., Li, J., & Yang, X. (2024). Common diffusion noise schedules and sample steps are flawed. In WACV.
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., & Black, M. J. (2015). SMPL: A skinned multi-person linear model. ACM Transactions on Graphics, 34(6), 248. https://doi.org/10.1145/2816795.2818013.
Lorensen, W. E., & Cline, H. E. (1998). Marching cubes: A high resolution 3d surface construction algorithm.
Metzer, G., Richardson, E., Patashnik, O., Giryes, R., & Cohen-Or, D. (2022). Latent-nerf for shape-guided generation of 3d shapes and textures. arXiv.
Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R. (2020). Nerf: Representing scenes as neural radiance fields for view synthesis.
Müller, T., Evans, A., Schied, C., & Keller, A. (2022). Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics, 41(4), 102. https://doi.org/10.1145/3528223.3530127.
Nealen, A., Igarashi, T., Sorkine, O., & Alexa, M. (2006). Laplacian mesh optimization. In CGIT.
Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A. A. A., Tzionas, D., & Black, M. J. (2019). Expressive body capture: 3d hands, face, and body from a single image. In CVPR.
Peng, S., Dong, J., Wang, Q., Zhang, S., Shuai, Q., Bao, H., & Zhou, X. (2021). Animatable neural radiance fields for human body modeling.
Poole, B., Jain, A., Barron, J. T., & Mildenhall, B. (2022). Dreamfusion: Text-to-3d using 2d diffusion.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision.
Richardson, E., Metzer, G., Alaluf, Y., Giryes, R., & Cohen-Or, D. (2023). Texture: Text-guided texturing of 3d shapes. arXiv.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2021). High-resolution image synthesis with latent diffusion models.
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi, S. S., Lopes, R. G., Salimans, T., Ho, J., Fleet, D. J., & Norouzi, M. (2022). Photorealistic text-to-image diffusion models with deep language understanding. arXiv.
Sanghi, A., Chu, H., Lambourne, J., Wang, Y., Cheng, C. Y., & Fumero, M. (2021). Clip-forge: Towards zero-shot text-to-shape generation.
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., & et al. (2022) Laion-5b: An open large-scale dataset for training next generation image-text models.
Schwarz, K., Sauer, A., Niemeyer, M., Liao, Y., & Geiger, A. (2022). Voxgraf: Fast 3d-aware image synthesis with sparse voxel grids.
Shen, T., Gao, J., Yin, K., Liu, M. Y., & Fidler, S. (2021). Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In NeurIPS.
Shi, Y., Wang, P., Ye, J., Mai, L., Li, K., & Yang, X. (2023). Mvdream: Multi-view diffusion for 3d generation.
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-or, D., & Bermano, A. H. (2023). Human motion diffusion model. In ICLR.
Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., & Zhu, J. (2023). Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv.
Xu, Y., Yang, Z., & Yang, Y. (2023a). Seeavatar: Photorealistic text-to-3d avatar generation with constrained geometry and appearance. arXiv preprint arXiv:2312.08889.
Xu, Z., Zhang, J., Liew, J., Feng, J., & Shou, M. Z. (2023b). Xagen: 3d expressive human avatars generation. In NeurIPS.
Ye, H., Zhang, J., Liu, S., Han, X., & Yang, W. (2023). Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv.
Yifan, W., Rahmann, L., & Sorkine-Hornung, O. (2021). Geometry-consistent neural shape representation with implicit displacement fields. arXiv.
Zhang, H., Chen, B., Yang, H., Qu, L., Wang, X., Chen, L., Long, C., Zhu, F., Du, K., & Zheng, M. (2024). Avatarverse: High-quality & stable 3d avatar creation from text and pose.
Zhang, J., Jiang, Z., Yang, D., Xu, H., Shi, Y., Song, G., Xu, Z., Wang, X., & Feng, J. (2022). Avatargen: A 3d generative model for animatable human avatars. arXiv.
Zhang, L., Rao, A., & Agrawala, M. (2023a). Adding conditional control to text-to-image diffusion models.
Zhang, X., Zhang, J., Rohan, C., Xu, H., Song, G., Yang, Y., & Feng, J. (2023b). Getavatar: Generative textured meshes for animatable human avatars. In ICCV.

Download references

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions

Author information

Xuanmeng Zhang and Jianfeng Zhang contributed equally to this work.

Authors and Affiliations

ReLER Lab, AAII, University of Technology Sydney, Sydney, Australia
Xuanmeng Zhang
ByteDance Seed, Singapore, Singapore
Jianfeng Zhang, Chenxu Zhang, Jun Hao Liew, Huichao Zhang & Jiashi Feng
ReLER Lab, CCAI, Zhejiang University, Hangzhou, China
Yi Yang

Authors

Xuanmeng Zhang
View author publications
Search author on:PubMed Google Scholar
Jianfeng Zhang
View author publications
Search author on:PubMed Google Scholar
Chenxu Zhang
View author publications
Search author on:PubMed Google Scholar
Jun Hao Liew
View author publications
Search author on:PubMed Google Scholar
Huichao Zhang
View author publications
Search author on:PubMed Google Scholar
Yi Yang
View author publications
Search author on:PubMed Google Scholar
Jiashi Feng
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Xuanmeng Zhang.

Additional information

Communicated by Jifeng Dai.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

In this Appendix, we provide additional implementation details in Sect. A, limitations and future work in Sect. B, as well as additional baseline comparisons in Sect. D. Please refer to our project page and supplementary video for results in video format.

A Implementation Details

Training Details. We represent the camera position in the world coordinate system by radius, elevation angle, azimuth angle, and field of view (FOV). Here we set the camera distance within the range of [1.0, 2.0], the elevation angle within the range of [$-10^{\circ }$, $60^{\circ }$], the azimuth angle within the range of [$0^{\circ }$, $360^{\circ }$], and the FOV within the range of [55, 65], respectively. During training, we perform part-aware super-resolution every 3 steps. In practice, we empirically find the following prompt postfixes can improve the quality and realism of generated images: “high quality, 8k UHD, realistic”. We also use “low-res, bad anatomy, bad hands, missing fingers, worst quality, nsfw” as negative prompts to improve the generation. Another challenge is the infamous multi-face Janus issue that is due to the lack of 3D awareness of 2D diffusion models. The text prompts often describe the identity and the appearance of the desired human avatar while ignoring the view information. Consequently, the model may generate redundant contents of the character, i.e., face, in other views. To address this, we append view-dependent text to the provided input text prompt according to the randomly sampled camera poses. Specifically, we divide the horizontal angles to “front view”, “back view”, and “side view”, and append “overhead view” for camera poses with high elevation angles, following DreamFusion (Poole et al., 2022).

Coarse Stage NeRF. In the coarse stage, we adopt NeRF to learn a static human in the canonical space by leveraging the low-resolution diffusion prior as guidance. We use hash grid decoding from InstantNGP (Müller et al., 2022) with a two-layer MLP to predict the density and color. The hash grid encoding from Instant NGP allows us to encode high-frequency details at a relatively low computational cost. The normal is calculated as the spatial gradient of the density fields. We find that directly learning the geometry of the human from scratch is non-trivial because the human body has complex pose and shape variations. Therefore, we adopt a residual prediction scheme by leveraging the SMPL model as a strong geometry prior. In other words, instead of directly predicting the density value from scratch, we predict a residual density based on the signed distance value from the surface of the SMPL template (Yifan et al., 2021; Zhang et al., 2023b; Xu et al., 2023b). Specifically, for a point $x_c$ in the canonical space, we query the body SDF value $d({x_c})$ to the surface of SMPL, and then convert SDF to the density by:

$$\begin{aligned} \begin{aligned} \sigma (\mathbf {x_c})&= \textrm{softmax}^{-1}(\tau ) + \Delta \sigma (x_c), \quad \\ \tau&= \frac{1}{\alpha }\textrm{sigmoid} \left( -\frac{d({x_c})}{\alpha }\right) \\ \end{aligned} \end{aligned}$$

(5)

where $\Delta \sigma (x_c)$ is the residual density term predicted by the MLP network, $\alpha $ is set to 0.001 in experiments. The residual densities prediction scheme based on the coarse SMPL body mesh can largely alleviate the geometry learning difficulties, yielding better generation results. In practice, we also use grid pruning to reduce the memory cost and speed up the training process. During training, we maintain an occupancy grid and gradually prune sample points in the empty space.

Background Modeling. We model the foreground avatars and backgrounds separately. We adopt a neural environment map network similar to DreamFusion (Poole et al., 2022) to model the learnable background. Specifically, we use a two-layer MLP which takes the ray direction position encoding as input and predicts the color. We generate the final image by performing alpha composition (Schwarz et al., 2022):

$$\begin{aligned} \begin{aligned}&\mathcal {I}_{final} = \mathcal {I}_{fg} + (1 - \mathcal {M}) \cdot \mathcal {I}_{bg}, \\ \end{aligned} \end{aligned}$$

(6)

where $\mathcal {I}_{fg}$ is the rendered foreground human image, and $\mathcal {I}_{bg}$ is the rendered background image, and $\mathcal {M}$ is the foreground object mask. For the coarse stage, the foreground mask of the NeRF model can be generated by accumulating to density along the ray:

$$\begin{aligned} \begin{aligned} \mathcal {M} = \sum \limits _{i=1}^{N} T_i (1 - \exp (-\sigma _i\delta _i), T_i = \exp (-\sum \limits _{j=1}^{i-1}\sigma _j\delta _j), \end{aligned} \end{aligned}$$

(7)

where N is the number of samples in each camera ray, $\delta _i$ is the distance between adjacent sample points i-th and $(i+1)$-th, $\sigma _i$ is the volume density of sample i. For more details, please refer to NeRF (Mildenhall et al., 2020). In the fine stage, we can directly generate the foreground mask of the textured mesh with the differentiable rasterizer Nvdiffrast (Laine et al., 2020).

Evaluation Metrics. In the experiments, we use CLIP score, implemented by TorchMetrics (Detlefsen et al., 2022; tor, 2023), as an evaluation metric to measure the consistency between the generated avatars and input texts. For each method, we render its generated avatars from four evenly distributed horizontal views, i.e., $[0^{\circ }, 90^{\circ }, 180^{\circ }, 270^{\circ }]$ and calculate the averaged CLIP score for these rendered images and the input text. Specifically, we use the CLIP model “clip-vit-base-patch16” for clip score evaluation.

In order to evaluate the pose controllability of the generated avatars, we predict the SMPL parameters for posed avatar images using a pre-trained 3D human reconstruction model HybrIK (Li et al., 2021). The images are generated via the given SMPL parameters. We then calculate the Mean Squared Error (MSE) in $10^{-2}$ units between the input and the estimated SMPL parameters. This MSE provides a quantitative measure of how closely the pose of the generated avatars aligns with the expected poses, as determined by the input SMPL parameters. Specifically, for each avatar, we generate 120 posed images using 120 fixed SMPL parameters in a frontal view. We compute the average MSE across those images as the final MSE score.

Loss Functions. We use the Score Distillation Sampling (SDS) loss (Poole et al., 2022) $ \mathcal {L}_{SDS}$ to guide the optimization of 3D human generation. To encourage smooth geometry, we also introduce several regularization terms, including normal smooth loss (Nealen et al., 2006) $\mathcal {L}_{c}$ and mesh Laplacian smoothing loss (Desbrun et al., 2023) $\mathcal {L}_{s}$. The overall loss function is formulated as:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{total} = \lambda _{SDS}\mathcal {L}_{SDS} + \lambda _{c}\mathcal {L}_{c} + \lambda _{s}\mathcal {L}_{s} \end{aligned} \end{aligned}$$

(8)

where $\lambda _{SDS}=1$, $\lambda _{c}=10{,}000$ and $\lambda _{s}=10{,}000$ are weights for the SDS loss $\mathcal {L}_{SDS}$, normal smooth loss $\mathcal {L}_{c}$ and mesh Laplacian smoothing loss $\mathcal {L}_{s}$, respectively.

B Limitations and Future Work

While AvatarStudio displays potential in generating high-quality, there is still room for improvement. Firstly, although AvatarStudio supports animation of the generated avatars, it currently does not support fine-grained motions, such as changes in facial expressions. Future improvements can delve into utilizing more expressive parametric human models such as the SMPL-X (Pavlakos et al., 2019) model for avatar creation guidance, which can yield greater expressiveness in avatar animation. Second, the efficiency of AvatarStudio can be further improved. It currently requires approximately 1.5 h to generate a single 3D avatar from a text input. Toward increasing optimization efficiency, we can consider two promising solutions: (1) adopting more efficient 3D representations like Gaussian Splatting (Kerbl et al., 2023) for human avatar representation. (2) incorporating more powerful guidance such as Multiview Diffusion (Shi et al., 2023) to expedite the optimization process. Finally, artifacts, particularly in the avatars’ hands, can occasionally be observed (see Fig. 12). A potential improvement could lie in the use of a hybrid guidance solution. For example, using DensePose-conditioned ControlNet for body guidance and Skeleton-conditioned ControlNet for hand guidance can mitigate these artifacts.

Societal Impact. Our work focuses on generating high-quality and animatable human avatars from text inputs in technical aspects and is not specifically designed for any malicious uses. This being said, we do see that the method could be potentially extended into controversial applications such as generating fake human videos. Therefore, we believe that the generated images and videos should present themselves as synthetic.

C Discussions

Forward LBS Versus Inverse LBS. Forward Linear Blend Skinning (LBS) maps 3D query points from the canonical space to the deformed (posed) space, enabling generalization to unseen poses and handling one-to-many mappings. Conversely, inverse LBS reverses the skinning deformation, mapping 3D points from the deformed space to the canonical space. In our framework, we utilize inverse LBS during optimization and forward LBS for animation: During Optimization, we employ inverse LBS to enable dual-space training. Given the deformed mesh in the posed space, inverse LBS allows us to compute the corresponding standard pose in the canonical space. This joint optimization in both spaces improves the model’s ability to generalize to diverse poses and enhances pose control accuracy. Random sampling poses during this process further strengthens the model’s robustness. During Animation, forward LBS is used to deform the canonical avatar representation into various poses, enabling smooth and accurate animations. To validate our choice of inverse LBS for optimization, we conducted additional experiments, particularly in scenarios with complex articulations. In these experiments: Inverse LBS corresponds to dual-space training, allowing optimization in both canonical and deformed spaces. Forward LBS corresponds to optimizing only in the canonical space without dual-space training. We add the visualization results of this comparative study to our rebuttal demos (video grid with caption “Dual-space training”). The experimental results demonstrate that inverse LBS consistently outperforms forward LBS, especially for challenging poses. The avatars optimized with inverse LBS exhibit significantly fewer artifacts, particularly around high-deformation areas like the hips and shoulders.

DensePose Conditioning Versus Depth/Normal Conditioning. We conduct ablation studies on alternative conditioning signals, i.e., depth and normal map conditioning, to evaluate our choice of DensePose. Depth/Normal Condition. The depth/normal maps rendered from the human body lack the ability to distinguish different body parts. This absence of fine-grained semantic information makes them less effective in providing detailed control over specific body parts. Consequently, the generated avatars always suffer from noticeable blurriness in the junction area of different body parts, such as the waist and facial regions. We also observe that the depth/normal condition can lead to over-smooth textures, which can be attributed to the inherent continuity properties of depth and normal maps (the values of depth/normal are continuous). leading to a loss of fine details in the appearance and shape of the generated avatars. As demonstrated on the rebuttal demos, avatars generated with depth/normal map conditioning often exhibit over-smooth artifacts resembling a tight bodysuit. DensePose Advantage. In contrast, DensePose maps can offer a detailed surface representation that incorporates both shape and semantic information. Specifically, DensePose partitions the human body mesh into 24 distinct parts with different labels, each corresponding to specific body regions (e.g., arms, legs, head). The precise and part-aware guidance allows for sharper textures and higher fidelity in the synthesized human generation. As shown in the rebuttal demos, DensePose-guided avatars demonstrate significantly sharper textures and improved fine-grained details.

D Additional Results

More Complex Prompts. Testing AvatarStudio with more complex prompts is essential for demonstrating its robustness and effectiveness. We have experimented with more complex prompts for human avatar creation, as shown in Fig. 13 and our project page (video grid with caption “Avatar creation with more complicated prompts”). Our method has shown promising results, effectively aligning the generated avatar with the detailed descriptions in the prompts. We also visualize the normal maps of the generated avatars in Figs. 14 and 15, showing that our method is robust to different input prompts and can produce high-quality results.

More Comparison Results. In Figs. 16, 17 and 18, we provide more comparison results with SOTA methods DreamHuman (Kolotouros et al., 2023), AvatarVerse (Zhang et al., 2024) and TADA (Liao et al., 2024), respectively.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, X., Zhang, J., Zhang, C. et al. AvatarStudio: High-Fidelity and Animatable 3D Avatar Creation from Text. Int J Comput Vis 133, 5178–5196 (2025). https://doi.org/10.1007/s11263-025-02423-5

Download citation

Received: 20 July 2024
Accepted: 01 March 2025
Published: 07 April 2025
Version of record: 07 April 2025
Issue date: August 2025
DOI: https://doi.org/10.1007/s11263-025-02423-5

Keywords

Profiles

Xuanmeng Zhang View author profile

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

AvatarStudio: High-Fidelity and Animatable 3D Avatar Creation from Text

Abstract

Similar content being viewed by others

AvatarCap: Animatable Avatar Conditioned Monocular Human Volumetric Capture

HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting

RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models

Explore related subjects

1 Introduction

2 Related Works

3 Preliminaries

4 Methodology

4.1 Articulated 3D Human Modeling

4.2 Text-to-Avatar Generation

5 Experiments

5.1 Qualitative Comparison

5.2 Ablation Studies

5.3 Applications

6 Conclusion

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

A Implementation Details

B Limitations and Future Work

C Discussions

D Additional Results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles