1 Introduction

The creation of high-fidelity and animatable 3D human avatars is essential in various fields, including the media industry, VR/AR, game design, etc. However, it is a labor-intensive task that typically requires pre-captured templates and extensive work from experienced artists. Therefore, a user-friendly system that can generate and animate 3D avatars is of great value. This objective serves as the primary goal of this work. Existing 3D avatars creation methods can be classified into three categories: (1) template-based generation pipeline (Met, 2023), (2) 3D generative models (Hong et al., 2022a; Zhang et al., 2022) and (3) 2D-lifting methods (Poole et al., 2022; Lin et al., 2023). Avatars generated using template-based methods typically exhibit relatively simple topology and texture. On the other hand, 3D generative models often struggle to generalize to arbitrary avatars with diverse appearances due to the scarcity and limited diversity of accessible 3D models. Yet, in real-world applications, users often desire high-quality 3D avatars with intricate structures and artistic styles.

Recently, 2D-lifting methods have shown that 2D generation models trained on large-scale image datasets possess strong generalizability, making them suitable for 3D content creation. Representative works such as DreamFusion (Poole et al., 2022) and Magic3D (Lin et al., 2023), employ 2D diffusion models as supervision to optimize 3D representations using Score Distillation Sampling. More recent studies (Cao et al., 2023; Huang et al., 2023; Kolotouros et al., 2023) incorporate parametric human priors (Loper et al., 2015; Alldieck et al., 2021) into the 2D-lifting optimization process to facilitate 3D human avatar creation. However, these methods often focus on generating static avatars, which are challenging to animate, or they produce low-quality, animatable 3D avatars (e.g., blurriness, lack of details, and poor pose controllability) (Cao et al., 2023; Jiang et al., 2023; Hong et al., 2022b; Kolotouros et al., 2023; Huang et al., 2023), thereby failing to meet the requirements for practical applications. Consequently, there is a growing need for more advanced solutions capable of generating high-fidelity, animatable 3D avatars.

In this work, we propose AvatarStudio, a novel framework designed for creating high-quality 3D avatars from textual descriptions while offering flexible animation ability. Our method proposes a new 3D human representation that incorporates articulated human modeling into an explicit mesh representation. The former enables to animate the generated avatars to desired poses, while the latter allows to fully harness the power of 2D diffusion priors at high resolution. Though conceptually straightforward, effectively training an articulated mesh representation to generate high-quality and animatable avatars presents a significant challenge, stemming from the absence of proper mesh initialization and an effective pose-controllable 2D guidance mechanism.

Fig. 1
figure 1

AvatarStudio generates high-fidelity, animatable 3D avatars featuring realistic textures and detailed geometry given text inputs. A unique feature of AvatarStudio is its easy-to-use animation ability, allowing users to animate the generated avatars via multimodal signals, such as a dancing video or a motion described by text (e.g., “A person is doing boxing”). Moreover, it supports the creation of avatars with distinct artistic styles (e.g., sketch style) given an additional reference style image

To address these challenges, we employ a two-pronged approach. Specifically, we train a human NeRF from scratch with a fixed pre-defined canonical pose, which largely eases the training difficulties. Initialized from it, we optimize a SMPL-guided (Loper et al., 2015) articulated mesh represented by DMTet (Shen et al., 2021). This mesh representation, rendered through an efficient rasterizer (Laine et al., 2020), enables the production of high-resolution images up to \(512^2\), thereby facilitating the creation of high-fidelity avatars. To effectively optimize this proposed representation from text, we utilize pre-trained 2D diffusion models as priors. Unlike previous methods that use StableDiffusion (Rombach et al., 2021) or skeleton-conditional ControlNet (Zhang et al., 2023a) for SDS supervision, which are prone to inaccurate pose control and the Janus problem (Cao et al., 2023; Huang et al., 2023), we propose a novel ControlNet conditioned on DensePose (Güler et al., 2018) as guidance, which offers dual benefits: (1) 3D-aware DensePose ensures a more stable and view-consistent avatar creation process; (2) it provides more precise pose control over the generated avatars.

As shown in Fig. 1, AvatarStudio can create high-fidelity, animatable avatars from the text. We evaluate it quantitatively and qualitatively, verifying its superiority over previous state-of-the-arts. Thanks to the easy-to-use animation capability, it allows users to animate the generated avatars using multimodal signals (e.g., video and text). Moreover, by simply plugging an additional adapter (Ye et al., 2023), AvatarStudio can create avatars with unique artistic styles given a reference style image, further expanding the range of applications and customization options for 3D avatar creation.

2 Related Works

Text-Guided 3D Content Generation. The successful advancement in text-guided 2D image generation has paved the way for text-guided 3D content creation. Notable examples include CLIP-forge (Sanghi et al., 2021), DreamFields (Jain et al., 2021), and CLIP-Mesh (Khalid et al., 2022), which utilize the widely acclaimed CLIP (Radford et al., 2021) to optimize underlying 3D representations, such as NeRF and textured meshes. DreamFusion (Poole et al., 2022) proposes to use the Score Distillation Sampling (SDS) loss, derived from a pre-trained diffusion model (Saharia et al., 2022) as supervision during optimization. Subsequent improvements over DreamFusion include optimizing 3D representations in a latent space (Metzer et al., 2022) and coarse-to-fine manner (Lin et al., 2023). Following this line of research, TEXTure (Richardson et al., 2023) opts to generate texture maps for a given 3D mesh, while ProlificDreamer (Wang et al., 2023) introduces variational score distillation to produce promising results. Fantasia3D (Chen et al., 2023) disentangle the modeling and learning of geometry and appearance for generating 3D assets. However, despite these advancements, when it comes to avatar creation, these techniques often exhibit limitations including low-quality generation, presence of the Janus problem, and incorrect rendering of body parts. In contrast, AvatarStudio enables high-fidelity and animatable generation of 3D avatars from text prompts.

Text-Guided 3D Avatar Generation. To enable 3D avatar generation from text, several approaches have been proposed. Avatar-CLIP (Hong et al., 2022b) sets the foundation by initializing human geometry with a shape VAE and utilizing CLIP (Radford et al., 2021) to assist in geometry and texture generation. DreamAvatar (Cao et al., 2023) and AvatarCraft (Jiang et al., 2023) integrate the human parametric model with pre-trained 2D diffusion models for 3D avatar creation. DreamHuman (Kolotouros et al., 2023) further introduces a camera zoom-in technique to refine the local details of resulting avatars; TADA (Liao et al., 2024) proposes a hybrid representation for avatar animation; while DreamWaltz (Huang et al., 2023) incorporates a skeleton-conditioned ControlNet and develops an occlusion-aware SDS guidance for pose-aligned supervision. Although these methods achieve animatable results, they suffer from low-quality generation issues, like blurriness, coarseness and insufficient details. Additionally, the weak SDS guidance and the inherent sparsity of skeleton conditioning make it difficult to generate multi-view consistent avatars with accurate pose controllability. HumanNorm (Huang et al., 2024) and SEEAvatar (Xu et al., 2023a) introduce a paradigm that produces the geometry of the human body first and then performs texture generation. In contrast, we propose an articulated textured mesh representation for 3D human modeling, enabling effective avatar animation and high-resolution rendering. It allows the model to fully utilize 2D diffusion priors at high resolution, leading to higher-quality generation. Moreover, we use a DensePose-conditioned ControlNet for SDS guidance to ensure more stable, view-consistent avatar creation and improved pose control. A concurrent work, AvatarVerse (Zhang et al., 2024) also employs DensePose as conditioning for SDS guidance. However, AvatarVerse is limited to static avatar generation, making it hard to animate the resulting avatars in a user-friendly way.

Fig. 2
figure 2

The overview of AvatarStudio. It takes a text prompt as input to optimize an articulated textured mesh representation via a DensePose-conditioned ControlNet for high-quality and animatable 3D avatar creation. To facilitate the optimization process, it leverages several simple yet effective strategies, like part-aware super-resolution and dual-space training. See the main text for more details

3 Preliminaries

Score Distillation Sampling. The key technique of lifting a pre-trained 2D diffusion model \(\varvec{\epsilon }_\phi \) into a 3D representation \(\theta \) is the Score Distillation Sampling (SDS), which can be used to guide the generation of 3D content given an input text prompt y. Specifically, given an image \(\mathcal {I} = g(\theta )\) rendered from a differentiable 3D model g, we add random noise \(\varvec{\epsilon }\) to obtain a noisy image. The SDS loss then computes the gradient of \(\theta \) by minimizing the difference between the predicted noise \(\varvec{\epsilon }_\phi \left( x_t; y, t\right) \) and the added noise \(\varvec{\epsilon }\), which can be formulated as:

$$\begin{aligned} \nabla _\theta \mathcal {L}_{\text {SDS}}\left( \phi , x_\theta \right) =\mathbb {E}_{t, \varvec{\epsilon }}\left[ w(t)\left( \varvec{\epsilon }_\phi \left( z_t ; y, t\right) -\varvec{\epsilon }\right) \frac{\partial x}{\partial \theta }\right] , \end{aligned}$$
(1)

where \(z_t\) is the noisy image at noise level t, w(t) denotes a weighting function depends on t and y.

SMPL. Skinned Multi-Person Linear model (Loper et al., 2015) is a parametric human model that represents a wide range of human body poses and shapes. It defines a deformable mesh \(\mathcal {M}(\xi , \beta ) = (\mathcal {V}, \mathcal {S})\), where \(\xi \) and \(\beta \) denote the pose and shape parameters, \(\mathcal {V}\) is the set of \(N_{v} = 6890\) vertices, and \(\mathcal {S}\) is the set of linear blend skinning (LBS) weights assigned for each vertex. It provides an articulated geometric proxy to the underlying dynamic human body. In this paper, we develop an articulated 3D human representation for animatable avatar creation by generalizing the LBS of SMPL to clothed human modeling (Fig. 2).

4 Methodology

In this work, we aim to generate high-fidelity and animatable 3D human avatars from text inputs. In Sec. 4.1, we first present how to design an articulated explicit mesh representation for animatable avatar modeling and high-resolution rendering. In Sec. 4.2, we elaborate on how to optimize the proposed representation from text inputs via a DensePose-conditioned ControlNet and introduce several simple yet effective strategies to facilitate the generation process.

4.1 Articulated 3D Human Modeling

SMPL-Guided Avatar Articulation. To create animatable human avatars, we incorporate a simple yet effective SMPL-guided articulation into the 3D human modeling process to drive the generated avatar to the desired poses. Specifically, given SMPL parameter \(p = (\xi , \beta )\), our model first generates a template avatar with a pre-defined pose in the canonical space, followed by deforming it to the target pose defined by the corresponding parameter p in the deformed space. We leverage the inverse transformation of SMPL LBS to guide the deformation of our human representation. Specifically, given a point \(\mathbf {x_d}\) in the deformed space, we first find its nearest vertex \(v^*\) in the corresponding SMPL mesh, and then use the skinning weights of \(v^*\) to deform \(\mathbf {x_d}\) to the corresponding point \(\mathbf {x_c}\) in the canonical space:

$$\begin{aligned} \begin{aligned} \mathbf {x_c} = \mathcal {G}^{-1} \cdot \mathbf {x_d}, \quad \mathcal {G} = \sum \limits _{i=1}^{N_j} s_i^* \cdot B_i (\xi , \beta ) \end{aligned} \end{aligned}$$
(2)

where \(s_i^*\) is the skinning weight of vertex \(v^*\) w.r.t. the i-th joint, \(B_i (\xi , \beta )\) is the bone transformation matrix of joint i, \(N_j = 24 \) is the number of joints. In addition, we employ a non-rigid deformation scheme (Peng et al., 2021; Chen et al., 2021) to learn residual deformation to compensate for areas where SMPL deformation is inaccurate. Specifically, we add an MLP-based deformation network to learn a residual deformation that models surface changes with the articulation of the avatar: \(\triangle n_i = MLPs(Concat[Embed(x_c), p])\), where we feed the position-embedded \(x_c\) and SMPL parameter p to MLPs. We found that adding such deformation makes the animation results more robust, especially for avatars with complex clothing, as shown on our project page (video grid with the caption “More animation results-AvatarStudio with non-rigid deformation”).

Articulated Textured Mesh Representation. Most existing text-to-avatar methods (Jiang et al., 2023; Cao et al., 2023) represent avatars as NeRFs and use volumetric rendering for rendering. However, they typically require a considerable computation budget, limiting the resolution of generated images. Thus, they often produce low-quality avatars with coarseness and lack details. In contrast, we opt for an explicit mesh representation for animatable human modeling, which supports high-resolution rendering via an efficient rasterizer (Laine et al., 2020). Nevertheless, we empirically observe that directly optimizing the mesh representation for avatar generation produces degenerated results (see Fig. 6 for more details), due to the high dimensionality of the mesh space and the complexity of human bodies.

To address this, we propose a coarse-to-fine pipeline to optimize the proposed representation. In the first stage, we adopt NeRF to learn a static human in a fixed pre-defined canonical space by leveraging the low-resolution diffusion prior as guidance, which largely alleviates the optimization difficulties. We use hash grid decoding from InstantNGP (Müller et al., 2022) with a two-layer MLP to predict the density and color. To further accelerate the learning process, we adopt a residual prediction scheme on top of the SMPL-derived density field, which serves as a strong geometric prior. In the second stage, we use a differentiable surface representation, i.e., Deep Marching Tetrahedra (DMTet) (Shen et al., 2021), to model avatars as textured mesh, which is initialized from the coarse NeRF using the marching cube algorithm (Lorensen & Cline, 1998). The explicit mesh representation allows us to improve the generation quality by optimizing with high-resolution diffusion prior (e.g., \(512\times 512\)). Please refer to the Appendix for more details.

For articulating avatar modeling, we establish the correspondence between the canonical and deformed spaces via the SMPL-guided deformation. Specifically, for a point \(x_d\) in the deformed space, we first find the corresponding point \(x_c\) in the canonical space (see Eq. 2). We then predict a signed distance offset from the surface of the mesh extracted from the coarse model for geometry refinement. The final signed distance of the fine stage \(d_{fine}(\mathbf {x_c})\) at point \(x_d\) can be computed as:

$$\begin{aligned} \begin{aligned} d_{fine}(\mathbf {x_c}) = d_{coarse}(\mathbf {x_c}) + \Delta d (x_c), \\ \end{aligned} \end{aligned}$$
(3)

where \(d_{coarse}(\mathbf {x_c})\) is the signed distance value from the coarse stage, \(\Delta d (x_c)\) is the residual SDF value predicted by a two-layer MLP. This allows us to animate the generated avatars to arbitrary poses by simply deforming the canonical one. We employ the neural color field initialized from the coarse stage for mesh textures modeling under higher-resolution space.

4.2 Text-to-Avatar Generation

DensePose SDS Optimization. Optimizing the proposed articulated mesh representation alone is insufficient to achieve significant performance in creating animatable 3D avatars due to the lack of effective 2D pre-trained guidance. The core idea is to optimize the 3D model by distilling prior knowledge from a pretrained diffusion model using Score Distillation Sampling (SDS) loss. Although the image diffusion model can guide content generation, it struggles to synthesize a human avatar with the correct pose due to the absence of conditioning signals. To address this issue, we adopt a DensePose-conditional ControlNet that leverages the more expressive DensePose signal as conditions for avatar generation. This approach helps to alleviate the inaccurate pose control and the Janus problem arises when applying pure Stable Diffusion (Cao et al., 2023) or sparse skeleton-conditioned ControlNet (Huang et al., 2023) for guidance. Specifically, given the SMPL parameter p, we render the human image \(\mathcal {I}= g(\theta , p)\) from the 3D human model g parametrized by \(\theta \). We also render the SMPL mesh defined by p as DensePose conditions \(\mathcal {I}_{cond}(\theta , p)\) from the same camera viewpoint as \(\mathcal {I}\). DensePose Güler et al. (2018) partitions the human body mesh into 24 distinct parts, each corresponding to specific body regions (e.g., arms, legs, head). Consequently, each triangular face in the body mesh is assigned to one of the 24 parts. This association is managed by a face-to-indice tensor that labels which part each face belongs to. To achieve this, we adopt rendering scripts from Pytorch3D turtorialFootnote 1 and Densepose repository.Footnote 2 The use of SMPL-derived DensePose maps allows our approach to bypass potential inaccuracies associated with estimation-based DensePose methods, ensuring reliable input data for our framework. The DensePose-conditioned SDS loss can be defined as follows:

$$\begin{aligned} \begin{aligned}&\nabla _{\theta } \mathcal {L}_{SDS} (\phi , \mathcal {I}= g(\theta , p)) = \\&\mathbb {E}_{t, \epsilon } \left[ \omega (t) (\hat{\epsilon }_{\phi }(\mathcal {I}_t; y, \mathcal {I}_{cond}, t) - \epsilon ) \frac{\partial \mathcal {I} }{\partial \eta }\right] , \end{aligned} \end{aligned}$$
(4)

where \(p = (\xi , \beta )\) is the SMPL parameter, \(\mathcal {I}_t\) denotes the noisy image at noise level t, \(\omega (t)\) is a weighting function that depends on the noise level t, \(\varvec{\epsilon }\) is the added noise, and y is the input text prompt. Compared to skeleton-conditioned ControlNet, DensePose-conditioned ControlNet offers two benefits: (1) 3D-aware DensePose ensures a more stable and view-consistent avatar creation process; (2) it enables more accurate pose control of the generated avatars.

CFG Rescale. To ensure better alignment with input text, existing works often use a large classifier-free guidance (CFG) scale when optimizing avatar representation with SDS. However, a large CFG scale can produce severe color saturation, making the generated avatars look unreal. To alleviate the color saturation issue, we apply the CFG rescale trick from  (Lin et al., 2024) for adjusting the denoised \(\hat{x}_0\). Please refer to  (Lin et al., 2024) for more details.

Additional Training Strategies. Directly generating the full-body avatars often results in blurry outcomes that lack fine details. To improve the fidelity of the generated avatars, we employ a part-level super-resolution strategy. By leveraging the body prior from SMPL, we can easily identify the positions of different body parts (i.e., head, hand, upper body, lower body, and arm). We zoom in on each part and apply SDS as before to refine their texture and geometric details. To guide this fine-grained optimization, we use corresponding text prompts for each body part (e.g., “\(\texttt {The headshot of <name>}\)”, “\(\texttt {The right hand of <name>}\)”, etc), where \(\mathtt{<name>}\) is the textual description of an avatar (Kolotouros et al., 2023).

To improve the quality of animation while preserving high-quality textures and geometries, we adopt a dual-space training strategy (Cao et al., 2023; Kolotouros et al., 2023) that jointly optimizes the human avatar in both the canonical space and deformed space. We utilize the “A-pose” in the canonical space as it is a standard pose for natural humans. Within the deformed space, we sample various poses for training to enhance pose control generalization and accuracy. Specifically, we randomly sample human poses from VPoser (Pavlakos et al., 2019), a variational autoencoder that learns a latent representation of the human pose prior in the training process.

Fig. 3
figure 3

Qualitative comparisons with four SOTA methods. AvatarStudio generates more realistic and higher-resolution avatars with fine-grained geometries, like cloth wrinkles, compared with other methods. The prompts we used for comparisons: 1st row: “A standing Captain Jack Sparrow from Pirates of the Caribbean”; “A man wearing a white tank top and shorts”, 2nd row: “Joker”; “A karate master wearing a Black belt”, 3rd row: “Stormtrooper”; “A man wearing a jean jacket and jean trousers”. Best viewed in \(2\times \) zoom. For more results, please refer to Appendix and project page

Fig. 4
figure 4

Our AvatarStudio produces high-quality and detailed geometry. Best viewed in 2\(\times \) zoom

5 Experiments

In this section, we first verify AvatarStudio’s ability for 3D avatar creation from text inputs. Then, we conduct ablation studies to analyze the effectiveness of each component. Finally, we showcase the applications of AvatarStudio, including multimodal avatar animation and style-guided creation.

Implementation Details. To train DensePose-conditioned ControlNet, we sample human images from the LAION dataset (Schuhmann et al., 2022) and annotate them using a pre-trained DensePose model (Güler et al., 2018), resulting in around 1.2M image pairs. The ControlNet training is based on the Stable Diffusion v2.1 base model (\(512^2\)) and takes about 2 days using 16 NVIDIA V100 GPUs. Our AvatarStudio is implemented in the threestudio (Guo et al., 2023) codebase. For each text prompt, AvatarStudio trains the 3D model with a batch size of 1 for 8k and 2k iterations in the coarse and fine stages, respectively, using the AdamW optimizer (Kingma and Ba, 2015) at a learning rate of 0.01. The entire training process takes around 1.5 h on a single NVIDIA V100 GPU. For the SDS guidance, the maximum and minimum timestep decrease from 0.98 to 0.5 and 0.02, over the first 6,000 steps in the coarse stage. In the fine stage, these are fixed to 0.5 and 0.02, respectively. We set the rescale factor to 0.5 for the CFG rescale trick. The rendering resolution begins at \(64^2\) and increases to \(256^2\) after the first 4,000 steps in the coarse stage and is set to \(512^2\) in the fine stage. For more implementation details, please refer to the Appendix.

5.1 Qualitative Comparison

We present a qualitative comparison against DreamAvatar (Cao et al., 2023), DreamWaltz (Huang et al., 2023), DreamHuman (Kolotouros et al., 2023) and AvatarVerse (Zhang et al., 2024) in Fig. 3. Benefiting from the explicit mesh representation, AvatarStudio outperforms DreamAvatar and DreamWaltz significantly in terms of both geometry and texture, resulting in richer details across all cases. In comparison with AvatarVerse, our AvatarStudio generates avatars with clearer appearances (1st and 3rd rows) and aligns more closely with the input texts (2nd row). Moreover, thanks to its articulation modeling, a standout feature of AvatarStudio is its ability to support avatar animation (see later in Fig. 10), which is not available in AvatarVerse. These clearly demonstrate the superiority of AvatarStudio for text-guided 3D avatar creation. We also visualize the normal maps of the generated avatars in Fig. 4 and 14, showing that our method is robust to different input prompts and can produce high-quality results.

User Study. To quantitatively evaluate AvatarStudio, we conduct user studies comparing the performance of our results with four SOTA methods under the same text prompts. We randomly pick 30 prompts for evaluation. Each prompt is evaluated by 20 volunteers. Each user is required to select a preferred 3D model among the given rendered videos with corresponding prompts. In Fig. 5, we first compare AvatarStudio with DreamAvatar (Cao et al., 2023) and DreamWaltz (Huang et al., 2023) for specific characters generation, and then compare with AvatarVerse and DreamHuman (Kolotouros et al., 2023) in terms of realistic human generation. As shown in Fig. 5, users prefer our model over other methods. The results demonstrate that our method achieves significantly superior preference over all other methods.

CLIP Score. We use CLIP score (Detlefsen et al., 2022) as an evaluation metric to measure the consistency between the generated avatars and input texts for the above methods. For each method, we render the generated avatars from four evenly distributed horizontal views and calculate the averaged CLIP score for these rendered images and the input text. Similar to the user study, we compare the proposed method with DreamAvatar and DreamWaltz in terms of specific character generation and compare with AvatarVerse and DreamHuman for realistic human generation. The CLIP scores for DreamAvatar, DreamWaltz, and ours are 30.45, 31.52, and 32.80, respectively, while the CLIP scores for DreamHuman, AvatarVerse, and ours are 29.54, 28.88, and 32.17, respectively. Our AvatarStudio consistently outperforms all these methods, verifying its effectiveness in creating more accurate avatars in alignment with the input texts.

Fig. 5
figure 5

User preference

Fig. 6
figure 6

Ablation studies on different 3D representations. Please zoom in for a better view. Please also refer to the project page for the video examples

5.2 Ablation Studies

Avatar Representation. Our approach, AvatarStudio, utilizes an articulated mesh representation in a coarse-to-fine manner, with the coarse stage being represented by NeRF. To explore the impact of different 3D representations, we optimize 3D avatars from text using either mesh-only (DMTet) or NeRF-only representations. As shown in Fig. 6, directly optimizing meshes for avatar creation results in collapsed results, while using NeRF-only representation often yields avatars of lower quality. In contrast, our proposed articulated representation, which combines NeRF and mesh, successfully generates high-resolution images with fine details, demonstrating its effectiveness.

Fig. 7
figure 7

Ablation studies. Please zoom in for a better view. Please also refer to the project page for more video examples

Fig. 8
figure 8

More examples for part-aware super-resolution and CFG rescale. Please zoom in for a better view. Please also refer to the project page for the video results

Fig. 9
figure 9

Ablation studies on different guidance. Please zoom in for a better view. Please also refer to the project page for the video examples

Part-Aware Super-Resolution and CFG Rescale Strategy. Furthermore, we explore the individual impacts of part-aware super-resolution (SR) and CFG rescale strategy (Figs. 7, 8). As shown in Figs. 7a and 8a, we observe CFG rescale method can mitigate the color saturation issue, generating more natural appearance for the generated avatar. Upon the addition of part-aware super-resolution, the model can produce sharper appearances and more local fine details, such as on faces and belts (see Fig. 7b). These studies validate the effectiveness of each proposed component in our approach, demonstrating their substantial contribution to the final result.

Fig. 10
figure 10

AvatarStudio facilitates the animation of avatars using multimodal signals. We demonstrate examples of animated avatars, using a video-driven motion and b text-driven motion (“Michael Jackson is doing Moonwalk”, and “A pregnant person is dancing”), respectively. Please refer to the project page for more examples

DensePose-Conditioned ControlNet. AvatarStudio uses ControlNet conditioned on DensePose for SDS guidance. To assess its efficacy, we compare the performance of our method when trained with StableDiffusion (SD) or Skeleton-conditioned ControlNet (see Fig. 9). We observe the model guided by StableDiffusion generates avatars that exhibit incorrect poses and lower quality due to the lack of pose-aware guidance, which results in inaccurate animations. While the Skeleton-conditioned ControlNet model improves pose control, it still suffers from inaccuracies in foot positioning and head orientation. In contrast, our proposed DensePose-conditioned diffusion guidance achieves precise and stable pose control, accompanied by high-quality textures, which validates the importance of leveraging DensePose-conditioned guidance in the avatar creation process. Moreover, we observe that Skeleton-conditioned ControlNet also suffers from the Janus problem (Fig. 9). This is because the keypoints reside in the SMPL mesh, and it is difficult to determine whether they are occluded, and thus guide the text-to-3D model to yield an incorrect back-view image. In contrast, the DensePose control signals provide a more detailed and accurate description of a person’s pose and view and thus guide the model to generate reasonable 3D avatar results, effectively mitigating the Janus problem. Leveraging DensePose as a form of SDS guidance for 3D generation offers a significant advantage over keypoint or skeleton-based guidance. The reasons are as follows: 1) Skeleton-based guidance, while effective in many scenarios, can be relatively sparse. This sparsity can lead to ambiguity in distinguishing between frontal and back views, a phenomenon often referred to as the “Janus problem”. Besides, due to its sparsity, the same skeleton-represented pose can potentially map to multiple real human poses, leading to inaccuracies in the generated 3D avatars. 2) On the other hand, DensePose provides a more detailed and accurate description of a person’s pose. Its dense nature allows for a more precise mapping between the guidance and the actual human pose, thereby alleviating the aforementioned Janus problem and enhancing view consistency.

To quantitatively assess the pose controllability of avatars generated with different diffusion guidances, we predict the SMPL parameters for posed avatar images using a pre-trained 3D human reconstruction model HybrIK (Li et al., 2021). The images are generated via the given SMPL parameters. We calculate the Mean Squared Error (MSE) in \(10^{-2}\) between the input and the estimated SMPL parameters. Specifically, for each avatar, we generate 120 posed images using 120 fixed SMPL parameters in a frontal view. We compute the average MSE across those images as the final MSE score. The MSE for StableDiffussion, Skeleton-conditioned ControlNet, and our method are 9.0, 7.7, and 5.9, respectively. We observe AvatarStudio achieves the best pose control with the lowest MSE, further verifying the effectiveness of DensePose guidance.

Fig. 11
figure 11

Our AvatarStudio also supports style-guided avatar creation by simply providing an additional style image. Note that the provided style image can be combined with text prompts to enable flexible avatar creation (e.g., a policewoman in Pixar Disney style)

5.3 Applications

Multimodal Animation. A crucial feature of our method lies in its capability to provide high-quality, natural, and easy-to-use animation, which allows users to drive avatars using multimodal signals (e.g., video, text, audio, etc). Figure 10 illustrates the animation of avatars created by AvatarStudio using either video (Fig. 10a) or text (Fig. 10b). For video-driven animation, we first employ VIBE (Kocabas et al., 2021) to estimate SMPL sequences from the driving video, which are subsequently leveraged to animate the generated avatar. For text-driven animation, we adopt MDM (Tevet et al., 2023) to convert text into SMPL sequences. Despite adopting a simple SMPL-guided animation, our method produces plausible animations, exhibiting natural movements. The consistency of these results w.r.t. SMPL motions is attributed to two factors: (1) the use of a pre-trained 2D diffusion model to provide SDS guidance, which can correct areas where the SMPL deformation is inaccurate, leading to better animation results; (2) the employment of a strong DensePose as prior, which provides strong pose information, helping to learn to model the avatar under different poses. As such, AvatarStudio can leverage any multimodal-to-motion methods that generate SMPL sequences for animation, showing the versatility and potential of our method in creating realistically animated avatars from diverse text prompts. For more results, please refer to our project page.

Style-Guided Avatar Creation. Moreover, we show that AvatarStudio supports stylized avatar creation by simply providing an additional style image. To achieve this, we employ IP-Adapter (Ye et al., 2023), an adapter that enables image prompt capability for pre-trained text-to-image diffusion model via a decoupled cross-attention design. We plug the IP-Adapter into our DensePose-conditioned ControlNet and optimize with SDS as before. Without bells and whistles, AvatarStudio can generate high-quality avatars of various styles of interest as shown in Fig. 11. Note that the provided style image can be combined with text prompts to enable flexible avatar creation (e.g., a policewoman in Pixar Disney style in Fig. 11). This capability expands its application, allowing users to create stylized avatars catering to specific aesthetic desires.

6 Conclusion

In this paper, we introduce AvatarStudio for creating high-fidelity and animatable 3D avatars from only textual inputs. Our AvatarStudio introduces articulated modeling into explicit 3D mesh representation to support avatars animation while offering high rendering quality. To further improve pose contrallability and view consistency, we leverage DensePose-conditioned ControlNet for Score Distillation Sampling supervision. We also discover several simple yet effective strategies, such as part-aware super-resolution for improving the fidelity of each body part, dual-space training for improving the robustness against different poses and CFG rescale to alleviate the color saturation issue. As a result, AvatarStudio supports various downstream applications, including multimodal avatar animations (e.g., video or text driven) and style-guided avatar creation.