1 Introduction

Digital art and visual design have been prevailing in our daily living spaces, expressing visually captivating aesthetics, unique tastes, and emotions of human beings. With the prosperity of artificial intelligence (AI), there emerges a new generation of toolsets for visual content creation such as generative image synthesis (Stable Diffusion, DALL\(\cdot \)E 3, Midjourney) (Rombach et al., 2022; OpenAI, 2023; Midjourney, 2023), and video synthesis (Sora, RunwayML) (OpenAI, 2024; Runway, 2023). AI-based visual content creation can also be extended to the 3D domain, notably by lifting images to 3D scenes (LUMA AI) (Luma, 2023), and creative text-guided 3D generation and design (DreamFusion, Meshy, SplineAI)  (Poole et al., 2023; Meshy, 2024; Spline, 2023).

Fig. 1
figure 1

The survey delves into the realm of neural stylization on diverse 3D representations, including meshes, point clouds, volume, and neural fields. The neural stylization with visual, textual and geometric features retrieved from large-scale neural models empowers artistic, photorealistic, and semantic style transformation of the geometry and appearance of 3D scenes. Images adapted from Liu et al. (2018), Ma et al. (2023), X. Cao et al. (2020), Yin et al. (2021), Wang et al. (2023b), Zhang et al. (2022), Zhang et al. (2023c), Song et al. (2023), Haque et al. (2023)

It’s noteworthy that the nature of visual concepts in our living space is tied to a critical factor: style. Formally, style is a way to express individuality and creativity in different mediums and practices. In relevant industries like animation, architectural and interior design, gaming, augmented reality, virtual reality, and artwork creation, assets are often created in styles such that they altogether harmonize to create an intended look and feel of the final scenes. A common practice is to first create the assets, and then post-process them to match some styles, often known as stylization. The advent of modern deep learning has led to the emergence of neural stylization, a family of methods that automatically create visual content in styles, facilitating the exploration of aesthetics in the creation of visual data. Neural stylization is applicable to visual data in general, including images, videos, and 3D data. Unlike image stylization, which has been well developed in the past decade, neural stylization for 3D data remains an open area to explore for new creative vistas and practical applications.

This report delves into the latest developments in the creation of 3D digital art using neural stylization methods, as shown in Fig. 1. Neural stylization can facilitate automated design and creation of explicit meshes, textures and volumetric assets, which supports seamless usage in the traditional rendering pipeline, accelerating labor-intensive manual tasks such as modeling, texturing, and simulation. Neural stylization also enables efficient and controllable manipulation or transformation of neural scenes, which typically utilizes neural networks instead of shaders to generate images (Fig. 2). Interestingly, neural stylization has shown practical importance in various applications, including 3D texture design and artistic simulation in movie making (Navarro & Rice, 2021; Kanyuk et al., 2023; Hoffman et al., 2023), virtual production (Manzaneque, 2023), mixed reality experiences (Tseng et al., 2022; Taniguchi, 2019), and artwork creation (Guljajeva & Canet, 2022).

Fig. 2
figure 2

A general overview of mesh-based and radiance field-based rendering pipelines. Images adapted from Yin et al. (2021), G. Kim et al. (2024), Chen and Wang (2024b), Avrahami et al. (2022), Spline (2023)

Fig. 3
figure 3

Structure of our survey

Despite its advantages, performing neural stylization in 3D presents new technical challenges such as multi-view consistency, view sampling and rendering, as well as robustness issues including the relative scarcity of 3D datasets, and memory consumption for training and inference. This report provides a comprehensive discussion and summary of advanced 3D neural stylization techniques that address these challenges, highlighting the power of neural rendering (Tewari et al., 2022), vision-language models (Radford et al., 2021), and large-scale generative models (Rombach et al., 2022).

The structure of this report is depicted in Fig. 3, which is outlined as follows: Sect. 2 reviews fundamentals of 2D neural stylization and important visual or textural feature priors, which act as components or backbones of 3D neural stylization techniques. In Sect. 3, we introduce a taxonomy for neural stylization and discuss advanced stylization methods on various types of 3D representations in depth with summaries of practical tips to guide future works. In Sect. 4, we summarize popular datasets and evaluation metrics for 3D stylization, and particularly, we deliver a benchmark of 3D neural stylization to serve as a reference for the performance of selected methods. Sect. 5 introduces diverse applications of 3D neural stylization, demonstrating its practical value in a wide range of domains. Finally, Sect. 6 highlights promising research directions with practical significance.

1.1 Definition and Terminologies

Definition 1

3D neural stylization refers to the process that employs deep learning techniques and stylization algorithms to generate stylized 3D digital assets or the stylized rendering from these assets, including the alteration of appearance and/or geometry.

3D neural stylization is well connected to the following terminologies and techniques.

\(\bullet \) Neural Style Transfer (NST) refers to a class of algorithms that manipulate digital images, or videos, in order to adopt the appearance or visual style of target reference while preserving original content features. It can be regarded as a foundation of 3D neural stylization, as most image-guided 3D stylization methods rely on it. We refer to former surveys (Jing et al., 2019; Singh et al., 2021; Zhan et al., 2023) and provide a concise review in Sect. 2.1.

\(\bullet \) Neural Rendering is a class of techniques that “learn to render and/or represent a scene from real-world imagery, which can be an unordered set of images, or structured, multi-view images or videos" (Tewari et al., 2022), several works of which focus on the generation of photorealistic rendering from the neural representations. Instead, neural stylization aims at modifying the visual appearance and aesthetic characteristics of existing digital representations and obtaining artistic or photorealistic rendering results. We refer readers to existing surveys for insight into neural rendering (Tewari et al., 2020, 2022), neural fields (Xie et al., 2022) and image generation (Zhan et al., 2023).

\(\bullet \) Non-photorealistic Rendering (NPR) is an area of computer graphics that enables abstract stylized rendering for either 3D models, 3D images or 2D images, such as toon shading, Gooch shading (Gooch et al., 1998), stroke-based painterly rendering (Haeberli, 1990; Hertzmann, 1998), patch-based texture synthesis and transfer (Efros & Freeman, 2001; Hertzmann et al., 2001). Mainly leveraging programmable shaders and image-processing techniques, NPR has been widely used in the realms of animation making, digital content creation (Artineering, 2018) and game development (McGuire et al., 2010). However, these techniques require creating handcrafted style patterns and rules, which are labor-intensive and entail domain expertise. In contrast, neural stylization enables fast production with arbitrary style references and has been applied to accelerate cinematic digital production (Joshi et al., 2017; Navarro & Rice, 2021; Hoffman et al., 2023).

\(\bullet \) Neural Scene Editing has become more practical in the recent few years thanks to the contribution of large language models (LLMs) and vision-language models (VLMs) (Radford et al., 2021; Li et al., 2023a). Editing methods focus on adding, modifying, or removing objects in a scene, or manipulating some regions of interest. By contrast, stylization methods focus on the transfer of overall appearance and the adoption of specific aesthetic characteristics. Still, stylization methods share critical ideas with editing methods, and some of the methods covered in this survey can also apply to scene editing (Koo et al., 2023; Song et al., 2023; Bao et al., 2023).

1.2 Related Surveys

In the literature, there exist comprehensive surveys on 2D neural style transfer (Jing et al., 2019; Singh et al., 2021), surveys on generative image models (Zhan et al., 2023; Croitoru et al., 2023; Yang et al., 2023), and surveys on neural field representations (Xie et al., 2022) and rendering (Tewari et al., 2020, 2022). Our survey aims to explore the potential of connecting neural stylization techniques with both traditional and advanced 3D representations, thereby offering valuable resources for style-based 3D digital designs. To the best of our knowledge, this paper is the first comprehensive review to summarize neural stylization techniques and applications specifically tailored to 3D data, highlighting the immense capabilities of neural stylization in the 3D domain.

2 Background

In this section, we provide a brief discussion on neural style transfer for images, which serves as the fundamental building block for discussing 3D neural stylization (Sect. 2.1). This covers techniques leveraging visual or textual guidance for image style transfer and manipulation, as well as insights on linkages to 3D stylization domain. We also discuss generic methods for 3D content generation with a focus on the state-of-the-art diffusion models for 3D generation (Sect. 2.2).

Fig. 4
figure 4

Pipeline comparisons of 2D neural style transfer. a Single-style transfer via optimization (Gatys et al., 2016; Kwon & Ye, 2022; Johnson et al., 2016). b Arbitrary style transfer via feature fusion or transformation (Huang & Belongie, 2017; Li et al., 2019; Liu et al., 2021). c Image-to-image translation with style condition via generative models (Huang et al., 2018; Deng et al., 2022; Wen et al., 2023; Zhang et al., 2023d)

2.1 Neural Style Transfer

The basic idea of neural style transfer is to reproduce the style of a reference image for an input image, while keeping the original content of the input. The content representation of an image can be extracted by predicting its features using a pre-trained or trainable encoder (Simonyan & Zisserman, 2015; Huang et al., 2018). The style representation can be represented by a Gram matrix, which is a dot-product matrix measuring the relevance of each pair of features extracted by a pre-trained network (Gatys et al., 2016). Alternatively, the style of an image can also be characterized by the spatially invariant statistics (i.e. channel-wise mean and variance) of features (Dumoulin et al., 2017; Huang & Belongie, 2017). With the rise of vision-language pre-trained models, textual embeddings have been widely employed to represent content or style information (Radford et al., 2021; Li et al., 2023a). In Fig. 4, we provide a high-level pipeline comparison of different types of neural style transfer methods, including single-style transfer via optimization, arbitrary style transfer via feed-forward network, and style transfer via generative models. We discuss each type of method below.

2.1.1 Single Style Transfer

One simple method for neural style transfer (Fig. 4a) is to optimize from a white noise image to obtain a new image that shares the content of a source image and the style of a reference image (Gatys et al., 2016). Given a content source image c and a style reference image s, the optimization can be done by minimizing a combined objective: \({\mathcal {L}}_{total} = {\mathcal {L}}_{c}(c, cs) + \lambda {\mathcal {L}}_{s}(s, cs)\), where the total loss consists of a content loss \({\mathcal {L}}_{c}\) of the squared-error of VGG features between content image c and output stylized image cs, and a Gram matrix style loss \({\mathcal {L}}_{s}\) between style image s and output stylized image. \(\lambda \) is a hyperparameter.

Instead of optimizing an image for each transfer, we can train a single feed-forward network with a perceptual loss to perform style transfer for arbitrary content images (Johnson et al., 2016). At inference, real-time stylization can be performed simply by forwarding an arbitrary content image through the network. Although performing network inference is much faster than running an optimization, one still needs to retrain the network for each different style.

The rise of vision-language models (Radford et al., 2021) leads to the possibility of performing style transfer using text prompts as guidance. Given CLIP with a text encoder and an image encoder sharing the same latent embedding space (Radford et al., 2021), text-guided image style transfer can be achieved by maximizing text-image semantic similarity as the style loss, usually formulated by the CLIP loss defined by

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{clip}(cs,s_{sty})&= 1 - sim( E_{I}(cs) , E_{T}(s_{sty}) ), \end{aligned} \end{aligned}$$
(1)

where cs and \(s_{sty}\) are the stylized image and style text prompt, \(E_{I}\) and \(E_{T}\) are the pre-trained CLIP image encoder and text encoder, respectively. sim(AB) is the cosine similarity between two feature vectors. One can also use the directional CLIP loss (Patashnik et al., 2021; Gal et al., 2022) to achieve better style transfer quality:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{dir}(c, cs, s_{src}, s_{sty})&= 1- sim(\Delta I, \Delta T), \end{aligned} \end{aligned}$$
(2)

where \(\Delta I = E_{I}(cs) - E_{I}(c), \Delta T = E_{T}(s_{sty}) - E_{T}(s_{src})\), c is the content image, cs is the stylized image, \(s_{src}\) and \(s_{sty}\) are the source (content) style prompt and target style prompt. An example of \(s_{src}\) and \(s_{sty}\) can be “Photo” and “Picasso style painting”, respectively (Kwon & Ye, 2022). Since CLIP does not support high-resolution image embedding, a patch-wise version of the directional CLIP loss with augmented patches can be used for better artistic semantic texture transfer (Kwon & Ye, 2022).

2.1.2 Arbitrary Style Transfer: AdaIN and LST

To enable the model to transfer arbitrary styles without re-training, one can employ an autoencoder with style fusion (Fig. 4b). Particularly, to fuse the content and style features, we can use adaptive instance normalization (AdaIN) that directly regulates the mean and variance of the feature maps of the content image to match those of the target style image (Huang & Belongie, 2017):

$$\begin{aligned} AdaIN(c,s) = \sigma ( F (s)) \left( \frac{ F (c) - \mu ( F (c))}{\sigma ( F (c))} \right) + \mu ( F (s)),\nonumber \\ \end{aligned}$$
(3)

where each VGG feature map \( F (\cdot )\) is normalized separately. The transformed feature maps are fed into a learned decoder to generate the final output.

Alternatively, we can fuse content and style features by learning an affine feature transformation matrix \( T \) from content features \( F (c)\) and style features \( F (s)\) through a convolutional neural network, as proposed by linear style transfer (LST)  (Li et al., 2019):

$$\begin{aligned} LST(c,s) = T \cdot ( F (c) - \mu ( F (c))) + \mu ( F (s)). \end{aligned}$$
(4)

2.1.3 Generative Models for Style Transfer

In addition to optimization-based and feed-forward inference methods, neural style transfer can be achieved via image synthesis tasks such as image-to-image translation and generative models. Particularly, image-to-image translation (I2I) achieves style transfer by translating an image from one source domain to a target domain using a generative model such as a generative adversarial network (GAN) (Goodfellow et al., 2014) (Fig. 4c). Deterministic I2I translation methods focus on task-specific domain-to-domain translation and do not require style guidance via reference images (Isola et al., 2017; Zhu et al., 2017; Liu et al., 2017). Multi-modal I2I translation models enable translation based on examples or latent features (Huang et al., 2018; Lee et al., 2018; Chang et al., 2020; Chen et al., 2022a). While GAN-based I2I models generate high-fidelity images, they are domain-specific and resource-consuming for training compared to traditional methods (Fig. 4a, b).

Recently, diffusion models have demonstrated state-of-the-art performance for image synthesis (Rombach et al., 2022). A particular strength of diffusion models that contributes to their wide adoption is their ability to learn across different data modalities. Similarly to the spirit of CLIP loss for style transfer, text-guided diffusion models leverage textual embedding of the text prompt for conditional image generation and thus allow style transfer via text-to-image generation and text-guided I2I translation (Rombach et al., 2022; Saharia et al., 2022; OpenAI, 2023). Among the publicly available text-to-image diffusion models, the Stable Diffusion series (Rombach et al., 2022; Podell et al., 2023; Esser et al., 2024) is the most representative. They have inspired a vast amount of research work and a broad range of downstream applications.

Numerous works explored the potential of text-to-image diffusion models for style transfer. One track fine-tunes the diffusion model (usually U-Net) or learns a special textual embedding using a set of images with target style, including techniques like Dreambooth (Ruiz et al., 2023), LoRA (Hu et al., 2022; Frenkel et al., 2024), textual inversion (Gal et al., 2022; Zhang et al., 2023d), etc. Among them, B-LoRA (Frenkel et al., 2024) jointly fine-tunes two blocks of LoRA layers to capture respectively the style and content of an image, enabling the transfer of this style to unseen content, or such content to new styles. Textual inversion method InST (Zhang et al., 2023d) binds a special language mark with a textual embedding inversed from a style image. This style can then be transferred to other images by including the language mark in the prompt during inference.

Another track leverages attention modules in diffusion U-Nets to embed style information without per-style optimization. Spatially invariant feature statistics, as discussed in Sect. 2.1.1, represent style effectively. Diffusion in Style (Everaert et al., 2023) pre-computes the mean and variance for Gaussian noise sampling based on the style feature statistics. StyleID (Chung et al., 2024f) employs AdaIN for noise initialization and replaces content’s self-attention key, value with those from the style layers. StyleAlign (Hertz et al., 2024) further applies AdaIN to the query, key of a sequence of generated images to ensure style consistency. DEADiff (Qi et al., 2024) focuses on cross-attention in high-resolution layers, utilizing Q-Former (Li et al., 2023a) for style extraction instead of AdaIN. InstantStyle (Wang et al., 2024) isolates styles for transferring by subtracting content CLIP embeddings from their image CLIP embeddings.

Moreover, large pre-trained image editing models such as Instruct-Pix2Pix (Brooks et al., 2023) showcase good style transfer capability. Leveraging LLMs, e.g., GPT-3 (Brown et al., 2020), Instruct-Pix2Pix automates the generation of diverse text prompts and adopts Prompt-to-Prompt (2023b) to create corresponding image pairs. Subsequently, a standard diffusion model is trained on these pairs, with the exception of concatenating the source image to the first network layer as conditional information. However, its performance degrades when intricate styles are difficult to express in natural language.

2.1.4 Linking 2D to 3D Stylization

The exploration of 2D neural style transfer offered us valuable insights into style feature and conversion: spatially invariant statistics (mean and variance) of visual feature maps can represent the image style; one can shift such statistics (to align with those of other images) to control the style of an image. This insight and several other practical skills can boost 3D stylization research. We briefly exemplify three directions of 3D stylization below.

  1. 1.

    Loss Function Design. Sect. 2.1.1 revisited the basic version of loss functions for 2D style transfer tasks. When it comes to 3D stylization, it is straightforward to employ similar optimization losses in a view-by-view manner while maintaining multi-view consistency via 3D representation. For instance, Liu et al. (2018) apply 2D latent content and style losses (Gatys et al., 2016) to supervise mesh surface morphing; Chen et al. (2024e) optimize texture style jointly with semantics-aware target image features (mean and standard deviation) and textual features (Eq. 1-2).

  2. 2.

    Feed-forward Feature Transform. The success of feed-forward arbitrary style transfer through feature statistics transformation in the 2D domain inspires feed-forward 3D neural style transfer with 3D-aware feature representation. For example, StyleGaussian (Liu et al., 2024) applies AdaIN (Eq. 3) for VGG (Simonyan & Zisserman, 2015) features stored in 3D Gaussians for efficient 3D style transfer. FPRF (Kim et al., 2024) applies a semantics-aware local AdaIN for features stored with tri-planes. StyleRF (Liu et al., 2023) proposes a modified volume-adaptive IN for features obtained from feature grids.

  3. 3.

    Stylization with Generative Priors. The burgeoning 2D large generative models (Sect. 2.1.3) have been leveraged to handle the stylization and even address 3D consistency issues (with geometry priors). A simple yet effective way is to directly stylize a sampled view as the target using pre-trained generative models, followed by backpropagation of the 2D error, such as IN2N (Haque et al., 2023) and IG2G(Vachha & Haque, 2024). A more advanced approach is score distillation, which leverages the capability of diffusion model to process multi-modal guidance for better controllability. Score distillation was first proposed and widely adopted in 3D generation tasks. We’ll discuss the topic in the next part.

2.2 3D Content Generation

Equipped with a background in neural style transfer, let us now briefly discuss 3D content generation methods, which serve as the background and provide valuable insights for 3D neural stylization subsequently.

3D Representations In contrast to image representations, there are various representations for learning to generate 3D content. Conventional 3D representations are mostly explicit representations including triangle and polygon meshes, point clouds, and voxel grids (or volumes). The advances in deep learning have spurred an increasing interest in using neural networks to represent 3D data as neural fields, notably neural radiance fields (NeRF) (Mildenhall et al., 2020). Subsequently, there appeared notable hybrid or compact radiance fields representations, featured by neural graphics primitives (NGP) (Müller et al., 2022) and 3D Gaussian splats (3DGS)  (Kerbl et al., 2023). Some implicit representations, such as signed distance functions (SDF) and their truncated versions (TSDF), also gain popularity to represent implicit shapes. We refer readers to the existing survey for a comprehensive overview of neural fields for visual computing (Xie et al., 2022).

3D Generative Models Existing 3D generative models have explored different types of 3D representations such as point clouds, voxel grids, meshes, and implicit fields (Zhao et al., 2021; Qi et al., 2017; Wu et al., 2015; Masci et al., 2015; Chen & Zhang, 2019). 3D data-driven generative models are trained with large-scale 3D assets with diverse appearances and shapes, which are challenging to collect (Chang et al., 2015; Deitke et al., 2023; Liu et al., 2019). Inspired by neural volume rendering, there appeared a group of 3D-aware image synthesis works learning 3D generation from accessible 2D data (Niemeyer & Geiger, 2021; Nguyen-Phuoc et al., 2020; Chan et al., 2022; Gu et al., 2022). Since the slow and resource-intensive nature of volume rendering results in long training time and low resolution, one can leverage a reduced 3D representation such as tri-planes in the GAN framework for efficient and high-quality image and 3D data generation (Chan et al., 2022).

Fig. 5
figure 5

3D generation architecture with score distillation sampling loss. A pre-trained denoising U-Net supervises NeRF optimization. Image adapted from Poole et al. (2023)

3D Generation via Diffusion Priors

In a similar spirit to 3D-aware image synthesis, it is of great interest to generate 3D data from priors learned by 2D diffusion models (OpenAI, 2023; Ruiz et al., 2023). DreamFusion (Poole et al., 2023) proposed a 3D generation pipeline that optimizes a NeRF by leveraging a text-guided diffusion model as the critic to the NeRF rendered images (Fig. 5). The gradient function for optimization is a weighted average of noise residual multiplied by the Jacobian of the rendering process, also known as score distillation sampling (SDS).

Fig. 6
figure 6

Taxonomy of neural stylization. Images from Richardson et al. (2023), Aurand et al. (2022), Mildenhall et al. (2020), Liu et al. (2018), Haque et al. (2023), Yin et al. (2021), Zhang et al. (2022), Liu et al. (2023), Chen et al. (2023a), Pang et al. (2023)

Fig. 7
figure 7

Hierarchical classification of selected image- and text-guided 3D neural stylization methods

Several variants have been proposed to resolve serious problems in the SDS loss such as oversaturation, over-smoothing, and lack of details, generating more realistic and high-definition 3D objects. Variational score distillation (VSD) trains a LoRA model to better estimate the data distribution of rendered images for effective updating (Wang et al., 2023c). Delta denoising score (DDS) computes the difference between two SDS scores as guidance (Hertz et al., 2023a), while posterior distillation sampling (PDS) aligns the stochastic latents of the source image and the target image instead of noise variables (Koo et al., 2024). Further works explore instilling geometric information to score distillation (Yang et al., 2023; Yeh et al., 2024).

Moreover, 3D datasets are also exploited to provide geometric priors such as canonical coordinates map (Li et al., 2024a) and normal map (Long et al., 2024) for better multi-view consistency. Zero-1-to-3 (Liu et al., 2023) proposed the view-conditioned diffusion that accepts a rotation angle as an extra condition, which synthesizes novel views based on any single view of a 3D model. Recent 3D generation works further improve the visual quality by combining 2D priors from text-to-image diffusion and 3D-aware priors from view-conditioned diffusion (Qian et al., 2024; Sun et al., 2024).

3 3D Neural Stylization

In this section, we first establish a taxonomy for neural stylization and give an example of the categorization of selected 3D neural stylization methods (Sect. 3.1). In the subsequent sections, we will discuss state-of-the-art 3D neural stylization techniques on diverse 3D representations, such as meshes (Sect. 3.2), neural fields (Sect. 3.3), volumetric data (Sect. 3.4), point clouds (Sect. 3.5), and implicit shapes (Sect. 3.6). We then discuss a set of guidelines for practical implementations of 3D stylization (Sect. 3.7).

3.1 Taxonomy

Our taxonomy for neural stylization methods consists of the following aspects:

  • Representations. We categorize stylization methods based on data representations such as image, mesh, volume, point cloud, and neural field.

  • Neural Style Feature. We categorize based on image visual features, textual semantic features, or 3D latent features derived from pre-trained models, typically neural classifiers or generative models.

  • Optimization. This refers to optimization-based or prediction-based stylization methods with single, multiple, or arbitrary styles supported.

  • Stylization Genres. This refers to different types of stylization, mainly including geometry stylization operating on asset shape and surface patterns, and appearance stylization focusing on color, texture and visual patterns to align with specific styles from artistic paintings to realistic concepts.

To guide the reader through the main section of this survey, we illustrate the taxonomy in Fig. 6, and provide a hierarchical classification of the 3D stylization methods in Fig. 7. Let us now discuss 3D neural stylization methods by following the categorization based on 3D representations below.

3.2 Mesh-Based Stylization

In computer graphics and 3D modeling, a mesh is a collection of vertices, edges, and faces that define the geometric structure of an object. Objects represented by meshes can also store additional appearance attributes, such as vertex colors, materials, UV coordinates, and texture maps. Additionally, neural networks, such as multi-layer perceptrons (MLPs), can represent these attributes, including neural textures (Thies et al., 2019; Oechsle et al., 2019), neural reflectance field (Baatz et al., 2022), neural visibility field (Srinivasan et al., 2021), neural vertex (Michel et al., 2022; Ma et al., 2023; Lei et al., 2022), etc. By using differentiable renderers (Ravi et al., 2020; Laine et al., 2020; Fuji Tsang et al., 2022), we can optimize these explicit or implicit attribute representations for 3D geometry manipulation and appearance editing. For example, one can predict vertex positions and colors (Michel et al., 2022), SVBRDF parameters and normals (Lei et al., 2022), and synthesize new texture images (Richardson et al., 2023). The following sections cover critical techniques for mesh-based stylization, including geometric deformation (Sect. 3.2.1) and texture synthesis (Sect. 3.2.2) to align with a provided image, text, or 3D shape guidance. Table 1 shows a comparison of recent mesh-based stylization methods.

3.2.1 Surface Geometric Deformation

3D neural stylization enables deforming mesh geometry to align with artistic visual patterns or a specified shape, guided by visual references or textual descriptions. This capability facilitates creative 3D modeling such as surface engraving effect (Liu et al., 2018) and geometry morphing (Gao et al., 2023). Existing works learn geometry variations in the form of vertex position displacement, e.g., explicit displacement via differentiable rendering (Liu et al., 2018) and implicit displacement via neural networks (Michel et al., 2022; Ma et al., 2023; Gao et al., 2023).

For example, Paparazzi is a neural stylization method based on differentiable rendering that allows the propagation of changes in the image domain to changes of the mesh vertex positions (Liu et al., 2018). It takes a triangle mesh as input, and applies latent VGG content and style losses (Sect. 2.1, Gatys et al. (2016)) between rendered image(s) and gray-scale style image to update vertex positions. After convergence, the mesh surface is stylized with artistic strokes and motifs from the style image.

Table 1 Summary of selected mesh-based neural stylization. PBR refers to physically-based rendering

With CLIP loss (Sect. 2.1.1, Eq. 1), recent works explored text-guided mesh geometric and/or appearance alteration (Michel et al., 2022; Ma et al., 2023). Text2Mesh (2022) and X-Mesh (2023) incorporate Neural Style Field, which is composed of an MLP network that maps vertex coordinates to vertex color (offset) and vertex position offset. The stylized mesh, with updated vertex colors and vertex positions, is rendered into multiple colored and gray-scale images, which are used for computing CLIP loss against a given text prompt.

Besides updating with the CLIP loss, X-Mesh (Ma et al., 2023) employs an attention module to directly include the prompt CLIP embedding as additional input along with the vertex coordinate embeddings. Accordingly, X-Mesh achieves fast convergence in a few minutes for high-quality stylized results. TextDeformer (Gao et al., 2023) upgraded the local CLIP-guided mesh geometric stylization (Wang et al., 2022; Michel et al., 2022) through Jacobians for global and smooth mesh deformation (Aigerman et al., 2022). Instead of learning position displacement directly, they assign Jacobians by matrices for each triangle and solve a Poisson problem (Aigerman et al., 2022) to compute the corresponding vertex deformation map, which largely achieves deformation with low-frequency to high-frequency details.

Alternatively, 3DStyleNet (Yin et al., 2021) learns to perform joint geometric and texture style transformation from one 3D object to another, and interpolation of geometric and texture style. This method consists of a 3D geometric part-aware style transfer network and a 2D texture style transfer network. The authors innovatively abstracted the geometries of an object with a set of 3D Gaussian ellipsoids and employed a learned 3D part-aware semantics affine transformation field based on Linear Blend Skinning (LBS) model (Lindholm et al., 2001). Meanwhile, they transfer the mesh texture using regular image style transfer techniques (Sect. 2.1 LST (2019)). The two networks are pre-trained with non-textured mesh models (TurboSquid, 2023; Chang et al., 2015; Renderpeople, 2023) and images (WikiArt (2016) and COCO (2014)) respectively. They are then jointly optimized by part-aware, content and style losses through rendered multi-views via a differentiable renderer (Laine et al., 2020).

A recent work (Haetinger et al., 2024) further explores geometry stylization for dynamic meshes, enabling efficient production of physics simulation and animation. They employ neural neighbor style transfer (Kolkin et al., 2022) instead of Gram-matrix to guide the style transfer, which produces higher quality high-frequency details by replacing each individual feature of the content image with its closest feature of the style image. The key to an effective and natural global and local stylization is the multi-level parameterization of mesh vertex position, which allows the 2D error to propagate in differential rendering sufficiently. An additional mechanism for vertex displacement interpolation and smoothing across frames is applied to improve time coherency. These enhancements result in high-quality, artifact-free mesh stylizations, suitable for creating unique artistic looks in simulations and 3D asset design.

3.2.2 Texture Synthesis

Mesh textures are essential for representing complex visual appearance with color and patterns, for which image- or text-guided neural style transfer are well developed (Sect. 2.1). Several methods have explored texture transformation and synthesis using visual features such as VGG features and diffusion priors (Höllein et al., 2022; Lei et al., 2022; Richardson et al., 2023; Cao et al., 2023; Yang et al., 2023). We categorize these methods based on their optimization and learning techniques and discuss them below.

Optimize via 2D Features With an artistic image reference, StyleMesh (Höllein et al., 2022) proposed a depth- and angle-aware texture optimization scheme for the 3D reconstructed indoor room. It optimizes an explicit texture image by backpropagating gradients computed from 2D content and style losses (Sect. 2.1, Gatys et al. (2016)) between each view of the scene and the style reference image. This method leverages depth and normal information from the mesh, mitigating artifacts such as view-dependent stretch and size artifacts that commonly arise from conventional 2D losses in 3D scenarios, as shown in Fig. 8. Nonetheless, StyleMesh largely relies on posed images under reconstructed scenes and ground-truth depths.

Optimizing mesh appearance colors with style descriptions using CLIP is effective but may not always achieve realistic results (Michel et al., 2022; Lei et al., 2022; Ma et al., 2023). Recently, text-to-image diffusion models have gained popularity for their ability to synthesize high-fidelity images. Therefore, researchers start to explore lifting 2D diffusion priors for 3D generation (Poole et al., 2023; Wang et al., 2023c; Lin et al., 2023; Chen et al., 2023b) and stylization (Chen et al., 2023a; Yang et al., 2023; Zeng et al., 2024; Youwang et al., 2024). Among these, TEXTure (Richardson et al., 2023) is a text-guided 3D texture painting method that iteratively paints the texture in a view-by-view manner. To maintain 3D consistency, each view drawing iteration is guided by a view-dependent trimap that indicates the “keep", “refine", and “generate" regions to control the amount of newly generated content for the texture. Alongside the trimap, a rendered RGB and depth map are fed into a pre-trained depth-to-image diffusion model (Sect. 2.1.3, ControlNet (2023b)) to obtain a synthesized view. This synthesis is finally projected back to the texture map via optimization. Apart from texturing, TEXTure supports various tasks such as texture transfer, texture editing, and multi-view image transfer.

Fig. 8
figure 8

Stretched pattern artifacts from stylization in screen space. Image adapted from Kato et al. (2018)

A concurrent work to TEXTure is Text2Tex (Chen et al., 2023a), which inpaints likewise the texture images from different views progressively with the help of the confident trimap and a depth-to-image ControlNet (2023b). This method presets some axis-aligned viewpoints and alternatively updates the next best view (see Table 4), which follows a more robust automatic view scheduling strategy addressing the blurriness and stretching artifacts. Paint3D (Zeng et al., 2024) employs coarse-to-fine UV diffusion models to further refine incomplete areas of multi-view inpainted texture in high definition. Based on an indoor room scenario, DreamSpace (Yang et al., 2024) synthesizes an indoor panorama with additional inpainting with diffusion models (Zhang et al., 2023b) for more consistent texture synthesis.

Optimize in Latent Space While running the entire generative diffusion process for multi-view painting provides an efficient approach for texture synthesis, it often results in inconsistent texture patterns and overall style. Instead, TexFusion (Cao et al., 2023) updates a 3D consistent latent texture at each denoising step from multi-views conditioned on previous denoising steps. To ensure consistency, the final texture image is optimized by distilling multi-view images decoded by a pre-trained depth-conditioned diffusion model (Sect. 2.1.3, Stable Diffusion (2022)) and with a neural color field mapping 3D coordinates to RGB values (Müller et al., 2022). Similarly, Knodt and Gao (2023) updates a latent texture map from multi-view via MultiDiffusion (Bar-Tal et al., 2023), a multi-window joint diffusion technique for multi-view consistency.

Optimize via Score Distillation Inspired by score distillation sampling (SDS) techniques (Sect. 2.2, Fig. 5), several works employ SDS and its variants for texture optimization, focusing on a single object or a single room (Yang et al., 2023; Guo et al., 2023; Wu et al., 2023; Chen et al., 2024a; Yeh et al., 2024). In particular, TextureDreamer (Yeh et al., 2024) and 3DStyle-Diffusion (Yang et al., 2023) adopt a neural field representation with BRDF parameters (Lei et al., 2022) to facilitate photorealistic rendering. Both works (Yang et al., 2023; Yeh et al., 2024) incorporate geometry-conditioned score distillation from ControlNet (2023b), leveraging additional inputs such as depth, normal, and camera pose. Additionally, Decorate3D (Guo et al., 2023) and HyperDreamer (Wu et al., 2023) utilize super-resolution diffusion techniques to enhance the synthesis of textures at higher resolutions.

Optimize with 3D Shape Supervision Point-UV Diffusion (Yu et al., 2023) explores texture synthesis leveraging the shape attributes such as vertex coordinates, normals, and segment masks of the mesh model. The proposed coarse-to-fine texture synthesis framework that combines a point diffusion network (Liu et al., 2019; Zhou et al., 2021) and a UV diffusion network enables unconditioned texture synthesis for arbitrary mesh models of each training category in the ShapeNet dataset (Chang et al., 2015). The pipeline can also receive additional image or text guidance using the CLIP encoder. Given a mesh and style visual or textual guidance, the point diffusion model generates color for sampled points from the mesh, which are then projected onto 2D UV space to create a coarse texture image. Subsequently, the UV diffusion model utilizes the coarse texture and additional shape attributes to predict the high-fidelity texture.

Table 2 Summary of selected neural field stylization methods. 3D Repr., Struct., geo., app. refer to 3D representation, data structure, geometry and appearance, respectively

3.3 Neural Field-Based Stylization

A neural field is “a field that is parameterized fully or in part by a neural network" (Po et al., 2023). The advanced 3D representations of neural fields, especially neural radiance fields (NeRFs) (Mildenhall et al., 2020; Sun et al., 2022; Müller et al., 2022; Kerbl et al., 2023), store scene geometry and appearance in a neural network or explicit data structure, enabling photorealistic rendering and 3D stylization in the latent space. Compared to mesh models (Fig. 2), which rely on texture maps to store various visual information like albedo, roughness, metalness, baked lighting, etc., neural fields store learned features that are mapped to an RGB image during rendering. Therefore, we can either stylize novel views during rendering without modifying the original neural field (Sect. 3.3.1), or stylize the neural field itself by updating the stored latent features (Sect. 3.3.2). Methods that stylize novel views learn a universal style transformation module for 3D-aware view features, thus avoiding additional training for each input style instance. While the other approaches that stylize neural fields require optimization for every single style or input reference set, the stylized neural field assets allow regular neural rendering, thus supporting seamless usage in related tools and software. Table 2 summarizes neural field-based stylization works, in terms of taxonomy and technical comparison. Please refer to the related surveys for a comprehensive review of neural fields and their applications (Xie et al., 2022; Chen & Wang, 2024b).

3.3.1 Feed-Forward Novel View Stylization

A straightforward approach to stylize a 3D scene is to stylize its novel views (Huang et al., 2021). However, it is known that a simple combination of existing 2D stylization and novel view synthesis methods would lead to blurry and inconsistent results. Instead, LSNV (Huang et al., 2021) proposed a feed-forward point cloud feature transformation model that first reconstructs a 3D point cloud by back-projecting the points in feature maps extracted from multi-view images following depth guidance, and then feeds these features into the transformation network (similar to LST (2019) in Sect. 2.1) to obtain stylized point cloud features, which can be decoded to render novel views.

Similarly to neural style transfer, one can also perform arbitrary style transfer for novel views of a NeRF scene (Chiang et al., 2022c; Huang et al., 2022; Liu et al., 2023; Kim et al., 2024). This involves a two-phase process consisting of geometry reconstruction training (NeRF training) and appearance stylization training. Chiang et al. (2022c) proposed to transfer arbitrary artistic style to novel views on NeRF++ representation (Zhang et al., 2020) for large outdoor \(360^{\circ }\) unbounded scenes. In the reconstruction phase, they separate geometry (density output) and appearance (view-dependent color output) into two branches, as illustrated in Fig. 9. In the stylization phase, they fix the geometry branch and use an MLP hypernetwork (Ha et al., 2016) with style features from a pre-trained VAE encoder to update the parameters of the appearance branch. Since NeRF++ cannot support high-resolution image rendering at a fast speed, they propose to do small-patch sub-sampling (Schwarz et al., 2020) to compute content and style losses (Huang & Belongie, 2017).

Fig. 9
figure 9

Canonical NeRF with geometry and appearance branches in 3D stylization

As discussed in Sect. 2.1.2, AdaIN and LST are two mainstream techniques for arbitrary style transfer, which are adapted in StylizedNeRF (Huang et al., 2022) and StyleRF (Liu et al., 2023) respectively. StylizedNeRF applies a mutual learning strategy between a 2D arbitrary style transfer AdaIN (Huang & Belongie, 2017) model and a NeRF to achieve multi-view consistency and stylization of NeRF appearance. Specifically, they pre-train the style transfer model under the supervision of multi-view content, as well as style and additional consistency loss. During the mutual learning, they jointly train the style transfer model and a new appearance MLP branch with NeRF’s fixed geometry branch.

StyleRF (Liu et al., 2023) employs a similar style transformation mechanism to LSNV, learning a 3D feature grid lifted from pre-trained VGG, and then applying linear style transformation on the weighted features of sampled points in the ray marching process of NeRF (Chen et al., 2022). Finally, a 2D CNN decoder is used to generate stylized views. StyleRF also conducts two-stage training, the stage of feature grid learning and reconstruction without viewing direction input (based on TensoRF (Chen et al., 2022)), and the stage of stylization training with fixed geometry. Moreover, StyleRF illustrated its advantage of data-driven style training for style interpolation and multi-style transfer with a 3D mask.

Later works further explore arbitrary photorealistic style transfer (Chen et al., 2022b), unbounded urban-scale scene style transfer (Kim et al., 2024) based on K-Planes representation and 2D-to-3D lifted DINO semantic features (Fridovich-Keil et al., 2023; Caron et al., 2021), and point cloud or mesh reconstruction of stylized novel views (Ibrahimli et al., 2024).

3.3.2 Optimization-Based Neural Field Stylization and Editing

  In this section, we explore stylizing a neural radiance field (NeRF) by updating the scene information and features stored in neural networks or explicit data structure, rather than processing in the rendering phase. Most NeRF-based approaches for appearance optimization follow a two-step procedure (Zhang et al., 2022; Fan et al., 2022; Zhang et al., 2023c; Pang et al., 2023). Initially, the scene’s geometry and appearance are reconstructed from multiple posed views before stylization. Subsequently, during the optimization of the appearance style, the geometry can be either fixed or self-distilled during stylization, as depicted in Fig. 9. Notably, some methods incorporate geometry updates during the stylization phase (Nguyen-Phuoc et al., 2022; Wang et al., 2023a; Haque et al., 2023).

A General Optimization FrameworkNguyen-Phuoc et al. (2022) proposed SNeRF, a general alternating optimization pipeline for novel view stylization by arbitrary off-the-shelf 2D style transfer methods and any NeRF methods (Fig. 10). The proposed method follows a sequential process. Firstly, they reconstruct the NeRF scene with original content multi-views. Next, they iteratively stylize rendered multi-views and use stylized views to fine-tune the NeRF in a loop. Through several iterations, the entire NeRF representation gradually becomes 3D-aware stylized.

Fig. 10
figure 10

A general framework for NeRF stylization through alternating updates of multi-views (Nguyen-Phuoc et al., 2022; Haque et al., 2023). Images adapted from Haque et al. (2023)

Haque et al. (2023) further extended the framework for text-guided NeRF editing and introduced Instruct-NeRF2NeRF, which edits a NeRF scene by leveraging 2D diffusion priors. Similar to SNeRF (2022), they adopted an iterative process to update training views and the NeRF scene alternatively (Fig. 10). They replace a training view by editing a rendered view from a training viewpoint using an off-the-shelf image-to-image diffusion model, e.g., Instruct-Pix2Pix (Brooks et al., 2023), then continue NeRF training on the updated training data.

Extensive experiments validated the flexibility of this general framework, showcasing its compatibility with off-the-shelf 2D style transfer methods across a range of NeRF or NeRF variants. Despite promising results, this framework is limited by its time-consuming iterative nature and is vulnerable to variations in stylized views, which can lead to style dilution and inconsistency. To address this issue, ViCA-NeRF (Dong & Wang, 2023) employs view-consistency-aware NeRF editing, which establishes explicit connections between different views and propagates the editing information from edited to unedited views.

Image-guided Radiance Field Stylization Image-guided neural style transfer (Sect. 2.1) has undergone significant development over the years and has inspired a multitude of works focusing on image-guided NeRF stylization. Similarly, most works follow a two-step optimization process, including NeRF reconstruction and appearance stylization stages.

Notably, ARF (Zhang et al., 2022) introduced the artistic radiance field approach, utilizing content loss (Gatys et al., 2016) and nearest neighbor feature matching (NNFM) style loss (Eq. 5) to optimize and stylize the appearance of a reconstructed NeRF scene with an exemplar style image. NNFM minimizes the cosine distance on VGG features between style reference and rendered image:

$$\begin{aligned} {\mathcal {L}}_{nnfm} = \frac{1}{N}\sum _{i,j}\min _{i',j'} D(F(cs)_{ij}, F(s)_{i'j'}), \end{aligned}$$
(5)

where \(F(\cdot )_{ij}\) denotes the feature vector at pixel location (ij) of the feature map \(F(\cdot )\). Experiments validated NNFM loss in NeRF stylization yields more visually appealing results than typical Gram matrix loss (Gatys et al., 2016) or CNNMRF loss (Li & Wand, 2016).

However, ARF still lacks explicit semantic correspondences. To address this limitation, Ref-NPR (Zhang et al., 2023c) proposed to first stylize a single view using structure-preserving 2D-stylization algorithms (Sect. 2.1) or manual editing, and then use this stylized reference view to construct a reference voxel dictionary for the scene, which enables matching of semantics and color features between edited view and the scene. The follow-up work CoARF (Zhang et al., 2024b) allows for style transfer with precise control over specific objects indicated by 2D segmentation masks. These semantics are also used for calculating the NNFM loss. Concurrently, ReGS (Mei et al., 2024) enables high-quality stylization that mimics reference textures while maintaining real-time rendering capabilities for free-view navigation. It achieves this by adapting 3D Gaussian Splatting (3DGS) and regularizing with scene depth.

Referring to the disentanglement of content and style representations for style transfer (Huang et al., 2018; Lee et al., 2018), Fan et al. (2022) proposed a generalizable model consisting of a style MLP and a content MLP to separately encode the style image and input scene, and an amalgamation MLP to output both final color and density, fusing style and content features. Additionally, Pang et al. (2023) considered semantic style matching and added additional segmentation output in the geometry branch with an extra segmentation MLP after the hash encoding process of iNGP (Müller et al., 2022). Both works proposed to use conditional style representations by feeding a one-hot vector or style index to the neural field, enabling conditional stylization for several styles (Fan et al., 2022; Pang et al., 2023).

Apart from the artistic style transfer approaches, LipRF (Zhang et al., 2023d) addressed the challenges in 3D photorealistic stylization by leveraging a Lipschitz MLP to transform the radiance appearance field during the stylization training stage. The scene views are first stylized by 2D photorealistic style transfer methods (Yoo et al., 2019; Wu et al., 2022) and then used to train the Lipschitz MLP.

Text-guided Radiance Field Stylization The recent increasingly developed vision-language models and text-to-image diffusion models (Sect. 2.1.3) inspire the community to develop works on text-guided or text-to-image guided 3D scene stylization and editing (Wang et al., 2022, 2023a, b; Song et al., 2023; Bao et al., 2023; Haque et al., 2023; Sella et al., 2023; Zhuang et al., 2023; Shum et al., 2024). Here we provide a brief discussion on some advances in text-guided NeRF stylization, showcasing their potential for rapid prototyping and customization of 3D asset designs.

NeRF-Art (Wang et al., 2023a) proposed text-guided NeRF stylization with profound semantics. Unlike simple color and shape stylization for objects in CLIP-NeRF (Wang et al., 2022) using a CLIP-based matching loss (Eq. 1), NeRF-Art realizes complex stylization on diverse shapes and scenes, such as turning a human face into a Tolkien elf. This method proposed to fine-tune the pre-trained NeRF (Yariv et al., 2021) using relatively direction CLIP loss (Eq. 2), local and global contrastive CLIP-based loss (Chen et al., 2020), perceptual loss (Johnson et al., 2016) and a weight regularization (Barron et al., 2022) for sharper details. The follow-up work Wang et al. (2023b) further enhances the semantics-aware stylization with a semantic contrastive loss and fine-tuned CLIP with ArtBench artwork database (Liao et al., 2022) for accurate artistic textual embedding.

Instead, SINE (Bao et al., 2023) employs a two-branch editing field to learn geometric and appearance adjustments. Given an edited view of a pre-trained NeRF scene, the method establishes the mesh prior using either DIF (Deng et al., 2021) for specific object categories or using ARAP (Sorkine & Alexa, 2007) with depth estimation (Bhat et al., 2021) and 2D feature matching (Jiang et al., 2021), and then composes the texture prior based on semantic features and structural self-similarity (Caron et al., 2021; Tumanyan et al., 2022). To preserve irrelevant areas, it distills a semantic feature field from DINO (Caron et al., 2021) and clusters features in the edited region of the edited view. To achieve precise manipulation of specific regions, Blending-NeRF (Song et al., 2023) leverages CLIPSeg (Lüddecke & Ecker, 2022), a pre-trained image segmentation model, and employs a region loss for supervision.

To support editing, DreamEditor (Zhuang et al., 2023) directly uses 2D diffusion priors to enable precise editing while keeping irrelevant regions untouched. Specifically, this method transforms the NeRF into a mesh-based neural field by marching cubes (Lorensen & Cline, 1987) and distillation, with each mesh vertex assigned geometry and color features. To localize the edit regions aligned with a text prompt, the method leverages DreamBooth (Ruiz et al., 2023) to fine-tune Stable Diffusion (Rombach et al., 2022) using sampled views from a spherical viewing trajectory centered on the scene, and then retrieve 2D attention maps (Hertz et al., 2023b) as view masks that are later back-projected to 3D scene forming the 3D editing mask. Finally, geometry and color features, as well as mesh vertex positions in the 3D masked region are jointly optimized by the SDS loss (Sect. 2.2). Similar to implicit shape deformation methods (Bao et al., 2023; Gao et al., 2023), DreamEditor employs mesh vertex regularizers, including Laplacian rigidity losses (Sumner et al., 2007) among neighbor vertices, for smooth mesh deformation.

Facing the limitation of slow optimization in NeRF editing, ED-NeRF (Park et al., 2024) proposed to edit a latent NeRF using 2D latent diffusion priors (Rombach et al., 2022), improving editing efficiency. However, multi-view rendering of latent features lacks geometry consistency. Therefore, they introduce a novel refinement layer with ResNet blocks and self-attention layers to refine inconsistent multi-view latent features. During editing, ED-NeRF employs delta denoising score (DDS) (Hertz et al., 2023a) which is a difference between two SDS scores conditioned on two different text prompts, and a masked DDS for the target region segmented by CLIPSeg (Lüddecke & Ecker, 2022) and SAM (Kirillov et al., 2023) to keep irrelevant regions unchanged.

Recent advancements in 3DGS-based scene editing offer new solutions for efficient optimization, fine-grained control, and high-quality scene segmentation. GaussCtrl (Wu et al., 2024) and GaussianEditor (Chen et al., 2024d) introduce text-driven editing for 3D Gaussian Splatting. GaussCtrl emphasizes multi-view consistency and depth-conditioned editing, while GaussianEditor enhances control and precision using Gaussian semantic tracing and hierarchical splatting. Gaussian Grouping (Ye et al., 2025) extends Gaussian Splatting by incorporating identity encoding for object segmentation, enabling fine-grained scene understanding and versatile editing applications. These methods collectively enable real-time, high-quality, and efficient 3D scene manipulation across a wide range of applications, such as object removal, inpainting, and style transfer.

Table 3 Summary of selected neural stylization works for volume, point clouds and implicit shapes

3.4 Volume Stylization

Compared to other representations, volume is an intuitive representation of 3D data, as naturally extended from 2D image representation. 3D neural stylization can be performed on a volume representation, e.g., image-guided neural style transfer on volumetric simulation, particularly dynamic smoke (Kim et al., 2019) and fluids (Kim et al., 2020). Table 3 summarizes works for volumetric stylization.

In Kim et al. (2019), they use pre-trained Inception CNN model (Szegedy et al., 2016) as the single-view feature extractor and apply content loss and style loss (Sect. 2.1) for semantics-aware abstract style transfer. They proposed a transport-based neural style transfer (TNST) method on grid-based voxels to optimize a velocity field (i.e., voxel movement) from several multi-views from Poisson sampling around a small area of the trajectory, and use a differentiable smoke renderer to render grayscale images to represent pixel-wise volume density. For temporal consistency among frames during smoke simulation, they compose a linear combination of the recursive aligned velocity fields of neighbor frames for the velocity field of the current frame. Later in Kim et al. (2020), they adopt particle-based attributes from mult-scale grids and optimize attributes of position, density, and color per particle, which intrinsically ensures better temporal consistency than recursive alignment of velocity fields (Kim et al., 2019). It largely improves efficiency by directly smoothing density gradients in stylization from adjacent frames for temporal consistency, and by stylizing only keyframes and interpolating particle attributes in between.

Subsequently, one can use a feed-forward network to achieve fast volumetric stylization (Aurand et al., 2022), reaching production-level quality (Kanyuk et al., 2023; Hoffman et al., 2023). One can also learn an arbitrary appearance style transfer model for volumetric simulation via a volume autoencoder (Guo et al., 2021).

3.5 Point Cloud Stylization

A point cloud is a discrete set of data points in 3D space, which may represent 3D shapes or objects. Each point can enclose additional attributes such as colors, normals (Pfister et al., 2000) and spherical harmonic coefficients (Kerbl et al., 2023) for rendering, or latent features (Huang et al., 2021) for 3D stylization.

There are a few works that attempted stylization for point clouds (Table 3). For example, PSNet (Cao et al., 2020) is a PointNet-based (Qi et al., 2017) stylization network for point cloud color and geometry style transfer with a point cloud example or an image example. Similar to representing content and style features from a pre-trained model (Gatys et al., 2016), PSNet uses a PointNet-based classifier with two separate shared MLPs to extract intermediate outputs as geometry/content representation and regard the Gram-matrix of these outputs as appearance/style representation. PSNet utilizes point-cloud-based content and style losses (Gatys et al., 2016) to optimize the geometry and/or appearance color of the source point cloud, simply replacing VGG features with PSNet features. Since the Gram-based style representation is invariant to the number or the order of the input points, an example style image treated as a set of points can stylize source colored point cloud with only the target color style without shape deformation. For point clouds generated by a generative model, one can learn to map a point cloud to the latent space for editing. PointInverter (Kim et al., 2023) employed 3D point cloud GAN inversion and introduced an efficient way to conduct a 3D point cloud mapping to the latent space of a 3D GAN based on SP-GAN, a sphere-guided 3D point cloud generator (Li et al., 2021). PointInverter resolves the point ordering issue during 3D point cloud inversion, while preserving point correspondences, which enables point editing in latent space.

3.6 Implicit Shape Editing

An implicit primitive shape is a 3D surface represented by an implicit distance function, such as a signed distance function (SDF) or truncated signed distance function (TSDF). Powered by learning-based techniques, neural implicit shapes can represent 3D geometry as occupancy networks (Mescheder et al., 2019; Peng et al., 2020), distance fields (Park et al., 2019), volumetric radiance fields (Mildenhall et al., 2020), and Gaussian mixture models (GMMs). These neural continuous implicit representations have gained significant attention in the field of 3D shape generation and editing (Table 3).

Particularly, NeuralWavelet (Hu et al., 2024) utilized a compact wavelet representation consisting of coarse and detail coefficient volumes and designed a pair of diffusion generative models for coarse and detail 3D shape generation. During shape learning, an encoder is jointly trained to map the coarse coefficient volume to a condensed latent code. This latent code serves as a controllable condition for shape generation, inversion, and manipulation. Recent approaches such as SPAGHETTI (Hertz et al., 2022) and SALAD (Koo et al., 2023) adopted 3D generative models equipped with a hybrid representation that employs part-level disentanglement, extrinsic approximate shape and intrinsic geometric details disentanglement. In the hybrid representation, each part of the 3D shape is characterized by a set of extrinsic parameters, which form a Gaussian ellipsoid in 3D space (formulated by a 3D position with a covariance), capturing the approximate shape structure of that particular part. This part-level extrinsic-intrinsic disentanglement enables 3D shape generation and implicit shape manipulation such as local adjustment and part mixing. SALAD further incorporates text-guided shape part segmentation (Koo et al., 2022) and performs text-guided shape completion. These methods utilizing hybrid representations and incorporating text guidance offer promising advancements in 3D shape generation, editing, and manipulation, allowing for more intuitive and controlled transformations of 3D shapes.

3.7 Practical Guidelines

This section discusses practical aspects of 3D neural stylization methods, summarizing several design choices including 3D consistency, controllability, generalization, and efficiency.

3D Consistency. A particular challenge when performing stylization of 3D data is to ensure that view consistency is achieved so that the styles appear similar across views. We discuss common strategies to achieve view consistency, as follows.

\(\bullet \) View Sampling. A reasonable camera sampling strategy is necessary for multi-view optimization without posed views. A common strategy is to (randomly) sample around pre-defined principal cameras or along the camera trajectory, as summarized in Table 4. Michel et al. (2022) devised a new training view selection scheme that samples view around an anchor view with the highest CLIP similarity with the target prompt. Data augmentation is a common trick as well, such as random perspective transformation and random resize plus crop (Michel et al., 2022; Ma et al., 2023; Chen et al., 2024e), rendering with random backgrounds (Hwang et al., 2023), mirroring and rotating subject elements (Aurand et al., 2022).

\(\bullet \) Constant Geometry. Appearance-only 3D stylization requires keeping 3D geometry constant before and after stylization. The frequently used strategy is to fix geometry during the stylization stage. Particularly, some neural field stylization works use hyper MLPs to predict stylized appearance branch parameters for stylization while fixing geometry branch parameters for 3D geometric consistency (Chiang et al., 2022c; Chen et al., 2022b; Wang et al., 2023b).

\(\bullet \) View-independent Appearance. Some works of appearance stylization tend to maintain multi-view color consistency. However, some 3D representations may lead to multi-view appearance inconsistency, for example, view-dependent effects in radiance fields. To preserve multi-view color consistency, scenes are often optimized without viewing direction input in radiance fields, sacrificing view-dependent effects for better multi-view appearance consistency (Zhang et al., 2022, 2023c; Liu et al., 2023; Pang et al., 2023).

Table 4 Summary of training view sampling in selected 3D neural stylization methods on object data. “Cam" refers to camera type such as orthogonal or perspective camera. “#View” is the sample number of views for each iteration/frame, unless indicated otherwise. “Aug” indicates rendered view augmentation

\(\bullet \) 2D Priors. With the growing popularity of large-scale pre-trained vision models including VGG, CLIP, DINO, and diffusion models (Simonyan & Zisserman, 2015; Radford et al., 2021; Caron et al., 2021; Rombach et al., 2022; Zhang et al., 2023b), 3D-aware stylization can be achieved through multi-view optimization utilizing these 2D visual priors (Zhang et al., 2022; Wang et al., 2023a; Bao et al., 2023; Haque et al., 2023; Koo et al., 2024).

\(\bullet \) 3D Priors. With numerous well-collected 3D datasets, pre-trained point cloud priors (Qi et al., 2017; Li et al., 2021; Nichol et al., 2022) are popular to represent coarse geometry and facilitate geometric deformation among mesh, point cloud stylization works (Yin et al., 2021; Cao et al., 2020; Kim et al., 2023). Some other works leverage additional geometry guidance, such as depth map, normal map, and camera pose, for precise control of 3D-aware synthesis (Höllein et al., 2022; Yu et al., 2023; Guo et al., 2023).

\(\bullet \) Multi-view Attention. Temporal attention mechanism is widely used in video generation methods (Blattmann et al., 2023), where an attention layer is applied to latent video frames to improve frame consistency. This concept can be adapted to the 3D domain. For example, VcEdit (Wang et al., 2025) inverse-renders the cross-attention maps from all views onto each Gaussian in the source 3DGS, creating an averaged 3D map. This 3D map is then rendered back to 2D, serving as the consolidated cross-attention maps for the originals, resulting in more coherent predictions.

Controllability. Stylization requires diversity and flexibility for users to design assets. We summarize some common strategies for different levels of control in 3D stylization.

\(\bullet \) Pre-trained Diffusion Models. State-of-the-art diffusion models provide powerful controllability for content creation. For example, TextureDreamer (Yeh et al., 2024) uses DreamBooth (Ruiz et al., 2023) to distill texture information from input reference images, and ControlNet (Sect. 2.1.3, 2023b) to process additional 2D conditions, such as depth, normal, and edge maps.

\(\bullet \) Semantic Alignment. Pre-trained segmentation models (e.g. Segment Anything (2023)), semantic labels or descriptions can be integrated to empower semantics-aware stylization. Table 5 shows some semantic matching tricks commonly used in 3D neural stylization. Some text-guided approaches rely on deliberate textual descriptions for local stylization (Michel et al., 2022; Ma et al., 2023; Wang et al., 2023b), while some approaches consider explicit visual semantic matching such as Text2Scene with 3D label inputs (Hwang et al., 2023). Gao et al. (2023) proposed a regularization term to preserve identity for smooth deformation. Some approaches consider using or lifting explicit semantic matching with off-the-shelf tools (Huang et al., 2022; Zhang et al., 2023c; Pang et al., 2023; Song et al., 2023; Kim et al., 2024). In addition, the reviewed works (Wang et al., 2023a, b) demonstrated that contrastive learning can effectively improve the learning of directional cues, such as textual semantics, in text-guided stylization.

\(\bullet \) Explicit Representation. An explicit scene representation allows much easier control and more precise manipulation. For example, DreamEditor (Zhuang et al., 2023) transforms the NeRF into a mesh-based neural field with each mesh vertex assigned geometry and color features; Chen et al. (2024d); Hertz et al. (2022); Koo et al. (2023) use explicit 3D Gaussians for editing.

\(\bullet \) 3D Shape Inversion. A commonly adopted technique for shape manipulation is 3D shape inversion, which inverts 3D representation into latent space learned by large-scale pre-trained 3D generative models (Nichol et al., 2022; Qi et al., 2017). This approach enables shape editing in latent space and has been explored in various stylization works such as NeuralWavelet (2022), SPAGHETTI (2022), PointInverter (2023), and SALAD (2023).

Table 5 Summary of semantic alignment in selected 3D neural stylization methods

Efficiency. Efficiency in 3D stylization is influenced by various factors such as style optimization methods and 3D representations. To enhance efficiency and minimize resource consumption, we present several tricks for efficiency improvement.

\(\bullet \) Optimize in Coarse-to-Fine Manner. Instead of direct stylization in final high resolution, some works operate 3D style optimization in the low resolution and then apply decoding or super-resolution techniques for higher optimization efficiency. For example, Cao et al. (2023); Knodt and Gao (2023) optimize latent features for 3D consistency and efficient diffusion. Guo et al. (2023); Wu et al. (2023) employ super-resolution models and Yu et al. (2023); Zeng et al. (2024) use texture refinement models to obtain high-quality outputs.

\(\bullet \) Scene Representations. For neural fields, there are various advanced representations for fast training or rendering, such as Plenoxels (2022), SNeRG (2021), iNGP (2022), DVGO (2022), TensoRF (2022), MobileNeRF (2023c) and 3DGS (2024b). Table 2 features neural field stylization methods with applied base reconstruction techniques. Some 3D representations also tend to use advanced neural fields to represent style features such as neural style fields for meshes in Michel et al. (2022); Ma et al. (2023).

\(\bullet \) Optimize Rendering and Back-propagation. In NeRF stylization, naive NeRF-based rendering is memory-intensive for a bulk of ray samplings and point queries, but style losses are based on full images. Therefore, some use patch-based training (Chiang et al., 2022c), deferred gradient back-propagation (Zhang et al., 2022), and some separate forward and back-forward steps with full-res computation and patch-wise back-propagation (Zhang et al., 2023d; Wang et al., 2023a, b).

\(\bullet \) Feed-forward Networks. Instead of optimizing scene representation parameters, some works used feed-forward networks for efficient training/inference (Ma et al., 2023; Huang et al., 2021; Chen et al., 2024c; Aurand et al., 2022).

Generalization. Stylization techniques across different scenarios and datasets are crucial for real-world applications, such as gaming or entertainment industries. We discuss some designs for generalization here.

\(\bullet \) Universal Style Transfer Module. Data-driven 3D neural stylization models train a universal 3D style transfer module that can generalize to new styles in a zero-shot manner. This type of work usually operates on novel view rendering rather than optimizing the scene features, as discussed in Sect. 3.3.1.

\(\bullet \) General Optimization Framework. As presented in Sect. 3.3.2, SNeRF (Nguyen-Phuoc et al., 2022) and Instruct-NeRF2NeRF (Haque et al., 2023) introduced a general framework for a single scene optimization with either image or textual reference. They apply to any scene, any style, either geometry or appearance stylization.

Table 6 Popular datasets for performance evaluation on 3D neural stylization. Pt. is the abbreviation of point cloud. \(^a\)Train/Test sets

4 Datasets and Evaluation

In this section, we summarize the frequently used datasets for 3D neural stylization, introduce existing evaluation metrics and criteria for 2D and 3D stylization, and provide a benchmark of state-of-the-art 3D neural stylization works as the reference for future works.

4.1 Datasets

Datasets are essential for effective training and thorough validating of 3D stylization models in terms of applicable scenarios, stylization diversity, etc. Table 6 illustrates selected popular 3D and 2D datasets for the evaluation of 3D neural stylization works, identifying their modality, scale, and other noteworthy features.

4.2 Criteria and Metrics

The stylization and evaluation of 3D assets are commonly conducted through multi-view renderings, which could be attributed to the inherent way how people perceive and process 3D stuff, and the advancement of 2D large pre-trained vision models. It is also possible to conduct stylization and evaluation directly in 3D space, mainly for the point cloud representation. For instance, 3DStyleNet (Yin et al., 2021) utilizes L1-Chamfer distance to guide the 3D shape transfer. PSNet (Cao et al., 2020) directly extracts style features from a point cloud using a modified PointNet (Qi et al., 2017) structure. SpiceE (Sella & Averbuch-Elor, 2023) introduces point cloud input as 3D shape prior to 3D generation. Achlioptas et al. (2023) provides a summary for evaluation metrics of 3D shape transfer.

We derive several critical aspects below from the state-of-the-art 3D neural stylization works for evaluating 3D stylization performance. Overall, the main consideration includes the alignment with the target style, the preservation of the original content, the consistency between different views, the visual quality of the stylized results, and the efficiency of training/inference.

\(\bullet \) Style Similarity Stylization tasks are driven by guidance information (i.e. style reference), mainly images and texts. For measuring style similarity between the reference image and the rendering output, Gram matrix loss and AdaIN loss (i.e. MSE of the channel-wise mean and variance) that introduced in Sect. 2.1 are heavily used. When the style reference is given by textual prompt, the most popular choice of recent works is CLIPScore (Hessel et al., 2021), which quantifies the correspondence between the rendered image and the textual prompt.

\(\bullet \) Content Preservation Content preservation is achieved to different extents in stylization works. In some works (Chiang et al., 2022c; Chen et al., 2022b; Wang et al., 2023b) the stylization is conducted only for appearance while the 3D geometry is locked, which dramatically eliminates the morphing of geometric content. Some other works (Zhang et al., 2022, 2023c) aim for a balance of stylization and content texture preservation by training exclusively with view-independent texture colors, hence showcasing multi-view color consistency.

\(\bullet \) Multi-view Consistency Explicit 3D representations inherently provide multi-view geometry consistency. To measure multi-view appearance consistency, some 3D stylization works refer to video temporal short-range and long-range consistency evaluation (Lai et al., 2018), using warping difference error and the warped LPIPS, via optical flow estimation or depth estimation (Chiang et al., 2022c; Liu et al., 2023; Nguyen-Phuoc et al., 2022; Huang et al., 2021; Höllein et al., 2022). CLIP can also be applied to evaluate multi-view semantic consistency by encoding adjacent views to CLIP space as proposed in Haque et al. (2023); Ma et al. (2023).

\(\bullet \) Visual Quality For image synthesis, we expect synthesized images to look natural and contain as few artifacts as possible. The inception score (IS)(Salimans et al., 2016) is designed to measure the image quality and diversity of generated images. Another popular metric is Frechet Inception Distance (FID) (Heusel et al., 2017), which compares the distribution of generated images with the distribution of real images. It works well to decide if generated images are similar to objects in the target domain. For instance, TSNeRF (Wang et al., 2023b) used FID to evaluate the distance between stylized rendered views and the target art database.

\(\bullet \) Robustness and Efficiency For model robustness, Ref-NPR (Zhang et al., 2023c) proposed to measure PSNR between rendered results around a specific view angle. It is not essential for general 3D neural stylization evaluation, but it can be taken as a robustness reference. Regarding efficiency, important factors include the training time, inference speed, memory usage, data accessibility, model size and usability.

\(\bullet \) User Study The above metrics do not necessarily reflect and align with human bias, especially for subjective factors such as naturalness and attractiveness. Therefore, conducting a user study is usually a suitable option. A typical user study involves steps including recruiting participants, preparing study materials and questionnaires, collecting answers, and analyzing statistics. In 3D stylization, the most frequently evaluated metrics are “stylization quality" and “temporal consistency” (Huang et al., 2021; Chiang et al., 2022c; Chen et al., 2022b; Liu et al., 2023).

4.3 Benchmark of 3D Stylization

In this section, we provide a benchmark evaluation in Table 7 of state-of-the-art mesh-based and neural field-based neural stylization methods in terms of the criteria discussed above, followed by a high-level discussion of the insights gained. The methods can be categorized into text-guided or image-guided object mesh texture stylization, text-guided neural field stylization, and image-guided neural field artistic stylization.

Table 7 Evaluation of selected works across 3D representation and guidance type. w/depth stands for with pre-trained depth-ControlNet. SNeRF-G is reproduced by Plenoxels (2022) and Gatys et al. (2016)

4.3.1 Experimental Settings

\(\bullet \) Datasets. For mesh-based stylization methods with image guidance, we create 300 object-image pairs from Objaverse (Deitke et al., 2023) dataset: 100 objects with their own rendered images, 100 with rendered images of other objects in the same category, and 100 with rendered images from different categories. The first two parts aim to evaluate the capacity of “in-domain” texture transfer, while the last part tests the capacity of “out-of-domain” texture transfer. All the images are rendered in \(2048\times 2048\) resolution. For TEXTure (Richardson et al., 2023), we fine-tuned ten diffusion models following their official requirement to conduct the texture transfer. For mesh-based stylization methods with text guidance, we use the same 100 selected objects from Objaverse and directly use the object name to construct the text prompt.

For neural field-based stylization methods with image guidance, we include eight scenes from three public datasets, including single-object scenes (chair, mic) in NeRF-Synthetic (Mildenhall et al., 2020) that are inward-facing \(360^{\circ }\) objects without background, forward-facing real scenes (fern, flower, horns, trex) in LLFF dataset (Mildenhall et al., 2019), and large unbounded real scenes (Truck, Playground) in Tanks&Temples dataset (Knapitsch et al., 2017). Particularly, masked large scenes Caterpillar and Truck without background (Knapitsch et al., 2017) are used instead for StyleRF and INS. The style reference images are selected from WikiArt (Painter by numbers, 2016) dataset. 120 WikiArt (Painter by numbers, 2016) style references are used for feed-forward methods, and 6 WikiArt images for single-style optimization methods. We select artistic images here because neural field-based methods usually have larger scenes with multi-objects and backgrounds, and single-object images won’t lead to satisfying results. Conversely, such artistic images with abstract concepts tend to ignore the detailed semantics of a concrete object, and thus are not suitable for most mesh stylization methods that focus on a single object. For neural field-based stylization methods with text guidance, we select two unbounded and two forward-facing scenes (farm, campsite; fangzhou, person) from InstructN2N (Haque et al., 2023) and test four style prompts for each of them.

\(\bullet \) Metrics. For style similarity, we compute Gram Loss for artistic image-guided works and ClipScore for others. FID (Heusel et al., 2017) is adopted for measuring the visual quality of selected methods, using the rendered views of stylized results as evaluation samples and style reference images as ground truths. The original rendering views of selected objects from the Objaverse dataset are used as ground truths for evaluating text-guided mesh stylization works. Since the FID metric needs to be calculated on a large number of evaluation images, we skipped a few works that are hard to obtain a large amount of ground truth data (Haque et al., 2023; Dong & Wang, 2023; Vachha & Haque, 2024; Koo et al., 2024) or have a relatively slow optimization speed which prevents generating a sufficient number of evaluation samples (Fan et al., 2022; Nguyen-Phuoc et al., 2022; Zhang et al., 2022, 2023c).

For multi-view consistency, we utilize CLIP-Var (Li et al., 2025) to take the minimum value of cosine similarity between CLIP features of uniformly sampled views as a metric, which derives from the idea that images of the same object from multiple views have the same semantics. For the artistic style transfer task, we compute short-range and long-range warp error with masked LPIPS scores via off-the-shelf optical flow estimator RAFT (Teed & Deng, 2020).

The GPU consumption, pre-training time and optimization time are measured on RTX 5880 Ada GPUs with 48GB memory per GPU. The pre-training time denotes the normal duration for required additional training of the method before conducting stylization (while the original authors may have provided trained weights), like training a ControlNet (Deng et al., 2024), training a feature transformation module (Liu et al., 2023; Chiang et al., 2022c), training the original 3D reconstruction, etc. Please refer to our evaluation code repository for details: https://github.com/chenyingshu/advances_3d_neural_stylization.

4.3.2 Discussion

Through the theoretical analysis, benchmarking and practical experience, we aim to address a research question: How do various factors such as 3D representation, optimization methods, guidance, etc., impact stylization outcomes across different dimensions like visual quality, consistency, and efficiency? We will delve into this inquiry through the following key points.

\(\bullet \) Optimization—How to conduct efficient optimization? When aiming for efficiency, effective strategies include employing large pre-trained models and training task-specific adapter modules. For example, TEXTure and FlashTex (Richardson et al., 2023; Deng et al., 2024) can synthesize stylish texture in high quality in under five minutes by leveraging large pre-trained diffusion models as priors. Additionally, some utilize feed-forward processing to enhance efficiency during stylization inference, removing the necessity for per-style optimization, as demonstrated in works like StyleRF and StyleGaussian(Liu et al., 2023, 2024).

\(\bullet \) Guidance—How to provide effective guidance? In stylization, a visual prompt can efficiently convey intricate details, especially for complex designs or expectations that are hard to articulate in natural language. Conversely, textual prompts offer greater flexibility allowing for easy adjustments. In Table 7, image-guided mesh stylization methods (Richardson et al., 2023; Zeng et al., 2024; Perla et al., 2024) exhibit higher CLIP-scores compared to text-guided approaches (Richardson et al., 2023; Youwang et al., 2024; Zhang et al., 2024d; Deng et al., 2024). This disparity stems from the calculation principle of CLIP-score that captures multi-concept features from the inputs and measures the similarity of the features, where image-guided texture transfer can directly reconstruct the features from the reference image and thus easily achieve higher CLIP-scores. Meticulous prompt engineering is required to achieve similar results using natural language.

Beyond 2D guidance, 3D guidance proves effective for tasks like 3D shape transfer, often through the point cloud representation which enables 3D shape similarity calculation using metrics like Chamfer distance. The point cloud representation offers efficiency for physics simulation, scalability, and other advantages. 3D Gaussian Splatting (Kerbl et al., 2023) akin to point clouds has great potential in such topics (Kotovenko et al., 2024).

\(\bullet \) Visual Quality—How to enrich visual effects while reducing artifacts? State-of-the-art 3D stylization methods improve visual quality by developing view-dependent appearances based on CG empirical models (Deng et al., 2024), utilizing multiple vision priors (Haque et al., 2023; Youwang et al., 2024; Zhang et al., 2022), and data-driven learning (Huang et al., 2021; Liu et al., 2023). For example, Easi-Tex (Perla et al., 2024) employs a pre-trained IP-Adapter (Ye et al., 2023) and an edge ControlNet to faithfully extract the texture and shape details respectively, demonstrating both high visual quality and style similarity, even if there are significant discrepancies between the input and the reference object (we denote it as “out-domain” type in Table 7). FlashTex (Deng et al., 2024) trains a novel LightControlNet, which learns from numerous rendered images of objects with different materials and enables providing rich visual details in texture generation.

\(\bullet \) Consistency - How to ensure multi-view consistency? Existing works strive to achieve multi-view consistency when rendering photorealistic or artistic 3D scenes. As mentioned in the practical guidelines (Sect. 3.7), some works directly construct view-independent objects/scenes (Zhang et al., 2022; Liu et al., 2023; Zeng et al., 2024) which largely alleviates the worry. However, it will significantly improve the overall quality to provide view-dependent effects or make the object/scene light-aware (Zhang et al., 2024f; Deng et al., 2024). Similar to the ideas of linking 2D to 3D stylization in Sect. 2.1.4, one way is to devise a dedicated loss item to enforce multi-view consistency (Zhang et al., 2023c; Mei et al., 2024). ARF (Zhang et al., 2022) presents another way that applies a simple linear transformation of colors in RGB space for all rendered views to match the color statistics of the style image, which greatly improves the consistency between the rendered views. Last but not least, we can incorporate additional guidance (depth map, normal map, etc.) for generative models. As seen in Table 7, FlashTex (Deng et al., 2024) with depth ControlNet achieves the highest multi-view consistency among the text-guided mesh stylization methods. Compared to Paint3D and Easi-Tex which only use one image as style reference, TEXTure achieves an overall higher Clip variance, probably benefiting from fine-tuning a diffusion model with multi-view renderings of the target object for texture transfer.

Fig. 11
figure 11

Neural field stylization results. Zoom in for details

\(\bullet \) Scalability - How to adapt stylization to different sizes of scenes? Mesh-based stylization is thriving in 3D object assets, which is suitable for benchmarking with accessible 3D object datasets. Some researchers also attempt to stylize room-scale (Chen et al., 2024a; Höllein et al., 2022) or city-scale scenarios (Chen et al., 2024e) concerning complex semantics and view sampling strategies. By employing neural fields, 3D representations support various scales of scenes, from naive NeRF and grid-based radiance fields for object-centric and feed-forward scenes (Fan et al., 2022; Zhang et al., 2022; Liu et al., 2023) to 3DGS for scenes in the wild (Vachha & Haque, 2024; Liu et al., 2024). Stylization techniques can be tailored to specific representations considering the structure types such as grids (Liu et al., 2023), points (Huang et al., 2021), and optimization strategies, as summarized in Table 2 (Fig. 11).

5 Applications

The burgeoning technologies for generating and manipulating 3D assets are unleashing the power of creativity and revolutionizing the way that we perceive and interact with visual content. 3D neural stylization sheds light on a new paradigm of providing infinite aesthetic possibilities from classic paintings to futuristic concepts, enhanced immersive experiences for virtual and augmented reality environments, seamless integration for cross-industry applications including advertising and marketing, fashion and product design, film and game development, architecture and environment visualization, and interactive education and learning, etc. Some examples are visualized in Fig. 12. In this section, we present some representative and promising applications of 3D neural stylization.

Fig. 12
figure 12

Applications of 3D neural stylization. Images adapted from Orghidan et al. (2022); Chen et al. (2024a); Han et al. (2024); Volinga (2023); Kanyuk et al. (2023); Zhang et al. (2022); Wu et al. (2023); Li et al. (2023c)

5.1 3D Asset Design

3D asset design and modeling involve constructing shapes, textures, materials, etc. Harnessing advanced neural stylization techniques, automatic 3D design becomes more flexible in a controllable way with text prompts, images or 3D references.

5.1.1 Single Object Design

In the creation of consumer products, the stylization of a single object can enhance its market appeal. Neural stylization enables fast illustration of ideas for a more effective discussion among designers, developers and customers, especially in the prototyping phase. Text-guided mesh stylization provides a flexible way to design 3D assets. For example, after the launch of Text2Mesh (Michel et al., 2022), digital R&D designers and artists employed this tool for 3D gaming asset designs (Orghidan et al., 2022) and artwork creation (Guljajeva & Canet, 2022). Advanced techniques in automatic 3D shape mixing and morphing (Hui et al., 2022; Gao et al., 2023) are also promising for speedy 3D asset design and production.

5.1.2 Room Decoration

Current digital room/house decoration tools support adding pre-made assets to a virtual scene and adjusting their places, or using the device’s camera to map the room and place provided furniture (Yu et al., 2011; Global, 2024; Houzz, 2024; Planner5D, 2024). However, the asset library provides limited types and styles of 3D models, and the overall style of the whole space is neglected. Recent works that leverage 3D neural stylization techniques explore more possibilities in room decoration. DreamSpace (Yang et al., 2024) allows users to personalize the appearance of real-world scene reconstructions with text prompts, and delivers immersive VR experiences on HMD devices. SceneTex (Chen et al., 2024a) generates high-quality textures for 3D indoor scenes from the given text prompts, which provide a consistent stylization for the whole space. Instead of entire room stylization, Text2Scene (Hwang et al., 2023) focuses on the stylization of individual object meshes in an indoor scene.

5.2 3D Avatar Stylization

Avatar stylization is a long-standing and popular research area, that enables interesting applications such as cartoonization for 2D or 3D-aware portraits (Jang et al., 2021; Song et al., 2021; Yang et al., 2022; Zhang et al., 2023a). With neural stylization techniques and novel 3D representations such as NeRF, there appear stylization solutions for 3D avatars (Pérez et al., 2024; Zhang et al., 2024c; Han et al., 2024).

For example, the general NeRF stylization framework SNeRF (Nguyen-Phuoc et al., 2022) supports style transfer for dynamic NeRF avatars. 3DFaceHybrid (Feng & Singhal, 2024) achieves arbitrary style transfer for a NeRF-based face by lifting 2D pre-trained face style transfer knowledge (Yang et al., 2022) to the 3D face mesh. StyleAvatar (Pérez et al., 2024) enables either image- or text-guided stylization for animatable avatars from a phone scan (Cao et al., 2022) with CLIP supervision. TECA (Zhang et al., 2024c) generates a detailed 3D avatar composition based on a given text description. The avatar includes a mesh-based face and body, and NeRF-based hair, clothing, and other accessories. HeadSculpt (Han et al., 2024) generates and edits a 3D-consistent head avatar with text prompts via diffusion priors (Brooks et al., 2023), achieving editions such as realistic or artistic head generation, expression editing, cartoonization, etc.

5.3 Non-Photorealistic Rendering

Compared to traditional non-photorealistic rendering techniques (Gooch et al., 1998; Gooch & Gooch, 2001) with low-level control of simple strokes and textures, neural stylization techniques realize general stylization for arbitrary style targets, offering high-level controllability with reference and semantics and higher speed for stylized assets production. 3D artistic stylization works have shown the potential of efficient NPR for a scene (Huang et al., 2021; Chiang et al., 2022c; Huang et al., 2022; Liu et al., 2023; Zhang et al., 2022; Fan et al., 2022; Nguyen-Phuoc et al., 2022; Zhang et al., 2023c; Wang et al., 2023a; Haque et al., 2023).

5.4 Physically Based Rendering

Physical properties in 3D scenes enable photorealistic rendering and editing.

5.4.1 Texture Stylization

TANGO (Lei et al., 2022) tends to optimize texture material parameters by CLIP supervision. It trains MLPs given the point and its normal to generate SVBRDF parameters and normal offset, which enables photorealistic rendering. Its follow-up work 3DStyle-Diffusion (Yang et al., 2023) further incorporates depth-guided ControlNet (Zhang et al., 2023b) for score distillation, enabling high-quality fine-grained texture stylization.

5.4.2 3D Generation and Editing

HyperDreamer (Wu et al., 2023) achieves single-image-to-3D generation with physical decomposition including semantics, albedo, specular, roughness, and normal. It supports diverse downstream tasks such as relighting, text-guided and part-aware editing. Decorate3D (Guo et al., 2023) converts NeRF scene to mesh for geometry and material decomposition. The decoupled geometry and UV texture representations support controllable texture editing and generation with text instructions.

5.4.3 Simulation

ClimateNeRF (Li et al., 2023c) fuses weather physical simulation with NeRF rendering to create NeRF scenes with realistic weather effects such as smog, snow, and floods. PhysGaussian (Xie et al., 2024) integrates physics-based dynamics simulation, specifically the Material Point Method (MPM) simulation, to deform a 3DGS scene. By merging realistic rendering and physical simulation, these approaches have the potential to enhance the realism of virtual games and films.

5.5 Industrial Production

3D neural stylization provides automatic stylization techniques for 3D assets including mesh, point cloud, volumetric simulation, and novel views. Stylized assets can be seamlessly integrated into traditional computer graphics rendering pipelines and software, such as meshes with new stylized texture, re-colored point clouds, and stylized volumetric simulation. Implicit reconstructed scenes, such as NeRF, can be exported as textured mesh or rendered by game engine plugins such as Luma AI’s Unreal Engine NeRF plug-ins (Luma, 2023). Automated 3D environment synthesis holds great promise for film virtual production applications. For instance, combining environmental NeRF with light stages (Manzaneque, 2023) enables cost-effective scene shooting using Volinga suite (Volinga, 2023). Non-photorealistic stylized 3D assets and scenes are particularly beneficial for animation production, as demonstrated by the film Elemental (Hoffman et al., 2023; Kanyuk et al., 2023). Moreover, these techniques find possible applications in VR and video game development (Menapace et al., 2022), enabling rapid stylization and editing of 3D scenes (Liu et al., 2023; Fang et al., 2023).

6 Open Challenges and Future Works

From this survey, we identify under-explored problems and notable challenges in 3D neural stylization that are worth investigating in future work, which we discuss below.

6.1 Generalization

6.1.1 Large-scale Scene Stylization

Most 3D neural stylization works focus on objects or object-centric scenes (Michel et al., 2022; Hertz et al., 2022), room-scale scenes (Pang et al., 2023; Höllein et al., 2022), and outdoor inward-facing scenes (Huang et al., 2021; Chiang et al., 2022c). Though G. Kim et al. (2024) extended novel view stylization to city scenes, it does not output stylized 3D assets. StyleCity (2024e) stylizes urban texture and sky but replies on heavy mesh representation. 3D assets and scenes can scale to as large as multi-room indoor scene (Straub et al., 2019; Huang et al., 2022), architectural scenes (Martin-Brualla et al., 2021; Wang et al., 2021), multi-block outdoor scenes (Tancik et al., 2022; Turki et al., 2023), and even city-scale scenes (Xiangli et al., 2022; Xu et al., 2023; Li et al., 2023b). These complex scenarios with intricate semantics are challenging for semantic alignment and computation efficiency.

6.1.2 4D Scene Stylization

Limited literature exists on 4D scene stylization, with only a few notable works such as SNeRF (2022) and S-DyRF (2024b) presenting to stylize dynamic portraits and small scenes. Stylizing time-varying scenes with dynamic geometry and appearance changes (Park et al., 2021; Song et al., 2023; Yang et al., 2024), or integrating time-related special effects (Shih et al., 2013; Logacheva et al., 2020), poses significant challenges in maintaining temporospatial consistency in this domain.

6.1.3 Generalizable Text-guided Stylization

Text-guided 3D scene stylization and editing are still in the early stages. Data-driven generalizable text-guided 3D scene stylization or editing without re-training is seldom explored yet (Fang et al., 2023), which demands more attention. It is worth investigation and exploration since current advanced large-language models such as BLIP (Li et al., 2023a), GPT (Brown et al., 2020) enable infinite image-text pair data generation for data-driven model training.

6.2 Controllability

6.2.1 3D Reference-Guided Stylization

Various modalities have been explored to be style references, especially images and text prompts. 3D to 3D geometric and appearance style transfer with 3D shape or 3D scene guidance is still underexplored (Yin et al., 2021). While 3D features can provide 3D-aligned holistic references for stylization, 3D feature extraction suffers from limited 3D datasets in a few categories. With the rapid development of 2D-to-3D lifting techniques, there is potential to leverage large-scale pre-trained 2D models, such as 2D diffusion models (Rombach et al., 2022; Zhang et al., 2023b), as priors for context-aware scene stylization with 3D references. In addition, we expect more and more 3D pre-trained feature extractors and generative models with data in the wild to boost the 3D context-aligned style transfer.

6.2.2 Multi-modal Controls

Currently, most research works focus on sole-reference guidance for stylization, while multi-modal reference can provide high accuracy and controllability on precise manipulation and design (Pang et al., 2023; Bao et al., 2023; Zhuang et al., 2024). Therefore, it is worthwhile to explore the possibilities of joint supervision incorporating visual (Simonyan & Zisserman, 2015), textual (Radford et al., 2021), semantic (Caron et al., 2021), and geometric features for 3D stylization.

6.3 Efficiency

6.3.1 Real-time Arbitrary Style Transfer of 3D Scenes

Modern photo or video filters support real-time processing (Ruder et al., 2016; Jamriška et al., 2019). For instance, Ioannou and Maddock (2023) proposed a simplified style transfer architecture embedded into Unity rendering pipeline, enabling real-time depth-aware 2D style transfer. However, real-time stylizing a 3D scene given arbitrary styles remains challenging for some 3D representations due to the slow optimization process (Michel et al., 2022; Richardson et al., 2023; Cao et al., 2023; Yang et al., 2023) and slow rendering in neural fields (Chiang et al., 2022c; Zhang et al., 2022). Even though there are some novel view stylization works (Li et al., 2019; Liu et al., 2023; Chen et al., 2024c) that achieve arbitrary style transfer for speedy novel view synthesis, they fail to obtain stylized 3D scenes instantly. It is worth exploring to improve stylization speed by leveraging feed-forward networks (Aurand et al., 2022), 3D generative models (Cao et al., 2020), and advanced 3D representations such as 3DGS (Liu et al., 2024; Zhang et al., 2024a).

6.4 3D Consistency

6.4.1 Comprehensive View Planning for Complex Scenes

Existing works have primarily focused on planning training views for object or room scenes for 3D stylization (Michel et al., 2022; Hwang et al., 2023; Richardson et al., 2023; Chen et al., 2023a), overlooking the crucial aspects of semantic- and instance-level view planning. Hence, it is a compelling research opportunity to investigate effective strategies for planning views in scenes characterized by intricate semantics such as cityscapes and multi-room scenarios.

6.4.2 3D-Holistic Style Feature of Scenes

The majority of works reviewed are supervised by large-data 2D pixel-level features extracted from multi-views (Michel et al., 2022; Zhang et al., 2022; Kim et al., 2019), since large-data 3D pre-trained models are still rare and expensive. Even though some works try to lift 2D content features to 3D before 3D stylization (Huang et al., 2021; Liu et al., 2023; Huang et al., 2022), they still use view-dependent style features for final 3D stylization supervision. It is also impractical to lift 2D features to 3D at every iteration. Some works supervise stylization with a 3D-aware style feature by averaging features of several views for a small object (Michel et al., 2022; Ma et al., 2023), which is not implementable for more views with limited memory. Per-view or multi-view supervision may not be sufficient to represent the whole 3D scene style feature, and worse may dilute the current single-view style from other views with conflicting gradients (Gao et al., 2023). More research and investigation are needed for efficient 3D-aware and even 3D-holistic style features for 3D stylization.

6.5 Evaluation

6.5.1 Standardized Evaluation Across Modalities

The current evaluation metrics do not always align with human preference. User study is still widely adopted but it also prevents a precise quantitative analysis of method performance. The heterogeneity of datasets of different modalities also imposes great challenges for a fair and comprehensive comparison of works on different modalities. We believe there should be some variations from the evaluation of our benchmark, while concerns should be similar to criteria in Sect. 4.2.

7 Conclusion

The report has explored the advancements in neural stylization techniques for diverse 3D data, including mesh, volume, neural fields, point cloud, and implicit shapes. Through this comprehensive survey of 3D neural stylization techniques and corresponding applications, we highlighted the importance of neural stylization in accelerating the creative process, enabling fine-grained control over stylization, and enhancing artistic expression in various domains such as movie making, virtual production, and video game development. Furthermore, we have introduced a taxonomy for neural stylization, providing a framework for categorizing new works in the neural stylization field. Our analysis and discussion of advanced techniques underscored the ongoing research efforts aimed at addressing limitations and pushing the boundaries of neural stylization in the 3D digital domain. In addition, we proposed a benchmark of 3D neural stylization, with which we aim to offer reference and inspiration for future 3D stylization works. Finally, we introduced practical applications and discussed open challenges and future works of 3D neural stylization.