1 Introduction

Fig. 1
figure 1

Comparison with baselines for multi-subject image generation. We use scientists’ names in the text prompt for text-only methods (SD, MJ). Text-only methods only perform well when subjects are present in the training dataset but struggle to maintain the identity otherwise. Fine-tuning-based methods blend the identity of different persons (TI rows 1 and 2, CD rows 1, 2, 4), deviate from the text instruction and only generate a single subject (TI row 4), or generate images that do not resemble any specific reference (CD row 3)

Recent advancements in text-to-image generation (Ramesh et al., 2021; Chang et al., 2023; Kang et al., 2023; Ding et al., 2021), particularly diffusion models (Ho et al., 2020; Song et al., 2021; Rombach et al., 2022; Ramesh et al., 2022; Sohl-Dickstein et al., 2015), have opened new frontiers in content creation. Subject-driven text-to-image generation permits the personalization to new individuals given a few sample images (Ruiz et al., 2023; Casanova et al., 2021; Nitzan et al., 2022; Gal et al., 2023a; Kumari et al., 2023), allowing the generation of images featuring specific subjects in novel scenes, styles, and actions. However, existing subject-driven text-to-image generation methods still suffer from two key limitations: the cost of fine-tuning in personalization and identity blending for multiple subjects. Personalization is costly because they often need model fine-tuning for each new subject for good fidelity. The computational overhead and high hardware demands introduced by model tuning, largely due to the memory consumption (Chen et al., 2016) and computations of backpropagation, constrain the applicability of these models across various platforms. Furthermore, although some methods are proposed to tackle the subject fine-tuning cost issue, existing tuning-free techniques struggle with multi-subject generation (Fig. 1) because of the “identity blending” issue (Fig. 2 left), in which the model combines the distinct characteristics of different subjects (subject A looks like subject B and vice versa).

We propose FastComposer, a tuning-free, personalized multi-subject text-to-image generation method. Our key idea is to replace the generic word tokens, such as “person”, with an embedding that captures an individual’s unique identity in the text conditioning. We use a vision encoder to derive this identity embedding from a referenced image and then augment the generic text tokens with features from this identity embedding. This enables image generation based on subject-augmented conditioning. Our design allows the generation of images featuring specified subjects with only forward passes and can be further integrated with model compression techniques (Xiao et al., 2022; Bolya et al., 2023; Han et al., 2016) to boost deployment efficiency.

Fig. 2
figure 2

Two challenges faced by existing subject-driven image generation methods. Firstly, current methods blend the distinct characteristics of different subjects (identity blending), shown by the right figure where Newton resembles Einstein. Cross-attention localization (Sect. 4.2) solves this problem. Secondly, they suffer from subject overfitting, where they overfit the input image and ignore the text instruction. Delayed subject conditioning (Sect. 4.3) addresses this issue

To tackle the multi-subject identity blending issue, we identify unregulated cross-attention as the primary reason (Fig. 4). When the text includes two “person” tokens, each token’s attention map attends to both person in the image rather than linking each token to a distinct person in the image. To address this, we propose supervising cross-attention maps of subjects with segmentation masks during training (i.e., cross-attention localization), using standard segmentation tools (Cheng et al., 2022). This supervision explicitly guides the model to map subject features to distinct and non-overlapping regions of the image, thereby facilitating the generation of high-quality multi-subject images (Fig. 2left). We note that segmentation and cross-attention localization is only required during the training phase.

Naively applying subject-augmented conditioning leads to subject overfitting (Fig. 2right), restricting the user’s ability to edit subjects based on textual directives. To address this, we introduce delayed subject conditioning, preserving the subject’s identity while following text instructions. It employs text-only conditioning in the early denoising stage to generate the image layout, followed by subject-augmented conditioning in the remaining denoising steps to refine the subject appearance. This simple technique effectively preserves subject identity without sacrificing editability (Fig. 5).

FastComposer enables inference-only generation of multiple-subject images across diverse scenarios (Fig. 1). FastComposer achieves 300\(\times \)–2500\(\times \) speedup and 2.8\(\times \)–6.7\(\times \) memory saving compared to fine-tuning-based methods, requiring zero extra storage for new subjects. FastComposer paves the way for low-cost, personalized, and versatile text-to-image generation.

2 Related Work

2.1 Subject-Driven Image Generation

Subject-driven image generation aims to render a particular subject unseen at the initial training stage. Given a limited number of example images of the subject, it seeks to synthesize novel renditions in diverse contexts. DreamBooth (Ruiz et al., 2023), textual-inversion (Gal et al., 2023a), and custom-diffusion (Kumari et al., 2023) use optimization-based methods to embed subjects into diffusion models. This is achieved by either fine-tuning the model weights (Ruiz et al., 2023; Kumari et al., 2023) or inverting the subject image into a text token that encodes the subject identity (Gal et al., 2023a). Recently, tuning-encoder (Roich et al., 2022) reduces the total number of fine-tuning steps by first generating an inverted set latent code using a pre-trained encoder and then refines these codes through several finetuning steps to better preserve subject identities. However, all these tuning-based methods (Gal et al., 2023b; Kumari et al., 2023; Gal et al., 2023a; Ruiz et al., 2023) require resource-intensive backpropagation, and the hardware must be capable of fine-tuning the model, which is neither feasible on edge devices such as smartphones, nor scalable for cloud-based applications. In contrast, our new FastComposer amortizes the costly subject tuning during the training phase, enabling instantaneous personalization of multiple subjects using simple feedforward methods at test time.

A number of concurrent works have explored tuning-free methods. X&Fuse (Kirstain et al., 2023) concatenates the reference image with the noisy latent for image conditioning. ELITE (Wei et al., 2023) and InstantBooth (Shi et al., 2023) use global and local mapping networks to project reference images into word embeddings and inject reference image patch features into cross-attention layers to enhance local details. IP-Adapter (Ye et al., 2023) introduces a small module for fine-tuning specific subjects and employs a decoupled cross-attention strategy for adapter reuse. Despite impressive results for single-object customization, their architecture design restricts their applicability to multiple subject settings, as they rely on global interactions between the generated image and reference input image. UMM-Diffusion (Ma et al., 2023) and Face0 (Valevski et al., 2023) share a similar architecture to ours. InstantID (Wang et al., 2024) uses a face encoder, an image prompt module, and IdentityNet, to generate identity-preserving images without modifying the pre-trained model. More recent works, such as W+Adapter Li et al. (2023b) and PhotoMaker (Li et al., 2023c), enhance our method by using better retrieval-based training sets and architectural optimizations to inject subject information. However, these works still struggle to generate multiple subjects in a single image. In comparison, our method supports multi-subject composition via a cross-attention localization supervision mechanism (Sect. 4.2).

2.2 Multi-Subject Image Generation

Custom-Diffusion (Kumari et al., 2023) enables multi-concepts composition by jointly fine-tuning the diffusion model for multiple concepts. However, it typically handles concepts with clear semantic distinctions, such as animals and their related accessories or backgrounds. The method encounters challenges when dealing with subjects within similar categories, often generating the same person twice when composing two different individuals  (Fig. 1). SpaText (Avrahami et al., 2023b), eDiff-I (Balaji et al., 2022), Paint by Word (Andonian et al., 2021), and Collage Diffusion (Sarukkai et al., 2023) enable multi-object composition through a layout to image generation process. A user-provided segmentation mask determines the final layouts, which are then transformed into high-resolution images using a diffusion model, often utilizing attention modulation techniques. Nevertheless, these techniques either compose generic objects without customization (Avrahami et al., 2023b) or demand the costly textual-inversion process to encode instance-specific details (Sarukkai et al., 2023). Additionally, while these techniques offer precise control over object locations, they necessitate that users provide detailed layouts, which can be challenging for complex scenes with rich interactions. In contrast, our approach simplifies the creation process by generating multi-subject conditioned images from just a text input, significantly reducing the burden on users and facilitating easier design process.

Lately, several works further enhance the performance of multi-subject image generation, though they often necessitate prolonged tuning for each subject. Among them, CelebA-basis (Yuan et al., 2023) constructs a basis in the CLIP text embedding space; given a new reference subject, the model then optimizes coefficients of this basis to accurately match the target face. Break-a-scene (Avrahami et al., 2023a) takes a different approach by first segmenting reference images into several concepts and optimizing a textual embedding for each. Mix-of-show (Gu et al., 2023) trains indivudal LoRAs for each subject and then composes them to generate images through region-aware cross attention. Svdiff (Han et al., 2023) fine-tunes the singular values of diffusion model weight matrices to enable personalization. This method innovates in multi-subject generation by employing a Cut-Mix-Unmix data augmentation strategy, where multiple subjects are concatenated into a single training example. This approach helps the network correlate each customized embedding to its specific object. However, this pseudo-dataset approach can deviate significantly from real natural images, often limiting the diversity of the generated images.

2.3 Attention in Diffusion Models

Our cross-attention localization technique for mitigating identity blending in multi-subject image generation builds upon prior research into the role of attention mechanisms within diffusion models (Hertz et al., 2023; Parmar et al., 2023; Tumanyan et al., 2022; Liu et al., 2024; Patashnik et al., 2023; Cao et al., 2023). Notable studies such as Prompt-to-Prompt (Hertz et al., 2023), Pix2Pix-Zero (Parmar et al., 2023), enable layout preserving image editing or translation by modifing the text prompt while keeping the cross attention maps unchanged.

Fig. 3
figure 3

Training and inference pipeline of FastComposer. Given a text description and images of multiple subjects, FastComposer uses an image encoder to extract the features of the subjects and augments the corresponding text tokens. The diffusion model is trained to generate multi-subject images with augmented conditioning. We use cross-attention localization (Sect. 4.2) to boost multi-subject generation quality, and delayed subject conditioning to avoid subject overfitting (Sect. 4.3)

3 Preliminaries

3.1 Stable Diffusion

We use the state-of-the-art StableDiffusion (SD) model as our backbone network. The SD model consists of three components: the variational autoencoder (VAE), U-Net, and a text encoder. The VAE encoder \(\mathcal {E}\) compresses the image x to a smaller latent representation z, which is subsequently perturbed by Gaussian noise \(\varepsilon \) in the forward diffusion process. The U-Net, parameterized by \(\theta \), denoises the noisy latent representation by predicting the noise. This denoising process can be conditioned on text prompts through the cross-attention mechanism, while the text encoder \(\psi \) maps the text prompts \(\mathcal {P}\) to conditional embeddings \(\psi (\mathcal {P})\). During training, the network is optimized to minimize the loss function given by the equation below:

$$\begin{aligned} \mathcal {L}_{\text {noise}} = \mathbb {E}_{z\sim \mathcal {E}(x),\mathcal {P},\varepsilon \sim \mathcal {N}(0,1),t} \left[ || \varepsilon - \varepsilon _\theta (z_t, t, \psi (\mathcal {P})) ||_2^2 \right] ,\nonumber \\ \end{aligned}$$
(1)

where \(z_t\) is the latent code at time step t. At inference time, a random noise \(z_T\) is sampled from \(\mathcal {N}(0,1)\) and iteratively denoised by the U-Net to the initial latent representation \(z_0\). Finally, the VAE decoder \(\mathcal {D}\) generates the final image by mapping the latent codes back to pixel space \(\hat{x} = \mathcal {D}(z_0)\).

3.2 Text-Conditioning via Cross-Attention Mechanism

In the SD model, the U-Net employs a cross-attention mechanism to denoise the latent code conditioned on text prompts. For simplicity, we use the single-head attention mechanism in our discussion. Let \(\mathcal {P}\) represent the text prompts with n tokens and \(\psi \) denote the text encoder, which is typically a pre-trained CLIP text encoder. The encoder converts \(\mathcal {P}\) into a list of d-dimensional embeddings, \(\psi (\mathcal {P}) = c \in \mathbb {R}^{n\times d}\). The cross-attention layer accepts the spatial latent code \(z \in \mathbb {R}^{(h \times w) \times f}\) and the text embeddings c as inputs, where h and w are the height and width of the 2-D latent code, and f is the number of dimensions of the latent space. It then projects the latent code and text embeddings into Query, Key, and Value matrices: \(Q = W^q z\), \(K = W^k c\), and \(V = W^v c\). Here, \(W^q \in \mathbb {R}^{f \times d'}, W^k, W^v \in \mathbb {R}^{d \times d'}\) represent the weight matrices of the three linear layers, and \(d'\) is the dimension of Query, Key, and Value embeddings. The cross-attention layer then computes the attention scores \(A = \text {Softmax}(\frac{QK^T}{\sqrt{d'}}) \in [0,1]^{(h \times w) \times n}\), and takes a weighted sum over the Value matrix to obtain the cross-attention output \(z_\text {attn} = AV \in \mathbb {R}^{(h \times w) \times d'}\). Intuitively, the cross-attention mechanism “scatters” textual information to the 2D latent code space, and A[ijk] represents the amount of information flow from the k-th text token to the (ij) latent pixel. Our method is based on this semantic interpretation of the cross-attention map, and we will discuss it in detail in Sect. 4.2.

Fig. 4
figure 4

In the absence of cross-attention regularization (top), the diffusion model attends to multiple subjects’ input tokens and merge their identity. By applying cross-attention regularization (bottom), the diffusion model learns to focus on only one reference token while generating a subject. This ensures that the features of multiple subjects in the generated image are more separated

4 FastComposer

4.1 Tuning-Free Subject-Driven Image Generation with an Image Encoder

4.1.1 Augmenting Text Representation with Subject Embedding

To achieve tuning-free subject-driven image generation, we propose to augment text prompts with visual features extracted from reference subject images. Given a text prompt \(\mathcal {P} = \{w_1, w_2, \dots w_n\}\), a list of reference subject images \(\mathcal {S} = \{s_1, s_2, \dots s_m\}\), and an index list indicating which subject corresponds to which word in the text prompt \(\mathcal {I} = \{i_1, i_2, \dots i_m\}, i_j \in {1, 2, \dots , n}\), we first encode the text prompt \(\mathcal {P}\) and reference subjects \(\mathcal {S}\) into embeddings using the pre-trained CLIP text and image encoders \(\psi \) and \(\phi \), respectively. Next, we employ a multilayer perceptron (MLP) to augment the text embeddings with visual features extracted from the reference subjects. We concatenate (represented by ||) the word embeddings with the visual features and feed the resulting augmented embeddings into the MLP. This process yields the final conditioning embeddings \(c' \in \mathbb {R}^{n\times d}\), defined as follows:

$$\begin{aligned} c'_{i} = {\left\{ \begin{array}{ll} \psi (\mathcal {P})_i, & i \notin \mathcal {I} \\ \text {MLP}(\psi (\mathcal {P})_i || \phi (s_{j})), & i = i_j\in \mathcal {I} \end{array}\right. } \end{aligned}$$
(2)

Figure 3 gives a concrete example of our augmentation approach.

4.1.2 Subject-Driven Image Generation Training

To enable inference-only subject-driven image generation, we train the image encoder, the MLP module, and the U-Net with the denoising loss (Fig. 3). We create a subject-augmented image-text paired dataset to train our model, where noun phrases from image captions are paired with subject segments appearing in the target images. We initially use a dependency parsing model to chunk all noun phrases (e.g., “a woman”) in image captions and a panoptic segmentation model to segment all subjects present in the image. We then pair these subject segments with corresponding noun phrases in the captions with a greedy matching algorithm based on text and image similarity (Radford et al., 2021; Reimers & Gurevych, 2019). The process of constructing the subject-augmented image-text dataset is detailed in Sec. 5.1.1. In the training phase, we employ subject-augmented conditioning, as outlined in Eq. 2, to denoise the perturbed target image. We also mask the subjects’ backgrounds with random noise before encoding, preventing the overfitting of the subjects’ backgrounds. Consequently, FastComposer can directly use natural subject images during inference without explicit background segmentation.

4.2 Localizing Cross-Attention Maps with Subject Segmentation Masks

We observe that traditional cross-attention maps tend to attend to all subjects at the same time, which leads to identity blending in multi-subject image generation (Fig. 4top). We propose to localize cross-attention maps with subject segmentation masks during training to solve this issue.

Fig. 5
figure 5

Effects of using different ratios of timesteps for subject conditioning. A ratio between 0.6 and 0.8 yields good results and achieves a balance between prompt consistency and identity preservation

4.2.1 Understanding the Identity Blending in Diffusion Models

Prior research (Hertz et al., 2023) shows that the cross-attention mechanism within diffusion models governs the layout of generated images. The scores in cross-attention maps represent “the amount of information flows from a text token to a latent pixel.” We hypothesize that identity blending arises from the unrestricted cross-attention mechanism, as a single latent pixel can attend to all text tokens. If one subject’s region attends to multiple reference subjects, identity blending will occur. In Fig. 4, we confirm our hypothesis by visualizing the average cross-attention map within the second up-sampling layer in the U-Net of the diffusion model. The unregularized model often has two reference subject tokens influencing the same generated person at the same time, causing a mix of features from both subjects. We argue that proper cross-attention maps should resemble an instance segmentation of the target image, clearly separating the features related to different subjects. To achieve this, we add a regularization term to the subject cross-attention maps during training to encourage focusing on specific instance areas. Segmentation maps and cross-attention regularization are only used during training, not at test time.

4.2.2 Localizing Cross-Attention with Segmentation Masks

As discussed in Sect. 3.2, a cross-attention map \(A \in [0,1]^{(h \times w) \times n}\) connects latent pixels to conditional embeddings at each layer, where A[ijk] denotes the information flow from the k-th conditional token to the (ij) latent pixel. Ideally, the subject token’s attention map should focus solely on the subject region rather than spreading throughout the entire image, preventing identity blending among subjects. To accomplish this, we propose localizing the cross-attention map using the reference subject’s segmentation mask. Let \(\mathcal {M} = \{M_1, M_2, \dots M_m\}\) represent the reference subjects’ segmentation masks, \(\mathcal {I} = \{i_1, i_2, \dots i_m\}\) be the index list indicating which subject corresponds to each word in the text prompt, and \(A_{i} = A[:,:,i] \in [0,1]^{(h \times w)}\) be the cross-attention map of the i-th subject token. We supervise the cross-attention map \(A_{i_j}\) to be close to the segmentation mask \(m_j\) of the j-th subject token, i.e., \(A_{i_j} \approx m_j\). We employ a balanced L1 loss to minimize the distance between the cross-attention map and the segmentation mask:

$$\begin{aligned} \mathcal {L}_{\text {loc}} = \frac{1}{m}\sum _{j=1}^m (\text {mean}(A_{i_j}[\bar{m}_j]) - \text {mean}(A_{i_j}[m_j])), \end{aligned}$$
(3)

where \(\bar{m}_j\) means the complement of a segmentation mask \(m_j\) (0 to 1, and 1 to 0). The final training objective of FastComposer is given by:

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{\text {noise}} + \lambda \mathcal {L}_{\text {loc}}, \end{aligned}$$
(4)

using a localization loss ratio controlled by hyperparameter \(\lambda = 0.001\). Motivated by (Hertz et al., 2023; Chefer et al., 2023), we apply the localization loss to the downsampled cross-attention maps, i.e., the middle 5 blocks of the U-Net, which are known to contain more semantic information. As illustrated in Fig. 4, our localization technique enables the model to precisely allocate attention to reference subjects at test time, which prevents identity blending between subjects.

4.3 Delayed Subject Conditioning in Iterative Denoising

During inference, using the augmented text representation directly often leads to images that closely resemble the subjects while ignoring the textual directives. This occurs because the image layout forms at the early phases of the denoising process, and premature augmentation from the reference image causes the resulting image to stray from the text instructions. Prior methods (Gal et al., 2023b; Roich et al., 2022) mitigate this issue by generating an initial latent code and refining it through iterative model finetuning. However, this process is resource-intensive and needs high-end devices for model fine-tuning. Inspired by Style Mixing (Karras et al., 2019), we propose a simple delayed subject conditioning, which allows for inference-only subject conditioning while striking a balance between identity preservation and editability.

Specifically, we perform image augmentation only after the layout has been created using a text-only prompt. In this framework, our time-dependent noise prediction model can be represented as:

$$\begin{aligned} \epsilon _t = {\left\{ \begin{array}{ll} \epsilon _\theta (z_t, t, c) & \text {if } t > \alpha T, \\ \epsilon _\theta (z_t, t, c') & \text {otherwise} \end{array}\right. } \end{aligned}$$
(5)

here c denotes the original text embedding and \(c'\) denotes text embedding augmented with the input image embedding. \(\alpha \) is a hyperparameter indicating the ratio of subject conditioning. We ablate the effect of using different \(\alpha \) in Fig. 5. Empirically, \(\alpha \in [0.6, 0.8]\) yields good results that balance prompt consistency and identity preservation, though it can be easily tuned for specific instances.

5 Experiments

5.1 Setup

5.1.1 Dataset Construction

We built a subject-augmented image-text paired dataset based on the FFHQ-wild (Karras et al., 2019) dataset to train our models. First, we use the BLIP-2 model (Li et al., 2023a) blip2-opt-6.7b-coco to generate captions for all images. Next, we employ the Mask2Former model (Cheng et al., 2022) mask2former-swin-large-coco-panoptic to generate panoptic segmentation masks for each image. We then leverage the spaCy (Honnibal & Montani, 2017) library to chunk all noun phrases in the image captions and expand numbered plural phrases (e.g., “two women”) into singular phrases connected by “and” (e.g., “a woman and a woman”). Finally, we use a greedy matching algorithm to match noun phrases with image segments. We do this by considering the product of the image-text similarity score by the OpenCLIP model (Ilharco et al., 2021) CLIP-ViT-H-14-laion2B-s32B-b79K and the label-text similarity score by the Sentence-Transformer (Reimers & Gurevych, 2019) model stsb-mpnet-base-v2. We reserve 1000 images for validation and testing purposes.

5.1.2 Training Details

We start training from the StableDiffusion v1-5 (Rombach et al., 2022) model. To encode the visual inputs, we use OpenAI’s clip-vit-large-patch14 vision model, which serves as the partner model of the text encoder in SDv1-5. During training, we freeze the text encoder and only train the U-Net, the MLP module, and the last two transformer blocks of the vision encoder. We train our models for 150k steps on 8 NVIDIA A6000 GPUs, with a constant learning rate of 1e-5 and a batch size of 128. We only augment segments whose COCO (Lin et al., 2014) label is “person” and set a maximum of 4 reference subjects during training, with each subject having a 10% chance of being dropped. We train the model solely on text conditioning with 10% of the samples to maintain the model’s capability for text-only generation. To facilitate classifier-free guidance sampling (Ho & Salimans, 2022), we train the model without any conditions on 10% of the instances. During training, we apply the loss only in the subject region to half of the training samples to enhance the generation quality in the subject area.

Table 1 Comparison between our method and baseline approaches on single-subject image generation

5.1.3 Evaluation Metrics

We evaluate image generation quality on identity preservation and prompt consistency. Identity preservation is determined by detecting faces in the reference and generated images using MTCNN (Zhang et al., 2016), and then calculating a pairwise identity similarity using FaceNet (Schroff et al., 2015). For multi-subject evaluation, we identify all faces within the generated images and use a greedy matching procedure between the generated faces and reference subjects. The minimum similarity value among all subjects measures overall identity preservation. We evaluate the prompt consistency using the average CLIP-L/14 image-text similarity following textual-inversion (Gal et al., 2023a). For efficiency evaluation, we consider the total time for customization, including fine-tuning (for tuning-based methods) and inference. We also measure peak memory usage during the entire procedure.

Fig. 6
figure 6

Comparison of different methods on single subject image generation. For text-only methods (i.e., StableDiffusion and Midjourney), we use names of scientists or generic terms like man and woman in the text prompt

5.2 Single-Subject Image Generation

Our first evaluation targets the performance of single-subject image generation. We benchmark our approach against leading optimization-based approaches, including DreamBooth (Ruiz et al., 2023), Textual-Inversion (Gal et al., 2023a), and Custom Diffusion (Kumari et al., 2023). Additionally, we incorporate a comparison with ELITE (Wei et al., 2023), a concurrent tuning-free method. We use the implementations from diffusers library (von Platen et al., 2022). We provide the detailed hyperparameters in the appendix section. We assess the capabilities of these different methods in generating personalized content for unseen subjects derived from the Celeb-A dataset (Liu et al., 2015). To construct our evaluation benchmark, we develop a broad range of text prompts encapsulating a wide spectrum of scenarios, such as recontextualization, stylization, accessorization, and diverse actions. The entire test set comprises 15 subjects, with 30 unique text prompts allocated to each. An exhaustive list of text prompts is available in the appendix. We utilized five images per subject to fine-tune the optimization-based methods, given our observation that these methods overfit and simply reproduce the reference image when a single reference image is used. In contrast, our model employs a single randomly selected image for each subject. Shown in Table 1, FastComposer surpasses all baselines, delivering superior identity preservation and prompt consistency. Remarkably, in comparison to optimizaation-based methods, it achieves 300\(\times \)–1200\(\times \) speedup and 2.8\(\times \) reduction in memory usage. Figure 6 shows the qualitative results of single-subject personalization comparisons, employing different approaches across an array of prompts. Significantly, our model matches the text consistency of text-only methods and exceeds all baseline strategies in terms of identity preservation, with only single input and forward passes used.

Table 2 Comparison between our method and baseline approaches on multiple-subject Image generation

5.3 Multi-Subject Image Generation

We then consider a more complex setting: multi-object, subject-driven image generation. We examine the quality of multi-subject generation by using all possible combinations (105 pairs in total) formed from 15 subjects described in Sec. 5.2, allocating 21 prompts to each pair for assessment. Table 2 shows a quantitative analysis contrasting FastComposer with the baseline methods. Optimization-based methods (Kumari et al., 2023; Ruiz et al., 2023; Gal et al., 2023a) frequently falter in maintaining identity preservation, often generating generic images or images that blend identities among different reference subjects. FastComposer, on the other hand, preserves the unique features of different subjects, yielding a significantly improved identity preservation score. Furthermore, our prompt consistency is on par with tuning-based approaches (Gal et al., 2023a; Kumari et al., 2023). Qualitative comparisons are shown in Fig. 1.

5.4 Generated Image Quality and Diversity

5.4.1 Automatic Evaluation

In Tables 3 and 4, we compare the generated images’ quality and diversity between our method and baseline method in the single- and multi-subject settings, respectively. We use ImageReward (Xu et al., 2024) and Aesthetic Score (Schuhmann et al., 2022) to measure image quality. To evaluate diversity, we adopt a diversity score from Kang et al. (2024), generating four images from the same prompt and measuring the inter-seed pair-wise distance. Our approach achieves comparable image quality and diversity to previous baselines, while enabling significantly better subject-driven generation (see Tables 12).

Table 3 Comparison of image quality and diversity between our method and baseline approaches on single-subject image generation
Table 4 Comparison of image quality and diversity between our method and baseline approaches on multi-subject image generation
Fig. 7
figure 7

User study comparing our model with competing baselines (Wei et al., 2023; Kumari et al., 2023) for single subject image generation

Fig. 8
figure 8

User study comparing our model with competing baselines (Kumari et al., 2023; Gal et al., 2023a) for multiple subject image generation

5.4.2 Human Evaluation

To further verify our method’s effectiveness, we conducted an extensive user study comparing our model’s output with those from competing methods. We used a subset of 100 (subject, prompt) pairs from our evaluation dataset. For each comparison, a random set of three evaluators was asked to choose the image that better represents the text prompt or reference subjects. Detailed information about the human evaluation is included in the Appendix D.

As shown in Figs. 7 and 8, our model achieves much higher subject consistency while maintaining comparable prompt alignment. Notably, in multi-subject generation, our model outperforms competing methods like custom diffusion (Kumari et al., 2023) and textual inversion (Gal et al., 2023a) over 80% of the time. While Custom Diffusion (Kumari et al., 2023) exhibits higher prompt alignment than our approach, it suffers from significantly worse identity preservation. This suggests that the improved alignment is mostly due to random outputs that ignore the reference.

5.5 Extending FastComposer Beyond Human Faces

In Figs. 9 and 10, we apply FastComposer on the LSUN-cat (Yu et al., 2015) dataset and successfully generate high-quality single and multiple cat images without subject-specific fine-tuning.

Fig. 9
figure 9

FastComposer can generate images beyond person. Here we show single-cat image generation results

Fig. 10
figure 10

FastComposer’s multiple-cat image generation results

Table 5 Comparison between our method and baseline approaches on single-cat image generation
Table 6 Comparison between our method and baseline approaches on multiple-cat image generation

In Tables 5 and 6, we show the quantitative results of single and multiple cats image generation, comparing against baselines. To assess identity preservation, we compute the pairwise DINO distance (Caron et al., 2021) between the cats detected in the generated images and the reference subjects. We use six cats in the Custom Diffusion (Kumari et al., 2023) evaluation datasets as the testing subjects. The results demonstrate that our method excels in generating cat images without the need for model fine-tuning at deployment time. Previous methods often overfit to the reference subjects (e.g., DreamBooth) or failed to effectively learn subject conditioning (e.g., Custom Diffusion). Additionally, these methods frequently either merge attributes from different subjects or completely ignore the second subject in multi-subject image generation. Our approach strikes an optimal balance between preserving identity and adhering to prompts. Additional computational resources and training data make extending to more categories of subjects possible, which we leave for future work.

5.6 Combining with ControlNet

In Fig. 11, FastComposer is synergized with a pre-existing ControlNet model (Zhang et al., 2023), illustrating our technique’s capability in crafting subject-specific imagery guided by specified poses. This integration highlights the flexibility and enhanced control offered by FastComposer in generating subject-specific images.

Fig. 11
figure 11

FastComposer combined with ControlNet can generate subject images following pose guides without fine-tuning

5.7 Taking Multiple Inputs for a Subject

Figure 12 shows that FastComposer can take multiple reference inputs for a single subject, which facilitates the creation of images with a broader spectrum of variations while mitigating the risk of over-reliance on a singular reference point. By averaging the subject embeddings derived from various reference images, our approach yields images with pronounced local variations. Notably, taking averaged subject embeddings results in discernible differences in aspects such as gaze direction and head positioning, demonstrating the nuanced capabilities of our model when supplied with an enriched data input.

Fig. 12
figure 12

FastComposer supports taking multiple inputs of a sinlge subject. Multiple inputs lead to enhanced local variations, such as differences in gazing direction (upper row) and head pose (bottom row)

Table 7 Ablation studies on the cross-attention localization supervision

5.8 Ablation Study

5.8.1 Delayed Subject Conditioning

Figure 5 shows the impact of varying the ratio of timesteps devoted to subject conditioning, a hyperparameter in our delayed subject conditioning approach. As this ratio increases, the model improves in identity preservation but loses editability. A ratio between 0.6 to 0.8 achieves a favorable balance on the tradeoff curve.

5.8.2 Cross-Attention Localization Loss

Table 7 presents the ablation studies on our proposed cross-attention localization loss. The baseline is trained in the same setting but excludes the localization loss. Our method demonstrates a substantial enhancement of the identity preservation score. Figure 4 shows the qualitative comparisons. Incorporating the localization loss allows the model to focus on particular reference subjects, thereby avoiding identity blending.

Table 8 Ablation studies on architectural configurations of FastComposer

5.8.3 Architectural Configurations

Table 8 presents the ablation study results, illustrating the impact of various architectural configurations on the performance of FastComposer. The configurations evaluated include: (1) training only the cross-attention module in the U-Net of the StableDiffusion model, (2) freezing the image encoder during training, and (3) replacing the subject token entirely with the image token instead of using a multi-layer perceptron (MLP) module to fuse the information. We use single-person image generation to measure model performance across these configurations. The experimental results show that all alternative configurations consistently underperform compared to the final architecture of FastComposer. Specifically, training only the cross-attention module yields moderate identity preservation but lower prompt consistency. Freezing the image encoder further reduces identity preservation without significant gains in prompt consistency. Replacing the subject token with the image token drastically diminishes identity preservation, indicating the importance of the MLP module for effective information fusion.

5.8.4 Number of Subject

Beyond generating two subjects, we test the limit of FastComposer by generating 3 and 4 subjects on one image, and the results are shown in Figs. 13 and 14. We find FastComposer can generate at most three subjects in a single image, and generating more than three subjects will fail (e.g., the output might only display 3 subjects, with one of them exhibiting blended features from two subjects). This limitation arises from the FFHQ-wild dataset’s long-tail distribution of human numbers, where images with more than three people are rare (less than 3%). We believe training on a larger dataset with more balanced and various subject numbers can mitigate this issue.

Fig. 13
figure 13

Generating images with three subjects

Fig. 14
figure 14

Generating images with four subjects. Our method fails due to subject fusion (merging multiple subjects into one) almost all the time

6 Conclusion

We propose FastComposer, a tuning-free method for personalized, multi-subject text-to-image generation. We achieve tuning-free subject-driven image generation by utilizing a pre-trained vision encoder, making this process efficient and accessible across various platforms. We also propose a novel delayed subject conditioning technique to balance the preservation of subject identity and the flexibility of image editability. Additionally, FastComposer effectively tackles the identity blending issue in multi-subject generation by supervising cross-attention maps during pre-training. FastComposer enables the creation of high-quality, personalized, and complex images, opening up new possibilities in the field of text-to-image generation.