FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention

Xiao, Guangxuan; Yin, Tianwei; Freeman, William T.; Durand, Frédo; Han, Song

doi:10.1007/s11263-024-02227-z

FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention

Open access
Published: 19 September 2024

Volume 133, pages 1175–1194, (2025)
Cite this article

You have full access to this open access article

Download PDF

International Journal of Computer Vision Aims and scope Submit manuscript

FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention

Download PDF

Guangxuan Xiao ORCID: orcid.org/0000-0002-7182-9284¹^na1,
Tianwei Yin¹^na1,
William T. Freeman¹,
Frédo Durand¹ &
…
Song Han^1,2

8169 Accesses
103 Citations
Explore all metrics

Abstract

Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend identity among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions with only forward passes. To address the identity blending problem in the multi-subject generation, FastComposer proposes cross-attention localization supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes delayed subject conditioning in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300$\times $–2500$\times $ speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available here (https://github.com/mit-han-lab/fastcomposer).

A Diffusion Model for Personalized Text-to-Image Generation

Be Yourself: Bounded Attention for Multi-subject Text-to-Image Generation

FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Recent advancements in text-to-image generation (Ramesh et al., 2021; Chang et al., 2023; Kang et al., 2023; Ding et al., 2021), particularly diffusion models (Ho et al., 2020; Song et al., 2021; Rombach et al., 2022; Ramesh et al., 2022; Sohl-Dickstein et al., 2015), have opened new frontiers in content creation. Subject-driven text-to-image generation permits the personalization to new individuals given a few sample images (Ruiz et al., 2023; Casanova et al., 2021; Nitzan et al., 2022; Gal et al., 2023a; Kumari et al., 2023), allowing the generation of images featuring specific subjects in novel scenes, styles, and actions. However, existing subject-driven text-to-image generation methods still suffer from two key limitations: the cost of fine-tuning in personalization and identity blending for multiple subjects. Personalization is costly because they often need model fine-tuning for each new subject for good fidelity. The computational overhead and high hardware demands introduced by model tuning, largely due to the memory consumption (Chen et al., 2016) and computations of backpropagation, constrain the applicability of these models across various platforms. Furthermore, although some methods are proposed to tackle the subject fine-tuning cost issue, existing tuning-free techniques struggle with multi-subject generation (Fig. 1) because of the “identity blending” issue (Fig. 2 left), in which the model combines the distinct characteristics of different subjects (subject A looks like subject B and vice versa).

We propose FastComposer, a tuning-free, personalized multi-subject text-to-image generation method. Our key idea is to replace the generic word tokens, such as “person”, with an embedding that captures an individual’s unique identity in the text conditioning. We use a vision encoder to derive this identity embedding from a referenced image and then augment the generic text tokens with features from this identity embedding. This enables image generation based on subject-augmented conditioning. Our design allows the generation of images featuring specified subjects with only forward passes and can be further integrated with model compression techniques (Xiao et al., 2022; Bolya et al., 2023; Han et al., 2016) to boost deployment efficiency.

To tackle the multi-subject identity blending issue, we identify unregulated cross-attention as the primary reason (Fig. 4). When the text includes two “person” tokens, each token’s attention map attends to both person in the image rather than linking each token to a distinct person in the image. To address this, we propose supervising cross-attention maps of subjects with segmentation masks during training (i.e., cross-attention localization), using standard segmentation tools (Cheng et al., 2022). This supervision explicitly guides the model to map subject features to distinct and non-overlapping regions of the image, thereby facilitating the generation of high-quality multi-subject images (Fig. 2left). We note that segmentation and cross-attention localization is only required during the training phase.

Naively applying subject-augmented conditioning leads to subject overfitting (Fig. 2right), restricting the user’s ability to edit subjects based on textual directives. To address this, we introduce delayed subject conditioning, preserving the subject’s identity while following text instructions. It employs text-only conditioning in the early denoising stage to generate the image layout, followed by subject-augmented conditioning in the remaining denoising steps to refine the subject appearance. This simple technique effectively preserves subject identity without sacrificing editability (Fig. 5).

FastComposer enables inference-only generation of multiple-subject images across diverse scenarios (Fig. 1). FastComposer achieves 300$\times $–2500$\times $ speedup and 2.8$\times $–6.7$\times $ memory saving compared to fine-tuning-based methods, requiring zero extra storage for new subjects. FastComposer paves the way for low-cost, personalized, and versatile text-to-image generation.

2 Related Work

2.1 Subject-Driven Image Generation

Subject-driven image generation aims to render a particular subject unseen at the initial training stage. Given a limited number of example images of the subject, it seeks to synthesize novel renditions in diverse contexts. DreamBooth (Ruiz et al., 2023), textual-inversion (Gal et al., 2023a), and custom-diffusion (Kumari et al., 2023) use optimization-based methods to embed subjects into diffusion models. This is achieved by either fine-tuning the model weights (Ruiz et al., 2023; Kumari et al., 2023) or inverting the subject image into a text token that encodes the subject identity (Gal et al., 2023a). Recently, tuning-encoder (Roich et al., 2022) reduces the total number of fine-tuning steps by first generating an inverted set latent code using a pre-trained encoder and then refines these codes through several finetuning steps to better preserve subject identities. However, all these tuning-based methods (Gal et al., 2023b; Kumari et al., 2023; Gal et al., 2023a; Ruiz et al., 2023) require resource-intensive backpropagation, and the hardware must be capable of fine-tuning the model, which is neither feasible on edge devices such as smartphones, nor scalable for cloud-based applications. In contrast, our new FastComposer amortizes the costly subject tuning during the training phase, enabling instantaneous personalization of multiple subjects using simple feedforward methods at test time.

A number of concurrent works have explored tuning-free methods. X&Fuse (Kirstain et al., 2023) concatenates the reference image with the noisy latent for image conditioning. ELITE (Wei et al., 2023) and InstantBooth (Shi et al., 2023) use global and local mapping networks to project reference images into word embeddings and inject reference image patch features into cross-attention layers to enhance local details. IP-Adapter (Ye et al., 2023) introduces a small module for fine-tuning specific subjects and employs a decoupled cross-attention strategy for adapter reuse. Despite impressive results for single-object customization, their architecture design restricts their applicability to multiple subject settings, as they rely on global interactions between the generated image and reference input image. UMM-Diffusion (Ma et al., 2023) and Face0 (Valevski et al., 2023) share a similar architecture to ours. InstantID (Wang et al., 2024) uses a face encoder, an image prompt module, and IdentityNet, to generate identity-preserving images without modifying the pre-trained model. More recent works, such as W+Adapter Li et al. (2023b) and PhotoMaker (Li et al., 2023c), enhance our method by using better retrieval-based training sets and architectural optimizations to inject subject information. However, these works still struggle to generate multiple subjects in a single image. In comparison, our method supports multi-subject composition via a cross-attention localization supervision mechanism (Sect. 4.2).

2.2 Multi-Subject Image Generation

Custom-Diffusion (Kumari et al., 2023) enables multi-concepts composition by jointly fine-tuning the diffusion model for multiple concepts. However, it typically handles concepts with clear semantic distinctions, such as animals and their related accessories or backgrounds. The method encounters challenges when dealing with subjects within similar categories, often generating the same person twice when composing two different individuals (Fig. 1). SpaText (Avrahami et al., 2023b), eDiff-I (Balaji et al., 2022), Paint by Word (Andonian et al., 2021), and Collage Diffusion (Sarukkai et al., 2023) enable multi-object composition through a layout to image generation process. A user-provided segmentation mask determines the final layouts, which are then transformed into high-resolution images using a diffusion model, often utilizing attention modulation techniques. Nevertheless, these techniques either compose generic objects without customization (Avrahami et al., 2023b) or demand the costly textual-inversion process to encode instance-specific details (Sarukkai et al., 2023). Additionally, while these techniques offer precise control over object locations, they necessitate that users provide detailed layouts, which can be challenging for complex scenes with rich interactions. In contrast, our approach simplifies the creation process by generating multi-subject conditioned images from just a text input, significantly reducing the burden on users and facilitating easier design process.

Lately, several works further enhance the performance of multi-subject image generation, though they often necessitate prolonged tuning for each subject. Among them, CelebA-basis (Yuan et al., 2023) constructs a basis in the CLIP text embedding space; given a new reference subject, the model then optimizes coefficients of this basis to accurately match the target face. Break-a-scene (Avrahami et al., 2023a) takes a different approach by first segmenting reference images into several concepts and optimizing a textual embedding for each. Mix-of-show (Gu et al., 2023) trains indivudal LoRAs for each subject and then composes them to generate images through region-aware cross attention. Svdiff (Han et al., 2023) fine-tunes the singular values of diffusion model weight matrices to enable personalization. This method innovates in multi-subject generation by employing a Cut-Mix-Unmix data augmentation strategy, where multiple subjects are concatenated into a single training example. This approach helps the network correlate each customized embedding to its specific object. However, this pseudo-dataset approach can deviate significantly from real natural images, often limiting the diversity of the generated images.

2.3 Attention in Diffusion Models

Our cross-attention localization technique for mitigating identity blending in multi-subject image generation builds upon prior research into the role of attention mechanisms within diffusion models (Hertz et al., 2023; Parmar et al., 2023; Tumanyan et al., 2022; Liu et al., 2024; Patashnik et al., 2023; Cao et al., 2023). Notable studies such as Prompt-to-Prompt (Hertz et al., 2023), Pix2Pix-Zero (Parmar et al., 2023), enable layout preserving image editing or translation by modifing the text prompt while keeping the cross attention maps unchanged.

3 Preliminaries

3.1 Stable Diffusion

We use the state-of-the-art StableDiffusion (SD) model as our backbone network. The SD model consists of three components: the variational autoencoder (VAE), U-Net, and a text encoder. The VAE encoder $\mathcal {E}$ compresses the image x to a smaller latent representation z, which is subsequently perturbed by Gaussian noise $\varepsilon $ in the forward diffusion process. The U-Net, parameterized by $\theta $, denoises the noisy latent representation by predicting the noise. This denoising process can be conditioned on text prompts through the cross-attention mechanism, while the text encoder $\psi $ maps the text prompts $\mathcal {P}$ to conditional embeddings $\psi (\mathcal {P})$. During training, the network is optimized to minimize the loss function given by the equation below:

$$\begin{aligned} \mathcal {L}_{\text {noise}} = \mathbb {E}_{z\sim \mathcal {E}(x),\mathcal {P},\varepsilon \sim \mathcal {N}(0,1),t} \left[ || \varepsilon - \varepsilon _\theta (z_t, t, \psi (\mathcal {P})) ||_2^2 \right] ,\nonumber \\ \end{aligned}$$

(1)

where $z_t$ is the latent code at time step t. At inference time, a random noise $z_T$ is sampled from $\mathcal {N}(0,1)$ and iteratively denoised by the U-Net to the initial latent representation $z_0$. Finally, the VAE decoder $\mathcal {D}$ generates the final image by mapping the latent codes back to pixel space $\hat{x} = \mathcal {D}(z_0)$.

3.2 Text-Conditioning via Cross-Attention Mechanism

In the SD model, the U-Net employs a cross-attention mechanism to denoise the latent code conditioned on text prompts. For simplicity, we use the single-head attention mechanism in our discussion. Let $\mathcal {P}$ represent the text prompts with n tokens and $\psi $ denote the text encoder, which is typically a pre-trained CLIP text encoder. The encoder converts $\mathcal {P}$ into a list of d-dimensional embeddings, $\psi (\mathcal {P}) = c \in \mathbb {R}^{n\times d}$. The cross-attention layer accepts the spatial latent code $z \in \mathbb {R}^{(h \times w) \times f}$ and the text embeddings c as inputs, where h and w are the height and width of the 2-D latent code, and f is the number of dimensions of the latent space. It then projects the latent code and text embeddings into Query, Key, and Value matrices: $Q = W^q z$, $K = W^k c$, and $V = W^v c$. Here, $W^q \in \mathbb {R}^{f \times d'}, W^k, W^v \in \mathbb {R}^{d \times d'}$ represent the weight matrices of the three linear layers, and $d'$ is the dimension of Query, Key, and Value embeddings. The cross-attention layer then computes the attention scores $A = \text {Softmax}(\frac{QK^T}{\sqrt{d'}}) \in [0,1]^{(h \times w) \times n}$, and takes a weighted sum over the Value matrix to obtain the cross-attention output $z_\text {attn} = AV \in \mathbb {R}^{(h \times w) \times d'}$. Intuitively, the cross-attention mechanism “scatters” textual information to the 2D latent code space, and A[i, j, k] represents the amount of information flow from the k-th text token to the (i, j) latent pixel. Our method is based on this semantic interpretation of the cross-attention map, and we will discuss it in detail in Sect. 4.2.

4 FastComposer

4.1 Tuning-Free Subject-Driven Image Generation with an Image Encoder

4.1.1 Augmenting Text Representation with Subject Embedding

To achieve tuning-free subject-driven image generation, we propose to augment text prompts with visual features extracted from reference subject images. Given a text prompt $\mathcal {P} = \{w_1, w_2, \dots w_n\}$, a list of reference subject images $\mathcal {S} = \{s_1, s_2, \dots s_m\}$, and an index list indicating which subject corresponds to which word in the text prompt $\mathcal {I} = \{i_1, i_2, \dots i_m\}, i_j \in {1, 2, \dots , n}$, we first encode the text prompt $\mathcal {P}$ and reference subjects $\mathcal {S}$ into embeddings using the pre-trained CLIP text and image encoders $\psi $ and $\phi $, respectively. Next, we employ a multilayer perceptron (MLP) to augment the text embeddings with visual features extracted from the reference subjects. We concatenate (represented by ||) the word embeddings with the visual features and feed the resulting augmented embeddings into the MLP. This process yields the final conditioning embeddings $c' \in \mathbb {R}^{n\times d}$, defined as follows:

$$\begin{aligned} c'_{i} = {\left\{ \begin{array}{ll} \psi (\mathcal {P})_i, & i \notin \mathcal {I} \\ \text {MLP}(\psi (\mathcal {P})_i || \phi (s_{j})), & i = i_j\in \mathcal {I} \end{array}\right. } \end{aligned}$$

(2)

Figure 3 gives a concrete example of our augmentation approach.

4.1.2 Subject-Driven Image Generation Training

To enable inference-only subject-driven image generation, we train the image encoder, the MLP module, and the U-Net with the denoising loss (Fig. 3). We create a subject-augmented image-text paired dataset to train our model, where noun phrases from image captions are paired with subject segments appearing in the target images. We initially use a dependency parsing model to chunk all noun phrases (e.g., “a woman”) in image captions and a panoptic segmentation model to segment all subjects present in the image. We then pair these subject segments with corresponding noun phrases in the captions with a greedy matching algorithm based on text and image similarity (Radford et al., 2021; Reimers & Gurevych, 2019). The process of constructing the subject-augmented image-text dataset is detailed in Sec. 5.1.1. In the training phase, we employ subject-augmented conditioning, as outlined in Eq. 2, to denoise the perturbed target image. We also mask the subjects’ backgrounds with random noise before encoding, preventing the overfitting of the subjects’ backgrounds. Consequently, FastComposer can directly use natural subject images during inference without explicit background segmentation.

4.2 Localizing Cross-Attention Maps with Subject Segmentation Masks

We observe that traditional cross-attention maps tend to attend to all subjects at the same time, which leads to identity blending in multi-subject image generation (Fig. 4top). We propose to localize cross-attention maps with subject segmentation masks during training to solve this issue.

4.2.1 Understanding the Identity Blending in Diffusion Models

Prior research (Hertz et al., 2023) shows that the cross-attention mechanism within diffusion models governs the layout of generated images. The scores in cross-attention maps represent “the amount of information flows from a text token to a latent pixel.” We hypothesize that identity blending arises from the unrestricted cross-attention mechanism, as a single latent pixel can attend to all text tokens. If one subject’s region attends to multiple reference subjects, identity blending will occur. In Fig. 4, we confirm our hypothesis by visualizing the average cross-attention map within the second up-sampling layer in the U-Net of the diffusion model. The unregularized model often has two reference subject tokens influencing the same generated person at the same time, causing a mix of features from both subjects. We argue that proper cross-attention maps should resemble an instance segmentation of the target image, clearly separating the features related to different subjects. To achieve this, we add a regularization term to the subject cross-attention maps during training to encourage focusing on specific instance areas. Segmentation maps and cross-attention regularization are only used during training, not at test time.

4.2.2 Localizing Cross-Attention with Segmentation Masks

As discussed in Sect. 3.2, a cross-attention map $A \in [0,1]^{(h \times w) \times n}$ connects latent pixels to conditional embeddings at each layer, where A[i, j, k] denotes the information flow from the k-th conditional token to the (i, j) latent pixel. Ideally, the subject token’s attention map should focus solely on the subject region rather than spreading throughout the entire image, preventing identity blending among subjects. To accomplish this, we propose localizing the cross-attention map using the reference subject’s segmentation mask. Let $\mathcal {M} = \{M_1, M_2, \dots M_m\}$ represent the reference subjects’ segmentation masks, $\mathcal {I} = \{i_1, i_2, \dots i_m\}$ be the index list indicating which subject corresponds to each word in the text prompt, and $A_{i} = A[:,:,i] \in [0,1]^{(h \times w)}$ be the cross-attention map of the i-th subject token. We supervise the cross-attention map $A_{i_j}$ to be close to the segmentation mask $m_j$ of the j-th subject token, i.e., $A_{i_j} \approx m_j$. We employ a balanced L1 loss to minimize the distance between the cross-attention map and the segmentation mask:

$$\begin{aligned} \mathcal {L}_{\text {loc}} = \frac{1}{m}\sum _{j=1}^m (\text {mean}(A_{i_j}[\bar{m}_j]) - \text {mean}(A_{i_j}[m_j])), \end{aligned}$$

(3)

where $\bar{m}_j$ means the complement of a segmentation mask $m_j$ (0 to 1, and 1 to 0). The final training objective of FastComposer is given by:

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{\text {noise}} + \lambda \mathcal {L}_{\text {loc}}, \end{aligned}$$

(4)

using a localization loss ratio controlled by hyperparameter $\lambda = 0.001$. Motivated by (Hertz et al., 2023; Chefer et al., 2023), we apply the localization loss to the downsampled cross-attention maps, i.e., the middle 5 blocks of the U-Net, which are known to contain more semantic information. As illustrated in Fig. 4, our localization technique enables the model to precisely allocate attention to reference subjects at test time, which prevents identity blending between subjects.

4.3 Delayed Subject Conditioning in Iterative Denoising

During inference, using the augmented text representation directly often leads to images that closely resemble the subjects while ignoring the textual directives. This occurs because the image layout forms at the early phases of the denoising process, and premature augmentation from the reference image causes the resulting image to stray from the text instructions. Prior methods (Gal et al., 2023b; Roich et al., 2022) mitigate this issue by generating an initial latent code and refining it through iterative model finetuning. However, this process is resource-intensive and needs high-end devices for model fine-tuning. Inspired by Style Mixing (Karras et al., 2019), we propose a simple delayed subject conditioning, which allows for inference-only subject conditioning while striking a balance between identity preservation and editability.

Specifically, we perform image augmentation only after the layout has been created using a text-only prompt. In this framework, our time-dependent noise prediction model can be represented as:

$$\begin{aligned} \epsilon _t = {\left\{ \begin{array}{ll} \epsilon _\theta (z_t, t, c) & \text {if } t > \alpha T, \\ \epsilon _\theta (z_t, t, c') & \text {otherwise} \end{array}\right. } \end{aligned}$$

(5)

here c denotes the original text embedding and $c'$ denotes text embedding augmented with the input image embedding. $\alpha $ is a hyperparameter indicating the ratio of subject conditioning. We ablate the effect of using different $\alpha $ in Fig. 5. Empirically, $\alpha \in [0.6, 0.8]$ yields good results that balance prompt consistency and identity preservation, though it can be easily tuned for specific instances.

5 Experiments

5.1 Setup

5.1.1 Dataset Construction

We built a subject-augmented image-text paired dataset based on the FFHQ-wild (Karras et al., 2019) dataset to train our models. First, we use the BLIP-2 model (Li et al., 2023a) blip2-opt-6.7b-coco to generate captions for all images. Next, we employ the Mask2Former model (Cheng et al., 2022) mask2former-swin-large-coco-panoptic to generate panoptic segmentation masks for each image. We then leverage the spaCy (Honnibal & Montani, 2017) library to chunk all noun phrases in the image captions and expand numbered plural phrases (e.g., “two women”) into singular phrases connected by “and” (e.g., “a woman and a woman”). Finally, we use a greedy matching algorithm to match noun phrases with image segments. We do this by considering the product of the image-text similarity score by the OpenCLIP model (Ilharco et al., 2021) CLIP-ViT-H-14-laion2B-s32B-b79K and the label-text similarity score by the Sentence-Transformer (Reimers & Gurevych, 2019) model stsb-mpnet-base-v2. We reserve 1000 images for validation and testing purposes.

5.1.2 Training Details

We start training from the StableDiffusion v1-5 (Rombach et al., 2022) model. To encode the visual inputs, we use OpenAI’s clip-vit-large-patch14 vision model, which serves as the partner model of the text encoder in SDv1-5. During training, we freeze the text encoder and only train the U-Net, the MLP module, and the last two transformer blocks of the vision encoder. We train our models for 150k steps on 8 NVIDIA A6000 GPUs, with a constant learning rate of 1e-5 and a batch size of 128. We only augment segments whose COCO (Lin et al., 2014) label is “person” and set a maximum of 4 reference subjects during training, with each subject having a 10% chance of being dropped. We train the model solely on text conditioning with 10% of the samples to maintain the model’s capability for text-only generation. To facilitate classifier-free guidance sampling (Ho & Salimans, 2022), we train the model without any conditions on 10% of the instances. During training, we apply the loss only in the subject region to half of the training samples to enhance the generation quality in the subject area.

Table 1 Comparison between our method and baseline approaches on single-subject image generation

Full size table

5.1.3 Evaluation Metrics

We evaluate image generation quality on identity preservation and prompt consistency. Identity preservation is determined by detecting faces in the reference and generated images using MTCNN (Zhang et al., 2016), and then calculating a pairwise identity similarity using FaceNet (Schroff et al., 2015). For multi-subject evaluation, we identify all faces within the generated images and use a greedy matching procedure between the generated faces and reference subjects. The minimum similarity value among all subjects measures overall identity preservation. We evaluate the prompt consistency using the average CLIP-L/14 image-text similarity following textual-inversion (Gal et al., 2023a). For efficiency evaluation, we consider the total time for customization, including fine-tuning (for tuning-based methods) and inference. We also measure peak memory usage during the entire procedure.

5.2 Single-Subject Image Generation

Our first evaluation targets the performance of single-subject image generation. We benchmark our approach against leading optimization-based approaches, including DreamBooth (Ruiz et al., 2023), Textual-Inversion (Gal et al., 2023a), and Custom Diffusion (Kumari et al., 2023). Additionally, we incorporate a comparison with ELITE (Wei et al., 2023), a concurrent tuning-free method. We use the implementations from diffusers library (von Platen et al., 2022). We provide the detailed hyperparameters in the appendix section. We assess the capabilities of these different methods in generating personalized content for unseen subjects derived from the Celeb-A dataset (Liu et al., 2015). To construct our evaluation benchmark, we develop a broad range of text prompts encapsulating a wide spectrum of scenarios, such as recontextualization, stylization, accessorization, and diverse actions. The entire test set comprises 15 subjects, with 30 unique text prompts allocated to each. An exhaustive list of text prompts is available in the appendix. We utilized five images per subject to fine-tune the optimization-based methods, given our observation that these methods overfit and simply reproduce the reference image when a single reference image is used. In contrast, our model employs a single randomly selected image for each subject. Shown in Table 1, FastComposer surpasses all baselines, delivering superior identity preservation and prompt consistency. Remarkably, in comparison to optimizaation-based methods, it achieves 300$\times $–1200$\times $ speedup and 2.8$\times $ reduction in memory usage. Figure 6 shows the qualitative results of single-subject personalization comparisons, employing different approaches across an array of prompts. Significantly, our model matches the text consistency of text-only methods and exceeds all baseline strategies in terms of identity preservation, with only single input and forward passes used.

Table 2 Comparison between our method and baseline approaches on multiple-subject Image generation

Full size table

5.3 Multi-Subject Image Generation

We then consider a more complex setting: multi-object, subject-driven image generation. We examine the quality of multi-subject generation by using all possible combinations (105 pairs in total) formed from 15 subjects described in Sec. 5.2, allocating 21 prompts to each pair for assessment. Table 2 shows a quantitative analysis contrasting FastComposer with the baseline methods. Optimization-based methods (Kumari et al., 2023; Ruiz et al., 2023; Gal et al., 2023a) frequently falter in maintaining identity preservation, often generating generic images or images that blend identities among different reference subjects. FastComposer, on the other hand, preserves the unique features of different subjects, yielding a significantly improved identity preservation score. Furthermore, our prompt consistency is on par with tuning-based approaches (Gal et al., 2023a; Kumari et al., 2023). Qualitative comparisons are shown in Fig. 1.

5.4 Generated Image Quality and Diversity

5.4.1 Automatic Evaluation

In Tables 3 and 4, we compare the generated images’ quality and diversity between our method and baseline method in the single- and multi-subject settings, respectively. We use ImageReward (Xu et al., 2024) and Aesthetic Score (Schuhmann et al., 2022) to measure image quality. To evaluate diversity, we adopt a diversity score from Kang et al. (2024), generating four images from the same prompt and measuring the inter-seed pair-wise distance. Our approach achieves comparable image quality and diversity to previous baselines, while enabling significantly better subject-driven generation (see Tables 1, 2).

Table 3 Comparison of image quality and diversity between our method and baseline approaches on single-subject image generation

Full size table

Table 4 Comparison of image quality and diversity between our method and baseline approaches on multi-subject image generation

Full size table

5.4.2 Human Evaluation

To further verify our method’s effectiveness, we conducted an extensive user study comparing our model’s output with those from competing methods. We used a subset of 100 (subject, prompt) pairs from our evaluation dataset. For each comparison, a random set of three evaluators was asked to choose the image that better represents the text prompt or reference subjects. Detailed information about the human evaluation is included in the Appendix D.

As shown in Figs. 7 and 8, our model achieves much higher subject consistency while maintaining comparable prompt alignment. Notably, in multi-subject generation, our model outperforms competing methods like custom diffusion (Kumari et al., 2023) and textual inversion (Gal et al., 2023a) over 80% of the time. While Custom Diffusion (Kumari et al., 2023) exhibits higher prompt alignment than our approach, it suffers from significantly worse identity preservation. This suggests that the improved alignment is mostly due to random outputs that ignore the reference.

5.5 Extending FastComposer Beyond Human Faces

In Figs. 9 and 10, we apply FastComposer on the LSUN-cat (Yu et al., 2015) dataset and successfully generate high-quality single and multiple cat images without subject-specific fine-tuning.

Table 5 Comparison between our method and baseline approaches on single-cat image generation

Full size table

Table 6 Comparison between our method and baseline approaches on multiple-cat image generation

Full size table

In Tables 5 and 6, we show the quantitative results of single and multiple cats image generation, comparing against baselines. To assess identity preservation, we compute the pairwise DINO distance (Caron et al., 2021) between the cats detected in the generated images and the reference subjects. We use six cats in the Custom Diffusion (Kumari et al., 2023) evaluation datasets as the testing subjects. The results demonstrate that our method excels in generating cat images without the need for model fine-tuning at deployment time. Previous methods often overfit to the reference subjects (e.g., DreamBooth) or failed to effectively learn subject conditioning (e.g., Custom Diffusion). Additionally, these methods frequently either merge attributes from different subjects or completely ignore the second subject in multi-subject image generation. Our approach strikes an optimal balance between preserving identity and adhering to prompts. Additional computational resources and training data make extending to more categories of subjects possible, which we leave for future work.

5.6 Combining with ControlNet

In Fig. 11, FastComposer is synergized with a pre-existing ControlNet model (Zhang et al., 2023), illustrating our technique’s capability in crafting subject-specific imagery guided by specified poses. This integration highlights the flexibility and enhanced control offered by FastComposer in generating subject-specific images.

5.7 Taking Multiple Inputs for a Subject

Figure 12 shows that FastComposer can take multiple reference inputs for a single subject, which facilitates the creation of images with a broader spectrum of variations while mitigating the risk of over-reliance on a singular reference point. By averaging the subject embeddings derived from various reference images, our approach yields images with pronounced local variations. Notably, taking averaged subject embeddings results in discernible differences in aspects such as gaze direction and head positioning, demonstrating the nuanced capabilities of our model when supplied with an enriched data input.

Table 7 Ablation studies on the cross-attention localization supervision

Full size table

5.8 Ablation Study

5.8.1 Delayed Subject Conditioning

Figure 5 shows the impact of varying the ratio of timesteps devoted to subject conditioning, a hyperparameter in our delayed subject conditioning approach. As this ratio increases, the model improves in identity preservation but loses editability. A ratio between 0.6 to 0.8 achieves a favorable balance on the tradeoff curve.

5.8.2 Cross-Attention Localization Loss

Table 7 presents the ablation studies on our proposed cross-attention localization loss. The baseline is trained in the same setting but excludes the localization loss. Our method demonstrates a substantial enhancement of the identity preservation score. Figure 4 shows the qualitative comparisons. Incorporating the localization loss allows the model to focus on particular reference subjects, thereby avoiding identity blending.

Table 8 Ablation studies on architectural configurations of FastComposer

Full size table

5.8.3 Architectural Configurations

Table 8 presents the ablation study results, illustrating the impact of various architectural configurations on the performance of FastComposer. The configurations evaluated include: (1) training only the cross-attention module in the U-Net of the StableDiffusion model, (2) freezing the image encoder during training, and (3) replacing the subject token entirely with the image token instead of using a multi-layer perceptron (MLP) module to fuse the information. We use single-person image generation to measure model performance across these configurations. The experimental results show that all alternative configurations consistently underperform compared to the final architecture of FastComposer. Specifically, training only the cross-attention module yields moderate identity preservation but lower prompt consistency. Freezing the image encoder further reduces identity preservation without significant gains in prompt consistency. Replacing the subject token with the image token drastically diminishes identity preservation, indicating the importance of the MLP module for effective information fusion.

5.8.4 Number of Subject

Beyond generating two subjects, we test the limit of FastComposer by generating 3 and 4 subjects on one image, and the results are shown in Figs. 13 and 14. We find FastComposer can generate at most three subjects in a single image, and generating more than three subjects will fail (e.g., the output might only display 3 subjects, with one of them exhibiting blended features from two subjects). This limitation arises from the FFHQ-wild dataset’s long-tail distribution of human numbers, where images with more than three people are rare (less than 3%). We believe training on a larger dataset with more balanced and various subject numbers can mitigate this issue.

6 Conclusion

We propose FastComposer, a tuning-free method for personalized, multi-subject text-to-image generation. We achieve tuning-free subject-driven image generation by utilizing a pre-trained vision encoder, making this process efficient and accessible across various platforms. We also propose a novel delayed subject conditioning technique to balance the preservation of subject identity and the flexibility of image editability. Additionally, FastComposer effectively tackles the identity blending issue in multi-subject generation by supervising cross-attention maps during pre-training. FastComposer enables the creation of high-quality, personalized, and complex images, opening up new possibilities in the field of text-to-image generation.

Data Availibility

The data and materials that support the findings of this study are available at this website (https://github.com/mit-han-lab/fastcomposer).

Code Availability

The code is opensourced at this website https://github.com/mit-han-lab/fastcomposer.

References

Andonian, A., Osmany, S., Cui, A., Park, Y., Jahanian, A., Torralba, A., & Bau, D.(2021). Paint by word. arXiv:2103.10951
Avrahami, O., Aberman, K., Fried, O., Cohen-Or, D., & Lischinski, D. (2023a). Break-a-scene: Extracting multiple concepts from a single image. In SIGGRAPH Asia 2023 Conference Papers. ACM, SA’23. https://doi.org/10.1145/3610548.3618154.
Avrahami, O., Hayes, T., Gafni, O., Gupta, S., Taigman, Y., Parikh, D., Lischinski, D., Fried, O., & Yin, X. (2023b). SpaText: Spatio-textual representation for controllable image generation. In CVPR.
Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Zhang, Q., Kreis, K., Aittala, M., Aila, T., Laine, S., & Catanzaro, B. (2022). eDiff-I: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324.
Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., & Hoffman, J. (2023). Token merging: Your ViT but faster. In International conference on learning representations.
Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., & Zheng, Y. (2023). MasaCtrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 22560–22570).
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9650–9660).
Casanova, A., Careil, M., Verbeek, J., Drozdzal, M., & Romero Soriano, A. (2021). Instance-conditioned GAN. Advances in Neural Information Processing Systems, 34, 27517–27529.
Google Scholar
Chang, H., Zhang, H., Barber, J., Maschinot, A. J., Lezama, J., Jiang, L., Yang, M. H., Murphy, K., Freeman, W. T., Rubinstein, M., & Li, Y. (2023). Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704
Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., & Cohen-Or, D. (2023). Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. In Siggraph.
Chen, T., Xu, B., Zhang, C., & Guestrin, C. (2016). Training deep nets with sublinear memory cost. arXiv:1604.06174
Cheng, B., Misra, I., Schwing, A. G., Kirillov, A. & Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1290–1299).
Ding, M., Yang, Z., Hong, W., Zheng, W., Zhou, C., Yin, D., Lin, J., Zou, X., Shao, Z., Yang, H., & Tang, J. (2021). CogView: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34, 19822–19835.
Google Scholar
Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A. H., Chechik, G., & Cohen-Or, D. (2023a). An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR.
Gal, R., Arar, M., Atzmon, Y., Bermano, A. H., Chechik, G., & Cohen-Or, D. (2023b). Designing an encoder for fast personalization of text-to-image models. In Siggraph.
Gu, Y., Wang, X., Wu, J.Z., Shi, Y., Chen, Y., Fan, Z., Xiao, W., Zhao, R., Chang, S., Wu, W., & Ge, Y. (2023). Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. arXiv:2305.18292
Han, L., Li, Y., Zhang, H., Milanfar, P., Metaxas, D., & Yang, F. (2023). SVDiff: Compact parameter space for diffusion fine-tuning. arXiv:2303.11305
Han, S., Mao, H., & Dally, W. J. (2016). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In ICLR.
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., & Cohen-Or, D. (2023). Prompt-to-prompt image editing with cross attention control. In ICLR.
Ho, J., & Salimans, T. (2022). Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.
Google Scholar
Honnibal, M., & Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. https://github.com/explosion/spaCy.
Ilharco, G., Wortsman, M., Wightman, R., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., & Schmidt, L. (2021). Openclip. https://doi.org/10.5281/zenodo.5143773
Kang, M., Zhang, R., Barnes, C., Paris, S., Kwak, S., Park, J., Shechtman, E., Zhu, J. Y., & Park, T. (2024). Distilling diffusion models into conditional GANs. arXiv preprint arXiv:2405.05967
Kang, M., Zhu, J.Y., Zhang, R., Park, J., Shechtman, E., Paris, S., & Park, T. (2023). Scaling up GANs for text-to-image synthesis. In CVPR.
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4401–4410).
Kirstain, Y., Levy, O., & Polyak, A. (2023). X &fuse: Fusing visual information in text-to-image generation. arXiv preprint arXiv:2303.01000
Kumari, N., Zhang, B., Zhang, R., Shechtman, E., & Zhu, J. Y. (2023). Multi-concept customization of text-to-image diffusion. In CVPR.
Li, Z., Cao, M., Wang, X., Qi, Z., Cheng, M. M., & Shan, Y., (2023c). Photomaker: Customizing realistic human photos via stacked id embedding. arXiv:2312.04461
Li, X., Hou, X., & Loy, C. C. (2023b). When stylegan meets stable diffusion: A $\mathscr {W}_+$ adapter for personalized image generation. arXiv:2311.17461
Li, J., Li, D., Savarese, S., & Hoi, S. (2023a). BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. arXiv:1405.0312
Liu, Z., Luo, P., Wang, X., & Tang, X. (2015). Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision (pp. 3730–3738).
Liu, L., Ren, Y., Lin, Z., & Zhao, Z. (2022). Pseudo numerical methods for diffusion models on manifolds. In ICLR.
Liu, B., Wang, C., Cao, T., Jia, K., & Huang, J. (2024). Towards understanding cross and self-attention in stable diffusion for text-guided image editing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7817–7826).
Ma, Y., Yang, H., Wang, W., Fu, J., & Liu, J. (2023). Unified multi-modal latent diffusion for joint subject and text conditional image generation. arXiv preprint arXiv:2303.09319
Nitzan, Y., Aberman, K., He, Q., Liba, O., Yarom, M., Gandelsman, Y., Mosseri, I., Pritch, Y., & Cohen-Or, D. (2022). MyStyle: A personalized generative prior. ACM Transactions on Graphics (TOG), 41(6), 1–10.
Article Google Scholar
Parmar, G., Kumar Singh, K., Zhang, R., Li, Y., Lu, J., & Zhu, J. Y. (2023). Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 conference proceedings (pp. 1–11).
Patashnik, O., Garibi, D., Azuri, I., Averbuch-Elor, H., & Cohen-Or, D. (2023). Localizing object-level shape variations with text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 23051–23061).
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., & Krueger, G. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning, PMLR (pp. 8748–8763).
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., & Sutskever, I. (2021). Zero-shot text-to-image generation. In International conference on machine learning, PMLR (pp. 8821–8831).
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In EMNLP.
Roich, D., Mokady, R., Bermano, A. H., & Cohen-Or, D. (2022). Pivotal tuning for latent-based editing of real images. ACM Transactions on Graphics (TOG), 42(1), 1–13.
Article Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684–10695).
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., & Aberman, K. (2023). DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR.
Sarukkai, V., Li, L., Ma, A., Ré, C., & Fatahalian, K. (2023). Collage diffusion. arXiv preprint arXiv:2303.00262
Schroff, F., Kalenichenko, D., & Philbin, J. (2015). FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 815–823).
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., & Schramowski, P. (2022). LAION-5B: An open large-scale dataset for training next generation image–text models. Advances in Neural Information Processing Systems, 35, 25278–25294.
Google Scholar
Shi, J., Xiong, W., Lin, Z., & Jung, H. J. (2023). Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, PMLR (pp. 2256–2265).
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-based generative modeling through stochastic differential equations. In ICLR.
Tumanyan, N., Geyer, M., Bagon, S., & Dekel, T. (2022). Plug-and-play diffusion features for text-driven image-to-image translation. arXiv:2211.12572
Valevski, D., Lumen, D., Matias, Y., & Leviathan, Y. (2023). Face0: Instantaneously conditioning a text-to-image model on a face. arXiv:2306.06638
von Platen, P., Patil, S., Lozhkov, A., et al. (2022). Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers
Wang, Q., Bai, X., Wang, H., Qin, Z. & Chen, A. (2024). InstantID: Zero-shot identity-preserving generation in seconds. arXiv:2401.07519
Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., & Zuo, W. (2023). ELITE: Encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848
Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., & Han, S. (2022). SmoothQuant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438
Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., & Dong, Y. (2024). ImageReward: Learning and evaluating human preferences for text-to-image generation. In Thirty- seventh conference on neural information processing systems. https://openreview.net/forum?id=JVzeOYEx6d.
Ye, H., Zhang, J., Liu, S., Han, X., & Yang, W. (2023). IP-Adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv:2308.06721
Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., & Xiao, J. (2015). LSUN: Construction of a large-scale image dataset using deep learning with humans in the loop. CoRR arXiv:1506.03365. http://dblp.uni-trier.de/db/journals/corr/corr1506.html#YuZSSX15
Yuan, G., Cun, X., Zhang, Y., Li, M., Qi, C., Wang, X., Shan, Y. & Zheng, H. (2023). Inserting anybody in diffusion models via celeb basis. arXiv:2306.00926
Zhang, L., Rao, A., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models.
Zhang, K., Zhang, Z., Li, Z., & Qiao, Y. (2016). Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10), 1499–1503.

Download references

Funding

’Open Access funding provided by the MIT Libraries’ This work is supported by MIT AI Hardware Program, NVIDIA Academic Partnership Award, MIT-IBM Watson AI Lab, Amazon and MIT Science Hub, Microsoft Turing Academic Program, Singapore DSTA under DST00OECI20300823 (New Representations for Vision), NSF grant 2105819 and NSF CAREER Award 1943349.

Author information

Guangxuan Xiao and Tianwei Yin have contributed equally to this work.

Authors and Affiliations

Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
Guangxuan Xiao, Tianwei Yin, William T. Freeman, Frédo Durand & Song Han
NVIDIA, Santa Clara, CA, 95050, USA
Song Han

Authors

Guangxuan Xiao
View author publications
Search author on:PubMed Google Scholar
Tianwei Yin
View author publications
Search author on:PubMed Google Scholar
William T. Freeman
View author publications
Search author on:PubMed Google Scholar
Frédo Durand
View author publications
Search author on:PubMed Google Scholar
Song Han
View author publications
Search author on:PubMed Google Scholar

Contributions

All authors contributed to the study’s conception and design. Guangxuan Xiao and Tianwei Yin wrote the initial code and ran all experiments. Frédo Durand, William T. Freeman, and Song Han were involved in regular meetings and gave valuable feedback on writing, results, and experiments. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Guangxuan Xiao or Tianwei Yin.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Ethics Approval

Not applicable

Consent to Participate

Not applicable

Consent for Publication

Not applicable

Additional information

Communicated by Shengfeng He.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Experiment Details

We use baselines implementations from the diffusers library von Platen et al. (2022). Each baseline employs the same StableDiffusion v1-5 model Rombach et al. (2022) that we use in our approach. All baselines, by default, run on the standard set of hyperparameters, with the exception being that we’ve adjusted the number of training steps to 1000 for both DreamBooth Ruiz et al. (2023) and Custom-Diffusion Kumari et al. (2023). This adjustment was made due to noticeable underfitting when using the original number of steps, 250 for Custom-Diffusion and 400 for DreamBooth. All baselines are trained with 5 images per subject. In our experiments, we use $\alpha =0.5$ for single object generation, and $\alpha =0.6$ for multi-subject generation. We use PNDM sampling Liu et al. (2022) with 50 steps and a classifier-free guidance scale of 5 across all methods.

Appendix B Additional Results

In Figs. 15 and 16, we present uncurated results for single-subject and multiple-subject generation, respectively.

Appendix C Evaluation Text Prompts

1.1 C.1 Single-Subject Prompts

"a painting of a [class noun] <A*> in the style of Banksy"

"a painting of a [class noun] <A*> in the style of Vincent Van Gogh"

"a colorful graffiti painting of a [class noun] <A*>"

"a watercolor painting of a [class noun] <A*>"

"a Greek marble sculpture of a [class noun] <A*>"

"a street art mural of a [class noun] <A*>"

"a black and white photograph of a [class noun] <A*>"

"a pointillism painting of a [class noun] <A*>"

"a Japanese woodblock print of a [class noun] <A*>"

"a street art stencil of a [class noun] <A*>"

"a [class noun] <A*> wearing a red hat"

"a [class noun] <A*> wearing a santa hat"

"a [class noun] <A*> wearing a rainbow scarf"

"a [class noun] <A*> wearing a black top hat and a monocle"

"a [class noun] <A*> in a chef outfit"

"a [class noun] <A*> in a firefighter outfit"

"a [class noun] <A*> in a police outfit"

"a [class noun] <A*> wearing pink glasses"

"a [class noun] <A*> wearing a yellow shirt"

"a [class noun] <A*> in a purple wizard outfit"

"a [class noun] <A*> riding a horse"

"a [class noun] <A*> holding a glass of wine"

"a [class noun] <A*> holding a piece of cake"

"a [class noun] <A*> giving a lecture"

"a [class noun] <A*> reading a book"

"a [class noun] <A*> gardening in the backyard"

"a [class noun] <A*> cooking a meal"

"a [class noun] <A*> working out at the gym"

"a [class noun] <A*> walking the dog"

"a [class noun] <A*> baking cookies"

"a [class noun] <A*> in the jungle"

"a [class noun] <A*> in the snow"

"a [class noun] <A*> on the beach"

"a [class noun] <A*> on a cobblestone street"

"a [class noun] <A*> on top of pink fabric"

"a [class noun] <A*> on top of a wooden floor"

"a [class noun] <A*> with a city in the background"

"a [class noun] <A*> with a mountain in the background"

"a [class noun] <A*> with a blue house in the background"

"a [class noun] <A*> on top of a purple rug in a forest"

1.2 C.2 Multiple-Subject Prompts

"a painting of a [class noun] <A*> and a [class noun] <A*> together in the style of Banksy"

"a painting of a [class noun] <A*> and a [class noun] <A*> together in the style of Vincent Van Gogh"

"a watercolor painting of a [class noun] <A*> and a [class noun] <A*> together"

"a street art mural of a [class noun] <A*> and a [class noun] <A*> together"

"a black and white photograph of a [class noun] <A*> and a [class noun] <A*> together"

"a pointillism painting of a [class noun] <A*> and a [class noun] <A*> together"

"a Japanese woodblock print of a [class noun] <A*> and a [class noun] <A*> together"

"a photo of a [class noun] <A*> and a [class noun] <A*> gardening in the backyard together"

"a photo of a [class noun] <A*> and a [class noun] <A*> cooking a meal together"

"a photo of a [class noun] <A*> and a [class noun] <A*> sitting in a park together"

"a photo of a [class noun] <A*> and a [class noun] <A*> working out at the gym together"

"a photo of a [class noun] <A*> and a [class noun] <A*> baking cookies together"

"a photo of a [class noun] <A*> and a [class noun] <A*> posing for a selfie together"

"a photo of a [class noun] <A*> and a [class noun] <A*> making funny faces for a photo booth together"

"a photo of a [class noun] <A*> and a [class noun] <A*> playing a musical duet together"

"a photo of a [class noun] <A*> and a [class noun] <A*> together in the jungle"

"a photo of a [class noun] <A*> and a [class noun] <A*> together in the snow"

"a photo of a [class noun] <A*> and a [class noun] <A*> together on the beach"

"a photo of a [class noun] <A*> and a [class noun] <A*> together with a city in the background"

"a photo of a [class noun] <A*> and a [class noun] <A*> together with a mountain in the background"

"a photo of a [class noun] <A*> and a [class noun] <A* together with a blue house in the background"

Appendix D User study details

To conduct the human preference study, we use the Prolific platform (https://www.prolific.com). We use 100 (subject, prompt) pairs from our evaluation dataset. All approaches generate corresponding images, which are presented in pairs to human evaluators to measure subject consistency and prompt alignment preference. The specific questions and interface are shown in Fig. 17. Consent is obtained from the voluntary participants, who are compensated at a flat rate of 12 dollars per hour. We manually verify that all generated images pose no risks to the study participants.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Xiao, G., Yin, T., Freeman, W.T. et al. FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention. Int J Comput Vis 133, 1175–1194 (2025). https://doi.org/10.1007/s11263-024-02227-z

Download citation

Received: 04 March 2024
Accepted: 15 August 2024
Published: 19 September 2024
Version of record: 19 September 2024
Issue date: March 2025
DOI: https://doi.org/10.1007/s11263-024-02227-z

FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention

Abstract

Similar content being viewed by others

A Diffusion Model for Personalized Text-to-Image Generation

Be Yourself: Bounded Attention for Multi-subject Text-to-Image Generation

FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior

Explore related subjects

1 Introduction

2 Related Work

2.1 Subject-Driven Image Generation

2.2 Multi-Subject Image Generation

2.3 Attention in Diffusion Models

3 Preliminaries

3.1 Stable Diffusion

3.2 Text-Conditioning via Cross-Attention Mechanism

4 FastComposer

4.1 Tuning-Free Subject-Driven Image Generation with an Image Encoder

4.1.1 Augmenting Text Representation with Subject Embedding

4.1.2 Subject-Driven Image Generation Training

4.2 Localizing Cross-Attention Maps with Subject Segmentation Masks

4.2.1 Understanding the Identity Blending in Diffusion Models

4.2.2 Localizing Cross-Attention with Segmentation Masks

4.3 Delayed Subject Conditioning in Iterative Denoising

5 Experiments

5.1 Setup

5.1.1 Dataset Construction

5.1.2 Training Details

5.1.3 Evaluation Metrics

5.2 Single-Subject Image Generation

5.3 Multi-Subject Image Generation

5.4 Generated Image Quality and Diversity

5.4.1 Automatic Evaluation

5.4.2 Human Evaluation

5.5 Extending FastComposer Beyond Human Faces

5.6 Combining with ControlNet

5.7 Taking Multiple Inputs for a Subject

5.8 Ablation Study

5.8.1 Delayed Subject Conditioning

5.8.2 Cross-Attention Localization Loss

5.8.3 Architectural Configurations

5.8.4 Number of Subject

6 Conclusion

Data Availibility

Code Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Ethics Approval

Consent to Participate

Consent for Publication

Additional information

Publisher's Note

Appendices

Appendix A Experiment Details

Appendix B Additional Results

Appendix C Evaluation Text Prompts

1.1 C.1 Single-Subject Prompts

1.2 C.2 Multiple-Subject Prompts

Appendix D User study details

Rights and permissions

About this article

Cite this article

Share this article

Keywords