Abstract
Denoising diffusion probabilistic models (DDPMs) have been proven capable of synthesizing high-quality images with remarkable diversity when trained on large amounts of data. Unfortunately, they are still vulnerable to overfitting when fine-tuned on limited data. Existing works have explored subject-driven generation with text-to-image (T2I) models using a few samples. However, there is still a lack of effective and stable data-efficient methods to synthesize images in specific domains (e.g. styles or properties), which remains challenging due to ambiguities inherent in natural language and out-of-distribution effects. This paper introduces a few-shot fine-tuning approach named DomainStudio as a domain-driven image generation paradigm, which is designed to retain the subjects from prior knowledge provided by pre-trained models and adapt them to the domain extracted from training data, pursuing high quality and great diversity. We propose to keep the image-level relative distances between adapted samples and enhance the learning of high-frequency details from both pre-trained models and training samples. DomainStudio is compatible with both unconditional and T2I DDPMs. The proposed method achieves better results than current state-of-the-art GAN-based approaches in unconditional few-shot image generation. It also outperforms existing few-shot fine-tuning methods for modern large-scale T2I diffusion models like Textual Inversion and DreamBooth on synthesizing samples in specific domains characterized by few-shot training data.
Similar content being viewed by others
References
Ahn, N., Lee, J., Lee, C., Kim, K., Kim, D., Nam, S.-H., & Hong, K. (2024) Dreamstyler: Paint by style inversion with text-to-image diffusion models. In: AAAI
Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila, T., Laine, S., & Catanzaro, B., et al. (2022) ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324
Bar-Tal, O., Yariv, L., Lipman, Y., & Dekel, T. (2023) Multidiffusion: Fusing diffusion paths for controlled image generation. In: International Conference on Machine Learning
Brock, A., Donahue, J., & Simonyan, K. (2019) Large scale GAN training for high fidelity natural image synthesis. In: ICLR
Cai, M., Zhang, H., Huang, H., Geng, Q., Li, Y., & Huang, G. (2021) Frequency domain image translation: More photo-realistic, better identity-preserving. In: ICCV, pp. 13930–13940
Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang, M.-H., Murphy, K., Freeman, W.T., & Rubinstein, M., et al. (2023) Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704
Chang, H., Zhang, H., Jiang, L., Liu, C., & Freeman, W.T. (2022) Maskgit: Masked generative image transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11315–11325
Chen, M., Laina, I., & Vedaldi, A. (2024) Training-free layout control with cross-attention guidance. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 5343–5353
Cheng, J., Liang, X., Shi, X., He, T., Xiao, T., & Li, M. (2023) Layoutdiffuse: Adapting foundational diffusion models for layout-to-image generation. arXiv preprint arXiv:2302.08908
Chong, M.J., & Forsyth, D. (2022) Jojogan: One shot face stylization. In: European Conference on Computer Vision, pp. 128–152. Springer
Crowson, K., Biderman, S., Kornis, D., Stander, D., Hallahan, E., Castricato, L., & Raff, E. (2022) Vqgan-clip: Open domain image generation and editing with natural language guidance. In: Proceedings of the European Conference on Computer Vision, pp. 88–105. Springer
Daubechies, I. (1990). The wavelet transform, time-frequency localization and signal analysis. IEEE Transactions on Information Theory, 36(5), 961–1005.
Deng, Y., Tang, F., Dong, W., Ma, C., Pan, X., Wang, L., & Xu, C. (2022) Stytr2: Image style transfer with transformers. In: CVPR, pp. 11326–11336
Dhariwal, P., & Nichol, A. (2021). Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 8780–8794.
Ding, M., Yang, Z., Hong, W., Zheng, W., Zhou, C., Yin, D., Lin, J., Zou, X., Shao, Z., & Yang, H. (2021). Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34, 19822–19835.
Everaert, M. N., Bocchio, M., Arpa, S., Süsstrunk, S., & Achanta, R. (2023) Diffusion in style. In: ICCV, pp. 2251–2261
Frenkel, Y., Vinker, Y., Shamir, A., & Cohen-Or, D. (2024) Implicit style-content separation using b-lora. In: European Conference on Computer Vision, pp. 181–198. Springer
Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., & Taigman, Y. (2022) Make-a-scene: Scene-based text-to-image generation with human priors. In: Proceedings of the European Conference on Computer Vision, pp. 89–106. Springer
Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A. H., Chechik, G., & Cohen-or, D. (2022) An image is worth one word: Personalizing text-to-image generation using textual inversion. In: The Eleventh International Conference on Learning Representations
Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A. H., Chechik, G., & Cohen-or, D. (2023) An image is worth one word: Personalizing text-to-image generation using textual inversion. In: The Eleventh International Conference on Learning Representations
Gal, R., Hochberg, D. C., Bermano, A., & Cohen-Or, D. (2021). Swagan: A style-based wavelet-driven generative model. ACM Transactions on Graphics (TOG), 40(4), 1–11.
Gal, R., Patashnik, O., Maron, H., Bermano, A. H., Chechik, G., & Cohen-Or, D. (2022). Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 41(4), 1–13.
Gatys, L.A., Ecker, A.S., & Bethge, M. (2015) A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576
Ghiasi, G., Lee, H., Kudlur, M., Dumoulin, V., & Shlens, J. (2017) Exploring the structure of a real-time, arbitrary neural artistic stylization network. arXiv preprint arXiv:1705.06830
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014) Generative adversarial nets. Advances in Neural Information Processing Systems 27
Gu, Y., Wang, X., Wu, J. Z., Shi, Y., Chen, Y., Fan, Z., Xiao, W., Zhao, R., Chang, S., & Wu, W., et al. (2024) Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. Advances in Neural Information Processing Systems 36
Hertz, A., Voynov, A., Fruchter, S., & Cohen-Or, D. (2024) Style aligned image generation via shared attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4775–4785
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems 30
Hinz, T., Heinrich, S., & Wermter, S. (2020). Semantic object accuracy for generative text-to-image synthesis. IEEE transactions on pattern analysis and machine intelligence, 44(3), 1552–1565.
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.
Hu, E.J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021) Lora: Low-rank adaptation of large language models. In: ICLR
Hu, T., Zhang, J., Liu, L., Yi, R., Kou, S., Zhu, H., Chen, X., Wang, Y., Wang, C., & Ma, L. (2023) Phasic content fusing diffusion model with directional distribution consistency for few-shot model adaption. In: ICCV, pp. 2406–2415
Huang, J., Cui, K., Guan, D., Xiao, A., Zhan, F., Lu, S., Liao, S., & Xing, E. (2022). Masked generative adversarial networks are data-efficient generation learners. Advances in Neural Information Processing Systems, 35, 2154–2167.
Karras, T., Laine, S., & Aila, T. (2019) A style-based generator architecture for generative adversarial networks. In: CVPR, pp. 4401–4410
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., & Aila, T. (2020) Analyzing and improving the image quality of stylegan. In: CVPR, pp. 8110–8119
Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., & Aila, T. (2020). Training generative adversarial networks with limited data. Advances in Neural Information Processing Systems, 33, 12104–12114.
Karras, T., Aittala, M., Laine, S., Härkönen, E., Hellsten, J., Lehtinen, J., & Aila, T. (2021). Alias-free generative adversarial networks. Advances in Neural Information Processing Systems, 34, 852–863.
Kim, G., Kwon, T., & Ye, J. C. (2022) Diffusionclip: Text-guided diffusion models for robust image manipulation. In: CVPR, pp. 2426–2435
Kingma, D. P., & Welling, M. (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114
Kingma, D., Salimans, T., Poole, B., & Ho, J. (2021). Variational diffusion models. Advances in Neural Information Processing Systems, 34, 21696–21707.
Krizhevsky, A., & Hinton, G., et al. (2009) Learning multiple layers of features from tiny images. Master’s thesis, University of Tront
Kumari, N., Zhang, B., Zhang, R., Shechtman, E., & Zhu, J.-Y. (2023) Multi-concept customization of text-to-image diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1931–1941
Kwon, G., & Ye, J.C. (2023) One-shot adaptation of gan in just one clip. TPAMI
Li, J., Li, D., Xiong, C., & Hoi, S. (2022) Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR
Li, Y., Liu, H., Wen, Y., & Lee, Y. J. (2023) Generate anything anywhere in any scene. arXiv preprint arXiv:2306.17154
Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C., & Lee, Y.J. (2023) Gligen: Open-set grounded text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22511–22521
Li, B., Qi, X., Lukasiewicz, T., & Torr, P. (2019) Controllable text-to-image generation. Advances in Neural Information Processing Systems 32
Li, W., Zhang, P., Zhang, L., Huang, Q., He, X., Lyu, S., & Gao, J. (2019) Object-driven text-to-image synthesis via adversarial training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12174–12182
Li, Y., Zhang, R., Lu, J., & Shechtman, E. (2020). Few-shot image generation with elastic weight consolidation. Advances in Neural Information Processing Systems, 33, 15885–15896.
Liang, H., Zhang, W., Li, W., Yu, J., & Xu, L. (2024) Intergen: Diffusion-based multi-human motion generation under complex interactions. International Journal of Computer Vision, 1–21
Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., & Van Gool, L. (2022) Repaint: Inpainting using denoising diffusion probabilistic models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11461–11471
Ma, J., Liang, J., Chen, C., & Lu, H. (2024) Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. In: ACM SIGGRAPH 2024 Conference Papers, pp. 1–12
Mo, S., Cho, M., & Shin, J. (2020) Freeze the discriminator: A simple baseline for fine-tuning gans. In: CVPR AI for Content Creation Workshop
Moon, S.-J., Kim, C., & Park, G.-M. (2022) Wagi: wavelet-based gan inversion for preserving high-frequency image details. arXiv preprint arXiv:2210.09655
Nichol, A. Q., & Dhariwal, P. (2021) Improved denoising diffusion probabilistic models. In: ICLR, pp. 8162–8171. PMLR
Noguchi, A., & Harada, T. (2019) Image generation from small datasets via batch statistics adaptation. In: ICCV, pp. 2750–2758
Ojha, U., Li, Y., Lu, J., Efros, A. A., Lee, Y. J., Shechtman, E., & Zhang, R. (2021) Few-shot image generation via cross-domain correspondence. In: CVPR, pp. 10743–10752
Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., & Graves, A., et al. (2016) Conditional image generation with pixelcnn decoders. Advances in Neural Information Processing Systems 29
Park, J., & Kim, Y. (2022) Styleformer: Transformer based generative adversarial networks with style vector. In: CVPR, pp. 8983–8992
Phung, H., Dao, Q., & Tran, A. (2023) Wavelet diffusion models are fast and scalable image generators. In: CVPR, pp. 10199–10208
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., & Rombach, R. (2024) Sdxl: Improving latent diffusion models for high-resolution image synthesis. In: International Conference on Learning Representations
Qiao, T., Zhang, J., Xu, D., & Tao, D. (2019) Learn, imagine and create: Text-to-image generation from prior knowledge. Advances in neural information processing systems 32
Radford, A., Kim, & J. W., Hallacy, C. (2021) Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR
Raj, A., Kaza, S., Poole, B., Niemeyer, M., Ruiz, N., Mildenhall, B., Zada, S., Aberman, K., Rubinstein, M., & Barron, J. (2023) Dreambooth3d: Subject-driven text-to-3d generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2349–2359
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., & Sutskever, I. (2021) Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022) High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., & Aberman, K. (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR, pp. 22500–22510
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., & Aberman, K. (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510
Ruiz, N., Li, Y., Jampani, V., Wei, W., Hou, T., Pritch, Y., Wadhwa, N., Rubinstein, M., & Aberman, K. (2024) Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6527–6536
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., & Salimans, T. (2022). Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35, 36479–36494.
Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D. J., & Norouzi, M. (2022). Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4), 4713–4726.
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., & Wortsman, M. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35, 25278–25294.
Schwarz, K., Liao, Y., & Geiger, A. (2021). On the frequency bias of generative models. Advances in Neural Information Processing Systems, 34, 18126–18136.
Shah, V., Ruiz, N., Cole, F., Lu, E., Lazebnik, S., Li, Y., & Jampani, V. (2024) Ziplora: Any subject in any style by effectively merging loras. In: European Conference on Computer Vision, pp. 422–438. Springer
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015) Deep unsupervised learning using nonequilibrium thermodynamics. In: ICLR, pp. 2256–2265
Sohn, K., Ruiz, N., Lee, K., Chin, D. C., Blok, I., Chang, H., Barber, J., Jiang, L., Entis, G., & Li, Y. (2023) Styledrop: text-to-image generation in any style. In: Proceedings of the 37th International Conference on Neural Information Processing Systems, pp. 66860–66889
Sohn, K., Shaw, A., Hao, Y., Zhang, H., Polania, L., Chang, H., Jiang, L., & Essa, I. (2023) Learning disentangled prompts for compositional image synthesis. arXiv preprint arXiv:2306.00763
Tao, M., Tang, H., Wu, S., Sebe, N., Jing, X.-Y., Wu, F., & Bao, B. (2020) Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865
Tran, N.-T., Tran, V.-H., Nguyen, N.-B., Nguyen, T.-K., & Cheung, N.-M. (2021). On data augmentation for gan training. IEEE TIP, 30, 1882–1897.
Tumanyan, N., Bar-Tal, O., Bagon, S., & Dekel, T. (2022) Splicing vit features for semantic appearance transfer. In: CVPR, pp. 10748–10757
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017) Attention is all you need. Advances in neural information processing systems 30
Wang, Y., Gonzalez-Garcia, A., Berga, D., Herranz, L., Khan, F. S., & Weijer, J.v.d. (2020) Minegan: Effective knowledge transfer from gans to target domains with few images. In: CVPR, pp. 9332–9341
Wang, H., Spinelli, M., Wang, Q., Bai, X., Qin, Z., & Chen, A. (2024) Instantstyle: Free lunch towards style-preserving in text-to-image generation. arXiv preprint arXiv:2404.02733
Wang, Y., Wu, C., Herranz, L., Weijer, J., Gonzalez-Garcia, A., & Raducanu, B. (2018) Transferring gans: Generating images from limited data. In: ECCV, pp. 218–234
Wang, J., Yue, Z., Zhou, S., Chan, K. C., & Loy, C. C. (2024) Exploiting diffusion prior for real-world image super-resolution. International Journal of Computer Vision, 1–21
Wang, Z., Chi, Z., & Zhang, Y. (2022). Fregan: exploiting frequency components for training gans under limited data. Advances in Neural Information Processing Systems, 35, 33387–33399.
Xiao, J., Li, L., Wang, C., Zha, Z.-J., & Huang, Q. (2022) Few shot generative model adaption via relaxed spatial structural alignment. In: CVPR, pp. 11204–11213
Xie, J., Li, Y., Huang, Y., Liu, H., Zhang, W., Zheng, Y., & Shou, M.Z. (2023) Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7452–7461
Xie, S., Zhang, Z., Lin, Z., Hinz, T., & Zhang, K. (2023) Smartbrush: Text and shape guided object inpainting with diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22428–22437
Xu, Y., Tang, F., Cao, J., Zhang, Y., Deussen, O., Dong, W., Li, J., & Lee, T.-Y. (2024) Break-for-make: Modular low-rank adaptations for composable content-style customization. arXiv preprint arXiv:2403.19456
Xu, Y., Wang, Z., Xiao, J., Liu, W., & Chen, L. (2024) Freetuner: Any subject in any style with training-free diffusion. arXiv preprint arXiv:2405.14201
Xue, H., Huang, Z., Sun, Q., Song, L., & Zhang, W. (2023) Freestyle layout-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14256–14266
Yang, C., Shen, Y., Zhang, Z., Xu, Y., Zhu, J., Wu, Z., & Zhou, B. (2023) One-shot generative domain adaptation. In: ICCV, pp. 7733–7742
Yang, M., Wang, Z., Chi, Z., & Feng, W. (2022) Wavegan: Frequency-aware gan for high-fidelity few-shot image generation. In: ECCV, pp. 1–17. Springer
Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., & Xiao, J. (2015) Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365
Zhang, Y., Huang, N., Tang, F., Huang, H., Ma, C., Dong, W., & Xu, C. (2023) Inversion-based style transfer with diffusion models. In: CVPR, pp. 10146–10156
Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018) The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR, pp. 586–595
Zhang, Z., Liu, Y., Han, C., Guo, T., Yao, T., & Mei, T. (2022) Generalized one-shot domain adaption of generative adversarial networks. Advances in Neural Information Processing Systems
Zhang, L., Rao, A., & Agrawala, M. (2023) Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847
Zhang, Z., Xie, Y., & Yang, L. (2018) Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6199–6208
Zhang, Y., Yao, M., Wei, Y., Ji, Z., Bai, J., & Zuo, W. (2022) Towards diverse and faithful one-shot adaption of generative adversarial networks. Advances in Neural Information Processing Systems
Zhao, Y., Chandrasegaran, K., Abdollahzadeh, M., & Cheung, N.-M. (2022) Few-shot image generation via adaptation-aware kernel modulation. Advances in Neural Information Processing Systems
Zhao, Y., Ding, H., Huang, H., & Cheung, N. -M. (2022) A closer look at few-shot image generation. In: CVPR, pp. 9140–9150
Zhao, Y., Du, C., Abdollahzadeh, M., Pang, T., Lin, M., YAN, S., & Cheung, N.-M. (2023) Exploring incompatible knowledge transfer in few-shot image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Zhao, Z., Zhang, Z., Chen, T., Singh, S., & Zhang, H. (2020) Image augmentations for gan training. arXiv preprint arXiv:2006.02595
Zhao, S., Liu, Z., Lin, J., Zhu, J.-Y., & Han, S. (2020). Differentiable augmentation for data-efficient gan training. Advances in Neural Information Processing Systems, 33, 7559–7570.
Zheng, G., Zhou, X., Li, X., Qi, Z., Shan, Y., & Li, X. (2023) Layoutdiffusion: Controllable diffusion model for layout-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22490–22499
Zhu, P., Abdal, R., Femiani, J., & Wonka, P. (2021) Mind the gap: Domain gap control for single shot domain adaptation for generative adversarial networks. In: ICLR
Zhu, J., Li, S., Liu, Y., Huang, P., Shan, J., Ma, H., & Yuan, J. (2024) Odgen: Domain-specific object detection data generation with diffusion models. In: Proceedings of the 38th International Conference on Neural Information Processing Systems
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Bolei Zhou.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Unconditional Source Models
We train DDPMs on FFHQ \(256^2\) (Karras et al., 2020) and LSUN Church \(256^2\) (Yu et al., 2015) from scratch for 300K iterations and 250K iterations as source models for DDPM adaptation, which cost 5 days and 22 hours, 4 days and 22 hours on \(\times 8\) NVIDIA RTX A6000 GPUs, respectively. We randomly sample 1000 images with these two models to evaluate their generation diversity using the average pairwise LPIPS (Zhang et al., 2018) metric, as shown in Table 9.
For comparison, we also evaluate the generation diversity of the source StyleGAN2 (Karras et al., 2020) models used by GAN-based baselines (Wang et al., 2018; Karras et al., 2020; Mo et al., 2020; Wang et al., 2020; Li et al., 2020; Ojha et al., 2021; Zhao et al., 2022a). DDPMs trained on FFHQ \(256^2\) and LSUN Church \(256^2\) achieve generation diversity similar to the widely-used StyleGAN2 models. Besides, we sample 5000 images to evaluate the generation quality of the source models using FID (Heusel et al., 2017). As shown in Table 10, DDPM-based source models achieve FID results similar to StyleGAN2 on the source datasets FFHQ \(256^2\) and LSUN Church \(256^2\).
Appendix B Additional Ablation Analysis
We provide detailed ablation analysis of the weight coefficients of \(\mathcal {L}_{img}\), \(\mathcal {L}_{hf}\), and \(\mathcal {L}_{hfmse}\) using 10-shot FFHQ \(\rightarrow \) Babies (unconditional) as an example. Intra-LPIPS and FID are employed for quantitative evaluation.
We first ablate \(\lambda _2\), the weight coefficient of \(\mathcal {L}_{img}\). We adapt the source model to 10-shot Babies without \(\mathcal {L}_{hf}\) and \(\mathcal {L}_{hfmse}\). The quantitative results are listed in Table 11. Corresponding generated samples are shown in Fig. 16. When \(\lambda _2\) is set as 0.0, the directly fine-tuned model produces coarse results lacking high-frequency details and diversity. With an appropriate choice of \(\lambda _2\), the adapted model achieves greater generation diversity and better learning of target distributions under the guidance of \(\mathcal {L}_{img}\). Too large values of \(\lambda _2\) make \(\mathcal {L}_{img}\) overwhelm \(\mathcal {L}_{simple}\) and prevent the adapted model from learning target distributions, leading to degraded generation quality and diversity. The adapted model with \(\lambda _2\) value of 2.5 gets unnatural generated samples even if it achieves the best FID result. We recommend \(\lambda _2\) ranging from 0.1 to 1.0 for the unconditional adaptation setups used in our paper based on a comprehensive consideration of the qualitative and quantitative evaluation.
Next, we ablate \(\lambda _3\), the weight coefficient of \(\mathcal {L}_{hf}\) with \(\lambda _2\) set as 0.5. The quantitative results are listed in Table 12. Corresponding generated samples are shown in Fig. 17. \(\mathcal {L}_{hf}\) guides adapted models to keep diverse high-frequency details learned from source samples for more realistic results. \(\mathcal {L}_{hf}\) helps the adapted model enhance details like clothes and hairstyles and achieves better FID and Intra-LPIPS, indicating improved quality and diversity. Too large values of \(\lambda _3\) make the adapted model pay too much attention to high-frequency components and fail to produce realistic results following the target distributions. We recommend \(\lambda _3\) ranging from 0.1 to 1.0 for the unconditional adaptation setups used in our paper.
Finally, we ablate \(\lambda _4\), the weight coefficient of \(\mathcal {L}_{hfmse}\), with \(\lambda _2\) and \(\lambda _3\) set as 0.5. The quantitative results are listed in Table 13. Corresponding generated samples are shown in Fig. 18. \(\mathcal {L}_{hfmse}\) guides the adapted model to learn more high-frequency details from limited training data. Appropriate choice of \(\lambda _4\) helps the adapted model generate diverse results containing rich details. Besides, the full DomainStudio approach achieves state-of-the-art results of FID and Intra-LPIPS on 10-shot FFHQ \(\rightarrow \) Babies (see Table 2 and 1). Similar to \(\lambda _2\) and \(\lambda _3\), too large values of \(\lambda _4\) lead to unreasonable results deviating from the target distributions. We recommend \(\lambda _4\) ranging from 0.01 to 0.08 for the unconditional adaptation setups in this paper. Results in Fig. 16, 17, and 18 are synthesized from fixed noise inputs.
Appendix C Additional Visualized Samples
As illustrated in Sec. 4, DomainStudio is capable of adapting the subjects prompted in text prompts to the style of few-shot training samples. However, baselines like DreamBooth (Ruiz et al., 2023) and Textual Inversion (Gal et al., 2023) fail to produce reasonable samples. We employ LoRA (Hu et al., 2021) as another baseline for T2I generation and provide qualitative results in Fig. 19. LoRA also suffers from overfitting or underfitting in domain-driven generation tasks like DreamBooth. Fig. 20 shows additional results of DomainStudio on T2I generation using a single image as training data.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhu, J., Ma, H., Chen, J. et al. DomainStudio: Fine-Tuning Diffusion Models for Domain-Driven Image Generation Using Limited Data. Int J Comput Vis 133, 7012–7036 (2025). https://doi.org/10.1007/s11263-025-02498-0
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02498-0