这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

DomainStudio: Fine-Tuning Diffusion Models for Domain-Driven Image Generation Using Limited Data

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Denoising diffusion probabilistic models (DDPMs) have been proven capable of synthesizing high-quality images with remarkable diversity when trained on large amounts of data. Unfortunately, they are still vulnerable to overfitting when fine-tuned on limited data. Existing works have explored subject-driven generation with text-to-image (T2I) models using a few samples. However, there is still a lack of effective and stable data-efficient methods to synthesize images in specific domains (e.g. styles or properties), which remains challenging due to ambiguities inherent in natural language and out-of-distribution effects. This paper introduces a few-shot fine-tuning approach named DomainStudio as a domain-driven image generation paradigm, which is designed to retain the subjects from prior knowledge provided by pre-trained models and adapt them to the domain extracted from training data, pursuing high quality and great diversity. We propose to keep the image-level relative distances between adapted samples and enhance the learning of high-frequency details from both pre-trained models and training samples. DomainStudio is compatible with both unconditional and T2I DDPMs. The proposed method achieves better results than current state-of-the-art GAN-based approaches in unconditional few-shot image generation. It also outperforms existing few-shot fine-tuning methods for modern large-scale T2I diffusion models like Textual Inversion and DreamBooth on synthesizing samples in specific domains characterized by few-shot training data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Data Availability

This work depends on open-source datasets like FFHQ (Karras et al., 2020) and LSUN (Yu et al., 2015), and few-shot datasets provided by DreamBooth (Ruiz et al., 2023). The code of our work is publicly available at https://github.com/bbzhu-jy16/DomainStudio.

References

  • Ahn, N., Lee, J., Lee, C., Kim, K., Kim, D., Nam, S.-H., & Hong, K. (2024) Dreamstyler: Paint by style inversion with text-to-image diffusion models. In: AAAI

  • Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila, T., Laine, S., & Catanzaro, B., et al. (2022) ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324

  • Bar-Tal, O., Yariv, L., Lipman, Y., & Dekel, T. (2023) Multidiffusion: Fusing diffusion paths for controlled image generation. In: International Conference on Machine Learning

  • Brock, A., Donahue, J., & Simonyan, K. (2019) Large scale GAN training for high fidelity natural image synthesis. In: ICLR

  • Cai, M., Zhang, H., Huang, H., Geng, Q., Li, Y., & Huang, G. (2021) Frequency domain image translation: More photo-realistic, better identity-preserving. In: ICCV, pp. 13930–13940

  • Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang, M.-H., Murphy, K., Freeman, W.T., & Rubinstein, M., et al. (2023) Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704

  • Chang, H., Zhang, H., Jiang, L., Liu, C., & Freeman, W.T. (2022) Maskgit: Masked generative image transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11315–11325

  • Chen, M., Laina, I., & Vedaldi, A. (2024) Training-free layout control with cross-attention guidance. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 5343–5353

  • Cheng, J., Liang, X., Shi, X., He, T., Xiao, T., & Li, M. (2023) Layoutdiffuse: Adapting foundational diffusion models for layout-to-image generation. arXiv preprint arXiv:2302.08908

  • Chong, M.J., & Forsyth, D. (2022) Jojogan: One shot face stylization. In: European Conference on Computer Vision, pp. 128–152. Springer

  • Crowson, K., Biderman, S., Kornis, D., Stander, D., Hallahan, E., Castricato, L., & Raff, E. (2022) Vqgan-clip: Open domain image generation and editing with natural language guidance. In: Proceedings of the European Conference on Computer Vision, pp. 88–105. Springer

  • Daubechies, I. (1990). The wavelet transform, time-frequency localization and signal analysis. IEEE Transactions on Information Theory, 36(5), 961–1005.

    Article  MathSciNet  Google Scholar 

  • Deng, Y., Tang, F., Dong, W., Ma, C., Pan, X., Wang, L., & Xu, C. (2022) Stytr2: Image style transfer with transformers. In: CVPR, pp. 11326–11336

  • Dhariwal, P., & Nichol, A. (2021). Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 8780–8794.

    Google Scholar 

  • Ding, M., Yang, Z., Hong, W., Zheng, W., Zhou, C., Yin, D., Lin, J., Zou, X., Shao, Z., & Yang, H. (2021). Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34, 19822–19835.

    Google Scholar 

  • Everaert, M. N., Bocchio, M., Arpa, S., Süsstrunk, S., & Achanta, R. (2023) Diffusion in style. In: ICCV, pp. 2251–2261

  • Frenkel, Y., Vinker, Y., Shamir, A., & Cohen-Or, D. (2024) Implicit style-content separation using b-lora. In: European Conference on Computer Vision, pp. 181–198. Springer

  • Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., & Taigman, Y. (2022) Make-a-scene: Scene-based text-to-image generation with human priors. In: Proceedings of the European Conference on Computer Vision, pp. 89–106. Springer

  • Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A. H., Chechik, G., & Cohen-or, D. (2022) An image is worth one word: Personalizing text-to-image generation using textual inversion. In: The Eleventh International Conference on Learning Representations

  • Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A. H., Chechik, G., & Cohen-or, D. (2023) An image is worth one word: Personalizing text-to-image generation using textual inversion. In: The Eleventh International Conference on Learning Representations

  • Gal, R., Hochberg, D. C., Bermano, A., & Cohen-Or, D. (2021). Swagan: A style-based wavelet-driven generative model. ACM Transactions on Graphics (TOG), 40(4), 1–11.

    Article  Google Scholar 

  • Gal, R., Patashnik, O., Maron, H., Bermano, A. H., Chechik, G., & Cohen-Or, D. (2022). Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 41(4), 1–13.

    Article  Google Scholar 

  • Gatys, L.A., Ecker, A.S., & Bethge, M. (2015) A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576

  • Ghiasi, G., Lee, H., Kudlur, M., Dumoulin, V., & Shlens, J. (2017) Exploring the structure of a real-time, arbitrary neural artistic stylization network. arXiv preprint arXiv:1705.06830

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014) Generative adversarial nets. Advances in Neural Information Processing Systems 27

  • Gu, Y., Wang, X., Wu, J. Z., Shi, Y., Chen, Y., Fan, Z., Xiao, W., Zhao, R., Chang, S., & Wu, W., et al. (2024) Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. Advances in Neural Information Processing Systems 36

  • Hertz, A., Voynov, A., Fruchter, S., & Cohen-Or, D. (2024) Style aligned image generation via shared attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4775–4785

  • Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems 30

  • Hinz, T., Heinrich, S., & Wermter, S. (2020). Semantic object accuracy for generative text-to-image synthesis. IEEE transactions on pattern analysis and machine intelligence, 44(3), 1552–1565.

    Article  Google Scholar 

  • Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.

    Google Scholar 

  • Hu, E.J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021) Lora: Low-rank adaptation of large language models. In: ICLR

  • Hu, T., Zhang, J., Liu, L., Yi, R., Kou, S., Zhu, H., Chen, X., Wang, Y., Wang, C., & Ma, L. (2023) Phasic content fusing diffusion model with directional distribution consistency for few-shot model adaption. In: ICCV, pp. 2406–2415

  • Huang, J., Cui, K., Guan, D., Xiao, A., Zhan, F., Lu, S., Liao, S., & Xing, E. (2022). Masked generative adversarial networks are data-efficient generation learners. Advances in Neural Information Processing Systems, 35, 2154–2167.

    Google Scholar 

  • Karras, T., Laine, S., & Aila, T. (2019) A style-based generator architecture for generative adversarial networks. In: CVPR, pp. 4401–4410

  • Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., & Aila, T. (2020) Analyzing and improving the image quality of stylegan. In: CVPR, pp. 8110–8119

  • Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., & Aila, T. (2020). Training generative adversarial networks with limited data. Advances in Neural Information Processing Systems, 33, 12104–12114.

    Google Scholar 

  • Karras, T., Aittala, M., Laine, S., Härkönen, E., Hellsten, J., Lehtinen, J., & Aila, T. (2021). Alias-free generative adversarial networks. Advances in Neural Information Processing Systems, 34, 852–863.

    Google Scholar 

  • Kim, G., Kwon, T., & Ye, J. C. (2022) Diffusionclip: Text-guided diffusion models for robust image manipulation. In: CVPR, pp. 2426–2435

  • Kingma, D. P., & Welling, M. (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114

  • Kingma, D., Salimans, T., Poole, B., & Ho, J. (2021). Variational diffusion models. Advances in Neural Information Processing Systems, 34, 21696–21707.

    Google Scholar 

  • Krizhevsky, A., & Hinton, G., et al. (2009) Learning multiple layers of features from tiny images. Master’s thesis, University of Tront

  • Kumari, N., Zhang, B., Zhang, R., Shechtman, E., & Zhu, J.-Y. (2023) Multi-concept customization of text-to-image diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1931–1941

  • Kwon, G., & Ye, J.C. (2023) One-shot adaptation of gan in just one clip. TPAMI

  • Li, J., Li, D., Xiong, C., & Hoi, S. (2022) Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR

  • Li, Y., Liu, H., Wen, Y., & Lee, Y. J. (2023) Generate anything anywhere in any scene. arXiv preprint arXiv:2306.17154

  • Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C., & Lee, Y.J. (2023) Gligen: Open-set grounded text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22511–22521

  • Li, B., Qi, X., Lukasiewicz, T., & Torr, P. (2019) Controllable text-to-image generation. Advances in Neural Information Processing Systems 32

  • Li, W., Zhang, P., Zhang, L., Huang, Q., He, X., Lyu, S., & Gao, J. (2019) Object-driven text-to-image synthesis via adversarial training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12174–12182

  • Li, Y., Zhang, R., Lu, J., & Shechtman, E. (2020). Few-shot image generation with elastic weight consolidation. Advances in Neural Information Processing Systems, 33, 15885–15896.

    Google Scholar 

  • Liang, H., Zhang, W., Li, W., Yu, J., & Xu, L. (2024) Intergen: Diffusion-based multi-human motion generation under complex interactions. International Journal of Computer Vision, 1–21

  • Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., & Van Gool, L. (2022) Repaint: Inpainting using denoising diffusion probabilistic models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11461–11471

  • Ma, J., Liang, J., Chen, C., & Lu, H. (2024) Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. In: ACM SIGGRAPH 2024 Conference Papers, pp. 1–12

  • Mo, S., Cho, M., & Shin, J. (2020) Freeze the discriminator: A simple baseline for fine-tuning gans. In: CVPR AI for Content Creation Workshop

  • Moon, S.-J., Kim, C., & Park, G.-M. (2022) Wagi: wavelet-based gan inversion for preserving high-frequency image details. arXiv preprint arXiv:2210.09655

  • Nichol, A. Q., & Dhariwal, P. (2021) Improved denoising diffusion probabilistic models. In: ICLR, pp. 8162–8171. PMLR

  • Noguchi, A., & Harada, T. (2019) Image generation from small datasets via batch statistics adaptation. In: ICCV, pp. 2750–2758

  • Ojha, U., Li, Y., Lu, J., Efros, A. A., Lee, Y. J., Shechtman, E., & Zhang, R. (2021) Few-shot image generation via cross-domain correspondence. In: CVPR, pp. 10743–10752

  • Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., & Graves, A., et al. (2016) Conditional image generation with pixelcnn decoders. Advances in Neural Information Processing Systems 29

  • Park, J., & Kim, Y. (2022) Styleformer: Transformer based generative adversarial networks with style vector. In: CVPR, pp. 8983–8992

  • Phung, H., Dao, Q., & Tran, A. (2023) Wavelet diffusion models are fast and scalable image generators. In: CVPR, pp. 10199–10208

  • Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., & Rombach, R. (2024) Sdxl: Improving latent diffusion models for high-resolution image synthesis. In: International Conference on Learning Representations

  • Qiao, T., Zhang, J., Xu, D., & Tao, D. (2019) Learn, imagine and create: Text-to-image generation from prior knowledge. Advances in neural information processing systems 32

  • Radford, A., Kim, & J. W., Hallacy, C. (2021) Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR

  • Raj, A., Kaza, S., Poole, B., Niemeyer, M., Ruiz, N., Mildenhall, B., Zada, S., Aberman, K., Rubinstein, M., & Barron, J. (2023) Dreambooth3d: Subject-driven text-to-3d generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2349–2359

  • Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125

  • Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., & Sutskever, I. (2021) Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR

  • Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022) High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695

  • Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., & Aberman, K. (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR, pp. 22500–22510

  • Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., & Aberman, K. (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510

  • Ruiz, N., Li, Y., Jampani, V., Wei, W., Hou, T., Pritch, Y., Wadhwa, N., Rubinstein, M., & Aberman, K. (2024) Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6527–6536

  • Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., & Salimans, T. (2022). Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35, 36479–36494.

    Google Scholar 

  • Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D. J., & Norouzi, M. (2022). Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4), 4713–4726.

    Google Scholar 

  • Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., & Wortsman, M. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35, 25278–25294.

    Google Scholar 

  • Schwarz, K., Liao, Y., & Geiger, A. (2021). On the frequency bias of generative models. Advances in Neural Information Processing Systems, 34, 18126–18136.

    Google Scholar 

  • Shah, V., Ruiz, N., Cole, F., Lu, E., Lazebnik, S., Li, Y., & Jampani, V. (2024) Ziplora: Any subject in any style by effectively merging loras. In: European Conference on Computer Vision, pp. 422–438. Springer

  • Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015) Deep unsupervised learning using nonequilibrium thermodynamics. In: ICLR, pp. 2256–2265

  • Sohn, K., Ruiz, N., Lee, K., Chin, D. C., Blok, I., Chang, H., Barber, J., Jiang, L., Entis, G., & Li, Y. (2023) Styledrop: text-to-image generation in any style. In: Proceedings of the 37th International Conference on Neural Information Processing Systems, pp. 66860–66889

  • Sohn, K., Shaw, A., Hao, Y., Zhang, H., Polania, L., Chang, H., Jiang, L., & Essa, I. (2023) Learning disentangled prompts for compositional image synthesis. arXiv preprint arXiv:2306.00763

  • Tao, M., Tang, H., Wu, S., Sebe, N., Jing, X.-Y., Wu, F., & Bao, B. (2020) Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865

  • Tran, N.-T., Tran, V.-H., Nguyen, N.-B., Nguyen, T.-K., & Cheung, N.-M. (2021). On data augmentation for gan training. IEEE TIP, 30, 1882–1897.

    MathSciNet  Google Scholar 

  • Tumanyan, N., Bar-Tal, O., Bagon, S., & Dekel, T. (2022) Splicing vit features for semantic appearance transfer. In: CVPR, pp. 10748–10757

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017) Attention is all you need. Advances in neural information processing systems 30

  • Wang, Y., Gonzalez-Garcia, A., Berga, D., Herranz, L., Khan, F. S., & Weijer, J.v.d. (2020) Minegan: Effective knowledge transfer from gans to target domains with few images. In: CVPR, pp. 9332–9341

  • Wang, H., Spinelli, M., Wang, Q., Bai, X., Qin, Z., & Chen, A. (2024) Instantstyle: Free lunch towards style-preserving in text-to-image generation. arXiv preprint arXiv:2404.02733

  • Wang, Y., Wu, C., Herranz, L., Weijer, J., Gonzalez-Garcia, A., & Raducanu, B. (2018) Transferring gans: Generating images from limited data. In: ECCV, pp. 218–234

  • Wang, J., Yue, Z., Zhou, S., Chan, K. C., & Loy, C. C. (2024) Exploiting diffusion prior for real-world image super-resolution. International Journal of Computer Vision, 1–21

  • Wang, Z., Chi, Z., & Zhang, Y. (2022). Fregan: exploiting frequency components for training gans under limited data. Advances in Neural Information Processing Systems, 35, 33387–33399.

    Google Scholar 

  • Xiao, J., Li, L., Wang, C., Zha, Z.-J., & Huang, Q. (2022) Few shot generative model adaption via relaxed spatial structural alignment. In: CVPR, pp. 11204–11213

  • Xie, J., Li, Y., Huang, Y., Liu, H., Zhang, W., Zheng, Y., & Shou, M.Z. (2023) Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7452–7461

  • Xie, S., Zhang, Z., Lin, Z., Hinz, T., & Zhang, K. (2023) Smartbrush: Text and shape guided object inpainting with diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22428–22437

  • Xu, Y., Tang, F., Cao, J., Zhang, Y., Deussen, O., Dong, W., Li, J., & Lee, T.-Y. (2024) Break-for-make: Modular low-rank adaptations for composable content-style customization. arXiv preprint arXiv:2403.19456

  • Xu, Y., Wang, Z., Xiao, J., Liu, W., & Chen, L. (2024) Freetuner: Any subject in any style with training-free diffusion. arXiv preprint arXiv:2405.14201

  • Xue, H., Huang, Z., Sun, Q., Song, L., & Zhang, W. (2023) Freestyle layout-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14256–14266

  • Yang, C., Shen, Y., Zhang, Z., Xu, Y., Zhu, J., Wu, Z., & Zhou, B. (2023) One-shot generative domain adaptation. In: ICCV, pp. 7733–7742

  • Yang, M., Wang, Z., Chi, Z., & Feng, W. (2022) Wavegan: Frequency-aware gan for high-fidelity few-shot image generation. In: ECCV, pp. 1–17. Springer

  • Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., & Xiao, J. (2015) Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365

  • Zhang, Y., Huang, N., Tang, F., Huang, H., Ma, C., Dong, W., & Xu, C. (2023) Inversion-based style transfer with diffusion models. In: CVPR, pp. 10146–10156

  • Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018) The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR, pp. 586–595

  • Zhang, Z., Liu, Y., Han, C., Guo, T., Yao, T., & Mei, T. (2022) Generalized one-shot domain adaption of generative adversarial networks. Advances in Neural Information Processing Systems

  • Zhang, L., Rao, A., & Agrawala, M. (2023) Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847

  • Zhang, Z., Xie, Y., & Yang, L. (2018) Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6199–6208

  • Zhang, Y., Yao, M., Wei, Y., Ji, Z., Bai, J., & Zuo, W. (2022) Towards diverse and faithful one-shot adaption of generative adversarial networks. Advances in Neural Information Processing Systems

  • Zhao, Y., Chandrasegaran, K., Abdollahzadeh, M., & Cheung, N.-M. (2022) Few-shot image generation via adaptation-aware kernel modulation. Advances in Neural Information Processing Systems

  • Zhao, Y., Ding, H., Huang, H., & Cheung, N. -M. (2022) A closer look at few-shot image generation. In: CVPR, pp. 9140–9150

  • Zhao, Y., Du, C., Abdollahzadeh, M., Pang, T., Lin, M., YAN, S., & Cheung, N.-M. (2023) Exploring incompatible knowledge transfer in few-shot image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  • Zhao, Z., Zhang, Z., Chen, T., Singh, S., & Zhang, H. (2020) Image augmentations for gan training. arXiv preprint arXiv:2006.02595

  • Zhao, S., Liu, Z., Lin, J., Zhu, J.-Y., & Han, S. (2020). Differentiable augmentation for data-efficient gan training. Advances in Neural Information Processing Systems, 33, 7559–7570.

  • Zheng, G., Zhou, X., Li, X., Qi, Z., Shan, Y., & Li, X. (2023) Layoutdiffusion: Controllable diffusion model for layout-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22490–22499

  • Zhu, P., Abdal, R., Femiani, J., & Wonka, P. (2021) Mind the gap: Domain gap control for single shot domain adaptation for generative adversarial networks. In: ICLR

  • Zhu, J., Li, S., Liu, Y., Huang, P., Shan, J., Ma, H., & Yuan, J. (2024) Odgen: Domain-specific object detection data generation with diffusion models. In: Proceedings of the 38th International Conference on Neural Information Processing Systems

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huimin Ma.

Additional information

Communicated by Bolei Zhou.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Unconditional Source Models

We train DDPMs on FFHQ \(256^2\) (Karras et al., 2020) and LSUN Church \(256^2\) (Yu et al., 2015) from scratch for 300K iterations and 250K iterations as source models for DDPM adaptation, which cost 5 days and 22 hours, 4 days and 22 hours on \(\times 8\) NVIDIA RTX A6000 GPUs, respectively. We randomly sample 1000 images with these two models to evaluate their generation diversity using the average pairwise LPIPS (Zhang et al., 2018) metric, as shown in Table 9.

Table 9 Average pairwise LPIPS (\(\uparrow \)) results of 1000 samples produced by StyleGAN2 and DDPMs trained on FFHQ \(256^2\) and LSUN Church \(256^2\)

For comparison, we also evaluate the generation diversity of the source StyleGAN2 (Karras et al., 2020) models used by GAN-based baselines (Wang et al., 2018; Karras et al., 2020; Mo et al., 2020; Wang et al., 2020; Li et al., 2020; Ojha et al., 2021; Zhao et al., 2022a). DDPMs trained on FFHQ \(256^2\) and LSUN Church \(256^2\) achieve generation diversity similar to the widely-used StyleGAN2 models. Besides, we sample 5000 images to evaluate the generation quality of the source models using FID (Heusel et al., 2017). As shown in Table 10, DDPM-based source models achieve FID results similar to StyleGAN2 on the source datasets FFHQ \(256^2\) and LSUN Church \(256^2\).

Table 10 FID (\(\downarrow \)) results of StyleGAN2 and DDPMs trained on FFHQ \(256^2\) and LSUN Church \(256^2\)

Appendix B Additional Ablation Analysis

We provide detailed ablation analysis of the weight coefficients of \(\mathcal {L}_{img}\), \(\mathcal {L}_{hf}\), and \(\mathcal {L}_{hfmse}\) using 10-shot FFHQ \(\rightarrow \) Babies (unconditional) as an example. Intra-LPIPS and FID are employed for quantitative evaluation.

We first ablate \(\lambda _2\), the weight coefficient of \(\mathcal {L}_{img}\). We adapt the source model to 10-shot Babies without \(\mathcal {L}_{hf}\) and \(\mathcal {L}_{hfmse}\). The quantitative results are listed in Table 11. Corresponding generated samples are shown in Fig. 16. When \(\lambda _2\) is set as 0.0, the directly fine-tuned model produces coarse results lacking high-frequency details and diversity. With an appropriate choice of \(\lambda _2\), the adapted model achieves greater generation diversity and better learning of target distributions under the guidance of \(\mathcal {L}_{img}\). Too large values of \(\lambda _2\) make \(\mathcal {L}_{img}\) overwhelm \(\mathcal {L}_{simple}\) and prevent the adapted model from learning target distributions, leading to degraded generation quality and diversity. The adapted model with \(\lambda _2\) value of 2.5 gets unnatural generated samples even if it achieves the best FID result. We recommend \(\lambda _2\) ranging from 0.1 to 1.0 for the unconditional adaptation setups used in our paper based on a comprehensive consideration of the qualitative and quantitative evaluation.

Table 11 Intra-LPIPS (\(\uparrow \)) and FID (\(\downarrow \)) results of adapted models trained on 10-shot FFHQ \(\rightarrow \) Babies with different \(\lambda _2\), the weight coefficient of \(\mathcal {L}_{img}\)
Fig. 16
figure 16

Visualized ablations of \(\lambda _2\), the weight coefficient of \(\mathcal {L}_{img}\) on 10-shot FFHQ \(\rightarrow \) Babies

Next, we ablate \(\lambda _3\), the weight coefficient of \(\mathcal {L}_{hf}\) with \(\lambda _2\) set as 0.5. The quantitative results are listed in Table 12. Corresponding generated samples are shown in Fig. 17. \(\mathcal {L}_{hf}\) guides adapted models to keep diverse high-frequency details learned from source samples for more realistic results. \(\mathcal {L}_{hf}\) helps the adapted model enhance details like clothes and hairstyles and achieves better FID and Intra-LPIPS, indicating improved quality and diversity. Too large values of \(\lambda _3\) make the adapted model pay too much attention to high-frequency components and fail to produce realistic results following the target distributions. We recommend \(\lambda _3\) ranging from 0.1 to 1.0 for the unconditional adaptation setups used in our paper.

Table 12 Intra-LPIPS (\(\uparrow \)) and FID (\(\downarrow \)) results of adapted models trained on 10-shot FFHQ \(\rightarrow \) Babies with different \(\lambda _3\), the weight coefficient of \(\mathcal {L}_{hf}\)
Fig. 17
figure 17

Visualized ablations of \(\lambda _3\), the weight coefficient of \(\mathcal {L}_{hf}\) on 10-shot FFHQ \(\rightarrow \) Babies

Table 13 Intra-LPIPS (\(\uparrow \)) and FID (\(\downarrow \)) results of adapted models trained on 10-shot FFHQ \(\rightarrow \) Babies with different \(\lambda _4\), the weight coefficient of \(\mathcal {L}_{hfmse}\)
Fig. 18
figure 18

Visualized ablations of \(\lambda _4\), the weight coefficient of \(\mathcal {L}_{hfmse}\) on 10-shot FFHQ \(\rightarrow \) Babies

Fig. 19
figure 19

Qualitative comparison between LoRA, DreamBooth, and DomainStudio

Finally, we ablate \(\lambda _4\), the weight coefficient of \(\mathcal {L}_{hfmse}\), with \(\lambda _2\) and \(\lambda _3\) set as 0.5. The quantitative results are listed in Table 13. Corresponding generated samples are shown in Fig. 18. \(\mathcal {L}_{hfmse}\) guides the adapted model to learn more high-frequency details from limited training data. Appropriate choice of \(\lambda _4\) helps the adapted model generate diverse results containing rich details. Besides, the full DomainStudio approach achieves state-of-the-art results of FID and Intra-LPIPS on 10-shot FFHQ \(\rightarrow \) Babies (see Table 2 and 1). Similar to \(\lambda _2\) and \(\lambda _3\), too large values of \(\lambda _4\) lead to unreasonable results deviating from the target distributions. We recommend \(\lambda _4\) ranging from 0.01 to 0.08 for the unconditional adaptation setups in this paper. Results in Fig. 16, 17, and 18 are synthesized from fixed noise inputs.

Fig. 20
figure 20

1-shot T2I generation samples of DomainStudio given different text prompts

Appendix C Additional Visualized Samples

As illustrated in Sec. 4, DomainStudio is capable of adapting the subjects prompted in text prompts to the style of few-shot training samples. However, baselines like DreamBooth (Ruiz et al., 2023) and Textual Inversion (Gal et al., 2023) fail to produce reasonable samples. We employ LoRA (Hu et al., 2021) as another baseline for T2I generation and provide qualitative results in Fig. 19. LoRA also suffers from overfitting or underfitting in domain-driven generation tasks like DreamBooth. Fig. 20 shows additional results of DomainStudio on T2I generation using a single image as training data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, J., Ma, H., Chen, J. et al. DomainStudio: Fine-Tuning Diffusion Models for Domain-Driven Image Generation Using Limited Data. Int J Comput Vis 133, 7012–7036 (2025). https://doi.org/10.1007/s11263-025-02498-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-025-02498-0

Keywords