Open-Vocabulary Text-Driven Human Image Generation

Zhang, Kaiduo; Sun, Muyi; Sun, Jianxin; Zhang, Kunbo; Sun, Zhenan; Tan, Tieniu

doi:10.1007/s11263-024-02079-7

Open-Vocabulary Text-Driven Human Image Generation

Published: 15 May 2024

Volume 132, pages 4379–4397, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

884 Accesses
3 Citations
Explore all metrics

A Correction to this article was published on 07 August 2024

This article has been updated

Abstract

Generating human images from open-vocabulary text descriptions is an exciting but challenging task. Previous methods (i.e., Text2Human) face two challenging problems: (1) they cannot well handle the open-vocabulary setting by arbitrary text inputs (i.e., unseen clothing appearances) and heavily rely on limited preset words (i.e., pattern styles of clothing appearances); (2) the generated human image is inaccuracy in open-vocabulary settings. To alleviate these drawbacks, we propose a flexible diffusion-based framework, namely HumanDiffusion, for open-vocabulary text-driven human image generation (HIG). The proposed framework mainly consists of two novel modules: the Stylized Memory Retrieval (SMR) module and the Multi-scale Feature Mapping (MFM) module. Encoded by the vision-language pretrained CLIP model, we obtain coarse features of the local human appearance. Then, the SMR module utilizes an external database that contains clothing texture details to refine the initial coarse features. Through SMR refreshing, we can achieve the HIG task with arbitrary text inputs, and the range of expression styles is greatly expanded. Later, the MFM module embedding in the diffusion backbone can learn fine-grained appearance features, which effectively achieves precise semantic-coherence alignment of different body parts with appearance features and realizes the accurate expression of desired human appearance. The seamless combination of the proposed novel modules in HumanDiffusion realizes the freestyle and high accuracy of text-guided HIG and editing tasks. Extensive experiments demonstrate that the proposed method can achieve state-of-the-art (SOTA) performance, especially in the open-vocabulary setting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Vision-Language Model for Efficient Multi-modal Similarity Search in Fashion Retrieval

Multi-level feature fusion model-based real-time person re-identification for forensics

Article 17 September 2019

Gender Recognition from 3D Shape Parameters

Data Availability

Data available to be shared: The data that support the findings of this study are available in Text2Human: Text-Driven Controllable Human Image Generation at https://github.com/yumingj/DeepFashion-MultiModal, and StyleGAN-Human: A Data-Centric Odyssey of Human Generation at https://github.com/styleganhuman/StyleGAN-Human/blob/main/docs/Dataset.md.

Change history

07 August 2024
A Correction to this paper has been published: https://doi.org/10.1007/s11263-024-02200-w

References

Albahar, B., Lu, J., Yang, J., & Shu, Z. (2021). Shechtman: Pose with style: Detail-preserving pose-guided image synthesis with conditional stylegan. ACM Transactions on Graphics (TOG), 40(6), 1–11.
Article Google Scholar
Avrahami, O., Lischinski, D., & Fried, O. (2022). Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18208–18218.
Blattmann, A., Rombach, R., Oktay, K., & Ommer, B. (2022). Retrieval-augmented diffusion models. arXiv preprint arXiv:2204.11824
Cheong, S.Y., Mustafa, A., & Gilbert, A. (2022) Pose guided multi-person image generation from text. arXiv preprint arXiv:2203.04907.
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., & Bougares, F. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Chunseong Park, C., Kim, B., & Kim, G. (2017). Attend to you: Personalized image captioning with context sequence memory networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 895–903.
Dhariwal, P., & Nichol, A. (2021). Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 8780–8794.
Google Scholar
Fu, J., Li, S., Jiang, Y., Lin, K.-Y., Qian, C., Loy, C.C., Wu, W., & Liu, Z. (2022) Stylegan-human: A data-centric odyssey of human generation. In European Conference on Computer Vision, pp. 1–19, Springer.
Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., & Taigman, Y. (2022). Make-a-scene: Scene-based text-to-image generation with human priors. arXiv preprint arXiv:2203.13131
Gu, S., Chen, D., Bao, J., Wen, F., Zhang, B., Chen, D., & Yuan, L. (2022). Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10696–10706.
He, S., Song, Y.-Z., & Xiang, T. (2022) Style-based global appearance flow for virtual try-on. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3470–3479.
Heusel, M., Ramsauer, H., Unterthiner, T., & Nessler, B. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems30.
Ho, J., & Salimans, T. (2022). Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.
Google Scholar
Huang, H., Chai, Z., He, R., & Tan, T. (2021). Selective wavelet attention learning for single image deraining. International Journal of Computer Vision, 129, 1282–1300.
Article Google Scholar
Jiang, Y., Yang, S., Qju, H., Wu, W., Loy, C. C., & Liu, Z. (2022). Text2human: Text-driven controllable human image generation. ACM Transactions on Graphics (TOG), 41(4), 1–11.
Article Google Scholar
Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., Mosseri, I., & Irani, M. (2022). Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276
Kim, G., & Kwon, T. (2022). Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2426–2435.
Kingma, D.P. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Kong, Y., & Fu, Y. (2022). Human action recognition and prediction: A survey. International Journal of Computer Vision, 130(5), 1366–1401.
Article Google Scholar
Lewis, K. M., Varadharajan, S., & Kemelmacher-Shlizerman, I. (2021). Tryongan: Body-aware try-on via layered interpolation. ACM Transactions on Graphics (TOG), 40(4), 1–10.
Li, Y., Huang, C., & Loy, C.C. (2019). Dense intrinsic appearance flow for human pose transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3693–3702.
Li, S., Zhao, Y., Varma, R., Salpekar, O., Noordhuis, P., Li, T., Paszke, A., Smith, J., Vaughan, B., & Damania, P., et al. (2020). Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704
Liu, Z., Luo, P., Qiu, S., Wang, X., & Tang, X. (2016). Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Liu, X., Park, D.H., Azadi, S., Zhang, G., Chopikyan, A., Hu, Y., Shi, H., Rohrbach, A., & Darrell, T. (2021). More control for free! image synthesis with semantic diffusion guidance. arXiv preprint arXiv:2112.05744
Liu, D., Wu, L., Zheng, F., Liu, L., & Wang, M. (2022). Verbal-person nets: Pose-guided multi-granularity language-to-person generation. IEEE Transactions on Neural Networks and Learning Systems.
Ma, L., Sun, Q., Georgoulis, S., Van Gool, & L. (2018). Schiele: Disentangled person image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 99–108.
Miao, J., & Wei, Y. (2020). Memory aggregation networks for efficient interactive video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10366–10375.
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., & Mishkin, P. (2021). Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741
Park, T., & Liu, M.-Y. (2019). Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2337–2346.
Qi, X., Liu, C., Sun, M., Li, L., Fan, C., & Yu, X. (2023). Diverse 3d hand gesture prediction from body dynamics by bilateral hand disentanglement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4616–4626.
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., & Chen, M. (2021). Zero-shot text-to-image generation. In International Conference on Machine Learning, pp. 8821–8831.
Ren, Y., Fan, X., Li, G., Liu, S., & Li, T.H. (2022) Neural texture extraction and distribution for controllable person image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13535–13544.
Rombach, R., Blattmann, A., Lorenz, D., & Esser, P. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695.
Sarkar, K., Liu, L., Golyanik, V., & Theobalt, C. (2021) Humangan: A generative model of human images. In 2021 International Conference on 3D Vision (3DV), pp. 258–267.
Sheynin, S., Ashual, O., Polyak, A., Singer, U., Gafni, O., Nachmani, E., & Taigman, Y. (2022). Knn-diffusion: Image generation via large-scale retrieval. arXiv preprint arXiv:2204.02849
Song, Y., & Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems.
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., & Poole, B. (2020). Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456
Sun, J., Deng, Q., Li, Q., Sun, M., Ren, M., & Sun, Z. (2022). Anyface: Free-style text-to-face synthesis and manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18687–18696.
Sun, M., Wang, J., Liu, J., Li, J., Chen, T., & Sun, Z. (2022). A unified framework for biphasic facial age translation with noisy-semantic guided generative adversarial networks. IEEE Transactions on Information Forensics and Security, 17, 1513–1527.
Article Google Scholar
Tan, Z., Yang, Y., Wan, J., Guo, G., & Li, S.Z. (2020) Relation-aware pedestrian attribute recognition with graph convolutional networks. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12055–12062.
Tan, Z., Yang, Y., Wan, J., Hang, H., Guo, G., & Li, S. Z. (2019). Attention-based pedestrian attribute analysis. IEEE Transactions on Image Processing, 28(12), 6126–6140.
Article MathSciNet Google Scholar
Wang, Z., & Simoncelli, E.P. (2003). Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol. 2, pp. 1398–1402.
Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., & Catanzaro, B. (2018). High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807.
Wang, Z., Qi, X., Yuan, K., & Sun, M. (2022). Self-supervised correlation mining network for person image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7703–7712.
Xu, X., Chen, & Y.-C. (2021). Text-guided human image manipulation via image-text shared space. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Yang, Y., Tan, Z., Tiwari, P., Pandey, H. M., Wan, J., Lei, Z., Guo, G., & Li, S. Z. (2021). Cascaded split-and-aggregate learning with feature recombination for pedestrian attribute recognition. International Journal of Computer Vision, 129, 2731–2744.
Article Google Scholar
Yu, C., Gao, C., Wang, J., Yu, G., Shen, C., & Sang, N. (2021). Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision, 129, 3051–3068.
Article Google Scholar
Zhang, L., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543
Zhang, P., & Yang, L. (2022). Exploring dual-task correlation for pose guided person image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7713–7722.
Zhang, S., & Zhao, W. (2021). Keypoint-graph-driven learning framework for object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1065–1073.
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595.
Zhang, J., Li, K., & Lai, Y.-K. (2021). Pise: Person image synthesis and editing with decoupled gan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7982–7990.
Zhang, J., Siarohin, A., Tang, H., Chen, J., Sangineto, E., Wang, W., & Sebe, N. (2021). Controllable person image synthesis with spatially-adaptive warped normalization. arXiv preprint arXiv:2105.14739
Zhou, Y., Zhang, R., Chen, C., Li, C., Tensmeyer, C., Yu, T., Gu, J., Xu, J., & Sun, T. (2021). Lafite: Towards language-free training for text-to-image generation. arXiv preprint arXiv:2111.13792
Zhu, M., & Pan, P. (2019). Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5802–5810.
Zhu, Z., Huang, T., Shi, B., Yu, M., Wang, B., & Bai, X. (2019). Progressive pose attention transfer for person image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2347–2356.

Download references

Acknowledgements

The research was partly supported by National Natural Science Foundation of China under Grant 62071468, 62306309, and U23B2054 and the Strategic Priority Research Program of Chinese Academy of Sciences under Grant XDA27010600.

Author information

Authors and Affiliations

CRIPAC, MAIS, CASIA, Beijing, 100190, China
Kaiduo Zhang, Muyi Sun, Jianxin Sun, Kunbo Zhang, Zhenan Sun & Tieniu Tan
School of AI, UCAS, Beijing, 101408, China
Kaiduo Zhang, Jianxin Sun, Kunbo Zhang, Zhenan Sun & Tieniu Tan
School of AI, BUPT, Beijing, 100875, China
Muyi Sun
Nanjing University, Nanjing, 210008, China
Tieniu Tan

Authors

Kaiduo Zhang
View author publications
Search author on:PubMed Google Scholar
Muyi Sun
View author publications
Search author on:PubMed Google Scholar
Jianxin Sun
View author publications
Search author on:PubMed Google Scholar
Kunbo Zhang
View author publications
Search author on:PubMed Google Scholar
Zhenan Sun
View author publications
Search author on:PubMed Google Scholar
Tieniu Tan
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Kunbo Zhang.

Additional information

Communicated by Segio Escalera

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised: The Corresponding author, Acknowledgment section and Table 2, 8 and 9 has been updated.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, K., Sun, M., Sun, J. et al. Open-Vocabulary Text-Driven Human Image Generation. Int J Comput Vis 132, 4379–4397 (2024). https://doi.org/10.1007/s11263-024-02079-7

Download citation

Received: 12 September 2023
Accepted: 06 April 2024
Published: 15 May 2024
Version of record: 15 May 2024
Issue date: October 2024
DOI: https://doi.org/10.1007/s11263-024-02079-7

Keywords

Part of a collection:

Special Issue on Biometrics Security and Privacy

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Open-Vocabulary Text-Driven Human Image Generation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Deep Vision-Language Model for Efficient Multi-modal Similarity Search in Fashion Retrieval

Multi-level feature fusion model-based real-time person re-identification for forensics

Gender Recognition from 3D Shape Parameters

Explore related subjects

Data Availability

Change history

07 August 2024

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now