Abstract
Generating human images from open-vocabulary text descriptions is an exciting but challenging task. Previous methods (i.e., Text2Human) face two challenging problems: (1) they cannot well handle the open-vocabulary setting by arbitrary text inputs (i.e., unseen clothing appearances) and heavily rely on limited preset words (i.e., pattern styles of clothing appearances); (2) the generated human image is inaccuracy in open-vocabulary settings. To alleviate these drawbacks, we propose a flexible diffusion-based framework, namely HumanDiffusion, for open-vocabulary text-driven human image generation (HIG). The proposed framework mainly consists of two novel modules: the Stylized Memory Retrieval (SMR) module and the Multi-scale Feature Mapping (MFM) module. Encoded by the vision-language pretrained CLIP model, we obtain coarse features of the local human appearance. Then, the SMR module utilizes an external database that contains clothing texture details to refine the initial coarse features. Through SMR refreshing, we can achieve the HIG task with arbitrary text inputs, and the range of expression styles is greatly expanded. Later, the MFM module embedding in the diffusion backbone can learn fine-grained appearance features, which effectively achieves precise semantic-coherence alignment of different body parts with appearance features and realizes the accurate expression of desired human appearance. The seamless combination of the proposed novel modules in HumanDiffusion realizes the freestyle and high accuracy of text-guided HIG and editing tasks. Extensive experiments demonstrate that the proposed method can achieve state-of-the-art (SOTA) performance, especially in the open-vocabulary setting.
Similar content being viewed by others
Data Availability
Data available to be shared: The data that support the findings of this study are available in Text2Human: Text-Driven Controllable Human Image Generation at https://github.com/yumingj/DeepFashion-MultiModal, and StyleGAN-Human: A Data-Centric Odyssey of Human Generation at https://github.com/styleganhuman/StyleGAN-Human/blob/main/docs/Dataset.md.
Change history
07 August 2024
A Correction to this paper has been published: https://doi.org/10.1007/s11263-024-02200-w
References
Albahar, B., Lu, J., Yang, J., & Shu, Z. (2021). Shechtman: Pose with style: Detail-preserving pose-guided image synthesis with conditional stylegan. ACM Transactions on Graphics (TOG), 40(6), 1–11.
Avrahami, O., Lischinski, D., & Fried, O. (2022). Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18208–18218.
Blattmann, A., Rombach, R., Oktay, K., & Ommer, B. (2022). Retrieval-augmented diffusion models. arXiv preprint arXiv:2204.11824
Cheong, S.Y., Mustafa, A., & Gilbert, A. (2022) Pose guided multi-person image generation from text. arXiv preprint arXiv:2203.04907.
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., & Bougares, F. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Chunseong Park, C., Kim, B., & Kim, G. (2017). Attend to you: Personalized image captioning with context sequence memory networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 895–903.
Dhariwal, P., & Nichol, A. (2021). Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 8780–8794.
Fu, J., Li, S., Jiang, Y., Lin, K.-Y., Qian, C., Loy, C.C., Wu, W., & Liu, Z. (2022) Stylegan-human: A data-centric odyssey of human generation. In European Conference on Computer Vision, pp. 1–19, Springer.
Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., & Taigman, Y. (2022). Make-a-scene: Scene-based text-to-image generation with human priors. arXiv preprint arXiv:2203.13131
Gu, S., Chen, D., Bao, J., Wen, F., Zhang, B., Chen, D., & Yuan, L. (2022). Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10696–10706.
He, S., Song, Y.-Z., & Xiang, T. (2022) Style-based global appearance flow for virtual try-on. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3470–3479.
Heusel, M., Ramsauer, H., Unterthiner, T., & Nessler, B. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems30.
Ho, J., & Salimans, T. (2022). Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.
Huang, H., Chai, Z., He, R., & Tan, T. (2021). Selective wavelet attention learning for single image deraining. International Journal of Computer Vision, 129, 1282–1300.
Jiang, Y., Yang, S., Qju, H., Wu, W., Loy, C. C., & Liu, Z. (2022). Text2human: Text-driven controllable human image generation. ACM Transactions on Graphics (TOG), 41(4), 1–11.
Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., Mosseri, I., & Irani, M. (2022). Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276
Kim, G., & Kwon, T. (2022). Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2426–2435.
Kingma, D.P. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Kong, Y., & Fu, Y. (2022). Human action recognition and prediction: A survey. International Journal of Computer Vision, 130(5), 1366–1401.
Lewis, K. M., Varadharajan, S., & Kemelmacher-Shlizerman, I. (2021). Tryongan: Body-aware try-on via layered interpolation. ACM Transactions on Graphics (TOG), 40(4), 1–10.
Li, Y., Huang, C., & Loy, C.C. (2019). Dense intrinsic appearance flow for human pose transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3693–3702.
Li, S., Zhao, Y., Varma, R., Salpekar, O., Noordhuis, P., Li, T., Paszke, A., Smith, J., Vaughan, B., & Damania, P., et al. (2020). Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704
Liu, Z., Luo, P., Qiu, S., Wang, X., & Tang, X. (2016). Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Liu, X., Park, D.H., Azadi, S., Zhang, G., Chopikyan, A., Hu, Y., Shi, H., Rohrbach, A., & Darrell, T. (2021). More control for free! image synthesis with semantic diffusion guidance. arXiv preprint arXiv:2112.05744
Liu, D., Wu, L., Zheng, F., Liu, L., & Wang, M. (2022). Verbal-person nets: Pose-guided multi-granularity language-to-person generation. IEEE Transactions on Neural Networks and Learning Systems.
Ma, L., Sun, Q., Georgoulis, S., Van Gool, & L. (2018). Schiele: Disentangled person image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 99–108.
Miao, J., & Wei, Y. (2020). Memory aggregation networks for efficient interactive video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10366–10375.
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., & Mishkin, P. (2021). Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741
Park, T., & Liu, M.-Y. (2019). Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2337–2346.
Qi, X., Liu, C., Sun, M., Li, L., Fan, C., & Yu, X. (2023). Diverse 3d hand gesture prediction from body dynamics by bilateral hand disentanglement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4616–4626.
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., & Chen, M. (2021). Zero-shot text-to-image generation. In International Conference on Machine Learning, pp. 8821–8831.
Ren, Y., Fan, X., Li, G., Liu, S., & Li, T.H. (2022) Neural texture extraction and distribution for controllable person image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13535–13544.
Rombach, R., Blattmann, A., Lorenz, D., & Esser, P. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695.
Sarkar, K., Liu, L., Golyanik, V., & Theobalt, C. (2021) Humangan: A generative model of human images. In 2021 International Conference on 3D Vision (3DV), pp. 258–267.
Sheynin, S., Ashual, O., Polyak, A., Singer, U., Gafni, O., Nachmani, E., & Taigman, Y. (2022). Knn-diffusion: Image generation via large-scale retrieval. arXiv preprint arXiv:2204.02849
Song, Y., & Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems.
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., & Poole, B. (2020). Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456
Sun, J., Deng, Q., Li, Q., Sun, M., Ren, M., & Sun, Z. (2022). Anyface: Free-style text-to-face synthesis and manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18687–18696.
Sun, M., Wang, J., Liu, J., Li, J., Chen, T., & Sun, Z. (2022). A unified framework for biphasic facial age translation with noisy-semantic guided generative adversarial networks. IEEE Transactions on Information Forensics and Security, 17, 1513–1527.
Tan, Z., Yang, Y., Wan, J., Guo, G., & Li, S.Z. (2020) Relation-aware pedestrian attribute recognition with graph convolutional networks. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12055–12062.
Tan, Z., Yang, Y., Wan, J., Hang, H., Guo, G., & Li, S. Z. (2019). Attention-based pedestrian attribute analysis. IEEE Transactions on Image Processing, 28(12), 6126–6140.
Wang, Z., & Simoncelli, E.P. (2003). Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol. 2, pp. 1398–1402.
Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., & Catanzaro, B. (2018). High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807.
Wang, Z., Qi, X., Yuan, K., & Sun, M. (2022). Self-supervised correlation mining network for person image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7703–7712.
Xu, X., Chen, & Y.-C. (2021). Text-guided human image manipulation via image-text shared space. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Yang, Y., Tan, Z., Tiwari, P., Pandey, H. M., Wan, J., Lei, Z., Guo, G., & Li, S. Z. (2021). Cascaded split-and-aggregate learning with feature recombination for pedestrian attribute recognition. International Journal of Computer Vision, 129, 2731–2744.
Yu, C., Gao, C., Wang, J., Yu, G., Shen, C., & Sang, N. (2021). Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision, 129, 3051–3068.
Zhang, L., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543
Zhang, P., & Yang, L. (2022). Exploring dual-task correlation for pose guided person image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7713–7722.
Zhang, S., & Zhao, W. (2021). Keypoint-graph-driven learning framework for object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1065–1073.
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595.
Zhang, J., Li, K., & Lai, Y.-K. (2021). Pise: Person image synthesis and editing with decoupled gan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7982–7990.
Zhang, J., Siarohin, A., Tang, H., Chen, J., Sangineto, E., Wang, W., & Sebe, N. (2021). Controllable person image synthesis with spatially-adaptive warped normalization. arXiv preprint arXiv:2105.14739
Zhou, Y., Zhang, R., Chen, C., Li, C., Tensmeyer, C., Yu, T., Gu, J., Xu, J., & Sun, T. (2021). Lafite: Towards language-free training for text-to-image generation. arXiv preprint arXiv:2111.13792
Zhu, M., & Pan, P. (2019). Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5802–5810.
Zhu, Z., Huang, T., Shi, B., Yu, M., Wang, B., & Bai, X. (2019). Progressive pose attention transfer for person image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2347–2356.
Acknowledgements
The research was partly supported by National Natural Science Foundation of China under Grant 62071468, 62306309, and U23B2054 and the Strategic Priority Research Program of Chinese Academy of Sciences under Grant XDA27010600.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Segio Escalera
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The original online version of this article was revised: The Corresponding author, Acknowledgment section and Table 2, 8 and 9 has been updated.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, K., Sun, M., Sun, J. et al. Open-Vocabulary Text-Driven Human Image Generation. Int J Comput Vis 132, 4379–4397 (2024). https://doi.org/10.1007/s11263-024-02079-7
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-024-02079-7