Fg-T2M++: LLMs-Augmented Fine-Grained Text Driven Human Motion Generation

Wang, Yin; Li, Mu; Liu, Jiapeng; Leng, Zhiying; Li, Frederick W. B.; Zhang, Ziyao; Liang, Xiaohui

doi:10.1007/s11263-025-02392-9

Fg-T2M++: LLMs-Augmented Fine-Grained Text Driven Human Motion Generation

Published: 27 February 2025

Volume 133, pages 4277–4293, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Yin Wang¹,
Mu Li¹,
Jiapeng Liu¹,
Zhiying Leng¹,
Frederick W. B. Li²,
Ziyao Zhang¹ &
…
Xiaohui Liang ORCID: orcid.org/0000-0001-6351-2538^1,3

673 Accesses
21 Citations
Explore all metrics

Abstract

We address the challenging problem of fine-grained text-driven human motion generation. Existing works generate imprecise motions that fail to accurately capture relationships specified in text due to: (1) lack of effective text parsing for detailed semantic cues regarding body parts, (2) not fully modeling linguistic structures between words to comprehend text comprehensively. To tackle these limitations, we propose a novel fine-grained framework Fg-T2M++ that consists of: (1) an LLMs semantic parsing module to extract body part descriptions and semantics from text, (2) a hyperbolic text representation module to encode relational information between text units by embedding the syntactic dependency graph into hyperbolic space, and (3) a multi-modal fusion module to hierarchically fuse text and motion features. Extensive experiments on HumanML3D and KIT-ML datasets demonstrate that Fg-T2M++ outperforms SOTA methods, validating its ability to accurately generate motions adhering to comprehensive text semantics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FG-MDM: Towards Zero-Shot Human Motion Generation via ChatGPT-Refined Descriptions

CEDT2M: text-driven human motion generation via cross-modal mixture of encoder-decoder

Article 10 March 2025

Improved Text-Driven Human Motion Generation via Out-of-Distribution Detection and Rectification

Data Availability

In this work, we use publicly available datasets, HumanML3D and KIT. These two datasets can be obtained at https://github.com/EricGuo5513/HumanML3D and https://drive.google.com/drive/folders/1MnixfyGfujSP-4t8w_2QvjtTVpEKr97t, respectively.

References

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, FL., Almeida, D., Altenschmidt, J., Altman, S., & Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774
Ahuja, C., & Morency, LP. (2019). Language2pose: Natural language grounded pose forecasting. In 2019 International conference on 3D vision (3DV) (pp. 719–728). IEEE.
Athanasiou, N., Petrovich, M., Black, MJ., & Varol, G. (2023). Sinc: Spatial composition of 3d human motions for simultaneous action generation. arXiv preprint arXiv:2304.10417
Cervantes, P., Sekikawa, Y., Sato, I., & Shinoda, K. (2022). Implicit neural representations for variable length human motion generation. InEuropean conference on computer vision Springer (pp. 356–372)
Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., & Yu, G. (2023). Executing your commands via motion diffusion in latent space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 18000–18010)
Devlin, J., Chang, MW., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Ghosh ,A., Cheema, N., Oguz, C., Theobalt, C., & Slusallek P. (2021). Synthesis of compositional animations from textual descriptions. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1396–1406)
Gilardi, F., Alizadeh, M., & Kubli, M. (2023). Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056
Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A., Gong, M., & Cheng, L. (2020). Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM international conference on multimedia (pp. 2021–2029)
Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., & Cheng, L. (2022a). Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5152–5161)
Guo, C., Zuo, X., Wang, S., & Cheng, L. (2022b). Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In European conference on computer vision Springer (pp. 580–597)
Guo, C., Zuo, X., Wang, S., Liu, X., Zou, S., Gong, M., & Cheng, L. (2022). Action2video: Generating videos of human 3d actions. International Journal of Computer Vision, 130(2), 285–315.
Article Google Scholar
Ho, J., & Salimans, T. (2022). Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.
Google Scholar
Honnibal, M., & Montani, I. (2017). spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear, 7(1), 411–420.
Google Scholar
Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., & Chen, T. (2024). Motiongpt: Human motion as a foreign language. Advances in Neural Information Processing Systems 36
Jin, P., Wu, Y., Fan, Y., Sun, Z., Yang, W., & Yuan, L. (2024). Act as you wish: Fine-grained control of motion diffusion model with hierarchical semantic graphs. Advances in Neural Information Processing Systems 36
Kalakonda, SS., Maheshwari, S., & Sarvadevabhatla, RK. (2023). Action-gpt: Leveraging large-scale language models for improved and generalized action generation. In 2023 IEEE international conference on multimedia and expo (ICME) (pp. 31–36). IEEE.
Kao, HK., & Su, L. (2020). Temporally guided music-to-body-movement generation. In Proceedings of the 28th ACM international conference on multimedia (pp. 147–155)
Karunratanakul, K., Preechakul, K., Suwajanakorn, S., & Tang, S. (2023). Guided motion diffusion for controllable human motion synthesis. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2151–2162)
Kim, J., Kim, J., & Choi, S. (2023). Flame: Free-form language-based motion synthesis & editing. Proceedings of the AAAI Conference on Artificial Intelligence, 37, 8255–8263.
Article Google Scholar
Kochurov, M., Karimov, R., & Kozlukov, S. (2020). Geoopt: Riemannian optimization in pytorch. arXiv preprint arXiv:2005.02819
Leng, Z., Wu, SC., Saleh, M., Montanaro, A., Yu, H., Wang, Y., Navab, N., Liang, X., & Tombari, F .(2023). Dynamic hyperbolic attention network for fine hand-object reconstruction. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 14894–14904)
Li, R., Yang, S., Ross, DA., & Kanazawa, A. (2021). Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13401–13412)
Liu, Q., Nickel, M., & Kiela, D. (2019). Hyperbolic graph neural networks. Advances in Neural Information Processing Systems 32
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., & Black, M. J. (2023). Smpl: A skinned multi-person linear model. Seminal Graphics Papers: Pushing the Boundaries, 2, 851–866.
Google Scholar
Mahmood, N., Ghorbani, N., Troje, NF., Pons-Moll, G., & Black, MJ. (2019). Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 5442–5451)
Nickel, M., & Kiela, D. (2017). Poincaré embeddings for learning hierarchical representations. Advances in Neural Information Processing Systems 30
Petrovich, M., Black, MJ., & Varol, G. (2021). Action-conditioned 3d human motion synthesis with transformer vae. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10985–10995)
Petrovich, M., Black, MJ., & Varol, G. (2022). Temos: Generating diverse human motions from textual descriptions. In European conference on computer vision Springer (pp. 480–497)
Plappert, M., Mandery, C., & Asfour, T. (2016). The kit motion-language dataset. Big Data, 4(4), 236–252.
Article Google Scholar
Punnakkal, AR., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., & Black, MJ. (2021). Babel: Bodies, action and behavior with english labels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 722–731)
Radford, A., Kim, JW., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning, PMLR (pp. 8748–8763)
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1), 5485–5551.
MathSciNet Google Scholar
Ren, X., Li, H., Huang, Z., & Chen, Q. (2020). Self-supervised dance video synthesis conditioned on music. In Proceedings of the 28th ACM international conference on multimedia (pp. 46–54)
Shafir, Y., Tevet, G., Kapon, R., & Bermano, AH. (2023). Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418
Shen, Z., Zhang, M., Zhao, H., Yi, S., & Li, H. (2021). Efficient attention: Attention with linear complexities. In textitProceedings of the IEEE/CVF winter conference on applications of computer vision (WACV) (pp. 3531–3539)
Starke, S., Mason, I., & Komura, T. (2022). Deepphase: Periodic autoencoders for learning motion phase manifolds. ACM Transactions on Graphics (TOG), 41(4), 1–13.
Article Google Scholar
Terlemez, Ö., Ulbrich, S., Mandery, C., Do, M., Vahrenkamp, N., & Asfour, T. (2014). Master motor map (mmm)-framework and toolkit for capturing, representing, and reproducing human motion on humanoid robots. In 2014 IEEE-RAS international conference on humanoid robots (pp. 894–901). IEEE.
Tevet, G., Gordon, B., Hertz, A., Bermano, AH., Cohen-Or, D. (2022). Motionclip: Exposing human motion generation to clip space. In European conference on computer vision Springer (pp. 358–374)
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-or, D., & Bermano, AH. (2023). Human motion diffusion model. In The eleventh international conference on learning representations, https://openreview.net/forum?id=SJ1kSyO2jwu
Tseng, J., Castellon, R., & Liu, K. (2023). Edge: Editable dance generation from music. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 448–458)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, AN., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems 30
Wan, W., Dou, Z., Komura, T., Wang, W., Jayaraman, D., & Liu, L. (2023). Tlcontrol: Trajectory and language control for human motion synthesis. arXiv preprint arXiv:2311.17135
Wang, Y., Leng, Z., Li, FW., Wu, SC., & Liang, X. (2023). Fg-t2m: Fine-grained text-driven human motion generation via diffusion model. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 22035–22044)
Yang, M., Zhou, M., Li, Z., Liu, J., Pan, L., Xiong, H., & King, I. (2022). Hyperbolic graph neural networks: a review of methods and applications. arXiv preprint arXiv:2202.13852
Zhang, J., Zhang, Y., Cun, X., Zhang, Y., Zhao, H., Lu, H., Shen, X., & Shan, Y. (2023a). Generating human motion from textual descriptions with discrete representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14730–14740)
Zhang, M., Guo, X., Pan, L., Cai, Z., Hong, F., Li, H., Yang, L., & Liu, Z. (2023b). Remodiffuse: Retrieval-augmented motion diffusion model. arXiv preprint arXiv:2304.01116
Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., & Liu, Z. (2024). Motiondiffuse: Text-driven human motion generation with diffusion model. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2024.3355414
Article Google Scholar
Zhang, M., Li, H., Cai, Z., Ren, J., Yang, L., & Liu, Z. (2024b). Finemogen: Fine-grained spatio-temporal motion generation and editing. Advances in Neural Information Processing Systems 36
Zhong, C., Hu, L., Zhang, Z., & Xia, S. (2023). Attt2m: Text-driven human motion generation with multi-perspective attention mechanism. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 509–519)

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Project Number: 62272019).

Author information

Authors and Affiliations

State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
Yin Wang, Mu Li, Jiapeng Liu, Zhiying Leng, Ziyao Zhang & Xiaohui Liang
Department of Computer Science, University of Durham, Durham, DH1 3LE, UK
Frederick W. B. Li
Zhongguancun Laboratory, Beijing, China
Xiaohui Liang

Authors

Yin Wang
View author publications
Search author on:PubMed Google Scholar
Mu Li
View author publications
Search author on:PubMed Google Scholar
Jiapeng Liu
View author publications
Search author on:PubMed Google Scholar
Zhiying Leng
View author publications
Search author on:PubMed Google Scholar
Frederick W. B. Li
View author publications
Search author on:PubMed Google Scholar
Ziyao Zhang
View author publications
Search author on:PubMed Google Scholar
Xiaohui Liang
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Xiaohui Liang.

Additional information

Communicated by Svetlana Lazebnik.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, Y., Li, M., Liu, J. et al. Fg-T2M++: LLMs-Augmented Fine-Grained Text Driven Human Motion Generation. Int J Comput Vis 133, 4277–4293 (2025). https://doi.org/10.1007/s11263-025-02392-9

Download citation

Received: 22 March 2024
Accepted: 08 February 2025
Published: 27 February 2025
Version of record: 27 February 2025
Issue date: July 2025
DOI: https://doi.org/10.1007/s11263-025-02392-9

Keywords

Profiles

Xiaohui Liang View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fg-T2M++: LLMs-Augmented Fine-Grained Text Driven Human Motion Generation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

FG-MDM: Towards Zero-Shot Human Motion Generation via ChatGPT-Refined Descriptions

CEDT2M: text-driven human motion generation via cross-modal mixture of encoder-decoder

Improved Text-Driven Human Motion Generation via Out-of-Distribution Detection and Rectification

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now