Abstract
We address the challenging problem of fine-grained text-driven human motion generation. Existing works generate imprecise motions that fail to accurately capture relationships specified in text due to: (1) lack of effective text parsing for detailed semantic cues regarding body parts, (2) not fully modeling linguistic structures between words to comprehend text comprehensively. To tackle these limitations, we propose a novel fine-grained framework Fg-T2M++ that consists of: (1) an LLMs semantic parsing module to extract body part descriptions and semantics from text, (2) a hyperbolic text representation module to encode relational information between text units by embedding the syntactic dependency graph into hyperbolic space, and (3) a multi-modal fusion module to hierarchically fuse text and motion features. Extensive experiments on HumanML3D and KIT-ML datasets demonstrate that Fg-T2M++ outperforms SOTA methods, validating its ability to accurately generate motions adhering to comprehensive text semantics.
Similar content being viewed by others
Data Availability
In this work, we use publicly available datasets, HumanML3D and KIT. These two datasets can be obtained at https://github.com/EricGuo5513/HumanML3D and https://drive.google.com/drive/folders/1MnixfyGfujSP-4t8w_2QvjtTVpEKr97t, respectively.
References
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, FL., Almeida, D., Altenschmidt, J., Altman, S., & Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774
Ahuja, C., & Morency, LP. (2019). Language2pose: Natural language grounded pose forecasting. In 2019 International conference on 3D vision (3DV) (pp. 719–728). IEEE.
Athanasiou, N., Petrovich, M., Black, MJ., & Varol, G. (2023). Sinc: Spatial composition of 3d human motions for simultaneous action generation. arXiv preprint arXiv:2304.10417
Cervantes, P., Sekikawa, Y., Sato, I., & Shinoda, K. (2022). Implicit neural representations for variable length human motion generation. InEuropean conference on computer vision Springer (pp. 356–372)
Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., & Yu, G. (2023). Executing your commands via motion diffusion in latent space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 18000–18010)
Devlin, J., Chang, MW., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Ghosh ,A., Cheema, N., Oguz, C., Theobalt, C., & Slusallek P. (2021). Synthesis of compositional animations from textual descriptions. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1396–1406)
Gilardi, F., Alizadeh, M., & Kubli, M. (2023). Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056
Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A., Gong, M., & Cheng, L. (2020). Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM international conference on multimedia (pp. 2021–2029)
Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., & Cheng, L. (2022a). Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5152–5161)
Guo, C., Zuo, X., Wang, S., & Cheng, L. (2022b). Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In European conference on computer vision Springer (pp. 580–597)
Guo, C., Zuo, X., Wang, S., Liu, X., Zou, S., Gong, M., & Cheng, L. (2022). Action2video: Generating videos of human 3d actions. International Journal of Computer Vision, 130(2), 285–315.
Ho, J., & Salimans, T. (2022). Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.
Honnibal, M., & Montani, I. (2017). spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear, 7(1), 411–420.
Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., & Chen, T. (2024). Motiongpt: Human motion as a foreign language. Advances in Neural Information Processing Systems 36
Jin, P., Wu, Y., Fan, Y., Sun, Z., Yang, W., & Yuan, L. (2024). Act as you wish: Fine-grained control of motion diffusion model with hierarchical semantic graphs. Advances in Neural Information Processing Systems 36
Kalakonda, SS., Maheshwari, S., & Sarvadevabhatla, RK. (2023). Action-gpt: Leveraging large-scale language models for improved and generalized action generation. In 2023 IEEE international conference on multimedia and expo (ICME) (pp. 31–36). IEEE.
Kao, HK., & Su, L. (2020). Temporally guided music-to-body-movement generation. In Proceedings of the 28th ACM international conference on multimedia (pp. 147–155)
Karunratanakul, K., Preechakul, K., Suwajanakorn, S., & Tang, S. (2023). Guided motion diffusion for controllable human motion synthesis. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2151–2162)
Kim, J., Kim, J., & Choi, S. (2023). Flame: Free-form language-based motion synthesis & editing. Proceedings of the AAAI Conference on Artificial Intelligence, 37, 8255–8263.
Kochurov, M., Karimov, R., & Kozlukov, S. (2020). Geoopt: Riemannian optimization in pytorch. arXiv preprint arXiv:2005.02819
Leng, Z., Wu, SC., Saleh, M., Montanaro, A., Yu, H., Wang, Y., Navab, N., Liang, X., & Tombari, F .(2023). Dynamic hyperbolic attention network for fine hand-object reconstruction. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 14894–14904)
Li, R., Yang, S., Ross, DA., & Kanazawa, A. (2021). Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13401–13412)
Liu, Q., Nickel, M., & Kiela, D. (2019). Hyperbolic graph neural networks. Advances in Neural Information Processing Systems 32
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., & Black, M. J. (2023). Smpl: A skinned multi-person linear model. Seminal Graphics Papers: Pushing the Boundaries, 2, 851–866.
Mahmood, N., Ghorbani, N., Troje, NF., Pons-Moll, G., & Black, MJ. (2019). Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 5442–5451)
Nickel, M., & Kiela, D. (2017). Poincaré embeddings for learning hierarchical representations. Advances in Neural Information Processing Systems 30
Petrovich, M., Black, MJ., & Varol, G. (2021). Action-conditioned 3d human motion synthesis with transformer vae. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10985–10995)
Petrovich, M., Black, MJ., & Varol, G. (2022). Temos: Generating diverse human motions from textual descriptions. In European conference on computer vision Springer (pp. 480–497)
Plappert, M., Mandery, C., & Asfour, T. (2016). The kit motion-language dataset. Big Data, 4(4), 236–252.
Punnakkal, AR., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., & Black, MJ. (2021). Babel: Bodies, action and behavior with english labels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 722–731)
Radford, A., Kim, JW., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning, PMLR (pp. 8748–8763)
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1), 5485–5551.
Ren, X., Li, H., Huang, Z., & Chen, Q. (2020). Self-supervised dance video synthesis conditioned on music. In Proceedings of the 28th ACM international conference on multimedia (pp. 46–54)
Shafir, Y., Tevet, G., Kapon, R., & Bermano, AH. (2023). Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418
Shen, Z., Zhang, M., Zhao, H., Yi, S., & Li, H. (2021). Efficient attention: Attention with linear complexities. In textitProceedings of the IEEE/CVF winter conference on applications of computer vision (WACV) (pp. 3531–3539)
Starke, S., Mason, I., & Komura, T. (2022). Deepphase: Periodic autoencoders for learning motion phase manifolds. ACM Transactions on Graphics (TOG), 41(4), 1–13.
Terlemez, Ö., Ulbrich, S., Mandery, C., Do, M., Vahrenkamp, N., & Asfour, T. (2014). Master motor map (mmm)-framework and toolkit for capturing, representing, and reproducing human motion on humanoid robots. In 2014 IEEE-RAS international conference on humanoid robots (pp. 894–901). IEEE.
Tevet, G., Gordon, B., Hertz, A., Bermano, AH., Cohen-Or, D. (2022). Motionclip: Exposing human motion generation to clip space. In European conference on computer vision Springer (pp. 358–374)
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-or, D., & Bermano, AH. (2023). Human motion diffusion model. In The eleventh international conference on learning representations, https://openreview.net/forum?id=SJ1kSyO2jwu
Tseng, J., Castellon, R., & Liu, K. (2023). Edge: Editable dance generation from music. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 448–458)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, AN., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems 30
Wan, W., Dou, Z., Komura, T., Wang, W., Jayaraman, D., & Liu, L. (2023). Tlcontrol: Trajectory and language control for human motion synthesis. arXiv preprint arXiv:2311.17135
Wang, Y., Leng, Z., Li, FW., Wu, SC., & Liang, X. (2023). Fg-t2m: Fine-grained text-driven human motion generation via diffusion model. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 22035–22044)
Yang, M., Zhou, M., Li, Z., Liu, J., Pan, L., Xiong, H., & King, I. (2022). Hyperbolic graph neural networks: a review of methods and applications. arXiv preprint arXiv:2202.13852
Zhang, J., Zhang, Y., Cun, X., Zhang, Y., Zhao, H., Lu, H., Shen, X., & Shan, Y. (2023a). Generating human motion from textual descriptions with discrete representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14730–14740)
Zhang, M., Guo, X., Pan, L., Cai, Z., Hong, F., Li, H., Yang, L., & Liu, Z. (2023b). Remodiffuse: Retrieval-augmented motion diffusion model. arXiv preprint arXiv:2304.01116
Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., & Liu, Z. (2024). Motiondiffuse: Text-driven human motion generation with diffusion model. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2024.3355414
Zhang, M., Li, H., Cai, Z., Ren, J., Yang, L., & Liu, Z. (2024b). Finemogen: Fine-grained spatio-temporal motion generation and editing. Advances in Neural Information Processing Systems 36
Zhong, C., Hu, L., Zhang, Z., & Xia, S. (2023). Attt2m: Text-driven human motion generation with multi-perspective attention mechanism. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 509–519)
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Project Number: 62272019).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Svetlana Lazebnik.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, Y., Li, M., Liu, J. et al. Fg-T2M++: LLMs-Augmented Fine-Grained Text Driven Human Motion Generation. Int J Comput Vis 133, 4277–4293 (2025). https://doi.org/10.1007/s11263-025-02392-9
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02392-9