CoMo: Controllable Motion Generation Through Language Guided Pose Code Editing

Huang, Yiming; Wan, Weilin; Yang, Yue; Callison-Burch, Chris; Yatskar, Mark; Liu, Lingjie

doi:10.1007/978-3-031-73397-0_11

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15087))

Included in the following conference series:

European Conference on Computer Vision

904 Accesses
16 Citations

Abstract

Text-to-motion models excel at efficient human motion generation, but existing approaches lack fine-grained controllability over the generation process. Consequently, modifying subtle postures within a motion or inserting new actions at specific moments remains a challenge, limiting the applicability of these methods in diverse scenarios. In light of these challenges, we introduce CoMo, a Controllable Motion generation model, adept at accurately generating and editing motions by leveraging the knowledge priors of large language models (LLMs). Specifically, CoMo decomposes motions into discrete and semantically meaningful pose codes, with each code encapsulating the semantics of a body part, representing elementary information such as “left knee slightly bent”. Given textual inputs, CoMo autoregressively generates sequences of pose codes, which are then decoded into 3D motions. Leveraging pose codes as interpretable representations, an LLM can directly intervene in motion editing by adjusting the pose codes according to editing instructions. Experiments demonstrate that CoMo achieves competitive performance in motion generation compared to state-of-the-art models while, in human studies, CoMo substantially surpasses previous work in motion editing abilities. Project page: https://yh2371.github.io/como/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

CEDT2M: text-driven human motion generation via cross-modal mixture of encoder-decoder

Article 10 March 2025

TEMOS: Generating Diverse Human Motions from Textual Descriptions

Large Motion Model for Unified Multi-modal Motion Generation

Notes

1.
Following [40], we set $\lambda $ to 0.5.
2.
Following the method for processing image patches in Vision Transformers [7], a linear layer projects the K-hot vectors before input into the transformer architecture.
3.
The 10 body parts are head, torso, left arm, right arm, left hand, right hand, left leg, right leg, left feet, right feet.
4.
The complete prompts we use are available in the Appendix.
5.
Details of metric calculations are provided in the Appendix.
6.
The definitions of all pose codes are listed in the Appendix.
7.
Prompts and updated descriptions are available in the Appendix.

References

Athanasiou, N., Petrovich, M., Black, M.J., Varol, G.: Teach: temporal action compositions for 3d humans. In: International Conference on 3D Vision (3DV), September 2022
Google Scholar
Athanasiou, N., Petrovich, M., Black, M.J., Varol, G.: SINC: spatial composition of 3D human motions for simultaneous action generation. In: ICCV (2023)
Google Scholar
Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, G.: Executing your commands via motion diffusion in latent space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18000–18010 (2023)
Google Scholar
Dabral, R., Mughal, M.H., Golyanik, V., Theobalt, C.: Mofusion: framework for denoising-diffusion-based motion synthesis. In: Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Delmas, G., Weinzaepfel, P., Lucas, T., Moreno-Noguer, F., Rogez, G.: Posescript: 3d human poses from natural language (2022)
Google Scholar
Delmas, G., Weinzaepfel, P., Moreno-Noguer, F., Rogez, G.: Posefix: Correcting 3d human poses with natural language (2023)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=YicbFdNTTy
Dou, Z., Chen, X., Fan, Q., Komura, T., Wang, W.: Case: learning conditional adversarial skill embeddings for physics-based characters. In: SIGGRAPH Asia 2023 Conference Papers, pp. 1–11 (2023)
Google Scholar
Goel, P., Wang, K.C., Liu, C.K., Fatahalian, K.: Iterative motion editing with natural language (2023)
Google Scholar
Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating diverse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5152–5161, June 2022
Google Scholar
Guo, C., Zuo, X., Wang, S., Cheng, L.: Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In: European Conference on Computer Vision, pp. 580–597. Springer (2022)
Google Scholar
Guo, C., et al.: Action2motion: conditioned generation of 3d human motions. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2021–2029 (2020)
Google Scholar
Holden, D., Saito, J., Komura, T.: A Deep Learning Framework for Character Motion Synthesis and Editing. Association for Computing Machinery, New York, NY, USA, 1 edn. (2023). https://doi.org/10.1145/3596711.3596789
Huang, S., et al.: Diffusion-based generation, optimization, and planning in 3d scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: Motiongpt: human motion as a foreign language. Advances in Neural Information Processing Systems 36 (2024)
Google Scholar
Jin, P., Wu, Y., Fan, Y., Sun, Z., Wei, Y., Yuan, L.: Act as you wish: fine-grained control of motion diffusion model with hierarchical semantic graphs. In: NeurIPS (2023)
Google Scholar
Kalakonda, S.S., Maheshwari, S., Sarvadevabhatla, R.K.: Action-gpt: leveraging large-scale language models for improved and generalized action generation (2023)
Google Scholar
Kim, J., Kim, J., Choi, S.: Flame: free-form language-based motion synthesis editing. In: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence. AAAI’23/IAAI’23/EAAI’23, AAAI Press (2023). https://doi.org/10.1609/aaai.v37i7.25996, https://doi.org/10.1609/aaai.v37i7.25996
Lab, C.M.U.G.: Cmu graphics lab motion capture database (2004)
Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34(6), 248:1–248:16 (Oct 2015)
Google Scholar
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: International Conference on Computer Vision, pp. 5442–5451 (Oct 2019)
Google Scholar
Mandery, C., Terlemez, O., Do, M., Vahrenkamp, N., Asfour, T.: Unifying representations and large-scale whole-body motion databases for studying human motion. IEEE Trans. Rob. 32(4), 796–809 (2016)
Article Google Scholar
van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning. CoRR abs/1711.00937 (2017). http://arxiv.org/abs/1711.00937
OpenAI, R.: Gpt-4 technical report. arXiv pp. 2303–08774 (2023)
Google Scholar
Petrovich, M., Black, M.J., Varol, G.: Temos: Generating diverse human motions from textual descriptions. In: Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, pp. 480–497. Springer, Heidelberg (2022)
Google Scholar
Plappert, M., Mandery, C., Asfour, T.: The kit motion-language dataset. Big Data 4(4), 236–252 (2016)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)
Google Scholar
Ren, J., Yu, C., Chen, S., Ma, X., Pan, L., Liu, Z.: Diffmimic: efficient motion mimicking with differentiable physics. ICLR (2022)
Google Scholar
Shafir, Y., Tevet, G., Kapon, R., Bermano, A.H.: Human motion diffusion as a generative prior (2023)
Google Scholar
Shi, X., Luo, C., Peng, J., Zhang, H., Sun, Y.: Generating fine-grained human motions using chatgpt-refined descriptions (2023)
Google Scholar
Siyao, L., et al.: Bailando: 3d dance generation via actor-critic gpt with choreographic memory. In: CVPR (2022)
Google Scholar
Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: Motionclip: Exposing human motion generation to clip space. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, pp. 358–374. Springer (2022)
Google Scholar
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)
Tseng, J., Castellon, R., Liu, K.: Edge: Editable dance generation from music. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 448–458 (2023)
Google Scholar
Wan, W., Dou, Z., Komura, T., Wang, W., Jayaraman, D., Liu, L.: Tlcontrol: trajectory and language control for human motion synthesis (2023)
Google Scholar
Wan, W., et al.: Diffusionphase: motion diffusion in frequency domain. arXiv preprint arXiv:2312.04036 (2023)
Xie, Y., Jampani, V., Zhong, L., Sun, D., Jiang, H.: Omnicontrol: control any joint at any time for human motion generation (2023)
Google Scholar
Yi, H., et al.: Generating holistic 3d human motion from speech. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 469–480 (June 2023)
Google Scholar
Yuan, Y., Song, J., Iqbal, U., Vahdat, A., Kautz, J.: Physdiff: Physics-guided human motion diffusion model. arXiv preprint arXiv:2212.02500 (2022)
Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen, X.: T2m-gpt: Generating human motion from textual descriptions with discrete representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Zhang, M., et al.: Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001 (2022)
Zhang, M., et al.: Remodiffuse: Retrieval-augmented motion diffusion model. arXiv preprint arXiv:2304.01116 (2023)
Zhang, M., Li, H., Cai, Z., Ren, J., Yang, L., Liu, Z.: Finemogen: Fine-grained spatio-temporal motion generation and editing. NeurIPS (2023)
Google Scholar
Zhang, X., Bhatnagar, B.L., Starke, S., Guzov, V., Pons-Moll, G.: Couch: towards controllable human-chair interactions. In: Computer Vision - ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part V, pp. 518–535. Springer-Verlag, Berlin, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20065-6_30
Zhang, Y., et al.: Motiongpt: finetuned llms are general-purpose motion generators (2023)
Google Scholar
Zhao, K., Zhang, Y., Wang, S., Beeler, T., Tang, S.: Synthesizing diverse human motions in 3d indoor scenes. In: International conference on computer vision (ICCV) (2023)
Google Scholar
Zhou, W., et al.: Emdm: efficient motion diffusion model for fast, high-quality motion generation. arXiv preprint arXiv:2312.02256 (2023)
Zhou, Z., Wang, B.: Ude: a unified driving engine for human motion generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5632–5641 (June 2023)
Google Scholar
Zhu, L., Liu, X., Liu, X., Qian, R., Liu, Z., Yu, L.: Taming diffusion models for audio-driven co-speech gesture generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10544–10553 (2023)
Google Scholar

Download references

Acknowledgment

This research is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the HIATUS Program contract #2022-22072200005. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

Author information

Authors and Affiliations

University of Pennsylvania, Philadelphia, USA
Yiming Huang, Yue Yang, Chris Callison-Burch, Mark Yatskar & Lingjie Liu
The University of Hong Kong, Pok Fu Lam, Hong Kong
Weilin Wan

Authors

Yiming Huang
View author publications
Search author on:PubMed Google Scholar
Weilin Wan
View author publications
Search author on:PubMed Google Scholar
Yue Yang
View author publications
Search author on:PubMed Google Scholar
Chris Callison-Burch
View author publications
Search author on:PubMed Google Scholar
Mark Yatskar
View author publications
Search author on:PubMed Google Scholar
Lingjie Liu
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Yiming Huang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 91654 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, Y., Wan, W., Yang, Y., Callison-Burch, C., Yatskar, M., Liu, L. (2025). CoMo: Controllable Motion Generation Through Language Guided Pose Code Editing. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15087. Springer, Cham. https://doi.org/10.1007/978-3-031-73397-0_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-73397-0_11
Published: 03 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73396-3
Online ISBN: 978-3-031-73397-0
eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics