Abstract
Text-to-motion models excel at efficient human motion generation, but existing approaches lack fine-grained controllability over the generation process. Consequently, modifying subtle postures within a motion or inserting new actions at specific moments remains a challenge, limiting the applicability of these methods in diverse scenarios. In light of these challenges, we introduce CoMo, a Controllable Motion generation model, adept at accurately generating and editing motions by leveraging the knowledge priors of large language models (LLMs). Specifically, CoMo decomposes motions into discrete and semantically meaningful pose codes, with each code encapsulating the semantics of a body part, representing elementary information such as “left knee slightly bent”. Given textual inputs, CoMo autoregressively generates sequences of pose codes, which are then decoded into 3D motions. Leveraging pose codes as interpretable representations, an LLM can directly intervene in motion editing by adjusting the pose codes according to editing instructions. Experiments demonstrate that CoMo achieves competitive performance in motion generation compared to state-of-the-art models while, in human studies, CoMo substantially surpasses previous work in motion editing abilities. Project page: https://yh2371.github.io/como/.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Following [40], we set \(\lambda \) to 0.5.
- 2.
Following the method for processing image patches in Vision Transformers [7], a linear layer projects the K-hot vectors before input into the transformer architecture.
- 3.
The 10 body parts are head, torso, left arm, right arm, left hand, right hand, left leg, right leg, left feet, right feet.
- 4.
The complete prompts we use are available in the Appendix.
- 5.
Details of metric calculations are provided in the Appendix.
- 6.
The definitions of all pose codes are listed in the Appendix.
- 7.
Prompts and updated descriptions are available in the Appendix.
References
Athanasiou, N., Petrovich, M., Black, M.J., Varol, G.: Teach: temporal action compositions for 3d humans. In: International Conference on 3D Vision (3DV), September 2022
Athanasiou, N., Petrovich, M., Black, M.J., Varol, G.: SINC: spatial composition of 3D human motions for simultaneous action generation. In: ICCV (2023)
Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, G.: Executing your commands via motion diffusion in latent space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18000–18010 (2023)
Dabral, R., Mughal, M.H., Golyanik, V., Theobalt, C.: Mofusion: framework for denoising-diffusion-based motion synthesis. In: Computer Vision and Pattern Recognition (CVPR) (2023)
Delmas, G., Weinzaepfel, P., Lucas, T., Moreno-Noguer, F., Rogez, G.: Posescript: 3d human poses from natural language (2022)
Delmas, G., Weinzaepfel, P., Moreno-Noguer, F., Rogez, G.: Posefix: Correcting 3d human poses with natural language (2023)
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=YicbFdNTTy
Dou, Z., Chen, X., Fan, Q., Komura, T., Wang, W.: Case: learning conditional adversarial skill embeddings for physics-based characters. In: SIGGRAPH Asia 2023 Conference Papers, pp. 1–11 (2023)
Goel, P., Wang, K.C., Liu, C.K., Fatahalian, K.: Iterative motion editing with natural language (2023)
Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating diverse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5152–5161, June 2022
Guo, C., Zuo, X., Wang, S., Cheng, L.: Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In: European Conference on Computer Vision, pp. 580–597. Springer (2022)
Guo, C., et al.: Action2motion: conditioned generation of 3d human motions. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2021–2029 (2020)
Holden, D., Saito, J., Komura, T.: A Deep Learning Framework for Character Motion Synthesis and Editing. Association for Computing Machinery, New York, NY, USA, 1 edn. (2023). https://doi.org/10.1145/3596711.3596789
Huang, S., et al.: Diffusion-based generation, optimization, and planning in 3d scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: Motiongpt: human motion as a foreign language. Advances in Neural Information Processing Systems 36 (2024)
Jin, P., Wu, Y., Fan, Y., Sun, Z., Wei, Y., Yuan, L.: Act as you wish: fine-grained control of motion diffusion model with hierarchical semantic graphs. In: NeurIPS (2023)
Kalakonda, S.S., Maheshwari, S., Sarvadevabhatla, R.K.: Action-gpt: leveraging large-scale language models for improved and generalized action generation (2023)
Kim, J., Kim, J., Choi, S.: Flame: free-form language-based motion synthesis editing. In: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence. AAAI’23/IAAI’23/EAAI’23, AAAI Press (2023). https://doi.org/10.1609/aaai.v37i7.25996, https://doi.org/10.1609/aaai.v37i7.25996
Lab, C.M.U.G.: Cmu graphics lab motion capture database (2004)
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34(6), 248:1–248:16 (Oct 2015)
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: International Conference on Computer Vision, pp. 5442–5451 (Oct 2019)
Mandery, C., Terlemez, O., Do, M., Vahrenkamp, N., Asfour, T.: Unifying representations and large-scale whole-body motion databases for studying human motion. IEEE Trans. Rob. 32(4), 796–809 (2016)
van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning. CoRR abs/1711.00937 (2017). http://arxiv.org/abs/1711.00937
OpenAI, R.: Gpt-4 technical report. arXiv pp. 2303–08774 (2023)
Petrovich, M., Black, M.J., Varol, G.: Temos: Generating diverse human motions from textual descriptions. In: Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, pp. 480–497. Springer, Heidelberg (2022)
Plappert, M., Mandery, C., Asfour, T.: The kit motion-language dataset. Big Data 4(4), 236–252 (2016)
Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)
Ren, J., Yu, C., Chen, S., Ma, X., Pan, L., Liu, Z.: Diffmimic: efficient motion mimicking with differentiable physics. ICLR (2022)
Shafir, Y., Tevet, G., Kapon, R., Bermano, A.H.: Human motion diffusion as a generative prior (2023)
Shi, X., Luo, C., Peng, J., Zhang, H., Sun, Y.: Generating fine-grained human motions using chatgpt-refined descriptions (2023)
Siyao, L., et al.: Bailando: 3d dance generation via actor-critic gpt with choreographic memory. In: CVPR (2022)
Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: Motionclip: Exposing human motion generation to clip space. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, pp. 358–374. Springer (2022)
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)
Tseng, J., Castellon, R., Liu, K.: Edge: Editable dance generation from music. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 448–458 (2023)
Wan, W., Dou, Z., Komura, T., Wang, W., Jayaraman, D., Liu, L.: Tlcontrol: trajectory and language control for human motion synthesis (2023)
Wan, W., et al.: Diffusionphase: motion diffusion in frequency domain. arXiv preprint arXiv:2312.04036 (2023)
Xie, Y., Jampani, V., Zhong, L., Sun, D., Jiang, H.: Omnicontrol: control any joint at any time for human motion generation (2023)
Yi, H., et al.: Generating holistic 3d human motion from speech. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 469–480 (June 2023)
Yuan, Y., Song, J., Iqbal, U., Vahdat, A., Kautz, J.: Physdiff: Physics-guided human motion diffusion model. arXiv preprint arXiv:2212.02500 (2022)
Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen, X.: T2m-gpt: Generating human motion from textual descriptions with discrete representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Zhang, M., et al.: Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001 (2022)
Zhang, M., et al.: Remodiffuse: Retrieval-augmented motion diffusion model. arXiv preprint arXiv:2304.01116 (2023)
Zhang, M., Li, H., Cai, Z., Ren, J., Yang, L., Liu, Z.: Finemogen: Fine-grained spatio-temporal motion generation and editing. NeurIPS (2023)
Zhang, X., Bhatnagar, B.L., Starke, S., Guzov, V., Pons-Moll, G.: Couch: towards controllable human-chair interactions. In: Computer Vision - ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part V, pp. 518–535. Springer-Verlag, Berlin, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20065-6_30
Zhang, Y., et al.: Motiongpt: finetuned llms are general-purpose motion generators (2023)
Zhao, K., Zhang, Y., Wang, S., Beeler, T., Tang, S.: Synthesizing diverse human motions in 3d indoor scenes. In: International conference on computer vision (ICCV) (2023)
Zhou, W., et al.: Emdm: efficient motion diffusion model for fast, high-quality motion generation. arXiv preprint arXiv:2312.02256 (2023)
Zhou, Z., Wang, B.: Ude: a unified driving engine for human motion generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5632–5641 (June 2023)
Zhu, L., Liu, X., Liu, X., Qian, R., Liu, Z., Yu, L.: Taming diffusion models for audio-driven co-speech gesture generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10544–10553 (2023)
Acknowledgment
This research is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the HIATUS Program contract #2022-22072200005. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Huang, Y., Wan, W., Yang, Y., Callison-Burch, C., Yatskar, M., Liu, L. (2025). CoMo: Controllable Motion Generation Through Language Guided Pose Code Editing. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15087. Springer, Cham. https://doi.org/10.1007/978-3-031-73397-0_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-73397-0_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73396-3
Online ISBN: 978-3-031-73397-0
eBook Packages: Computer ScienceComputer Science (R0)