+
Skip to main content

CoMo: Controllable Motion Generation Through Language Guided Pose Code Editing

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15087))

Included in the following conference series:

  • 904 Accesses

  • 16 Citations

Abstract

Text-to-motion models excel at efficient human motion generation, but existing approaches lack fine-grained controllability over the generation process. Consequently, modifying subtle postures within a motion or inserting new actions at specific moments remains a challenge, limiting the applicability of these methods in diverse scenarios. In light of these challenges, we introduce CoMo, a Controllable Motion generation model, adept at accurately generating and editing motions by leveraging the knowledge priors of large language models (LLMs). Specifically, CoMo decomposes motions into discrete and semantically meaningful pose codes, with each code encapsulating the semantics of a body part, representing elementary information such as “left knee slightly bent”. Given textual inputs, CoMo autoregressively generates sequences of pose codes, which are then decoded into 3D motions. Leveraging pose codes as interpretable representations, an LLM can directly intervene in motion editing by adjusting the pose codes according to editing instructions. Experiments demonstrate that CoMo achieves competitive performance in motion generation compared to state-of-the-art models while, in human studies, CoMo substantially surpasses previous work in motion editing abilities. Project page: https://yh2371.github.io/como/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Following [40], we set \(\lambda \) to 0.5.

  2. 2.

    Following the method for processing image patches in Vision Transformers [7], a linear layer projects the K-hot vectors before input into the transformer architecture.

  3. 3.

    The 10 body parts are head, torso, left arm, right arm, left hand, right hand, left leg, right leg, left feet, right feet.

  4. 4.

    The complete prompts we use are available in the Appendix.

  5. 5.

    Details of metric calculations are provided in the Appendix.

  6. 6.

    The definitions of all pose codes are listed in the Appendix.

  7. 7.

    Prompts and updated descriptions are available in the Appendix.

References

  1. Athanasiou, N., Petrovich, M., Black, M.J., Varol, G.: Teach: temporal action compositions for 3d humans. In: International Conference on 3D Vision (3DV), September 2022

    Google Scholar 

  2. Athanasiou, N., Petrovich, M., Black, M.J., Varol, G.: SINC: spatial composition of 3D human motions for simultaneous action generation. In: ICCV (2023)

    Google Scholar 

  3. Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, G.: Executing your commands via motion diffusion in latent space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18000–18010 (2023)

    Google Scholar 

  4. Dabral, R., Mughal, M.H., Golyanik, V., Theobalt, C.: Mofusion: framework for denoising-diffusion-based motion synthesis. In: Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  5. Delmas, G., Weinzaepfel, P., Lucas, T., Moreno-Noguer, F., Rogez, G.: Posescript: 3d human poses from natural language (2022)

    Google Scholar 

  6. Delmas, G., Weinzaepfel, P., Moreno-Noguer, F., Rogez, G.: Posefix: Correcting 3d human poses with natural language (2023)

    Google Scholar 

  7. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=YicbFdNTTy

  8. Dou, Z., Chen, X., Fan, Q., Komura, T., Wang, W.: Case: learning conditional adversarial skill embeddings for physics-based characters. In: SIGGRAPH Asia 2023 Conference Papers, pp. 1–11 (2023)

    Google Scholar 

  9. Goel, P., Wang, K.C., Liu, C.K., Fatahalian, K.: Iterative motion editing with natural language (2023)

    Google Scholar 

  10. Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating diverse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5152–5161, June 2022

    Google Scholar 

  11. Guo, C., Zuo, X., Wang, S., Cheng, L.: Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In: European Conference on Computer Vision, pp. 580–597. Springer (2022)

    Google Scholar 

  12. Guo, C., et al.: Action2motion: conditioned generation of 3d human motions. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2021–2029 (2020)

    Google Scholar 

  13. Holden, D., Saito, J., Komura, T.: A Deep Learning Framework for Character Motion Synthesis and Editing. Association for Computing Machinery, New York, NY, USA, 1 edn. (2023). https://doi.org/10.1145/3596711.3596789

  14. Huang, S., et al.: Diffusion-based generation, optimization, and planning in 3d scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  15. Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: Motiongpt: human motion as a foreign language. Advances in Neural Information Processing Systems 36 (2024)

    Google Scholar 

  16. Jin, P., Wu, Y., Fan, Y., Sun, Z., Wei, Y., Yuan, L.: Act as you wish: fine-grained control of motion diffusion model with hierarchical semantic graphs. In: NeurIPS (2023)

    Google Scholar 

  17. Kalakonda, S.S., Maheshwari, S., Sarvadevabhatla, R.K.: Action-gpt: leveraging large-scale language models for improved and generalized action generation (2023)

    Google Scholar 

  18. Kim, J., Kim, J., Choi, S.: Flame: free-form language-based motion synthesis editing. In: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence. AAAI’23/IAAI’23/EAAI’23, AAAI Press (2023). https://doi.org/10.1609/aaai.v37i7.25996, https://doi.org/10.1609/aaai.v37i7.25996

  19. Lab, C.M.U.G.: Cmu graphics lab motion capture database (2004)

    Google Scholar 

  20. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34(6), 248:1–248:16 (Oct 2015)

    Google Scholar 

  21. Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: International Conference on Computer Vision, pp. 5442–5451 (Oct 2019)

    Google Scholar 

  22. Mandery, C., Terlemez, O., Do, M., Vahrenkamp, N., Asfour, T.: Unifying representations and large-scale whole-body motion databases for studying human motion. IEEE Trans. Rob. 32(4), 796–809 (2016)

    Article  Google Scholar 

  23. van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning. CoRR abs/1711.00937 (2017). http://arxiv.org/abs/1711.00937

  24. OpenAI, R.: Gpt-4 technical report. arXiv pp. 2303–08774 (2023)

    Google Scholar 

  25. Petrovich, M., Black, M.J., Varol, G.: Temos: Generating diverse human motions from textual descriptions. In: Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, pp. 480–497. Springer, Heidelberg (2022)

    Google Scholar 

  26. Plappert, M., Mandery, C., Asfour, T.: The kit motion-language dataset. Big Data 4(4), 236–252 (2016)

    Google Scholar 

  27. Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)

    Google Scholar 

  28. Ren, J., Yu, C., Chen, S., Ma, X., Pan, L., Liu, Z.: Diffmimic: efficient motion mimicking with differentiable physics. ICLR (2022)

    Google Scholar 

  29. Shafir, Y., Tevet, G., Kapon, R., Bermano, A.H.: Human motion diffusion as a generative prior (2023)

    Google Scholar 

  30. Shi, X., Luo, C., Peng, J., Zhang, H., Sun, Y.: Generating fine-grained human motions using chatgpt-refined descriptions (2023)

    Google Scholar 

  31. Siyao, L., et al.: Bailando: 3d dance generation via actor-critic gpt with choreographic memory. In: CVPR (2022)

    Google Scholar 

  32. Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: Motionclip: Exposing human motion generation to clip space. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, pp. 358–374. Springer (2022)

    Google Scholar 

  33. Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)

  34. Tseng, J., Castellon, R., Liu, K.: Edge: Editable dance generation from music. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 448–458 (2023)

    Google Scholar 

  35. Wan, W., Dou, Z., Komura, T., Wang, W., Jayaraman, D., Liu, L.: Tlcontrol: trajectory and language control for human motion synthesis (2023)

    Google Scholar 

  36. Wan, W., et al.: Diffusionphase: motion diffusion in frequency domain. arXiv preprint arXiv:2312.04036 (2023)

  37. Xie, Y., Jampani, V., Zhong, L., Sun, D., Jiang, H.: Omnicontrol: control any joint at any time for human motion generation (2023)

    Google Scholar 

  38. Yi, H., et al.: Generating holistic 3d human motion from speech. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 469–480 (June 2023)

    Google Scholar 

  39. Yuan, Y., Song, J., Iqbal, U., Vahdat, A., Kautz, J.: Physdiff: Physics-guided human motion diffusion model. arXiv preprint arXiv:2212.02500 (2022)

  40. Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen, X.: T2m-gpt: Generating human motion from textual descriptions with discrete representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  41. Zhang, M., et al.: Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001 (2022)

  42. Zhang, M., et al.: Remodiffuse: Retrieval-augmented motion diffusion model. arXiv preprint arXiv:2304.01116 (2023)

  43. Zhang, M., Li, H., Cai, Z., Ren, J., Yang, L., Liu, Z.: Finemogen: Fine-grained spatio-temporal motion generation and editing. NeurIPS (2023)

    Google Scholar 

  44. Zhang, X., Bhatnagar, B.L., Starke, S., Guzov, V., Pons-Moll, G.: Couch: towards controllable human-chair interactions. In: Computer Vision - ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part V, pp. 518–535. Springer-Verlag, Berlin, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20065-6_30

  45. Zhang, Y., et al.: Motiongpt: finetuned llms are general-purpose motion generators (2023)

    Google Scholar 

  46. Zhao, K., Zhang, Y., Wang, S., Beeler, T., Tang, S.: Synthesizing diverse human motions in 3d indoor scenes. In: International conference on computer vision (ICCV) (2023)

    Google Scholar 

  47. Zhou, W., et al.: Emdm: efficient motion diffusion model for fast, high-quality motion generation. arXiv preprint arXiv:2312.02256 (2023)

  48. Zhou, Z., Wang, B.: Ude: a unified driving engine for human motion generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5632–5641 (June 2023)

    Google Scholar 

  49. Zhu, L., Liu, X., Liu, X., Qian, R., Liu, Z., Yu, L.: Taming diffusion models for audio-driven co-speech gesture generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10544–10553 (2023)

    Google Scholar 

Download references

Acknowledgment

This research is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the HIATUS Program contract #2022-22072200005. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yiming Huang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 91654 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Huang, Y., Wan, W., Yang, Y., Callison-Burch, C., Yatskar, M., Liu, L. (2025). CoMo: Controllable Motion Generation Through Language Guided Pose Code Editing. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15087. Springer, Cham. https://doi.org/10.1007/978-3-031-73397-0_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73397-0_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73396-3

  • Online ISBN: 978-3-031-73397-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载