这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

Beyond Talking – Generating Holistic 3D Human Dyadic Motion for Communication

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In this paper, we introduce an innovative task focused on human communication, aiming to generate 3D holistic human motions for both speakers and listeners. Central to our approach is the incorporation of factorization to decouple audio features and the combination of textual semantic information, thereby facilitating the creation of more realistic and coordinated movements. We separately train VQ-VAEs with respect to the holistic motions of both speaker and listener. We consider the real-time mutual influence between the speaker and the listener and propose a novel chain-like transformer-based auto-regressive model specifically designed to characterize real-world communication scenarios effectively which can generate the motions of both the speaker and the listener simultaneously. These designs ensure that the results we generate are both coordinated and diverse. Our approach demonstrates state-of-the-art performance on two benchmark datasets. Furthermore, we introduce the HoCo holistic communication dataset, which is a valuable resource for future research. Our HoCo dataset and code will be released for research purposes upon acceptance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

We claim to release the dataset and code upon acceptance. The datasets generated and analyzed during the current study will be available in our open-source repository.

Notes

  1. https://www.youtube.com.

  2. https://pypi.org/project/moviepy/.

  3. https://github.com/openai/whisper.

References

  • Ahuja, C., Ma, S., Morency, LP., & Sheikh, Y. (2019). To react or not to react: End-to-end visual pose forecasting for personalized avatar during dyadic conversations. In 2019 International conference on multimodal interaction (pp. 74–84).

  • Ahuja, C., Joshi, P., Ishii, R., & Morency, L.P. (2023). Continual learning for personalized co-speech gesture generation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 20893–20903).

  • Ao, T., Gao, Q., Lou, Y., Chen, B., & Liu, L. (2022). Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. ACM Transactions on Graphics (TOG), 41(6), 1–19.

    Article  Google Scholar 

  • Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449–12460.

    Google Scholar 

  • Bhattacharya, U., Childs, E., Rewkowski, N., & Manocha, D. (2021). Speech2affectivegestures: Synthesizing co-speech gestures with generative adversarial affective expression learning. In: Proceedings of the 29th ACM International Conference on Multimedia, pp 2027–2036.

  • Birdwhistell, R. (1952). Introduction to Kenesics.

  • Blanz, V., & Vetter, T. (2023). A morphable model for the synthesis of 3d faces. Seminal Graphics Papers: Pushing the Boundaries, 2, 157–164.

    Google Scholar 

  • Boker, S. M., Rotondo, J. L., Xu, M., & King, K. (2002). Windowed cross-correlation and peak picking for the analysis of variability in the association between behavioral time series. Psychology Methods, 7(3), 338.

    Article  Google Scholar 

  • Bozkurt, E., Khaki, H., Keçeci, S., Türker, B. B., Yemez, Y., & Erzin, E. (2017). The jestkod database: An affective multimodal database of dyadic interactions. Language Resources and Evaluation, 51, 857–872.

    Article  Google Scholar 

  • Chang, Z., Hu, W., Yang, Q., & Zheng, S. (2023). Hierarchical semantic perceptual listener head video generation: A high-performance pipeline. In Proceedings of the 31st ACM International Conference on Multimedia (pp. 9581–9585).

  • Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A., & Black, M.J. (2019). Capture, learning, and synthesis of 3d speaking styles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10101–10111).

  • Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.

    Article  Google Scholar 

  • Devlin, J., Chang, M.W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

  • Doukas, M.C., Zafeiriou, S., & Sharmanska, V. (2021). Headgan: One-shot neural head synthesis and editing. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 14398–14407).

  • Fan, Y., Lin, Z., Saito, J., Wang, W., & Komura, T. (2022). Faceformer: Speech-driven 3d facial animation with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 18770–18780).

  • Ferstl, Y., & McDonnell, R. (2018). Investigating the use of recurrent motion modelling for speech gesture generation. In Proceedings of the 18th international conference on intelligent virtual agents (pp. 93–98).

  • Gillies, M., Pan, X., Slater, M., & Shawe-Taylor, J. (2008). Responsive listening behavior. Computer Animation and Virtual Worlds, 19(5), 579–589.

    Article  Google Scholar 

  • Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., & Malik, J. (2019). Learning individual styles of conversational gesture. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3497–3506).

  • Guo, Y., Chen, K., Liang, S., Liu, Y.J., Bao, H., & Zhang, J. (2021). Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 5784–5794).

  • Habibie, I., Xu, W., Mehta, D., Liu, L., Seidel, HP., Pons-Moll, G., Elgharib, M., & Theobalt, C. (2021). Learning speech-driven 3d conversational gestures from video. In Proceedings of the 21st ACM international conference on intelligent virtual agents (pp. 101–108).

  • Huang, CM., & Mutlu, B. (2014). Learning-based modeling of multimodal behaviors for humanlike robots. In Proceedings of the 2014 ACM/IEEE international conference on human-robot interaction (pp. 57–64).

  • Jocelyn Scheirer RWP. (1999). Affective objects. MIT Media Laboratory: Tech. rep.

  • Jonell, P., Kucherenko, T., Henter, GE., & Beskow, J. (2020). Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings. In Proceedings of the 20th ACM international conference on intelligent virtual agents (pp. 1–8).

  • Joo, H., Simon, T., Cikara, M., & Sheikh, Y. (2019). Towards social artificial intelligence: Nonverbal social signal prediction in a triadic interaction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10873–10883).

  • Kipp, M. (2005). Gesture generation by imitation: From human behavior to computer character animation. Universal-Publishers.

  • Kopp, S., Krenn, B., Marsella, S., Marshall, AN., Pelachaud, C., Pirker, H., Thórisson, KR., & Vilhjálmsson, H. (2006). Towards a common framework for multimodal generation: The behavior markup language. In Intelligent virtual agents: 6th international conference, IVA 2006, Marina Del Rey, CA, USA, August 21–23, 2006. Proceedings 6, Springer (pp. 205–217).

  • Kucherenko, T., Jonell, P., Van Waveren, S., Henter, GE., Alexandersson, S., Leite, I., & Kjellström, H. (2020). Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the 2020 international conference on multimodal interaction (pp. 242–250).

  • Kucherenko, T., Nagy, R., Yoon, Y., Woo, J., Nikolov, T., Tsakov, M., & Henter, GE. (2023). The genea challenge 2023: A large-scale evaluation of gesture generation models in monadic and dyadic settings. In Proceedings of the 25th international conference on multimodal interaction (pp. 792–801).

  • Lee, G., Deng, Z., Ma, S., Shiratori, T., Srinivasa, SS., & Sheikh, Y. (2019). Talking with hands 16.2 m: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 763–772).

  • Levine, S., Krähenbühl, P., Thrun, S., & Koltun, V. (2010). Gesture controllers. In Acm siggraph 2010 papers (pp. 1–11).

  • Li, YA., Han, C., & Mesgarani ,N. (2022). Styletts: A style-based generative model for natural and diverse text-to-speech synthesis. arXiv preprint arXiv:2205.15439.

  • Liu, J., Wang, X., Fu, X., Chai, Y., Yu, C., Dai, J., & Han, J. (2023). Mfr-net: Multi-faceted responsive listening head generation via denoising diffusion model. In Proceedings of the 31st ACM international conference on multimedia (pp. 6734–6743).

  • Liu, X., Wu, Q., Zhou, H., Xu, Y., Qian, R., Lin, X., Zhou, X., Wu, W., Dai, B., & Zhou, B. (2022a). Learning hierarchical cross-modal association for co-speech gesture generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10462–10472).

  • Liu, X., Xu, Y., Wu, Q., Zhou, H., Wu, W., & Zhou, B. (2022b). Semantic-aware implicit neural audio-driven video portrait generation. In European conference on computer vision (Springer, pp. 106–125).

  • Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., & Black, M. J. (2015). SMPL: A skinned multi-person linear model. ACM Transactions on Graphics, (Proc SIGGRAPH Asia), 34(6), 248:1-248:16.

    Google Scholar 

  • Lu, Y., Chai, J., & Cao, X. (2021). Live speech portraits: Real-time photorealistic talking-head animation. ACM Transactions on Graphics (TOG), 40(6), 1–17.

    Article  Google Scholar 

  • McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. (2017). Montreal forced aligner: Trainable text-speech alignment using kaldi. Interspeech, 2017, 498–502.

    Article  Google Scholar 

  • Nagrani, A., Chung, J.S., & Zisserman, A. (2017). Voxceleb: A large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612.

  • Ng, E., Joo, H., Hu, L., Li, H., Darrell, T., Kanazawa, A., & Ginosar, S. (2022). Learning to listen: Modeling non-deterministic dyadic facial motion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 20395–20405).

  • Palmero, C., Selva, J., Smeureanu, S., Junior, J., Jacques, C., Clapés, A., Moseguí, A., Zhang, Z., Gallardo, D., Guilera, G., et al. (2021). Context-aware personality inference in dyadic scenarios: Introducing the udiva dataset. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 1–12).

  • Palmero, C., Barquero, G., Junior, JCJ., Clapés, A., Núnez, J., Curto, D., Smeureanu, S., Selva, J., Zhang, Z., Saeteros, D., et al. (2022) Chalearn lap challenges on self-reported personality recognition and non-verbal behavior forecasting during social dyadic interactions: Dataset, design, and results. In Understanding social behavior in dyadic and small group interactions, PMLR (pp. 4–52).

  • Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, AAA., Tzionas, D., & Black, M.J. (2019) Expressive body capture: 3d hands, face, and body from a single image. In Proceedings IEEE conference on computer vision and pattern recognition (CVPR).

  • Paysan, P., Knothe, R., Amberg, B., Romdhani, S., & Vetter, T. (2009). A 3d face model for pose and illumination invariant face recognition. In 2009 sixth IEEE international conference on advanced video and signal based surveillance (Ieee, pp. 296–301).

  • Petrovich, M., Black, MJ., & Varol, G. (2021a). Action-conditioned 3D human motion synthesis with transformer VAE. In International Conference on Computer Vision (ICCV).

  • Petrovich, M., Black, M.J., & Varol, G. (2021b). Action-conditioned 3d human motion synthesis with transformer vae. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10985–10995).

  • Qian, S., Tu, Z., Zhi, Y., Liu, W., & Gao, S. (2021). Speech drives templates: Co-speech gesture synthesis with learned templates. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 11077–11086).

  • Richard, A., Zollhöfer, M., Wen, Y., De la Torre, F., & Sheikh, Y. (2021). Meshtalk: 3d face animation from speech using cross-modality disentanglement. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1173–1182).

  • Sahidullah, M., & Saha, G. (2012). Design, analysis and experimental evaluation of block based transformation in mfcc computation for speaker recognition. Speech Communication, 54(4), 543–565.

    Article  Google Scholar 

  • Sargin, M. E., Yemez, Y., Erzin, E., & Tekalp, A. M. (2008). Analysis of head gesture and prosody patterns for prosody-driven head-gesture animation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(8), 1330–1345.

    Article  Google Scholar 

  • Song, L., Yin, G., Jin, Z., Dong, X., & Xu, C. (2023a). Emotional listener portrait: Realistic listener motion simulation in conversation. arXiv preprint arXiv:2310.00068.

  • Song, S., Spitale, M., Luo, C., Barquero, G., Palmero, C., Escalera, S., Valstar, M., Baur, T., Ringeval, F., Andre, E., et al. (2023b). React2023: the first multi-modal multiple appropriate facial reaction generation challenge. arXiv preprint arXiv:2306.06583.

  • Sun, Q., Wang, Y., Zeng, A., Yin, W., Wei, C., Wang, W., Mei, H., Leung, CS., Liu, Z., Yang, L., et al. (2024). Aios: All-in-one-stage expressive human pose and shape estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1834–1843).

  • Suwajanakorn, S., Seitz, S. M., & Kemelmacher-Shlizerman, I. (2017). Synthesizing obama: Learning lip sync from audio. ACM Transactions on Graphics (ToG), 36(4), 1–13.

    Article  Google Scholar 

  • Takeuchi, K., Kubota, S., Suzuki, K., Hasegawa, D., & Sakuta, H. (2017). Creating a gesture-speech dataset for speech-based automatic gesture generation. In: HCI International 2017–Posters’ Extended Abstracts: 19th International Conference, HCI International 2017, Vancouver, BC, Canada, July 9–14, 2017, Proceedings, Part I 19, Springer, pp 198–202.

  • Tuyen, NTV., & Celiktutan, O. (2022). Agree or disagreeÆ’ generating body gestures from affective contextual cues during dyadic interactions. In 2022 31st IEEE international conference on robot and human interactive communication (RO-MAN) (IEEE, pp. 1542–1547).

  • Tuyen, N. T. V., & Celiktutan, O. (2023). It takes two, not one: Context-aware nonverbal behaviour generation in dyadic interactions. Advanced Robotics, 37(24), 1552–1565.

    Article  Google Scholar 

  • Tuyen, N.T.V., Georgescu, A.L., Di Giulio, I., & Celiktutan, O. (2023). A multimodal dataset for robot learning to imitate social human-human interaction. In Companion of the 2023 ACM/IEEE international conference on human-robot interaction (pp. 238–242).

  • Van Den Oord, A., Vinyals, O., et al. (2017). Neural discrete representation learning. Advances in Neural Information Processing Systems 30.

  • Wang, K., Wu, Q., Song, L., Yang, Z., Wu, W., Qian, C., He, R., Qiao, Y., & Loy, CC. (2020). Mead: A large-scale audio-visual dataset for emotional talking-face generation. In ECCV.

  • Wang, T.C., Mallya, A., Liu, M.Y. (2021). One-shot free-view neural talking-head synthesis for video conferencing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10039–10049).

  • Wu, X., Hu, P., Wu, Y., Lyu, X., Cao, Y.P., Shan, Y., Yang, W., Sun, Z., & Qi, X. (2023). Speech2lip: High-fidelity speech to lip generation by learning from a short video. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 22168–22177).

  • Wuu, Ch., Zheng, N., Ardisson, S., Bali, R., Belko, D., Brockmeyer, E., Evans, L., Godisart, T., Ha, H., Huang, X., et al. (2022). Multiface: A dataset for neural face rendering. arXiv preprint arXiv:2207.11243.

  • Xu, C., Zhu, J., Zhang, J., Han, Y., Chu, W., Tai, Y., Wang, C., Xie, Z., & Liu, Y. (2023). High-fidelity generalized emotional talking face generation with multi-modal emotion space learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6609–6619).

  • Ye, Z., Jiang, Z., Ren, Y., Liu, J., He, J., & Zhao, Z. (2023). Geneface: Generalized and high-fidelity audio-driven 3d talking face synthesis. arXiv preprint arXiv:2301.13430.

  • Yi, H., Liang, H., Liu, Y., Cao, Q., Wen, Y., Bolkart, T., Tao, D., & Black, M.J. (2023). Generating holistic 3d human motion from speech. In CVPR.

  • Yin, L., Wang, Y., He, T., Liu, J., Zhao, W., Li, B., Jin, X., & Lin, J. (2023). Emog: Synthesizing emotive co-speech 3d gesture with diffusion model. arXiv preprint arXiv:2306.11496.

  • Yoon, Y., Ko, WR., Jang, M., Lee, J., Kim, J., & Lee, G. (2019). Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In 2019 international conference on robotics and automation (ICRA) (IEEE, pp. 4303–4309).

  • Yoon, Y., Cha, B., Lee, J. H., Jang, M., Lee, J., Kim, J., & Lee, G. (2020). Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics (TOG), 39(6), 1–16.

    Article  Google Scholar 

  • Yoon, Y., Wolfert, P., Kucherenko, T., Viegas, C., Nikolov, T., Tsakov, M., & Henter, GE. (2022). The genea challenge 2022: A large evaluation of data-driven co-speech gesture generation. In Proceedings of the 2022 international conference on multimodal interaction (pp. 736–747).

  • Zhang, H., Tian, Y., Zhang, Y., Li, M., An, L., Sun, Z., & Liu, Y. (2023a). Pymaf-x: Towards well-aligned full-body model regression from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  • Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., & Shen, X. (2023b). T2m-gpt: Generating human motion from textual descriptions with discrete representations. arXiv preprint arXiv:2301.06052.

  • Zhang, W., Cun, X., Wang, X., Zhang, Y., Shen, X., Guo, Y., Shan, Y., & Wang, F. (2023c). Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8652–8661).

  • Zhao, Q., Long, P., Zhang, Q., Qin, D., Liang, H., Zhang, L., Zhang, Y., Yu, J., & Xu, L. (2024). Media2face: Co-speech facial animation generation with multi-modality guidance. In ACM SIGGRAPH 2024 conference papers (pp. 1–13).

  • Zheng, F., Zhang, G., & Song, Z. (2001). Comparison of different implementations of mfcc. Journal of Computer Science and Technology, 16, 582–589.

    Article  Google Scholar 

  • Zhi, Y., Cun, X., Chen, X., Shen, X., Guo, W., Huang, S., & Gao, S. (2023). Livelyspeaker: Towards semantic-aware co-speech gesture generation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 20807–20817).

  • Zhou, M., Bai, Y., Zhang, W., Yao, T., Zhao, T., & Mei, T. (2022). Responsive listening head generation: a benchmark dataset and baseline. In European conference on computer vision (Springer, pp. 124–142).

  • Zhu, L., Liu, X., Liu, X., Qian, R., Liu, Z., & Yu, L. (2023). Taming diffusion models for audio-driven co-speech gesture generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10544–10553).

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under contract No. 62171256 and Meituan.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruqi Huang.

Additional information

Communicated by Shengfeng He.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (mp4 58654 KB)

Appendix

Appendix

1.1 Emotion Editing via Our Audio Decoupling

To verify the effectiveness of our audio decoupling method, we design the editing experiment shown in Fig. 9. First, we select two segments of audio from the MEAD dataset, containing happy and sad emotions respectively. The texts are "The revolution now underway in materials handling makes it much easier" and "No, the man was not drunk, he wondered how he got up tied up with a stranger". The generated motion is then shown in the top-left and bottom-left of Fig. 9. Note the opposite directions of arm movement in the left two panels, reflecting the difference in emotion.

Thanks to our audio decoupling design, our generation framework allows for emotion-based editing. We first compute the style feature \(F^A_{happy}, F^A_{sad}\) of the input audio clips. Then we switch the style feature between the two audio clips. Namely, following Eq. 1, we construct:

$$\begin{aligned} F^m_{happy2sad} = [F^A_{sad}; F^W_{happy}], F^f_{happy2sad} = [F^A_{sad}; F^P_{happy}], \end{aligned}$$
(8)

as the input of our trained model. Similarly, we can edit the sad audio clip to a happy one by reversing the above construction.

We remark that, in the above editing, the audio texts are untouched. Top-right of Fig. 9 shows the editing result of making the happy audio sad, and the bottom-right of Fig. 9 shows the opposite. Again, we check the arm movement directions, which confirms the editing effect.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, M., Xu, C., Jiang, X. et al. Beyond Talking – Generating Holistic 3D Human Dyadic Motion for Communication. Int J Comput Vis 133, 2910–2926 (2025). https://doi.org/10.1007/s11263-024-02300-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-024-02300-7

Keywords

Profiles

  1. Baigui Sun