$$\hbox {I}^2$$ MD: 3D Action Representation Learning with Inter- and Intra-Modal Mutual Distillation

Mao, Yunyao; Deng, Jiajun; Zhou, Wengang; Lu, Zhenbo; Ouyang, Wanli; Li, Houqiang

doi:10.1007/s11263-025-02415-5

$\hbox {I}^2$MD: 3D Action Representation Learning with Inter- and Intra-Modal Mutual Distillation

Published: 27 March 2025

Volume 133, pages 4944–4961, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Yunyao Mao¹,
Jiajun Deng²,
Wengang Zhou ORCID: orcid.org/0000-0003-1690-9836¹,
Zhenbo Lu³,
Wanli Ouyang⁴ &
…
Houqiang Li¹

367 Accesses
Explore all metrics

This article has been updated

Abstract

Recent progresses on self-supervised 3D human action representation learning are largely attributed to contrastive learning. However, in conventional contrastive frameworks, the rich complementarity between different skeleton modalities remains under-explored. Moreover, optimized with distinguishing self-augmented samples, models struggle with numerous similar positive instances in the case of limited action categories. In this work, we tackle the aforementioned problems by introducing a general Inter- and intra-modal mutual distillation ($\hbox {I}^2$MD) framework. In $\hbox {I}^2$MD, we first re-formulate the cross-modal interaction as a cross-modal mutual distillation (CMD) process. Different from existing distillation solutions that transfer the knowledge of a pre-trained and fixed teacher to the student, in CMD, the knowledge is continuously updated and bidirectionally distilled between modalities during pre-training. To alleviate the interference of similar samples and exploit their underlying contexts, we further design the intra-modal mutual distillation (IMD) strategy, In IMD, the dynamic neighbors aggregation (DNA) mechanism is first introduced, where an additional cluster-level discrimination branch is instantiated in each modality. It adaptively aggregates highly-correlated neighboring features, forming local cluster-level contrasting. Mutual distillation is then performed between the two branches for cross-level knowledge exchange. Extensive experiments on three datasets show that our approach sets a series of new records.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CMD: Self-supervised 3D Action Representation Learning with Cross-Modal Mutual Distillation

Multi-modal Relation Distillation for Unified 3D Representation Learning

Modality-uncertainty-aware knowledge distillation framework for multimodal sentiment analysis

Article Open access 11 November 2025

Data Availability

The NTU-RGB+D dataset (Shahroudy et al., 2016; Liu et al., 2020a) and the PKU-MMD (Chunhui et al., 2017) dataset used in this study are well-recognized public benchmarks in skeleton-based action recognition. The code for data processing has been made publicly available in https://github.com/maoyunyao/CMD.

Change history

07 April 2025
The repeated text in all tables has been removed.

References

Abbasi Koohpayegani, S., Tejankar, A., & Pirsiavash, H. (2020). CompRess: Self-supervised learning by compressing representations. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 33, 12980–12992.
Google Scholar
Avola, D., Cascio, M., Cinque, L., Foresti, G. L., Massaroni, C., & Rodolà, E. (2020). 2-D skeleton-based action recognition via two-branch stacked LSTM-RNNs. IEEE Transactions on Multimedia (TMM), 22(10), 2481–2496. https://doi.org/10.1109/TMM.2019.2960588
Article Google Scholar
Ballard, D. H. (1987). Modular learning in neural networks. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 647, 279–284.
Google Scholar
Caetano, C., Bremond, F., & Schwartz, W. (2019). Skeleton image representation for 3d action recognition based on tree structure and reference joints. In: SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp 16–23.
Cao, Z., Hidalgo, G., Simon, T., Wei, S., & Sheikh, Y. (2021). OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43(01), 172–186.
Article Google Scholar
Chen, X., Fan, H., Girshick, R., & He, K. (2020b). Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020a). A simple framework for contrastive learning of visual representations. In: Proceedings of the International Conference on Machine Learning (ICML), pp 1597–1607.
Chen, X., Xie, S., & He, K. (2021a). An empirical study of training self-supervised vision transformers. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 9640–9649.
Chen, Y., Zhang, Z., Yuan, C., Li, B., Deng, Y., & Hu, W. (2021b). Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 13359–13368.
Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., & Lu, H. (2020). Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 183–192.
Chunhui, L., Yueyu, H., Yanghao, L., Sijie, S., & Jiaying, L. (2017). PKU-MMD: A large scale benchmark for continuous multi-modal human action understanding. arXiv preprint arXiv:1703.07475
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 248–255.
Deng, J., Yang, Z., Liu, D., Chen, T., Zhou, W., Zhang, Y., Li, H., & Ouyang, W. (2022). TransVG++: End-to-end visual grounding with language conditioned vision transformer. arXiv preprint arXiv:2206.06619
Dinh, L., Krueger, D., & Bengio, Y. (2014). NICE: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516
Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using real NVP. arXiv preprint arXiv:1605.08803
Du, Y., Wang, W., & Wang, L. (2015). Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1110–1118.
Fang, Z., Wang, J., Wang, L., Zhang, L., Yang, Y., & Liu, Z. (2021). SEED: Self-supervised distillation for visual representation. In: Proceedings of the International Conference on Learning Representations (ICLR).
Fang, H.S., Xie, S., Tai, Y.W., & Lu, C. (2017). RMPE: Regional multi-person pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 2334–2343.
Franco, L., Mandica, P., Munjal, B., & Galasso, F. (2023). Hyperbolic self-paced learning for self-supervised skeleton-based action representations. arXiv preprint arXiv:2303.06242
Gao, X., Yang, Y., Zhang, Y., Li, M., Yu, J. G., & Du, S. (2023). Efficient spatio-temporal contrastive learning for skeleton-based 3-d action recognition. IEEE Transactions on Multimedia (TMM), 25, 405–417. https://doi.org/10.1109/TMM.2021.3127040
Article Google Scholar
Gupta, P., Thatipelli, A., Aggarwal, A., Maheshwari, S., Trivedi, N., Das, S., & Sarvadevabhatla, R. K. (2021). Quo vadis, skeleton action recognition? International Journal of Computer Vision (IJCV), 129(7), 2097–2112.
Article Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 16000–16009.
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 9729–9738.
Hinton, G., Vinyals, O., Dean, J., et al. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531
Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the International Conference on Machine Learning (ICML), pp 448–456.
Ke, Q., Bennamoun, M., An, S., Sohel, F., & Boussaid, F. (2017). A new representation of skeleton sequences for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3288–3297.
Kim, B., Chang, H.J., Kim, J., & Choi. J.Y. (2022). Global-local motion transformer for unsupervised skeleton-based action learning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 209–225.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: A large video database for human motion recognition. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 2556–2563.
Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., & Tian, Q. (2019). Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3595–3603.
Li, T., Ke, Q., Rahmani, H., Ho, R.E., Ding, H., & Liu, J. (2021c). Else-Net: Elastic semantic network for continual action recognition from skeleton data. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 13434–13443.
Li, J., Selvaraju, R.R., Gotmare, A.D., Joty, S., Xiong, C., & Hoi, S. (2021a). Align before fuse: Vision and language representation learning with momentum distillation. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS).
Li, L., Wang, M., Ni, B., Wang, H., Yang, J., & Zhang, W. (2021b). 3D human action representation learning via cross-view consistency pursuit. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4741–4750.
Liang, D., Fan, G., Lin, G., Chen, W., Pan, X., & Zhu, H. (2019). Three-stream convolutional neural network with multi-task and ensemble learning for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp 934–940.
Lin, L., Song, S., Yang, W., & Liu, J. (2020). MS2L: Multi-task self-supervised learning for skeleton based action recognition. In: Proceedings of the 28th ACM International Conference on Multimedia (ACM MM), pp 2490–2498.
Lin, L., Zhang, J., & Liu, J. (2023). Actionlet-dependent contrastive learning for unsupervised skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2363–2372
Liu, Z., Zhang, H., Chen, Z., Wang, Z., & Ouyang, W. (2020b). Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 143–152.
Liu, X., Zhang, F., Hou, Z., Mian, L., Wang, Z., Zhang, J., & Tang, J. (2021). Self-supervised learning: Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering (TKDE).
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L. Y., & Kot, A. C. (2020). NTU RGB+D 120: A large-scale benchmark for 3d human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 42(10), 2684–2701.
Article Google Scholar
Mao, Y., Zhou, W., Lu, Z., Deng, J., & Li, H. (2022). CMD: Self-supervised 3d action representation learning with cross-modal mutual distillation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 734–752.
Misra, I., Zitnick, C.L., & Hebert, M. (2016). Shuffle and learn: Unsupervised learning using temporal order verification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 527–544.
Nie, Q., Liu, Z., & Liu, Y. (2020). Unsupervised 3D human pose representation with viewpoint and pose disentanglement. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 102–118.
Nie, Q., & Liu, Y. (2021). View transfer on human skeleton pose: Automatically disentangle the view-variant and view-invariant information for pose representation learning. International Journal of Computer Vision (IJCV), 129(1), 1–22.
Article Google Scholar
Noroozi, M., & Favaro, P. (2016). Unsupervised learning of visual representations by solving jigsaw puzzles. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 69–84.
Ouyang, J., Wu, H., Wang, M., Zhou, W., & Li, H. (2021). Contextual similarity aggregation with self-attention for visual re-ranking. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS).
Park, W., Kim, D., Lu, Y., & Cho, M. (2019). Relational knowledge distillation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3967–3976.
Passalis, N., & Tefas, A. (2018). Learning deep representations with probabilistic knowledge transfer. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 268–284.
Peng, B., Jin, X., Liu, J., Li, D., Wu, Y., Liu, Y., Zhou, S., & Zhang, Z. (2019). Correlation congruence for knowledge distillation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 5007–5016.
Rao, H., Leung, C., & Miao, C. (2023). Hierarchical skeleton meta-prototype contrastive learning with hard skeleton mining for unsupervised person re-identification. International Journal of Computer Vision (IJCV) pp 1–23.
Rao, H., Xu, S., Hu, X., Cheng, J., & Hu, B. (2021). Augmented skeleton based contrastive action learning with momentum LSTM for unsupervised action recognition. Information Sciences, 569, 90–109.
Article Google Scholar
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., & Bengio, Y. (2015). FitNets: Hints for thin deep nets. In: Proceedings of the International Conference on Learning Representations (ICLR).
Shah, A., Roy, A., Shah, K., Mishra, S., Jacobs, D., Cherian, A., & Chellappa, R. (2023). Halp: Hallucinating latent positives for skeleton-based self-supervised learning of actions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 18846–18856.
Shahroudy, A., Liu, J., Ng, T.T., & Wang, G. (2016). NTU RGB+D: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1010–1019.
Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2019a). Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 7912–7921.
Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2019b). Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 12026–12035.
Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2021). AdaSGN: Adapting joint number and model size for efficient skeleton-based action recognition. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 13413–13422.
Si, C., Chen, W., Wang, W., Wang, L., & Tan, T. (2019). An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1227–1236.
Si, C., Nie, X., Wang, W., Wang, L., Tan, T., & Feng, J. (2020). Adversarial self-supervised learning for semi-supervised 3D action recognition. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 35–51.
Soomro, K., Zamir, A.R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Su, K., Liu, X., & Shlizerman, E. (2020). PREDICT & CLUSTER: Unsupervised skeleton based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 9631–9640.
Tejankar, A., Koohpayegani, S.A., Pillai, V., Favaro, P., & Pirsiavash, H. (2021). ISD: Self-supervised learning by iterative similarity distillation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 9609–9618.
Thoker, F.M., Doughty, H., & Snoek, C.G. (2021). Skeleton-contrastive 3D action representation learning. In: Proceedings of the 29th ACM International Conference on Multimedia (ACM MM), pp 1655–1663.
Tianyu, G., Hong, L., Zhan, C., Mengyuan, L., Tao, W., & Runwei, D. (2022). Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).
Tung, F., & Mori, G. (2019). Similarity-preserving knowledge distillation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 1365–1374.
Van Den Oord, A., Kalchbrenner, N., & Kavukcuoglu, K. (2016). Pixel recurrent neural networks. In: Proceedings of the International conference on machine learning, pp 1747–1756.
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research (JMLR), 9(11), 2579–2605.
MATH Google Scholar
Van den Oord, A., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1809.03327
van den Oord, A., Vinyals, O., & kavukcuoglu, k. (2017). Neural discrete representation learning. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS).
Wang, M., Ni, B., & Yang, X. (2020). Learning multi-view interactional skeleton graph for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).
Wang, N., Zhou, W., & Li, H. (2021). Contrastive transformation for self-supervised correspondence learning. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp 10174–10182.
Wu, H., Wang, M., Zhou, W., Li, H., & Tian, Q. (2022). Contextual similarity distillation for asymmetric image retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 9489–9498.
Xu, J., Yu, Z., Ni, B., Yang, J., Yang, X., & Zhang, W. (2020). Deep kinematics analysis for monocular 3D human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 899–908.
Xu, S., Rao, H., Hu, X., Cheng, J., & Hu, B. (2023). Prototypical contrast and reverse prediction: Unsupervised skeleton based action recognition. IEEE Transactions on Multimedia (TMM), 25, 624–634. https://doi.org/10.1109/TMM.2021.3129616
Article Google Scholar
Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp 7444–7452.
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., & Le, Q.V. (2019). XLNet: Generalized autoregressive pretraining for language understanding. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS).
Yang, S., Liu, J., Lu, S., Er, M.H., & Kot, A.C. (2021b). Skeleton cloud colorization for unsupervised 3D action representation learning. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 13423–13433.
Yang, S., Liu, J., Lu, S., Hwa, E.M., Hu, Y., & Kot, A.C. (2023). Self-supervised 3d action representation learning with skeleton cloud colorization. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Yang, J., Liu, W., Yuan, J., & Mei, T. (2021). Hierarchical soft quantization for skeleton-based human action recognition. IEEE Transactions on Multimedia (TMM), 23, 883–898. https://doi.org/10.1109/TMM.2020.2990082
Article Google Scholar
Zhang, H., Hou, Y., Zhang, W., & Li, W. (2022). Contrastive positive mining for unsupervised 3D action representation learning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 36–51.
Zhang, P., Lan, C., Zeng, W., Xing, J., Xue, J., & Zheng, N. (2020a). Semantics-guided neural networks for efficient skeleton-based human action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1112–1121.
Zhang, J., Lin, L., & Liu, J. (2023a). Hierarchical consistent contrastive learning for skeleton-based action recognition with growing augmentations. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp 3427–3435.
Zhang, X., Xu, C., & Tao, D. (2020b). Context aware graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 14333–14342.
Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., & Zheng, N. (2019). View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 41(8), 1963–1978.
Article Google Scholar
Zhang, S., Wang, C., Nie, L., Yao, H., Huang, Q., & Tian, Q. (2023). Learning enriched hop-aware correlation for robust 3d human pose estimation. International Journal of Computer Vision (IJCV), 131(6), 1566–1583.
Zheng, N., Wen, J., Liu, R., Long, L., Dai, J., & Gong, Z. (2018). Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp 2644–2651.

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China under Contract U20A20183 & 62021001, and the Youth Innovation Promotion Association CAS. It was also supported by the GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC, and the Supercomputing Center of the USTC.

Author information

Authors and Affiliations

CAS Key Laboratory of Technology in GIPAS, Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, China
Yunyao Mao, Wengang Zhou & Houqiang Li
The University of Adelaide, Adelaide, Australia
Jiajun Deng
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China
Zhenbo Lu
Shanghai Artificial Intelligent Laboratory, Shanghai, China
Wanli Ouyang

Authors

Yunyao Mao
View author publications
Search author on:PubMed Google Scholar
Jiajun Deng
View author publications
Search author on:PubMed Google Scholar
Wengang Zhou
View author publications
Search author on:PubMed Google Scholar
Zhenbo Lu
View author publications
Search author on:PubMed Google Scholar
Wanli Ouyang
View author publications
Search author on:PubMed Google Scholar
Houqiang Li
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Wengang Zhou.

Additional information

Communicated by Minsu Cho.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Yunyao Mao and Jiajun Deng have contributed equally to this work and should be considered co-first authors

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Mao, Y., Deng, J., Zhou, W. et al. $\hbox {I}^2$MD: 3D Action Representation Learning with Inter- and Intra-Modal Mutual Distillation. Int J Comput Vis 133, 4944–4961 (2025). https://doi.org/10.1007/s11263-025-02415-5

Download citation

Received: 23 October 2023
Accepted: 28 February 2025
Published: 27 March 2025
Version of record: 27 March 2025
Issue date: July 2025
DOI: https://doi.org/10.1007/s11263-025-02415-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

\(\hbox {I}^2\)MD: 3D Action Representation Learning with Inter- and Intra-Modal Mutual Distillation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

CMD: Self-supervised 3D Action Representation Learning with Cross-Modal Mutual Distillation

Multi-modal Relation Distillation for Unified 3D Representation Learning

Modality-uncertainty-aware knowledge distillation framework for multimodal sentiment analysis

Data Availability

Change history

07 April 2025

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

\(\hbox {I}^2\)MD: 3D Action Representation Learning with Inter- and Intra-Modal Mutual Distillation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

CMD: Self-supervised 3D Action Representation Learning with Cross-Modal Mutual Distillation

Multi-modal Relation Distillation for Unified 3D Representation Learning

Modality-uncertainty-aware knowledge distillation framework for multimodal sentiment analysis

Explore related subjects

Data Availability

Change history

07 April 2025

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now