Abstract
Recent progresses on self-supervised 3D human action representation learning are largely attributed to contrastive learning. However, in conventional contrastive frameworks, the rich complementarity between different skeleton modalities remains under-explored. Moreover, optimized with distinguishing self-augmented samples, models struggle with numerous similar positive instances in the case of limited action categories. In this work, we tackle the aforementioned problems by introducing a general Inter- and intra-modal mutual distillation (\(\hbox {I}^2\)MD) framework. In \(\hbox {I}^2\)MD, we first re-formulate the cross-modal interaction as a cross-modal mutual distillation (CMD) process. Different from existing distillation solutions that transfer the knowledge of a pre-trained and fixed teacher to the student, in CMD, the knowledge is continuously updated and bidirectionally distilled between modalities during pre-training. To alleviate the interference of similar samples and exploit their underlying contexts, we further design the intra-modal mutual distillation (IMD) strategy, In IMD, the dynamic neighbors aggregation (DNA) mechanism is first introduced, where an additional cluster-level discrimination branch is instantiated in each modality. It adaptively aggregates highly-correlated neighboring features, forming local cluster-level contrasting. Mutual distillation is then performed between the two branches for cross-level knowledge exchange. Extensive experiments on three datasets show that our approach sets a series of new records.
Similar content being viewed by others
Data Availability
The NTU-RGB+D dataset (Shahroudy et al., 2016; Liu et al., 2020a) and the PKU-MMD (Chunhui et al., 2017) dataset used in this study are well-recognized public benchmarks in skeleton-based action recognition. The code for data processing has been made publicly available in https://github.com/maoyunyao/CMD.
Change history
07 April 2025
The repeated text in all tables has been removed.
References
Abbasi Koohpayegani, S., Tejankar, A., & Pirsiavash, H. (2020). CompRess: Self-supervised learning by compressing representations. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 33, 12980–12992.
Avola, D., Cascio, M., Cinque, L., Foresti, G. L., Massaroni, C., & Rodolà, E. (2020). 2-D skeleton-based action recognition via two-branch stacked LSTM-RNNs. IEEE Transactions on Multimedia (TMM), 22(10), 2481–2496. https://doi.org/10.1109/TMM.2019.2960588
Ballard, D. H. (1987). Modular learning in neural networks. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 647, 279–284.
Caetano, C., Bremond, F., & Schwartz, W. (2019). Skeleton image representation for 3d action recognition based on tree structure and reference joints. In: SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp 16–23.
Cao, Z., Hidalgo, G., Simon, T., Wei, S., & Sheikh, Y. (2021). OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43(01), 172–186.
Chen, X., Fan, H., Girshick, R., & He, K. (2020b). Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020a). A simple framework for contrastive learning of visual representations. In: Proceedings of the International Conference on Machine Learning (ICML), pp 1597–1607.
Chen, X., Xie, S., & He, K. (2021a). An empirical study of training self-supervised vision transformers. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 9640–9649.
Chen, Y., Zhang, Z., Yuan, C., Li, B., Deng, Y., & Hu, W. (2021b). Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 13359–13368.
Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., & Lu, H. (2020). Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 183–192.
Chunhui, L., Yueyu, H., Yanghao, L., Sijie, S., & Jiaying, L. (2017). PKU-MMD: A large scale benchmark for continuous multi-modal human action understanding. arXiv preprint arXiv:1703.07475
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 248–255.
Deng, J., Yang, Z., Liu, D., Chen, T., Zhou, W., Zhang, Y., Li, H., & Ouyang, W. (2022). TransVG++: End-to-end visual grounding with language conditioned vision transformer. arXiv preprint arXiv:2206.06619
Dinh, L., Krueger, D., & Bengio, Y. (2014). NICE: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516
Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using real NVP. arXiv preprint arXiv:1605.08803
Du, Y., Wang, W., & Wang, L. (2015). Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1110–1118.
Fang, Z., Wang, J., Wang, L., Zhang, L., Yang, Y., & Liu, Z. (2021). SEED: Self-supervised distillation for visual representation. In: Proceedings of the International Conference on Learning Representations (ICLR).
Fang, H.S., Xie, S., Tai, Y.W., & Lu, C. (2017). RMPE: Regional multi-person pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 2334–2343.
Franco, L., Mandica, P., Munjal, B., & Galasso, F. (2023). Hyperbolic self-paced learning for self-supervised skeleton-based action representations. arXiv preprint arXiv:2303.06242
Gao, X., Yang, Y., Zhang, Y., Li, M., Yu, J. G., & Du, S. (2023). Efficient spatio-temporal contrastive learning for skeleton-based 3-d action recognition. IEEE Transactions on Multimedia (TMM), 25, 405–417. https://doi.org/10.1109/TMM.2021.3127040
Gupta, P., Thatipelli, A., Aggarwal, A., Maheshwari, S., Trivedi, N., Das, S., & Sarvadevabhatla, R. K. (2021). Quo vadis, skeleton action recognition? International Journal of Computer Vision (IJCV), 129(7), 2097–2112.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 16000–16009.
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 9729–9738.
Hinton, G., Vinyals, O., Dean, J., et al. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531
Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the International Conference on Machine Learning (ICML), pp 448–456.
Ke, Q., Bennamoun, M., An, S., Sohel, F., & Boussaid, F. (2017). A new representation of skeleton sequences for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3288–3297.
Kim, B., Chang, H.J., Kim, J., & Choi. J.Y. (2022). Global-local motion transformer for unsupervised skeleton-based action learning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 209–225.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: A large video database for human motion recognition. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 2556–2563.
Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., & Tian, Q. (2019). Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3595–3603.
Li, T., Ke, Q., Rahmani, H., Ho, R.E., Ding, H., & Liu, J. (2021c). Else-Net: Elastic semantic network for continual action recognition from skeleton data. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 13434–13443.
Li, J., Selvaraju, R.R., Gotmare, A.D., Joty, S., Xiong, C., & Hoi, S. (2021a). Align before fuse: Vision and language representation learning with momentum distillation. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS).
Li, L., Wang, M., Ni, B., Wang, H., Yang, J., & Zhang, W. (2021b). 3D human action representation learning via cross-view consistency pursuit. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4741–4750.
Liang, D., Fan, G., Lin, G., Chen, W., Pan, X., & Zhu, H. (2019). Three-stream convolutional neural network with multi-task and ensemble learning for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp 934–940.
Lin, L., Song, S., Yang, W., & Liu, J. (2020). MS2L: Multi-task self-supervised learning for skeleton based action recognition. In: Proceedings of the 28th ACM International Conference on Multimedia (ACM MM), pp 2490–2498.
Lin, L., Zhang, J., & Liu, J. (2023). Actionlet-dependent contrastive learning for unsupervised skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2363–2372
Liu, Z., Zhang, H., Chen, Z., Wang, Z., & Ouyang, W. (2020b). Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 143–152.
Liu, X., Zhang, F., Hou, Z., Mian, L., Wang, Z., Zhang, J., & Tang, J. (2021). Self-supervised learning: Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering (TKDE).
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L. Y., & Kot, A. C. (2020). NTU RGB+D 120: A large-scale benchmark for 3d human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 42(10), 2684–2701.
Mao, Y., Zhou, W., Lu, Z., Deng, J., & Li, H. (2022). CMD: Self-supervised 3d action representation learning with cross-modal mutual distillation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 734–752.
Misra, I., Zitnick, C.L., & Hebert, M. (2016). Shuffle and learn: Unsupervised learning using temporal order verification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 527–544.
Nie, Q., Liu, Z., & Liu, Y. (2020). Unsupervised 3D human pose representation with viewpoint and pose disentanglement. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 102–118.
Nie, Q., & Liu, Y. (2021). View transfer on human skeleton pose: Automatically disentangle the view-variant and view-invariant information for pose representation learning. International Journal of Computer Vision (IJCV), 129(1), 1–22.
Noroozi, M., & Favaro, P. (2016). Unsupervised learning of visual representations by solving jigsaw puzzles. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 69–84.
Ouyang, J., Wu, H., Wang, M., Zhou, W., & Li, H. (2021). Contextual similarity aggregation with self-attention for visual re-ranking. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS).
Park, W., Kim, D., Lu, Y., & Cho, M. (2019). Relational knowledge distillation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3967–3976.
Passalis, N., & Tefas, A. (2018). Learning deep representations with probabilistic knowledge transfer. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 268–284.
Peng, B., Jin, X., Liu, J., Li, D., Wu, Y., Liu, Y., Zhou, S., & Zhang, Z. (2019). Correlation congruence for knowledge distillation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 5007–5016.
Rao, H., Leung, C., & Miao, C. (2023). Hierarchical skeleton meta-prototype contrastive learning with hard skeleton mining for unsupervised person re-identification. International Journal of Computer Vision (IJCV) pp 1–23.
Rao, H., Xu, S., Hu, X., Cheng, J., & Hu, B. (2021). Augmented skeleton based contrastive action learning with momentum LSTM for unsupervised action recognition. Information Sciences, 569, 90–109.
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., & Bengio, Y. (2015). FitNets: Hints for thin deep nets. In: Proceedings of the International Conference on Learning Representations (ICLR).
Shah, A., Roy, A., Shah, K., Mishra, S., Jacobs, D., Cherian, A., & Chellappa, R. (2023). Halp: Hallucinating latent positives for skeleton-based self-supervised learning of actions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 18846–18856.
Shahroudy, A., Liu, J., Ng, T.T., & Wang, G. (2016). NTU RGB+D: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1010–1019.
Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2019a). Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 7912–7921.
Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2019b). Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 12026–12035.
Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2021). AdaSGN: Adapting joint number and model size for efficient skeleton-based action recognition. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 13413–13422.
Si, C., Chen, W., Wang, W., Wang, L., & Tan, T. (2019). An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1227–1236.
Si, C., Nie, X., Wang, W., Wang, L., Tan, T., & Feng, J. (2020). Adversarial self-supervised learning for semi-supervised 3D action recognition. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 35–51.
Soomro, K., Zamir, A.R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Su, K., Liu, X., & Shlizerman, E. (2020). PREDICT & CLUSTER: Unsupervised skeleton based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 9631–9640.
Tejankar, A., Koohpayegani, S.A., Pillai, V., Favaro, P., & Pirsiavash, H. (2021). ISD: Self-supervised learning by iterative similarity distillation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 9609–9618.
Thoker, F.M., Doughty, H., & Snoek, C.G. (2021). Skeleton-contrastive 3D action representation learning. In: Proceedings of the 29th ACM International Conference on Multimedia (ACM MM), pp 1655–1663.
Tianyu, G., Hong, L., Zhan, C., Mengyuan, L., Tao, W., & Runwei, D. (2022). Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).
Tung, F., & Mori, G. (2019). Similarity-preserving knowledge distillation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 1365–1374.
Van Den Oord, A., Kalchbrenner, N., & Kavukcuoglu, K. (2016). Pixel recurrent neural networks. In: Proceedings of the International conference on machine learning, pp 1747–1756.
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research (JMLR), 9(11), 2579–2605.
Van den Oord, A., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1809.03327
van den Oord, A., Vinyals, O., & kavukcuoglu, k. (2017). Neural discrete representation learning. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS).
Wang, M., Ni, B., & Yang, X. (2020). Learning multi-view interactional skeleton graph for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).
Wang, N., Zhou, W., & Li, H. (2021). Contrastive transformation for self-supervised correspondence learning. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp 10174–10182.
Wu, H., Wang, M., Zhou, W., Li, H., & Tian, Q. (2022). Contextual similarity distillation for asymmetric image retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 9489–9498.
Xu, J., Yu, Z., Ni, B., Yang, J., Yang, X., & Zhang, W. (2020). Deep kinematics analysis for monocular 3D human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 899–908.
Xu, S., Rao, H., Hu, X., Cheng, J., & Hu, B. (2023). Prototypical contrast and reverse prediction: Unsupervised skeleton based action recognition. IEEE Transactions on Multimedia (TMM), 25, 624–634. https://doi.org/10.1109/TMM.2021.3129616
Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp 7444–7452.
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., & Le, Q.V. (2019). XLNet: Generalized autoregressive pretraining for language understanding. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS).
Yang, S., Liu, J., Lu, S., Er, M.H., & Kot, A.C. (2021b). Skeleton cloud colorization for unsupervised 3D action representation learning. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 13423–13433.
Yang, S., Liu, J., Lu, S., Hwa, E.M., Hu, Y., & Kot, A.C. (2023). Self-supervised 3d action representation learning with skeleton cloud colorization. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Yang, J., Liu, W., Yuan, J., & Mei, T. (2021). Hierarchical soft quantization for skeleton-based human action recognition. IEEE Transactions on Multimedia (TMM), 23, 883–898. https://doi.org/10.1109/TMM.2020.2990082
Zhang, H., Hou, Y., Zhang, W., & Li, W. (2022). Contrastive positive mining for unsupervised 3D action representation learning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 36–51.
Zhang, P., Lan, C., Zeng, W., Xing, J., Xue, J., & Zheng, N. (2020a). Semantics-guided neural networks for efficient skeleton-based human action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1112–1121.
Zhang, J., Lin, L., & Liu, J. (2023a). Hierarchical consistent contrastive learning for skeleton-based action recognition with growing augmentations. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp 3427–3435.
Zhang, X., Xu, C., & Tao, D. (2020b). Context aware graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 14333–14342.
Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., & Zheng, N. (2019). View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 41(8), 1963–1978.
Zhang, S., Wang, C., Nie, L., Yao, H., Huang, Q., & Tian, Q. (2023). Learning enriched hop-aware correlation for robust 3d human pose estimation. International Journal of Computer Vision (IJCV), 131(6), 1566–1583.
Zheng, N., Wen, J., Liu, R., Long, L., Dai, J., & Gong, Z. (2018). Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp 2644–2651.
Acknowledgements
This work was supported by National Natural Science Foundation of China under Contract U20A20183 & 62021001, and the Youth Innovation Promotion Association CAS. It was also supported by the GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC, and the Supercomputing Center of the USTC.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Minsu Cho.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Yunyao Mao and Jiajun Deng have contributed equally to this work and should be considered co-first authors
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mao, Y., Deng, J., Zhou, W. et al. \(\hbox {I}^2\)MD: 3D Action Representation Learning with Inter- and Intra-Modal Mutual Distillation. Int J Comput Vis 133, 4944–4961 (2025). https://doi.org/10.1007/s11263-025-02415-5
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02415-5