Abstract
Thanks to capability to alleviate the cost of large-scale annotation, few-shot action recognition (FSAR) has attracted increased attention of researchers in recent years. Existing FSAR approaches typically neglect the role of individual motion pattern in comparison, and under-explore the feature statistics for video dynamics. Thereby, they struggle to handle the challenging temporal misalignment in video dynamics, particularly by using 2D backbones. To overcome these limitations, this work proposes an adaptively aligned multi-scale second-order moment network, namely \(\text {A}^2\text {M}^2\)-Net, to describe the latent video dynamics with a collection of powerful representation candidates and adaptively align them in an instance-guided manner. To this end, our \(\text {A}^2\text {M}^2\)-Net involves two core components, namely, adaptive alignment (\(\text {A}^2\) module) for matching, and multi-scale second-order moment (\(\text {M}^2\) block) for strong representation. Specifically, \(\text {M}^2\) block develops a collection of semantic second-order descriptors at multiple spatio-temporal scales. Furthermore, \(\text {A}^2\) module aims to adaptively select informative candidate descriptors while considering the individual motion pattern. By such means, our \(\text {A}^2\text {M}^2\)-Net is able to handle the challenging temporal misalignment problem by establishing an adaptive alignment protocol for strong representation. Notably, our proposed method generalizes well to various few-shot settings and diverse metrics. The experiments are conducted on five widely used FSAR benchmarks, and the results show our \(\text {A}^2\text {M}^2\)-Net achieves very competitive performance compared to state-of-the-arts, demonstrating its effectiveness and generalization.
Similar content being viewed by others
Data Availability
All experiments are conducted on publicly available datasets. To be specific, SSV2 dataset can be found at https://developer.qualcomm.com/software/ai-datasets/something-something, Kinetics dataset is available at http://deepmind.com/kinetics, HMDB-51 dataset is at https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database, and UCF-101 dataset is accessible at URL https://www.crcv.ucf.edu/data/UCF101.php.
Notes
Here, the notation \(3^2\) refers to a simplification of spatial size \(3\times 3\).
For convenience, here we abbreviate \(t_0+i\) to \(t_i\).
References
Andrychowicz, M., Denil, M., Colmenarejo, S.G., Hoffman, M.W., Pfau, D., Schaul, T., & Freitas, N. (2016). Learning to learn by gradient descent by gradient descent. In NIPS (pp. 3988–3996).
Antoniou, A., Edwards, H., & Storkey, A.J. (2018). How to train your MAML. arXiv preprint arXiv:1810.09502
Arsigny, V., Fillard, P., Pennec, X., & Ayache, N. (2005). Fast and simple calculus on tensors in the Log-Euclidean framework. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 115–122).
Bishay, M., Zoumpourlis, G., & Patras, I. (2019). TARN: Temporal attentive relation network for few-shot and zero-shot action recognition. In BMVC.
Cao, K., Ji, J., Cao, Z., Chang, C.-Y., & Niebles, J.C. (2020). Few-shot video classification via temporal alignment. In CVPR (pp. 10618–10627).
Cao, C., Li, Y., Lv, Q., Wang, P., & Zhang, Y. (2021). Few-shot action recognition with implicit temporal alignment and pair similarity optimization. Computer Vision and Image Understanding, 210, Article 103250.
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the Kinetics dataset. In CVPR (pp. 6299–6308).
Cui, Y., Zhou, F., Wang, J., Liu, X., Lin, Y., & Belongie, S. (2017). Kernel pooling for convolutional neural networks. In CVPR.
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. In ICCV (pp. 764–773).
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR (pp. 248–255).
Doersch, C., Gupta, A., & Zisserman, A. (2020). CrossTransformers: Spatially-aware few-shot transfer. In NuerIPS (vol. 33, pp. 21981–21993).
Dvornik, M., Hadji, I., Derpanis, K. G., Garg, A., & Jepson, A. (2021). Drop-dtw: Aligning common signal between sequences while dropping outliers. NeurIPS, 34, 13782–13793.
Feichtenhofer, C. (2020). X3D: Expanding architectures for efficient video recognition. In ECCV (pp. 203–213).
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). SlowFast networks for video recognition. In ICCV (pp. 6202–6211).
Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In ICML (pp. 1126–1135).
Gao, Z., Wang, Q., Zhang, B., Hu, Q., & Li, P. (2021). Temporal-attentive covariance pooling networks for video recognition. In NeurIPS.
Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., & Memisevic, R. (2017). The “Something Something" video database for learning and evaluating visual common sense. In ICCV.
Guo, F., Wang, Y., Qi, H., Jin, W., & Zhu, L. (2024a). Multi-view distillation based on multi-modal fusion for few-shot action recognition (CLIP-M\(^2\)DF). arXiv preprint arXiv:2401.08345
Guo, F., Zhu, L., Wang, Y., & Qi, H. (2023). Consistency prototype module and motion compensation for few-shot action recognition (CLIP-CPM2C). arXiv preprint arXiv:2312.01083.
Guo, F., Wang, Y., Qi, H., Jin, W., Zhu, L., & Sun, J. (2024b). Multi-view distillation based on multi-modal fusion for few-shot action recognition (CLIP-MDMF). Knowledge-Based Systems, 304, Article 112539.
Hao, F., He, F., Cheng, J., Wang, L., Cao, J., & Tao, D. (2019). Collect and select: Semantic alignment metric learning for few-shot learning. In ICCV (pp. 8460–8469).
He, J., & Gao, S. (2021). TBSN: Sparse-transformer based siamese network for few-shot action recognition. In ICTC (pp. 47–53).
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R.B.(2022). Masked autoencoders are scalable vision learners. In CVPR (pp. 15979–15988).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
Hilde, K., Hueihan, J., Stiefel, h., & Thomas, S. (2011). HMDB: A large video database for human motion recognition. In ICCV (pp. 2556–2563).
Huang, Y., Yang, L., & Sato, Y. (2022). Compound prototype matching for few-shot action recognition. In ECCV (pp. 351–368).
Li, S., Liu, H., Qian, R., Li, Y., See, J., Fei, M., Yu, X., & Lin, W. (2022). TA\(^2\)N: Two-stage action alignment network for few-shot action recognition. In AAAI (vol. 36, pp. 1404–1411).
Li, P., Xie, J., Wang, Q., & Zuo, W. (2017). Is second-order information helpful for large-scale visual recognition? In ICCV (pp. 2070–2078).
Li, P., Xie, J., Wang, Q., Gao, Z.(2018). Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In CVPR (pp. 947–955)
Lifchitz, Y., Avrithis, Y., Picard, S., & Bursuc, A. (2019). Dense classification and implanting for few-shot learning. In CVPR (pp. 9258–9267).
Lin, T., RoyChowdhury, A., & Maji, S. (2015). Bilinear CNN models for fine-grained visual recognition. In ICCV (pp. 1449–1457).
Lin, J., Liu, Z., Wang, W., Wu, W., & Wang, L. (2024). VLG: General video recognition with web textual knowledge. International Journal of Computer Vision, 132(10), 4792–4817.
Liu, X., Zhang, H., & Pirsiavash, H. (2023). MASTAF: A model-agnostic spatio-temporal attention fusion network for few-shot video classification. In WACV (pp. 2508–2517).
Liu, S., Jiang, M., & Kong, J. (2022). Multidimensional prototype refactor enhanced network for few-shot action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 32, 6955–6966.
Lu, S., Ye, H.-J., & Zhan, D.-C. (2021). Few-shot action recognition with compromised metric via optimal transport. arXiv preprint arXiv:2104.03737
Nguyen, K.D., Tran, Q.-H., Nguyen, K., Hua, B.-S., & Nguyen, R. (2022). Inductive and transductive few-shot video classification via appearance and temporal alignments. In ECCV (pp. 471–487).
OpenCV. (2024). https://docs.opencv.org/4.x/index.html. Accessed: June 14, 2024
Pennec, X., Fillard, P., & Ayache, N. (2006). A Riemannian framework for tensor computing. International Journal of Computer Vision, 66, 41–66.
Perrett, T., Masullo, A., Burghardt, T., Mirmehdi, M., & Damen, D. (2021). Temporal-relational CrossTransformers for few-shot action recognition. In CVPR (pp. 475–484).
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J., et al. (2021a). Learning transferable visual models from natural language supervision. In ICML (pp. 8748–8763).
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J., et al. (2021b). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (pp. 8748–8763). PMLR.
Ravi, S., & Larochelle, H. (2016). Optimization as a model for few-shot learning. In ICLR.
Rubner, Y., Tomasi, C., & Guibas, L. J. (2000). The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision, 40(2), 99–121.
Shi, Y., Wu, X., & Lin, H. (2024). Knowledge prompting for few-shot action recognition. TMM.
Snell, J., Swersky, K., & Zemel, R.S. (2017). Prototypical networks for few-shot learning. In NIPS (vol. 30, pp. 4080–4090).
Soomro, K., Zamir, A.R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Su, B., & Wen, J.-R. (2021). Temporal alignment prediction for supervised representation learning and few-shot sequence classification. In ICLR.
Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., & Hospedales, T.M. (2018). Learning to compare: Relation network for few-shot learning. In CVPR. (pp. 1199–1208).
Tang, Y., Béjar, B., & Vidal, R. (2024). Semantic-aware video representation for few-shot action recognition. In WACV. (pp. 6458–6468).
Tang, H., Liu, J., Yan, S., Yan, R., Li, Z., & Tang, J. (2023). M\(^3\)Net: Multi-view encoding, matching, and fusion for few-shot fine-grained action recognition. In ACM Multimedia (pp. 1719–1728).
Thatipelli, A., Narayan, S., Khan, S., Anwer, R.M., Khan, F.S., & Ghanem, B. (2022). Spatio-temporal relation modeling for few-shot action recognition. In CVPR (pp. 19958–19967).
Tian, Y., Yan, Y., Zhai, G., Guo, G., & Gao, Z. (2022). EAN: Event adaptive network for enhanced action recognition. International Journal of Computer Vision, 130(10), 2453–2471.
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In CVPR (pp. 6450–6459).
Vinyals, O., Blundell, C., Lillicrap, T.P., Kavukcuoglu, K., & Wierstra, D. (2016). Matching networks for one shot learning. In NIPS
Wang, J., Wang, Y., Liu, S., & Li, A. (2021). Few-shot fine-grained action recognition via bidirectional attention and contrastive meta-learning. In ACM Multimedia (pp. 582–591).
Wang, H., Wang, Y., Sun, R., & Li, B. (2022a). Global convergence of MAML and theory-inspired neural architecture search for few-shot learning. In CVPR (pp. 9787–9798).
Wang, G., Ye, H., Wang, X., Ye, W., & Wang, H. (2021b). Temporal relation based attentive prototype network for few-shot action recognition. In ACML (pp. 406–421).
Wang, X., Zhang, S., Qing, Z., Gao, C., Zhang, Y., Zhao, D., & Sang, N. (2023a). MoLo: Motion-augmented long-short contrastive learning for few-shot action recognition. In CVPR (pp. 18011–18021).
Wang, X., Zhang, S., Qing, Z., Tang, M., Zuo, Z., Gao, C., Jin, R., & Sang, N (2022b) Hybrid relation guided set matching for few-shot action recognition. In CVPR (pp. 19948–19957).
Wang, Q., Hu, Q., Gao, Z., Li, P., & Hu, Q. (2023b). AMS-Net: Modeling adaptive multi-granularity spatio-temporal cues for video action recognition. Transactions on Neural Networks and Learning Systems, 35(12), 18731–18745.
Wang, X., Wang, X., Jiang, B., & Luo, B. (2023c). Few-shot learning meets transformer: Unified query-support Transformers for few-shot classification. IEEE Transactions on Circuits and Systems for Video Technology, 33, 7789–7802.
Wang, L., Wang, Z., Qiao, Y., & Van Gool, L. (2018). Transferring deep object and scene representations for event recognition in still images. International Journal of Computer Vision, 126, 390–409.
Wang, Q., Xie, J., Zuo, W., Zhang, L., & Li, P. (2020). Deep CNNs meet global covariance pooling: Better representation and generalization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(8), 2582–2597.
Wang, M., Xing, J., Mei, J., Liu, Y., & Jiang, Y. (2023d). ActionCLIP: Adapting language-image pretrained models for video action recognition. IEEE Transactions on Neural Networks and Learning Systems, 36(1), 625–637.
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Gool, L. V. (2019). Temporal segment networks for action recognition in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(11), 2740–2755.
Wang, X., Zhang, S., Cen, J., Gao, C., Zhang, Y., Zhao, D., & Sang, N. (2023e). Clip-guided prototype modulating for few-shot action recognition. International Journal of Computer Vision, 132, 1899–1912.
Wang, X., Zhang, S., Cen, J., Gao, C., Zhang, Y., Zhao, D., & Sang, N. (2024). Clip-guided prototype modulating for few-shot action recognition. International Journal of Computer Vision, 132(6), 1899–1912.
Wanyan, Y., Yang, X., Chen, C., & Xu, C. (2023). Active exploration of multimodal complementarity for few-shot action recognition. In CVPR (pp. 6492–6502).
Wu, J., Zhang, T., Zhang, Z., Wu, F., & Zhang, Y. (2022). Motion-modulated temporal fragment alignment network for few-shot action recognition. In CVPR (pp. 9151–9160).
Wu, Z., Li, H., Zheng, Y., Xiong, C., Jiang, Y.-G., & Davis, L. S. (2021). A coarse-to-fine framework for resource efficient video recognition. International Journal of Computer Vision, 129(11), 2965–2977.
Xia, H., Li, K., Min, M.R., & Ding, Z. (2023). Few-shot video classification via representation fusion and promotion learning. In ICCV (pp. 19311–19320).
Xie, J., Long, F., Lv, J., Wang, Q., & Li, P. (2022) Joint distribution matters: Deep brownian distance covariance for few-shot classification. In CVPR (pp. 7962–7971).
Xing, J., Wang, M., Hou, X., Dai, G., Wang, J., & Liu, Y. (2023a). Multimodal adaptation of clip for few-shot action recognition. arXiv preprint arXiv:2308.01532
Xing, J., Wang, M., Ruan, Y., Chen, B., Guo, Y., Mu, B., Dai, G., Wang, J., & Liu, Y. (2023b). Boosting few-shot action recognition with graph-guided hybrid matching. In ICCV (pp. 1740–1750).
Yang, C., Xu, Y., Shi, J., Dai, B., & Zhou, B. (2020). Temporal pyramid network for action recognition. In CVPR (pp. 591–600).
Ye, H.-J., Hu, H., Zhan, D.-C., & Sha, F. (2020). Few-shot learning via embedding adaptation with set-to-set functions. In CVPR (pp. 8808–8817).
You, H., Zhou, L., Xiao, B., Codella, N.C., Cheng, Y., Xu, R., Chang, S.-F., & Yuan, L. (2021). MA-CLIP: towards modality-agnostic contrastive language-image pre-training
Yu, T., Chen, P., Dang, Y., Huan, R., & Liang, R. (2023). Multi-speed global contextual subspace matching for few-shot action recognition. In ACM Multimedia (pp. 2344–2352).
Zhang, C., Cai, Y., Lin, G., & Shen, C. (2020a). DeepEMD: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. In CVPR (pp. 12200–12210).
Zhang, Y., Fu, Y., Ma, X., Qi, L., Chen, J., Wu, Z., & Jiang, Y.-G. (2023). On the importance of spatial relations for few-shot action recognition. In ACM Multimedia.
Zhang, S., Zhou, J., & He, X. (2021). Learning implicit temporal alignment for few-shot video classification. In IJCAI.
Zhang, C., Cai, Y., Lin, G., & Shen, C. (2022). DeepEMD: Differentiable earth mover’s distance for few-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence., 45, 5632–5648.
Zhang, L., Chang, X., Liu, J., Luo, M., Prakash, M., & Hauptmann, A. G. (2020b). Few-shot activity recognition with cross-modal memory network. Pattern Recognition, 108, Article 107348.
Zhao, Y., Xiong, Y., & Lin, D. (2018). Trajectory convolution for action recognition. NuerIPS 31.
Zheng, S., Chen, S., & Jin, Q. (2022). Few-shot action recognition with hierarchical matching and contrastive learning. In ECCV (pp. 297–313).
Zhou, B., Andonian, A., Oliva, A., & Torralba, A. (2018). Temporal relational reasoning in videos. In ECCV (pp. 803–818).
Zhu, L., & Yang, Y. (2018). Compound memory networks for few-shot video classification. In ECCV (pp. 751–766).
Zhu, X., Toisoul, A., Prez-Ra, J.-M., Zhang, L., Martínez, B., & Xiang, T. (2021). Few-shot action recognition with prototype-centered attentive learning. In BMVC.
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant 62471083, Grant 61971086, Grant 62276186, and Grant 61925602; in part by the Haihe Lab of ITAI under Grant 22HHXCJC00002; and in part by the Science and Technology Development Program Project of Jilin Province under Grant 20230201111GX.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Dima Damen.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gao, Z., Wang, Q., Zhang, B. et al. \(\text {A}^2\text {M}^2\)-Net: Adaptively Aligned Multi-scale Moment for Few-Shot Action Recognition. Int J Comput Vis 133, 5363–5378 (2025). https://doi.org/10.1007/s11263-025-02432-4
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02432-4