这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

\(\text {A}^2\text {M}^2\)-Net: Adaptively Aligned Multi-scale Moment for Few-Shot Action Recognition

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Thanks to capability to alleviate the cost of large-scale annotation, few-shot action recognition (FSAR) has attracted increased attention of researchers in recent years. Existing FSAR approaches typically neglect the role of individual motion pattern in comparison, and under-explore the feature statistics for video dynamics. Thereby, they struggle to handle the challenging temporal misalignment in video dynamics, particularly by using 2D backbones. To overcome these limitations, this work proposes an adaptively aligned multi-scale second-order moment network, namely \(\text {A}^2\text {M}^2\)-Net, to describe the latent video dynamics with a collection of powerful representation candidates and adaptively align them in an instance-guided manner. To this end, our \(\text {A}^2\text {M}^2\)-Net involves two core components, namely, adaptive alignment (\(\text {A}^2\) module) for matching, and multi-scale second-order moment (\(\text {M}^2\) block) for strong representation. Specifically, \(\text {M}^2\) block develops a collection of semantic second-order descriptors at multiple spatio-temporal scales. Furthermore, \(\text {A}^2\) module aims to adaptively select informative candidate descriptors while considering the individual motion pattern. By such means, our \(\text {A}^2\text {M}^2\)-Net is able to handle the challenging temporal misalignment problem by establishing an adaptive alignment protocol for strong representation. Notably, our proposed method generalizes well to various few-shot settings and diverse metrics. The experiments are conducted on five widely used FSAR benchmarks, and the results show our \(\text {A}^2\text {M}^2\)-Net achieves very competitive performance compared to state-of-the-arts, demonstrating its effectiveness and generalization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability

All experiments are conducted on publicly available datasets. To be specific, SSV2 dataset can be found at https://developer.qualcomm.com/software/ai-datasets/something-something, Kinetics dataset is available at http://deepmind.com/kinetics, HMDB-51 dataset is at https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database, and UCF-101 dataset is accessible at URL https://www.crcv.ucf.edu/data/UCF101.php.

Notes

  1. Here, the notation \(3^2\) refers to a simplification of spatial size \(3\times 3\).

  2. For convenience, here we abbreviate \(t_0+i\) to \(t_i\).

References

  • Andrychowicz, M., Denil, M., Colmenarejo, S.G., Hoffman, M.W., Pfau, D., Schaul, T., & Freitas, N. (2016). Learning to learn by gradient descent by gradient descent. In NIPS (pp. 3988–3996).

  • Antoniou, A., Edwards, H., & Storkey, A.J. (2018). How to train your MAML. arXiv preprint arXiv:1810.09502

  • Arsigny, V., Fillard, P., Pennec, X., & Ayache, N. (2005). Fast and simple calculus on tensors in the Log-Euclidean framework. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 115–122).

  • Bishay, M., Zoumpourlis, G., & Patras, I. (2019). TARN: Temporal attentive relation network for few-shot and zero-shot action recognition. In BMVC.

  • Cao, K., Ji, J., Cao, Z., Chang, C.-Y., & Niebles, J.C. (2020). Few-shot video classification via temporal alignment. In CVPR (pp. 10618–10627).

  • Cao, C., Li, Y., Lv, Q., Wang, P., & Zhang, Y. (2021). Few-shot action recognition with implicit temporal alignment and pair similarity optimization. Computer Vision and Image Understanding, 210, Article 103250.

    Article  Google Scholar 

  • Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the Kinetics dataset. In CVPR (pp. 6299–6308).

  • Cui, Y., Zhou, F., Wang, J., Liu, X., Lin, Y., & Belongie, S. (2017). Kernel pooling for convolutional neural networks. In CVPR.

  • Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. In ICCV (pp. 764–773).

  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR (pp. 248–255).

  • Doersch, C., Gupta, A., & Zisserman, A. (2020). CrossTransformers: Spatially-aware few-shot transfer. In NuerIPS (vol. 33, pp. 21981–21993).

  • Dvornik, M., Hadji, I., Derpanis, K. G., Garg, A., & Jepson, A. (2021). Drop-dtw: Aligning common signal between sequences while dropping outliers. NeurIPS, 34, 13782–13793.

    Google Scholar 

  • Feichtenhofer, C. (2020). X3D: Expanding architectures for efficient video recognition. In ECCV (pp. 203–213).

  • Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). SlowFast networks for video recognition. In ICCV (pp. 6202–6211).

  • Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In ICML (pp. 1126–1135).

  • Gao, Z., Wang, Q., Zhang, B., Hu, Q., & Li, P. (2021). Temporal-attentive covariance pooling networks for video recognition. In NeurIPS.

  • Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., & Memisevic, R. (2017). The “Something Something" video database for learning and evaluating visual common sense. In ICCV.

  • Guo, F., Wang, Y., Qi, H., Jin, W., & Zhu, L. (2024a). Multi-view distillation based on multi-modal fusion for few-shot action recognition (CLIP-M\(^2\)DF). arXiv preprint arXiv:2401.08345

  • Guo, F., Zhu, L., Wang, Y., & Qi, H. (2023). Consistency prototype module and motion compensation for few-shot action recognition (CLIP-CPM2C). arXiv preprint arXiv:2312.01083.

  • Guo, F., Wang, Y., Qi, H., Jin, W., Zhu, L., & Sun, J. (2024b). Multi-view distillation based on multi-modal fusion for few-shot action recognition (CLIP-MDMF). Knowledge-Based Systems, 304, Article 112539.

    Article  Google Scholar 

  • Hao, F., He, F., Cheng, J., Wang, L., Cao, J., & Tao, D. (2019). Collect and select: Semantic alignment metric learning for few-shot learning. In ICCV (pp. 8460–8469).

  • He, J., & Gao, S. (2021). TBSN: Sparse-transformer based siamese network for few-shot action recognition. In ICTC (pp. 47–53).

  • He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R.B.(2022). Masked autoencoders are scalable vision learners. In CVPR (pp. 15979–15988).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.

  • Hilde, K., Hueihan, J., Stiefel, h., & Thomas, S. (2011). HMDB: A large video database for human motion recognition. In ICCV (pp. 2556–2563).

  • Huang, Y., Yang, L., & Sato, Y. (2022). Compound prototype matching for few-shot action recognition. In ECCV (pp. 351–368).

  • Li, S., Liu, H., Qian, R., Li, Y., See, J., Fei, M., Yu, X., & Lin, W. (2022). TA\(^2\)N: Two-stage action alignment network for few-shot action recognition. In AAAI (vol. 36, pp. 1404–1411).

  • Li, P., Xie, J., Wang, Q., & Zuo, W. (2017). Is second-order information helpful for large-scale visual recognition? In ICCV (pp. 2070–2078).

  • Li, P., Xie, J., Wang, Q., Gao, Z.(2018). Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In CVPR (pp. 947–955)

  • Lifchitz, Y., Avrithis, Y., Picard, S., & Bursuc, A. (2019). Dense classification and implanting for few-shot learning. In CVPR (pp. 9258–9267).

  • Lin, T., RoyChowdhury, A., & Maji, S. (2015). Bilinear CNN models for fine-grained visual recognition. In ICCV (pp. 1449–1457).

  • Lin, J., Liu, Z., Wang, W., Wu, W., & Wang, L. (2024). VLG: General video recognition with web textual knowledge. International Journal of Computer Vision, 132(10), 4792–4817.

    Article  Google Scholar 

  • Liu, X., Zhang, H., & Pirsiavash, H. (2023). MASTAF: A model-agnostic spatio-temporal attention fusion network for few-shot video classification. In WACV (pp. 2508–2517).

  • Liu, S., Jiang, M., & Kong, J. (2022). Multidimensional prototype refactor enhanced network for few-shot action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 32, 6955–6966.

    Article  Google Scholar 

  • Lu, S., Ye, H.-J., & Zhan, D.-C. (2021). Few-shot action recognition with compromised metric via optimal transport. arXiv preprint arXiv:2104.03737

  • Nguyen, K.D., Tran, Q.-H., Nguyen, K., Hua, B.-S., & Nguyen, R. (2022). Inductive and transductive few-shot video classification via appearance and temporal alignments. In ECCV (pp. 471–487).

  • OpenCV. (2024). https://docs.opencv.org/4.x/index.html. Accessed: June 14, 2024

  • Pennec, X., Fillard, P., & Ayache, N. (2006). A Riemannian framework for tensor computing. International Journal of Computer Vision, 66, 41–66.

    Article  Google Scholar 

  • Perrett, T., Masullo, A., Burghardt, T., Mirmehdi, M., & Damen, D. (2021). Temporal-relational CrossTransformers for few-shot action recognition. In CVPR (pp. 475–484).

  • Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J., et al. (2021a). Learning transferable visual models from natural language supervision. In ICML (pp. 8748–8763).

  • Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J., et al. (2021b). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (pp. 8748–8763). PMLR.

  • Ravi, S., & Larochelle, H. (2016). Optimization as a model for few-shot learning. In ICLR.

  • Rubner, Y., Tomasi, C., & Guibas, L. J. (2000). The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision, 40(2), 99–121.

    Article  Google Scholar 

  • Shi, Y., Wu, X., & Lin, H. (2024). Knowledge prompting for few-shot action recognition. TMM.

  • Snell, J., Swersky, K., & Zemel, R.S. (2017). Prototypical networks for few-shot learning. In NIPS (vol. 30, pp. 4080–4090).

  • Soomro, K., Zamir, A.R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402

  • Su, B., & Wen, J.-R. (2021). Temporal alignment prediction for supervised representation learning and few-shot sequence classification. In ICLR.

  • Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., & Hospedales, T.M. (2018). Learning to compare: Relation network for few-shot learning. In CVPR. (pp. 1199–1208).

  • Tang, Y., Béjar, B., & Vidal, R. (2024). Semantic-aware video representation for few-shot action recognition. In WACV. (pp. 6458–6468).

  • Tang, H., Liu, J., Yan, S., Yan, R., Li, Z., & Tang, J. (2023). M\(^3\)Net: Multi-view encoding, matching, and fusion for few-shot fine-grained action recognition. In ACM Multimedia (pp. 1719–1728).

  • Thatipelli, A., Narayan, S., Khan, S., Anwer, R.M., Khan, F.S., & Ghanem, B. (2022). Spatio-temporal relation modeling for few-shot action recognition. In CVPR (pp. 19958–19967).

  • Tian, Y., Yan, Y., Zhai, G., Guo, G., & Gao, Z. (2022). EAN: Event adaptive network for enhanced action recognition. International Journal of Computer Vision, 130(10), 2453–2471.

    Article  Google Scholar 

  • Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In CVPR (pp. 6450–6459).

  • Vinyals, O., Blundell, C., Lillicrap, T.P., Kavukcuoglu, K., & Wierstra, D. (2016). Matching networks for one shot learning. In NIPS

  • Wang, J., Wang, Y., Liu, S., & Li, A. (2021). Few-shot fine-grained action recognition via bidirectional attention and contrastive meta-learning. In ACM Multimedia (pp. 582–591).

  • Wang, H., Wang, Y., Sun, R., & Li, B. (2022a). Global convergence of MAML and theory-inspired neural architecture search for few-shot learning. In CVPR (pp. 9787–9798).

  • Wang, G., Ye, H., Wang, X., Ye, W., & Wang, H. (2021b). Temporal relation based attentive prototype network for few-shot action recognition. In ACML (pp. 406–421).

  • Wang, X., Zhang, S., Qing, Z., Gao, C., Zhang, Y., Zhao, D., & Sang, N. (2023a). MoLo: Motion-augmented long-short contrastive learning for few-shot action recognition. In CVPR (pp. 18011–18021).

  • Wang, X., Zhang, S., Qing, Z., Tang, M., Zuo, Z., Gao, C., Jin, R., & Sang, N (2022b) Hybrid relation guided set matching for few-shot action recognition. In CVPR (pp. 19948–19957).

  • Wang, Q., Hu, Q., Gao, Z., Li, P., & Hu, Q. (2023b). AMS-Net: Modeling adaptive multi-granularity spatio-temporal cues for video action recognition. Transactions on Neural Networks and Learning Systems, 35(12), 18731–18745.

    Article  Google Scholar 

  • Wang, X., Wang, X., Jiang, B., & Luo, B. (2023c). Few-shot learning meets transformer: Unified query-support Transformers for few-shot classification. IEEE Transactions on Circuits and Systems for Video Technology, 33, 7789–7802.

    Article  Google Scholar 

  • Wang, L., Wang, Z., Qiao, Y., & Van Gool, L. (2018). Transferring deep object and scene representations for event recognition in still images. International Journal of Computer Vision, 126, 390–409.

    Article  MathSciNet  Google Scholar 

  • Wang, Q., Xie, J., Zuo, W., Zhang, L., & Li, P. (2020). Deep CNNs meet global covariance pooling: Better representation and generalization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(8), 2582–2597.

    Google Scholar 

  • Wang, M., Xing, J., Mei, J., Liu, Y., & Jiang, Y. (2023d). ActionCLIP: Adapting language-image pretrained models for video action recognition. IEEE Transactions on Neural Networks and Learning Systems, 36(1), 625–637.

    Article  Google Scholar 

  • Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Gool, L. V. (2019). Temporal segment networks for action recognition in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(11), 2740–2755.

    Article  Google Scholar 

  • Wang, X., Zhang, S., Cen, J., Gao, C., Zhang, Y., Zhao, D., & Sang, N. (2023e). Clip-guided prototype modulating for few-shot action recognition. International Journal of Computer Vision, 132, 1899–1912.

    Article  Google Scholar 

  • Wang, X., Zhang, S., Cen, J., Gao, C., Zhang, Y., Zhao, D., & Sang, N. (2024). Clip-guided prototype modulating for few-shot action recognition. International Journal of Computer Vision, 132(6), 1899–1912.

    Article  Google Scholar 

  • Wanyan, Y., Yang, X., Chen, C., & Xu, C. (2023). Active exploration of multimodal complementarity for few-shot action recognition. In CVPR (pp. 6492–6502).

  • Wu, J., Zhang, T., Zhang, Z., Wu, F., & Zhang, Y. (2022). Motion-modulated temporal fragment alignment network for few-shot action recognition. In CVPR (pp. 9151–9160).

  • Wu, Z., Li, H., Zheng, Y., Xiong, C., Jiang, Y.-G., & Davis, L. S. (2021). A coarse-to-fine framework for resource efficient video recognition. International Journal of Computer Vision, 129(11), 2965–2977.

    Article  Google Scholar 

  • Xia, H., Li, K., Min, M.R., & Ding, Z. (2023). Few-shot video classification via representation fusion and promotion learning. In ICCV (pp. 19311–19320).

  • Xie, J., Long, F., Lv, J., Wang, Q., & Li, P. (2022) Joint distribution matters: Deep brownian distance covariance for few-shot classification. In CVPR (pp. 7962–7971).

  • Xing, J., Wang, M., Hou, X., Dai, G., Wang, J., & Liu, Y. (2023a). Multimodal adaptation of clip for few-shot action recognition. arXiv preprint arXiv:2308.01532

  • Xing, J., Wang, M., Ruan, Y., Chen, B., Guo, Y., Mu, B., Dai, G., Wang, J., & Liu, Y. (2023b). Boosting few-shot action recognition with graph-guided hybrid matching. In ICCV (pp. 1740–1750).

  • Yang, C., Xu, Y., Shi, J., Dai, B., & Zhou, B. (2020). Temporal pyramid network for action recognition. In CVPR (pp. 591–600).

  • Ye, H.-J., Hu, H., Zhan, D.-C., & Sha, F. (2020). Few-shot learning via embedding adaptation with set-to-set functions. In CVPR (pp. 8808–8817).

  • You, H., Zhou, L., Xiao, B., Codella, N.C., Cheng, Y., Xu, R., Chang, S.-F., & Yuan, L. (2021). MA-CLIP: towards modality-agnostic contrastive language-image pre-training

  • Yu, T., Chen, P., Dang, Y., Huan, R., & Liang, R. (2023). Multi-speed global contextual subspace matching for few-shot action recognition. In ACM Multimedia (pp. 2344–2352).

  • Zhang, C., Cai, Y., Lin, G., & Shen, C. (2020a). DeepEMD: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. In CVPR (pp. 12200–12210).

  • Zhang, Y., Fu, Y., Ma, X., Qi, L., Chen, J., Wu, Z., & Jiang, Y.-G. (2023). On the importance of spatial relations for few-shot action recognition. In ACM Multimedia.

  • Zhang, S., Zhou, J., & He, X. (2021). Learning implicit temporal alignment for few-shot video classification. In IJCAI.

  • Zhang, C., Cai, Y., Lin, G., & Shen, C. (2022). DeepEMD: Differentiable earth mover’s distance for few-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence., 45, 5632–5648.

    Google Scholar 

  • Zhang, L., Chang, X., Liu, J., Luo, M., Prakash, M., & Hauptmann, A. G. (2020b). Few-shot activity recognition with cross-modal memory network. Pattern Recognition, 108, Article 107348.

    Article  Google Scholar 

  • Zhao, Y., Xiong, Y., & Lin, D. (2018). Trajectory convolution for action recognition. NuerIPS 31.

  • Zheng, S., Chen, S., & Jin, Q. (2022). Few-shot action recognition with hierarchical matching and contrastive learning. In ECCV (pp. 297–313).

  • Zhou, B., Andonian, A., Oliva, A., & Torralba, A. (2018). Temporal relational reasoning in videos. In ECCV (pp. 803–818).

  • Zhu, L., & Yang, Y. (2018). Compound memory networks for few-shot video classification. In ECCV (pp. 751–766).

  • Zhu, X., Toisoul, A., Prez-Ra, J.-M., Zhang, L., Martínez, B., & Xiang, T. (2021). Few-shot action recognition with prototype-centered attentive learning. In BMVC.

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 62471083, Grant 61971086, Grant 62276186, and Grant 61925602; in part by the Haihe Lab of ITAI under Grant 22HHXCJC00002; and in part by the Science and Technology Development Program Project of Jilin Province under Grant 20230201111GX.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peihua Li.

Additional information

Communicated by Dima Damen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 16666 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gao, Z., Wang, Q., Zhang, B. et al. \(\text {A}^2\text {M}^2\)-Net: Adaptively Aligned Multi-scale Moment for Few-Shot Action Recognition. Int J Comput Vis 133, 5363–5378 (2025). https://doi.org/10.1007/s11263-025-02432-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-025-02432-4

Keywords