Abstract
Few-shot action recognition aims to address the high cost and impracticality of manually labeling complex and variable video data in action recognition. It requires accurately classifying human actions in videos using only a few labeled examples per class. Compared to few-shot learning in image scenarios, few-shot action recognition is more challenging due to the intrinsic complexity of video data. Numerous approaches have driven significant advancements in few-shot action recognition, which underscores the need for a comprehensive survey. Unlike early surveys that focus on few-shot image or text classification, we deeply consider the unique challenges of few-shot action recognition. In this survey, we provide a comprehensive review of recent methods and introduce a novel and systematic taxonomy of existing approaches, accompanied by a detailed analysis. We categorize the methods into generative-based and meta-learning frameworks, and further elaborate on the methods within the meta-learning framework, covering aspects: video instance representation, category prototype learning, and generalized video alignment. Additionally, the survey presents the commonly used benchmarks and discusses relevant advanced topics and promising future directions. We hope this survey can serve as a valuable resource for researchers, offering essential guidance to newcomers and stimulating seasoned researchers with fresh insights.
Similar content being viewed by others
Data Availability
This survey introduces commonly used datasets for FSAR, summarized in Section 3.3. These publicly available datasets include HMDB (https://serre-lab.clps.brown.edu/resource/hmdb-a-largehuman-motion-database), UCF101 (https://www.crcv.ucf.edu/data/UCF101.php), Kinetics (https://github.com/cvdfoundation/kinetics-dataset), SSv2 (https://20bn.com/datasets/something-something), and EPIC-Kitchens (https://epic-kitchens.github.io/2021).
References
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt ,J., Altman, S., Anadkat, S., & et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Akula, A., Shah, A. K., & Ghosh, R. (2018). Deep learning approach for human action recognition in infrared images. Cogn. Syst. Res., 50, 146–154.
An, Y., Xue, H., Zhao, X., & Wang, J. (2023). From instance to metric calibration: A unified framework for open-world few-shot learning. IEEE Trans. Pattern Anal. Mach. Intell., 45(8), 9757–9773.
Barr, P., Noble, J., & Biddle, R. (2007). Video game values: Human-computer interaction and games. Interact. Comput., 19(2), 180–195.
Bayat, A., Pomplun, M., & Tran, D. A. (2014). A study on human activity recognition using accelerometer data from smartphones. Procedia Computer Science, 34, 450–457.
Bishay, M., Zoumpourlis, G., & Patras, I. Tarn: Temporal attentive relation network for few-shot and zero-shot action recognition. arXiv preprint arXiv:1907.09021, 2019.
Y. Bo, Y. Lu, & W. He. Few-shot learning of video action recognition only based on video contents. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 595–604, 2020.
Boudiaf, M., Bennequin, E., Tami, M., Toubhans, A., Piantanida, P., Hudelot, C., & Ben Ayed, I. Open-set likelihood maximization for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24007–24016, 2023.
Cai, Q., Pan, Y., Yao, T., Yan, C., & Mei, T. Memory matching networks for one-shot image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4080–4088, 2018.
Calderbank, R., Jafarpour, S., & Schapire, R. Compressed learning: Universal sparse dimensionality reduction and learning in the measurement domain. preprint, 2009.
Cao, C., Li, Y., Lv, Q., Wang, P., & Zhang, Y. (2021). Few-shot action recognition with implicit temporal alignment and pair similarity optimization. Comput. Vis. Image Underst., 210, Article 103250.
Cao, C., Zhang, Y., Yu, Y., Lv, Q., Min, L., & Zhang, Y. Task-adapter: Task-specific adaptation of image models for few-shot action recognition. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 9038–9047, 2024.
Cao, K., Ji, J., Cao, Z., Chang, C.-Y., & Niebles, J. C. Few-shot video classification via temporal alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10618–10627, 2020.
Cao, Y., Su, X., Tang, Q., You, S., Lu, X., & Xu, C. (2022). Searching for better spatio-temporal alignment in few-shot action recognition. Adv. Neural. Inf. Process. Syst., 35, 21429–21441.
Carreira, J., & Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., & Zisserman, A. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
Carreira, J., Noland, E., Hillier, C., & Zisserman, A. A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987, 2019.
Castro, F. M., Marín-Jiménez, M. J., Guil, N., Schmid, C., & Alahari, K. End-to-end incremental learning. In Proceedings of the European conference on computer vision (ECCV), pages 233–248, 2018.
Chen, Y., Chen, D., Liu, R., Li, H., & Peng, W. Video action recognition with attentive semantic units. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10170–10180, 2023.
Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., & et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pages 720–736, 2018.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. https://doi.org/10.1109/CVPR.2009.5206848.
Deng, L., Li, Z., Zhou, B., Chen, Z., Li, A., & Ge, Y. Two-stream joint matching method based on contrastive learning for few-shot action recognition. arXiv preprint arXiv:2401.04150, 2024.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., &et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
Estevam, V., Pedrini, H., & Menotti, D. (2021). Zero-shot action recognition in videos: A survey. Neurocomputing, 439, 159–175.
Fathi, A., & Mori, G. Action recognition by learning mid-level motion features. In 2008 IEEE conference on computer vision and pattern recognition, pages 1–8. IEEE, 2008.
Feng, Y., Gao, J., & Xu, C. Learning dual-routing capsule graph neural network for few-shot video classification. IEEE Transactions on Multimedia, 2022.
Feng, Y., Gao, J., & Xu, C. Spatiotemporal orthogonal projection capsule network for incremental few-shot action recognition. IEEE Transactions on Multimedia, 2024.
Finn, C., Abbeel, P., & Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017.
Fu, Y., Wang, C., Fu, Y., Wang, Y.-X., Bai, C., Xue, X., & Jiang, Y.-G. Embodied one-shot video recognition: Learning from actions of a virtual embodied agent. In Proceedings of the 27th ACM international conference on multimedia, pages 411–419, 2019.
Fu, Y., Zhang, L., Wang, J.,Fu, Y., & Jian, Y.-G. Depth guided adaptive meta-fusion network for few-shot video recognition. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1142–1151, 2020.
Fukushi, K., Nozaki, Y., Nishihara, K., & Nakahara, K. Few-shot generative model for skeleton-based human action synthesis using cross-domain adversarial learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3946–3955, 2024.
Gao, T., Han, X., Liu, Z., & Sun, M. (2019). Hybrid attention-based prototypical networks for noisy few-shot relation classification. In Proceedings of the AAAI conference on artificial intelligence, 33, 6407–6414.
Gao, Z., Guo, L., Guan, W., Liu, A.-A., Ren, T., & Chen, S. (2020). A pairwise attentive adversarial spatiotemporal network for cross-domain few-shot action recognition-r2. IEEE Trans. Image Process., 30, 767–782.
Garnelo, M., Rosenbaum, D., Maddison, C.,Ramalho, T., Saxton, D.,Shanahan, M., Teh, Y. W., Rezende, D., & Eslami, S. A. Conditional neural processes. In International conference on machine learning, pages 1704–1713. PMLR, 2018.
Gharoun, H., Momenifar, F., Chen, F., & Gandomi, A. Meta-learning approaches for few-shot learning: A survey of recent advances. ACM Computing Surveys, 2023.
Gidaris, S., & Komodakis, N. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4367–4375, 2018.
Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., & et al. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pages 5842–5850, 2017.
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., & et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025a.
Guo, F., Wang, Y., Qi, H., Jin, W., Zhu, L., & Sun, J. (2024). Multi-view distillation based on multi-modal fusion for few-shot action recognition (clip-\(\text{ m}^2\)dmf). Knowl.-Based Syst., 304, Article 112539.
Guo, F., Wang, Y.,Qi, H., Zhu, L., & Sun, J. Dmsd-cdfsar: Distillation from mixed-source domain for cross-domain few-shot action recognition. Expert Systems with Applications, page 126411, 2025b.
Guo, H., Yu, W., Que, S., Du, K., Yan, Y., & Wang, H. Video-to-task learning via motion-guided attention for few-shot action recognition. arXiv preprint arXiv:2411.11335, 2024b.
Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., & Alahi, A. Social gan: Socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2255–2264, 2018.
Hara, K., Kataoka, H., & Satoh, Y. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018.
He, J., & Gao, S. TBSN: sparse-transformer based siamese network for few-shot action recognition. In 2021 2nd Information Communication Technologies Conference (ICTC), pages 47–53. IEEE, 2021.
He, K., Zhang, X., Ren, S., & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Hong, J., Fisher, M., Gharbi, M., & Fatahalian, K. Video pose distillation for few-shot, fine-grained sports action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9254–9263, 2021.
Hospedales, T., Antoniou, A., Micaelli, P., & Storkey, A. (2021). Meta-learning in neural networks: A survey. IEEE Trans. Pattern Anal. Mach. Intell., 44(9), 5149–5169.
Hu, W., Xie, D., Fu, Z., Zeng, W., & Maybank, S. (2007). Semantic-based surveillance video retrieval. IEEE Trans. Image Process., 16(4), 1168–1181.
Hu, Y., Gao, J., & Xu, C. (2021). Learning dual-pooling graph neural networks for few-shot video classification. IEEE Trans. Multimedia, 23, 4285–4296.
Hu, Y., Lee, C.-H., Xie, T., Yu, T., Smith, N. A., & Ostendorf, M. In-context learning for few-shot dialogue state tracking. arXiv preprint arXiv:2203.08568, 2022.
Huang, W., Zhang, J., Li, G., Zhang, L., Wang, S., Dong, F., Jin, J., Ogawa, T., & Haseyama, M. Manta: Enhancing mamba for few-shot action recognition of long sub-sequence. arXiv preprint arXiv:2412.07481, 2024a.
Huang, Y., Yang, L., & Sato, Y. Compound prototype matching for few-shot action recognition. In European Conference on Computer Vision, pages 351–368. Springer Nature Switzerland Cham, 2022.
Huang, Y., Yang, L., Chen, G., Zhang, H., Lu, F., & Sato, Y. Matching compound prototypes for few-shot action recognition. International Journal of Computer Vision, pages 1–26, 2024b.
Innocenti, S. U., Becattini, F., Pernici, F., & Del Bimbo, A. Temporal binary representation for event-based action recognition. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 10426–10432. IEEE, 2021.
Ji, S., Xu, W., Yang, M., & Yu, K. (2012). 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell., 35(1), 221–231.
Jiang, L., Yu, J., Dang, Y., Chen, P., & Huan, R. (2023). HiTIM: Hierarchical task information mining for few-shot action recognition. Appl. Sci., 13(9), 5277.
Jiang, L., Zhan, Y., Jiang, Z., & Tang, N. A dual-prototype network combining query-specific and class-specific attentive learning for few-shot action recognition. Neurocomputing, page 127819, 2024.
Kahatapitiya, K., Arnab, A., Nagrani, A., & Ryoo, M. S. Victr: Video-conditioned text representations for activity recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18547–18558, 2024.
Kay, W., Carreira, J., Simonyan, K., Zhang, B.,Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., & et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
Kim, M., Han, D., Kim, T., & Han, B. Leveraging temporal contextualization for video action recognition. arXiv preprint arXiv:2404.09490, 2024.
Kong, Y., & Fu, Y. (2022). Human action recognition and prediction: A survey. Int. J. Comput. Vision, 130(5), 1366–1401.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. HMDB: a large video database for human motion recognition. In 2011 International conference on computer vision, pages 2556–2563. IEEE, 2011.
Kumar, N., & Narang, S. Few-shot activity recognition using variational inference. arXiv preprint arXiv:2108.08990, 2021.
Kumar, P., Padmanabhan, N., Luo, L., Rambhatla, S. S., & Shrivastava, A. Trajectory-aligned space-time tokens for few-shot action recognition. In European Conference on Computer Vision, pages 474–493. Springer, 2024.
Kumar Dwivedi, S.,Gupta, V ., Mitra, R., Ahmed, S., & Jain, A. Protogan: Towards few shot learning for action recognition. In Proceedings of the IEEE/CVF international conference on computer vision workshops, pages 0–0, 2019.
Li, B., Liu, M., Wang, G., & Yu, Y.. Frame order matters: A temporal sequence-aware model for few-shot action recognition. arXiv preprint arXiv:2408.12475, 2024a.
Li, C., Zhang, J., Wu, S., Jin, X., & Shan, S. (2024). Hierarchical compositional representations for few-shot action recognition. Comput. Vis. Image Underst., 240, Article 103911.
Li, J., Liu, X., Zhang, W., Zhang, M., Song, J., & Sebe, N. (2020). Spatio-temporal attention networks for action recognition and detection. IEEE Trans. Multimedia, 22(11), 2990–3001.
Li, S., Liu, H., Fei, M., Yu, X., Lin, W., & Cloud, H. Temporal alignment via event boundary for few-shot action recognition. In BMVC, page 184, 2021.
Li, S., Liu, H., Qian, R., Li, Y., See, J., Fei, M., Yu, X., & Lin, W. (2022). TA2N: Two-stage action alignment network for few-shot action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, 36, 1404–1411.
Li, X., Yang, X., Ma, Z., & Xue, J.-H. (2023). Deep metric learning for few-shot image classification: A review of recent developments. Pattern Recogn., 138, Article 109381.
Li, Y., Chen, G., Abramowitz, B., Anzellott, S., & Wei, D. Learning domain-invariant temporal dynamics for few-shot action recognition. arXiv preprint arXiv:2402.12706, 2024c.
Li, Z., Gong, X., Song, R., Duan, P., Liu, J., & Zhang, W. (2022). SMAM: Self and mutual adaptive matching for skeleton-based few-shot action recognition. IEEE Trans. Image Process., 32, 392–402.
Liang, D., & Thomaz, E. (2019). Audio-based activities of daily living (adl) recognition with large-scale acoustic embeddings from online videos. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 3(1), 1–18.
Lin, A., & Ling, H. (2007). Doppler and direction-of-arrival (ddoa) radar for multiple-mover sensing. IEEE Trans. Aerosp. Electron. Syst., 43(4), 1496–1509.
Liu, B., Kang, H., Li, H., Hua, G., & Vasconcelos, N. Few-shot open-set recognition using meta-learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8798–8807, 2020a.
Liu, B., Zheng, T., Zheng, P., Liu, D., Qu, X., Gao, J., Dong, J., & Wang, X. Lite-MKD: A multi-modal knowledge distillation framework for lightweight few-shot action recognition. In Proceedings of the 31st ACM International Conference on Multimedia, MM ’23, page 7283–7294, New York, NY, USA, 2023a. Association for Computing Machinery. ISBN 979840070108https://doi.org/10.1145/3581783.3612279. URL https://doi.org/10.1145/3581783.3612279.
Liu, H., Lv, W., See, J., & Lin, W. Task-adaptive spatial-temporal video sampler for few-shot action recognition. In Proceedings of the 30th ACM International Conference on Multimedia, pages 6230–6240, 2022a.
Liu, H.,Li, C., Wu, Q., & Lee, Y. J. Visual instruction tuning, 2023b.
Liu, S., Jiang, M., & Kong, J. (2022). Multidimensional prototype refactor enhanced network for few-shot action recognition. IEEE Trans. Circuits Syst. Video Technol., 32(10), 6955–6966.
Liu, X., Zhang, H., & Pirsiavash, H. MASTAF: a model-agnostic spatio-temporal attention fusion network for few-shot video classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2508–2517, 2023c.
Liu, X., Zhou, S., Wang, L., & Hua, G. Parallel attention interaction network for few-shot skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1379–1388, 2023d.
Liu, Y., Schiele, B., & Sun, Q. An ensemble of epoch-wise empirical bayes for few-shot learning. In European Conference on Computer Vision, pages 404–421. Springer, 2020b.
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., & Hu, H. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022c.
Lu, M., Yang, S.,Lu, X., & Liu, J. Cross-modal contrastive pre-training for few-shot skeleton action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2024.
Luo, W., Liu, Y., Li, B., Hu, W., Miao, Y., & Li, Y. Long-short term cross-transformer in compressed domain for few-shot video classification. In IJCAI, pages 1247–1253, 2022.
Ma, N., Bu, J., Yang, J., Zhang, Z., Yao, C., Yu, Z., Zhou, S., & Yan, X. Adaptive-step graph meta-learner for few-shot graph classification. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 1055–1064, 2020.
Ma, N., Zhang, H., Li, X., Zhou, S., Zhang, Z., Wen, J., Li, H., Gu, J., & Bu, J. Learning spatial-preserved skeleton representations for few-shot action recognition. In European Conference on Computer Vision, pages 174–191. Springer, 2022.
Majumder, S., Chen, C., Al-Halah, Z., & Grauman, K. (2022). Few-shot audio-visual learning of environment acoustics. Adv. Neural. Inf. Process. Syst., 35, 2522–2536.
Markham, G., Balamurali, M., & Hill, A. J. Understanding the cross-domain capabilities of video-based few-shot action recognition models. arXiv preprint arXiv:2406.01073, 2024.
Memmesheimer, R., Häring, S., Theisen, N., & Paulus, D. Skeleton-dml: Deep metric learning for skeleton-based one-shot action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3702–3710, 2022.
Mercea, O.-B., Hummel, T., Koepke, A. S., & Akata, Z. Text-to-feature diffusion for audio-visual few-shot learning. In DAGM German Conference on Pattern Recognition, pages 491–507. Springer, 2023.
Mikolov, T., Chen, K.,Corrado, G., & Dean, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
Min, Y., Zhang, Y., Chai, X., & Chen, X. An efficient pointlstm for point clouds based gesture recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5761–5770, 2020.
Mishra, A., Verma, V. K., Reddy, M. S. K., Arulkumar, S., Rai, P., & Mittal, A. A generative approach to zero-shot and few-shot action recognition. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 372–380. IEEE, 2018.
Müller, M. Dynamic time warping. Information retrieval for music and motion, pages 69–84, 2007.
Nguyen, K. D., Tran, Q.-H., Nguyen, K., Hua B.-S.,, & Nguyen, R. Inductive and transductive few-shot video classification via appearance and temporal alignments. In European Conference on Computer Vision, pages 471–487. Springer, 2022.
Ni, X., Liu, Y., Wen, H., Ji, Y., Xiao, J., & Yang, Y. Multimodal prototype-enhanced network for few-shot action recognition. arXiv preprint arXiv:2212.04873, 2024.
Patravali, J., Mittal, G., Yu, Y., Li, F., & Chen, M. Unsupervised few-shot action recognition via action-appearance aligned meta-adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8484–8494, 2021.
Pei, W., Tan, Q., Lu, G., & Tian, J. \(\text{ D}^2\)ST-Adapter: Disentangled-and-deformable spatio-temporal adapter for few-shot action recognition. arXiv preprint arXiv:2312.01431, 2023.
Peng, K., Wen, D., Schneider, D., Zhang, J., Yang, K., Sarfraz, M. S., Stiefelhagen, R., & Roitberg, A. Exploring few-shot adaptation for activity recognition on diverse domains. arXiv preprint arXiv:2305.08420, 2023.
Perez-Rua, J.-M., Zhu, X., Hospedales, T. M., & Xiang, T. Incremental few-shot object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13846–13855, 2020.
Perrett, T., Masullo, A., Burghardt, T., Mirmehdi, M., & Damen, D. Temporal-relational crosstransformers for few-shot action recognition. pages 475–484, 2021.
Qi, M., Qin, J., Zhen, X., Huang, D., Yang, Y., & Luo, J. Few-shot ensemble learning for video classification with slowfast memory networks. In Proceedings of the 28th ACM international conference on multimedia, pages 3007–3015, 2020.
Qiao, S., Liu, C., Shen, W., & Yuille, A. L. Few-shot image recognition by predicting parameters from activations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7229–7238, 2018.
Qu, H., Yan, R., Shu, X., Gao, H., Huang, P., & Xie, G.-S. MVP-Shot: Multi-velocity progressive-alignment framework for few-shot action recognition. arXiv preprint arXiv:2405.02077, 2024.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., & et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Roh, M.-C., Shin, H.-K., & Lee, S.-W. (2010). View-independent human action recognition with volume motion template on single stereo camera. Pattern Recogn. Lett., 31(7), 639–647.
Ruan, Z., Wei, Y., Yuan, Y., Li, Y., Guo, Y., & Xie, Y. Advances in few-shot action recognition: A comprehensive review. In 2024 7th International conference on artificial intelligence and big data (ICAIBD), pages 390–398. IEEE, 2024.
Sabater, A., Santos, L., Santos-Victor, J., Bernardino, A., Montesano, L., & Murillo, A. C. One-shot action recognition in challenging therapy scenarios. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2777–2785, 2021.
Saleem, G., Bajwa, U. I., & Raza, R. H. (2023). Toward human activity recognition: a survey. Neural Comput. Appl., 35(5), 4145–4182.
Samarasinghe, S., Rizve, M. N., Kardan, N., & Shah, M. Cdfsl-v: Cross-domain few-shot learning for videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11643–11652, 2023.
Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., & Lillicrap, T. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842–1850. PMLR, 2016.
Schick, T., & Schütze, H. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255–269, 2021.
Shao, S., Bai, Y., Wang, Y., Liu, B., & Liu, B. (2024). Collaborative consortium of foundation models for open-world few-shot learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 38, 4740–4747.
Shi, Y., Wu, X., & Lin, H. Knowledge prompting for few-shot action recognition. arXiv preprint arXiv:2211.12030, 2022.
Shi, Y., Wu, X., Lin, H., & Luo, J. Commonsense knowledge prompting for few-shot action recognition in videos. IEEE Transactions on Multimedia, 2024.
Si, C., Chen, W., Wang, W., Wang, L., & Tan, T. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1227–1236, 2019.
Simon, C., Koniusz, P., Nock, R., & Harandi, M. Adaptive subspaces for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4136–4145, 2020.
Simonyan, K., & Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, & K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper_files/paper/2014/file/00ec53c4682d36f5c4359f4ae7bd7ba1-Paper.pdf.
Snell, J., Swersky, K., & Zemel, R. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.
Song, Y., Wang, T., Cai, P., Mondal, S. K., & Sahoo, J. P. A comprehensive survey of few-shot learning: Evolution, applications, challenges, & opportunities. ACM Comput. Surv., 55(13s), jul 2023. ISSN 0360-030https://doi.org/10.1145/3582688. URL https://doi.org/10.1145/3582688.
Soomro, K., & Zamir, A. R. Action recognition in realistic sports videos. In Computer vision in sports, pages 181–208. Springer, 2015.
Soomro, K., Zamir, A. R., & Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
Su, B., Hua, G. Order-preserving wasserstein distance for sequence matching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1049–1057, 2017.
Sun, Z., Ke, Q., Rahmani, H., Bennamoun, M., Wang, G., & Liu, J. (2022). Human action recognition from various data modalities: A review. IEEE Trans. Pattern Anal. Mach. Intell., 45(3), 3200–3225.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
Tang, H., Liu, J., Yan, S., Yan, R., Li, Z., & Tang, J. M3net: multi-view encoding, matching, & fusion for few-shot fine-grained action recognition. In Proceedings of the 31st ACM international conference on multimedia, pages 1719–1728, 2023.
Tang, Y., Béjar, B., & Vidal, R. Semantic-aware video representation for few-shot action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6458–6468, 2024.
Tao, X., Hong, X., Chang, X., Dong, S., Wei, X., & Gong, Y. Few-shot class-incremental learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12183–12192, 2020.
Taylor, G. W., Fergus, R., LeCun, Y., & Bregler, C. Convolutional learning of spatio-temporal features. In Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part VI 11, pages 140–153. Springer, 2010.
Thatipelli, A., Narayan, S., Khan, S., Anwer, R. M., Khan, F. S., & Ghanem, B. Spatio-temporal relation modeling for few-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19958–19967, 2022.
Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., & Li, L.-J. (2016). Yfcc100m: The new data in multimedia research. Commun. ACM, 59(2), 64–73.
Tong, Z., Song, Y., Wang, J., & Wang, L. (2022). Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural. Inf. Process. Syst., 35, 10078–10093.
Touvron, H.,Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F. & et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018.
Tseng, M.-R., Gupta, A., Tang, C.-K., & Tai, Y.-W. HAA4D: few-shot human atomic action recognition via 3d spatio-temporal skeletal alignment. arXiv preprint arXiv:2202.07308, 2022.
Tu, N. A., Abu, A., Aikyn, N., Makhanov, N., Lee, M.-H., Le-Huy, K., & Wong, K.-S. FedFSLAR: A federated learning framework for few-shot action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 270–279, 2024.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Vemulapalli, R., Arrate, F., & Chellappa, R. Human action recognition by representing 3d skeletons as points in a lie group. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 588–595, 2014.
Vettoruzzo, A., Bouguelia, M.-R., Vanschoren, J., Rognvaldsson, T., & Santosh, K. Advances and challenges in meta-learning: A technical review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., & et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.
Walker, J., Gupta, A., & Hebert, M. Patch to the future: Unsupervised visual prediction. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 3302–3309, 2014.
Wang, C., & Zhou, Y. A survey of few-shot action recognition. Journal of Artificial Intelligence Practice, 6 (1):34–40, 2023. https://doi.org/10.23977/jaip.2023.060105.
Wang, G., Ye, H., Wang, X., Ye, W., & Wang, H. Temporal relation based attentive prototype network for few-shot action recognition. In Asian Conference on Machine Learning, pages 406–421. PMLR, 2021a.
Wang, J., Liu, Z., Wu, Y., & Yuan, J. Mining actionlet ensemble for action recognition with depth cameras. In 2012 IEEE conference on computer vision and pattern recognition, pages 1290–1297. IEEE, 2012.
Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2013). Learning actionlet ensemble for 3d human action recognition. IEEE Trans. Pattern Anal. Mach. Intell., 36(5), 914–927.
Wang, J., Wang, Y., Liu, S., & Li, A. Few-shot fine-grained action recognition via bidirectional attention and contrastive meta-learning. In Proceedings of the 29th ACM International Conference on Multimedia, pages 582–591, 2021b.
Wang, L., & Koniusz, P. Temporal-viewpoint transportation plan for skeletal few-shot action recognition. In Proceedings of the Asian Conference on Computer Vision, pages 4176–4193, 2022.
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
Wang, L., Tong, Z., Ji, B., & Wu, G. Tdn: Temporal difference networks for efficient action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1895–1904, 2021c.
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., & et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024a.
Wang, Q., Du, J., Yan, K., & Ding, S. Seeing in flowing: Adapting clip for action recognition with motion prompts learning. In Proceedings of the 31st ACM International Conference on Multimedia, MM ’23, page 5339–5347, New York, NY, USA, 2023a. Association for Computing Machinery. ISBN 979840070108https://doi.org/10.1145/3581783.3612490. URL https://doi.org/10.1145/3581783.3612490.
Wang, X., Ye, W., Qi, Z., Zhao, X., Wang, G., Shan, Y., & Wang,H. Semantic-guided relation propagation network for few-shot action recognition. In Proceedings of the 29th ACM International Conference on Multimedia, pages 816–825, 2021d.
Wang, X., Zhang, S., Qing, Z., Tang, M., Zuo, Z., Gao, C., Jin, R., & Sang, N. Hybrid relation guided set matching for few-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19948–19957, 2022.
Wang, X., Ye, W., Qi, Z., Wang, G., Wu, J., Shan, Y., Qie, X., & Wang, H. Task-aware dual-representation network for few-shot action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2023b.
Wang, X., Zhang, S., Cen, J., Gao, C., Zhang, Y., Zhao, D., & Sang, N. CLIP-guided prototype modulating for few-shot action recognition. International Journal of Computer Vision, pages 1–14, 2023c.
Wang, X., Zhang, S., Qing, Z., Gao, C., Zhang, Y., Zhao, D., & Sang, N. Molo: Motion-augmented long-short contrastive learning for few-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18011–18021, 2023d.
Wang, X., Lu, Y., Yu, W., Pang, Y., & Wang, H. Few-shot action recognition via multi-view representation learning. IEEE Transactions on Circuits and Systems for Video Technology, 2024b.
Wang, X., Zhang, S., Qing, Z., Zuo, Z., Gao, C., Jin, R., & Sang, N. (2024). HyRSM++: Hybrid relation guided temporal set matching for few-shot action recognition. Pattern Recogn., 147, Article 110110.
Wang, Y., Yao, Q., Kwok, J. T., & Ni, L. M. (2020). Generalizing from a few examples: A survey on few-shot learning. ACM Comput. Surv., 53(3), 1–34.
Wang, Y., Bryan, N. J., Cartwright, M., Bello, J. P., & Salamon,J. Few-shot continual learning for audio classification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 321–325. IEEE, 2021e.
Wang, Y., Gao, Z., Wang, Q., Chen, Z., Li, P., & Hu, Q. Tamt: Temporal-aware model tuning for cross-domain few-shot action recognition. arXiv preprint arXiv:2411.19041, 2024d.
Wanyan, Y., Yang, X., Chen, C., & Xu, C. Active exploration of multimodal complementarity for few-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6492–6502, 2023.
Wei, C., & Deng, Z. A novel contrastive diffusion graph convolutional network for few-shot skeleton-based action recognition. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5780–5784. IEEE, 2024.
Wu, A., & Ding, S. (2023). Reconstructed prototype network combined with cdc-tagcn for few-shot action recognition. Appl. Sci., 13(20), 11199.
Wu, C., Wu, X.-J., Li, L., Xu, T., Feng, Z., & Kittler, J. Efficient few-shot action recognition via multi-level post-reasoning. In European Conference on Computer Vision, pages 38–56. Springer, 2024.
Wu, J., Zhang, T., Zhang, Z., Wu, F., & Zhang, Y. Motion-modulated temporal fragment alignment network for few-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9151–9160, 2022.
Wu, Y., Chen, Y., Wang, L., Ye, Y., Liu, Z., Guo, Y., & Fu, Y. Large scale incremental learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 374–382, 2019.
Xia, H., Li, K., Min, M. R., & Ding, Z. Few-shot video classification via representation fusion and promotion learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19311–19320, 2023.
Xian, Y., Korbar, B., Douze, M., Schiele, B., Akata, Z., & Torresani, L. Generalized many-way few-shot video classification. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 111–127. Springer, 2020.
Xian, Y., Korbar, B., Douze, M., Torresani, L., Schiele, B., & Akata, Z. (2021). Generalized few-shot video classification with video retrieval and feature generation. IEEE Trans. Pattern Anal. Mach. Intell., 44(12), 8949–8961.
Xiao, J., Xiang, T., & Tu, Z. Adaptive prototype model for attribute-based multi-label few-shot action recognition. arXiv preprint arXiv:2502.12582, 2025.
Xing, J., Wang, M., Hou, X., Dai, G., Wang, J., & Liu, Y. Multimodal adaptation of clip for few-shot action recognition. arXiv e-prints, pages arXiv–2308, 2023a.
Xing, J., Wang, M., Liu, Y., & Mu, B. (2023). Revisiting the spatial and temporal modeling for few-shot action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, 37, 3001–3009.
Xing, J., Wang, M., Ruan, Y., Chen, B., Guo, Y., Mu, B., Dai, G., Wang, J., & Liu, Y. Boosting few-shot action recognition with graph-guided hybrid matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1740–1750, 2023c.
Xu, B., Ye, H., Zheng, Y., Wang, H., Luwang, T., Jiang, Y.-G. Dense dilated network for few shot action recognition. In Proceedings of the 2018 ACM on international conference on multimedia retrieval, pages 379–387, 2018.
Xu, Q., Yang, J., Zhang, H., Jie, X., & Bandara, D. Enhancing few-shot action recognition using skeleton temporal alignment and adversarial training. IEEE Access, 2024.
Xu, Y., Yang, J., Zhou, Y., Chen, Z., Wu, M., & Li, X. Augmenting and aligning snippets for few-shot video domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13445–13456, 2023.
Yan, K., Zhang, C., Hou, J., Wang, P., Bouraoui, Z., Jameel, S., & Schockaert, S. (2022). Inferring prototypes for multi-label few-shot image classification with word vector guided attention. In Proceedings of the AAAI Conference on Artificial Intelligence, 36, 2991–2999.
Yang, F., Wang, R., & Chen, X. SEGA: semantic guided attention on visual prototype for few-shot learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1056–1066, 2022.
Yang, S., Liu, L., & Xu, M. Free lunch for few-shot learning: Distribution calibration. In International Conference on Learning Representations, 2021.
Yang, S., Liu, J., Lu, S., Hwa, E. M., & Kot, A. C. One-shot action recognition via multi-scale spatial-temporal skeleton matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
Yu, M., Guo, X., Yi, J., Chang, S., Potdar, S., Cheng, Y., Tesauro, G., Wang, H., & Zhou, B. Diverse few-shot text classification with multiple metrics. arXiv preprint arXiv:1805.07513, 2018.
Yu, T., Chen, P., Dang, Y., Huan, R., & Liang, R. Multi-speed global contextual subspace matching for few-shot action recognition. In Proceedings of the 31st ACM International Conference on Multimedia, pages 2344–2352, 2023.
Zhang, H., Zhang, L., Qi, X., Li, H., Torr, P. H., & Koniusz,P. Few-shot action recognition with permutation-invariant attention. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 525–542. Springer International Publishing, 2020a.
Zhang, H., Li, H., & Koniusz, P. Multi-level second-order few-shot learning. IEEE Transactions on Multimedia, 2022.
Zhang, H., Li, X., & Bing, L. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a.
Zhang, L., Chang, X., Liu, J., Luo, M., Prakash, M., & Hauptmann, A. G. (2020). Few-shot activity recognition with cross-modal memory network. Pattern Recogn., 108, Article 107348.
Zhang, R., Che, T., Ghahramani, Z., Bengio, Y., & Song, Y. Metagan: An adversarial approach to few-shot learning. Advances in neural information processing systems, 31, 2018.
Zhang, S., Zhou, J., & He, X. Learning implicit temporal alignment for few-shot video classification. arXiv preprint arXiv:2105.04823, 2021.
Zhang, Y., Fu, Y., Ma, X., Qi, L., Chen, J., Wu, Z., & Jiang, Y.-G. On the importance of spatial relations for few-shot action recognition. In Proceedings of the 31st ACM International Conference on Multimedia, pages 2243–2251, 2023b.
Zheng, S., Chen, S., & Jin, Q. Few-shot action recognition with hierarchical matching and contrastive learning. In European Conference on Computer Vision, pages 297–313. Springer, 2022.
Zhu, L., & Yang, Y. Compound memory networks for few-shot video classification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 751–766, 2018.
Zhu, L., & Yang, Y. (2020). Label independent memory for semi-supervised few-shot video classification. IEEE Trans. Pattern Anal. Mach. Intell., 44(1), 273–285.
Zhu, X., Toisoul, A., Pérez-Rúa, J.-M., Zhang, L., Martinez, B., & Xiang, T. Few-shot action recognition with prototype-centered attentive learning. arXiv preprint arXiv:2101.08085, 2021.
Zou, H., Yang, J., Prasanna Das, H., Liu, H., Zhou, Y., & Spanos, C. J. Wifi and vision multimodal learning for accurate and robust device-free human activity recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019.
Zou, Y., Shi, Y., Shi, D., Wang, Y., Liang, Y., & Tian, Y. (2020). Adaptation-oriented feature projection for one-shot action recognition. IEEE Trans. Multimedia, 22(12), 3166–3179.
Acknowledgements
This work was supported by National Natural Science Foundation of China under Grants U23A20387, 62322212, in part by Pengcheng Laboratory Research Project under Grant PCL2023A08, in part by Alibaba Innovative Research Program, and also in part by CAS Project for Young Scientists in Basic Research (YSBR-116).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Yoichi Sato.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wanyan, Y., Yang, X., Dong, W. et al. A Comprehensive Review of Few-Shot Action Recognition. Int J Comput Vis 133, 6832–6859 (2025). https://doi.org/10.1007/s11263-025-02503-6
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02503-6