CLIP-guided Prototype Modulating for Few-shot Action Recognition

Wang, Xiang; Zhang, Shiwei; Cen, Jun; Gao, Changxin; Zhang, Yingya; Zhao, Deli; Sang, Nong

doi:10.1007/s11263-023-01917-4

CLIP-guided Prototype Modulating for Few-shot Action Recognition

Published: 17 October 2023

Volume 132, pages 1899–1912, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

2441 Accesses
65 Citations
Explore all metrics

Abstract

Learning from large-scale contrastive language-image pre-training like CLIP has shown remarkable success in a wide range of downstream tasks recently, but it is still under-explored on the challenging few-shot action recognition (FSAR) task. In this work, we aim to transfer the powerful multimodal knowledge of CLIP to alleviate the inaccurate prototype estimation issue due to data scarcity, which is a critical problem in low-shot regimes. To this end, we present a CLIP-guided prototype modulating framework called CLIP-FSAR, which consists of two key components: a video-text contrastive objective and a prototype modulation. Specifically, the former bridges the task discrepancy between CLIP and the few-shot video task by contrasting videos and corresponding class text descriptions. The latter leverages the transferable textual concepts from CLIP to adaptively refine visual prototypes with a temporal Transformer. By this means, CLIP-FSAR can take full advantage of the rich semantic priors in CLIP to obtain reliable prototypes and achieve accurate few-shot classification. Extensive experiments on five commonly used benchmarks demonstrate the effectiveness of our proposed method, and CLIP-FSAR significantly outperforms existing state-of-the-art methods under various settings. The source code and models are publicly available at https://github.com/alibaba-mmai-research/CLIP-FSAR.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fine-Tuning of CLIP in Few-Shot Scenarios via Supervised Contrastive Learning

Few-shot cross-modal text detection via CLIP

Article 25 May 2025

Adversarial domain adaptation with CLIP for few-shot image classification

Article 30 November 2024

Data Availability

The datasets generated during and/or analysed during the current study are available in our open source repository.

References

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). Vivit: A video vision transformer. In ICCV, pp. 6836–6846.
Bishay, M., Zoumpourlis, G., & Patras, I. (2019). TARN: temporal attentive relation network for few-shot and zero-shot action recognition. In BMVC, BMVA Press, p. 154, https://bmvc2019.org/wp-content/uploads/papers/0650-paper.pdf
Cao, K., Ji, J., Cao, Z., Chang, C.Y., & Niebles, J.C. (2020). Few-shot video classification via temporal alignment. In CVPR, pp. 10618–10627.
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pp. 6299–6308.
Chen, C.F.R., Panda, R., Ramakrishnan, K., Feris, R., Cohn, J., Oliva, A., & Fan, Q. (2021). Deep analysis of cnn-based spatio-temporal representations for action recognition. In CVPR, pp. 6165–6175.
Chung, J., Gulcehre, C., Cho, K., Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555
Dai Z, Yang Z, Yang Y, Carbonell J, Le, Q.V., Salakhutdinov, R. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR, pp. 248–255.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., & Gelly, S., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
Fei-Fei, L., Fergus, R., & Perona, P. (2006). One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4), 594–611.
Article Google Scholar
Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, PMLR, pp. 1126–1135.
Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., & Qiao, Y. (2021). Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544
Goyal, R., Michalski, V., Materzy, J., Westphal, S., Kim, H., Haenel, V., Yianilos, P., Mueller-freitag, M., Hoppe, F., Thurau, C., Bax, I., & Memisevic, R. (2017). The “Something Something” Video Database for Learning and Evaluating Visual Common Sense. In ICCV.
Graves, A., Mohamed, Ar., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In ICASSP, pp. 6645–6649.
Gu, X., Lin, T.Y., Kuo, W., Cui, Y. (2022). Open-vocabulary object detection via vision and language knowledge distillation. In ICLR.
Guo, Y., Codella, N.C., Karlinsky, L., Codella, J.V., Smith, J.R., Saenko, K., Rosing, T., & Feris, R. (2020). A broader study of cross-domain few-shot learning. In ECCV, Springer, pp. 124–141.
Hariharan, B., & Girshick, R. (2017). Low-shot visual recognition by shrinking and hallucinating features. In ICCV, pp. 3018–3027.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR, pp. 770–778.
Huang, Y., Yang, L., & Sato, Y. (2022). Compound prototype matching for few-shot action recognition. In ECCV.
Jamal, M.A., & Qi, G.J. (2019). Task agnostic meta-learning for few-shot learning. In CVPR, pp. 11719–11727.
Jia. C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, PMLR, pp. 4904–4916.
Ju, C., Han, T., Zheng, K., Zhang, Y., & Xie, W. (2022). Prompting visual-language models for efficient video understanding. In ECCV.
Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Kuehne, H., Serre, T., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: A large video database for human motion recognition. In ICCV, https://doi.org/10.1109/ICCV.2011.6126543
Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., & Ranftl, R. (2022a). Language-driven semantic segmentation. In ICLR.
Li, H., Eigen, D., Dodge, S., Zeiler, M., & Wang, X. (2019). Finding task-relevant features for few-shot learning by category traversal. In CVPR, pp. 1–10.
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., & Hoi, S. C. H. (2021). Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS, 34, 9694–9705.
Google Scholar
Li, K., Zhang, Y., Li, K., & Fu, Y. (2020a). Adversarial feature hallucination networks for few-shot learning. In CVPR, pp. 13470–13479.
Li, S., Liu, H., Qian, R., Li, Y., See, J., Fei, M., Yu, X., & Lin, W. (2022b). Ta2n: Two-stage action alignment network for few-shot action recognition. In AAAI, pp. 1404–1411.
Li, W., Gao, C., Niu, G., Xiao, X., Liu, H., Liu, J., Wu, H., & Wang, H. (2020b). Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409
Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., & Wei, F., et al. (2020c). Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, Springer, pp. 121–137
Li, Z., Zhou, F., Chen, F., & Li, H. (2017). Meta-sgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835
Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. In ICCV, pp. 7083–7093
Lin, Z., Geng, S., Zhang, R., Gao, P., de Melo, G., Wang, X., Dai, J., Qiao, Y., & Li, H (2022). Frozen clip models are efficient video learners. In ECCV.
Liu, Y., Xiong, P., Xu, L., Cao, S., & Jin, Q. (2022). Ts2-net: Token shift and selection transformer for text-video retrieval. In ECCV.
Luo, J., Li, Y., Pan, Y., Yao, T., Chao, H., & Mei, T. (2021). Coco-bert: Improving video-language pre-training with contrastive cross-modal matching and denoising. In ACMMM, pp. 5600–5608.
Müller, M. (2007). Dynamic time warping. Information Retrieval for Music and Motion pp. 69–84.
Nguyen, K.D., Tran, Q.H., Nguyen, K., Hua, B.S., & Nguyen, R. (2022). Inductive and transductive few-shot video classification via appearance and temporal alignments. In ECCV.
Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., & Ling, H. (2022). Expanding language-image pretrained models for general video recognition. In ECCV, Springer, pp. 1–18.
Pahde, F., Ostapenko, O., Hnichen, P.J., Klein, T., & Nabi, M. (2019). Self-paced adversarial training for multimodal few-shot learning. In WACV, IEEE, pp. 218–226.
Pahde, F., Puscas, M., Klein, T., & Nabi, M. (2021). Multimodal prototypical networks for few-shot learning. In WACV, pp. 2644–2653.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., & Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. NeurIPS 32.
Perrett, T., Masullo, A., Burghardt, T., Mirmehdi, M., & Damen, D. (2021). Temporal-relational crosstransformers for few-shot action recognition. In CVPR, pp. 475–484.
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In ICML, PMLR, pp. 8748–8763.
Rajeswaran, A., Finn, C., Kakade, S.M., & Levine, S. (2019). Meta-learning with implicit gradients. In NeurIPS, vol 32.
Rao, Y., Zhao, W., Chen, G., Tang, Y., Zhu, Z., Huang, G., Zhou. J., & Lu, J. (2022). Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR, pp. 18082–18091.
Rasheed, H., Khattak, M.U., Maaz, M., Khan, S., & Khan, F.S. (2023). Fine-tuned clip models are efficient video learners. In CVPR, pp. 6545–6554.
Ravi, S., & Larochelle, H. (2017). Optimization as a model for few-shot learning. In ICLR.
Rusu, A.A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., & Hadsell, R. (2019). Meta-learning with latent embedding optimization. In ICLR.
Shi, H., Hayat, M., Wu, Y., & Cai, J. (2022). Proposalclip: Unsupervised open-category object proposal generation via exploiting clip cues. In CVPR, pp. 9611–9620.
Shi, Z., Liang, J., Li, Q., Zheng, H., Gu, Z., Dong, J., & Zheng, B. (2021). Multi-modal multi-action video recognition. In ICCV, pp. 13678–13687.
Snell, J., Swersky, K., & Zemel, R. (2017). Prototypical networks for few-shot learning. NeurIPS, 30, 4077–4087.
Google Scholar
Soomro, K., Zamir, A.R., & Shah, M. (2012). UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv arXiv:1212.0402
Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P,H., & Hospedales, T.M. (2018). Learning to compare: Relation network for few-shot learning. In CVPR, pp. 1199–1208.
Thatipelli, A., Narayan, S., Khan, S., Anwer, R,M., Khan, F,S., & Ghanem, B. (2022). Spatio-temporal relation modeling for few-shot action recognition. In CVPR, pp. 19958–19967.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In NeurIPS, pp. 5998–6008.
Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., & Wierstra, D. (2016). Matching Networks for One Shot Learning. In: NeurIPS, arXiv:1606.04080v2
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In ECCV, Springer, pp. 20–36.
Wang, M., Xing, J., & Liu, Y. (2021a). Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472
Wang, T., Jiang, W., Lu, Z., Zheng, F., Cheng, R., Yin, C., & Luo, P. (2022a). Vlmixer: Unpaired vision-language pre-training via cross-modal cutmix. In ICML, PMLR, pp. 22680–22690.
Wang, X., Zhang, S., Qing, Z., Shao, Y., Gao, C., & Sang, N. (2021b). Self-supervised learning for semi-supervised temporal action proposal. In CVPR, pp. 1905–1914.
Wang, X., Zhang, S., Qing, Z., Shao, Y., Zuo, Z., Gao, C., & Sang, N. (2021c). Oadtr: Online action detection with transformers. In ICCV, pp. 7565–7575.
Wang, X., Zhang, S., Qing, Z., Tang, M., Zuo, Z., Gao, C., Jin, R., & Sang, N. (2022b). Hybrid relation guided set matching for few-shot action recognition. In CVPR, pp. 19948–19957.
Wang, X., Zhang, S., Qing, Z., Zuo, Z., Gao, C., Jin, R., & Sang, N. (2023). Hyrsm++: Hybrid relation guided temporal set matching for few-shot action recognition. arXiv preprint arXiv:2301.03330
Wang, Z., Lu, Y., Li, Q., Tao, X., Guo, Y., Gong, M., & Liu, T. (2022c). Cris: Clip-driven referring image segmentation. In CVPR, pp. 11686–11695.
Wu, J., Zhang, T., Zhang, Z., Wu, .F, & Zhang, Y. (2022). Motion-modulated temporal fragment alignment network for few-shot action recognition. In CVPR, pp. 9151–9160.
Wu, W., Sun, Z., & Ouyang, W. (2023). Revisiting classifier: Transferring vision-language models for video recognition. In AAAI, pp. 7–8.
Xing, C., Rostamzadeh, N., Oreshkin, B., & O Pinheiro, P.O. (2019). Adaptive cross-modal few-shot learning. NeurIPS 32.
Xu, W., Xian, Y., Wang, J., Schiele, B., & Akata, Z. (2022). Attribute prototype network for any-shot learning. IJCV, 130(7), 1735–1753.
Article Google Scholar
Yang, J., Li, C., Zhang, P., Xiao, B., Liu, C., Yuan, L., & Gao, J. (2022). Unified contrastive learning in image-text-label space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19163–19173.
Ye, H.J., Hu, H., Zhan, D.C., & Sha, F. (2020). Few-shot learning via embedding adaptation with set-to-set functions. In CVPR, pp. 8808–8817.
Ye, H. J., Hu, H., & Zhan, D. C. (2021). Learning adaptive classifiers synthesis for generalized few-shot learning. IJCV, 129, 1930–1953.
Yoon, S.W., Seo, J., & Moon, J. (2019). Tapnet: Neural network augmented with task-adaptive projection for few-shot learning. In ICML, PMLR, pp. 7115–7123.
Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., & Beyer, L. (2022). Lit: Zero-shot transfer with locked-image text tuning. In CVPR, pp. 18123–18133.
Zhang, H., Zhang, L., Qi, X., Li, H., Torr, P.H., Koniusz, P. (2020). Few-shot action recognition with permutation-invariant attention. In ECCV, Springer, pp. 525–542.
Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L., & Shum, H. (2022a). Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In ICLR.
Zhang, R., Che, T., Ghahramani, Z., Bengio, Y., & Song, Y. (2018). Metagan: An adversarial approach to few-shot learning. NeurIPS 31.
Zhang, R., Fang, R., Gao, P., Zhang, W., Li, K., Dai, J., Qiao, Y., & Li, H. (2022b). Tip-adapter: Training-free clip-adapter for better vision-language modeling. In ECCV.
Zhang, S., Zhou, J., & He, X. (2021). Learning implicit temporal alignment for few-shot video classification. In IJCAI.
Zheng, S., Chen, S., & Jin, Q. (2022). Few-shot action recognition with hierarchical matching and contrastive learning. In ECCV, Springer.
Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., & Li, Y., et al. (2022). Regionclip: Region-based language-image pretraining. In CVPR, pp. 16793–16803.
Zhou, B., Andonian, A., Oliva, A., & Torralba, A. (2018). Temporal relational reasoning in videos. In ECCV, pp. 803–818.
Zhou, K., Yang, J., Loy, C.C., & Liu, Z. (2022a). Conditional prompt learning for vision-language models. In CVPR, pp. 16816–16825.
Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Learning to prompt for vision-language models. IJCV, 130(9), 2337–2348.
Article Google Scholar
Zhu, L., & Yang, Y. (2018). Compound memory networks for few-shot video classification. In ECCV, pp. 751–766.
Zhu, L., & Yang, Y. (2020). Label independent memory for semi-supervised few-shot video classification. IEEE Transactions on Pattern Analysis and Machine Intelligence., 44(1), 273–85.
Google Scholar
Zhu, X., Toisoul, A., Perez-Rua, J.M., Zhang, L., Martinez, B., & Xiang, T. (2021). Few-shot action recognition with prototype-centered attentive learning. arXiv preprint arXiv:2101.08085

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China under grant U22B2053 and Alibaba Group through Alibaba Research Intern Program.

Author information

Authors and Affiliations

Key Laboratory of Ministry of Education for Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, China
Xiang Wang, Changxin Gao & Nong Sang
Alibaba Group, Hangzhou, China
Shiwei Zhang, Yingya Zhang & Deli Zhao
The Hong Kong University of Science and Technology, Hong Kong, China
Jun Cen

Authors

Xiang Wang
View author publications
Search author on:PubMed Google Scholar
Shiwei Zhang
View author publications
Search author on:PubMed Google Scholar
Jun Cen
View author publications
Search author on:PubMed Google Scholar
Changxin Gao
View author publications
Search author on:PubMed Google Scholar
Yingya Zhang
View author publications
Search author on:PubMed Google Scholar
Deli Zhao
View author publications
Search author on:PubMed Google Scholar
Nong Sang
View author publications
Search author on:PubMed Google Scholar

Corresponding authors

Correspondence to Shiwei Zhang or Nong Sang.

Additional information

Communicated by Xiaohua Zhai.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, X., Zhang, S., Cen, J. et al. CLIP-guided Prototype Modulating for Few-shot Action Recognition. Int J Comput Vis 132, 1899–1912 (2024). https://doi.org/10.1007/s11263-023-01917-4

Download citation

Received: 01 March 2023
Accepted: 15 September 2023
Published: 17 October 2023
Version of record: 17 October 2023
Issue date: June 2024
DOI: https://doi.org/10.1007/s11263-023-01917-4

Keywords

Part of a collection:

Special Issue on The Promises and Dangers of Large Vision Models

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CLIP-guided Prototype Modulating for Few-shot Action Recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Fine-Tuning of CLIP in Few-Shot Scenarios via Supervised Contrastive Learning

Few-shot cross-modal text detection via CLIP

Adversarial domain adaptation with CLIP for few-shot image classification

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now