这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

CLIP-guided Prototype Modulating for Few-shot Action Recognition

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Learning from large-scale contrastive language-image pre-training like CLIP has shown remarkable success in a wide range of downstream tasks recently, but it is still under-explored on the challenging few-shot action recognition (FSAR) task. In this work, we aim to transfer the powerful multimodal knowledge of CLIP to alleviate the inaccurate prototype estimation issue due to data scarcity, which is a critical problem in low-shot regimes. To this end, we present a CLIP-guided prototype modulating framework called CLIP-FSAR, which consists of two key components: a video-text contrastive objective and a prototype modulation. Specifically, the former bridges the task discrepancy between CLIP and the few-shot video task by contrasting videos and corresponding class text descriptions. The latter leverages the transferable textual concepts from CLIP to adaptively refine visual prototypes with a temporal Transformer. By this means, CLIP-FSAR can take full advantage of the rich semantic priors in CLIP to obtain reliable prototypes and achieve accurate few-shot classification. Extensive experiments on five commonly used benchmarks demonstrate the effectiveness of our proposed method, and CLIP-FSAR significantly outperforms existing state-of-the-art methods under various settings. The source code and models are publicly available at https://github.com/alibaba-mmai-research/CLIP-FSAR.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability

The datasets generated during and/or analysed during the current study are available in our open source repository.

References

  • Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). Vivit: A video vision transformer. In ICCV, pp. 6836–6846.

  • Bishay, M., Zoumpourlis, G., & Patras, I. (2019). TARN: temporal attentive relation network for few-shot and zero-shot action recognition. In BMVC, BMVA Press, p. 154, https://bmvc2019.org/wp-content/uploads/papers/0650-paper.pdf

  • Cao, K., Ji, J., Cao, Z., Chang, C.Y., & Niebles, J.C. (2020). Few-shot video classification via temporal alignment. In CVPR, pp. 10618–10627.

  • Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pp. 6299–6308.

  • Chen, C.F.R., Panda, R., Ramakrishnan, K., Feris, R., Cohn, J., Oliva, A., & Fan, Q. (2021). Deep analysis of cnn-based spatio-temporal representations for action recognition. In CVPR, pp. 6165–6175.

  • Chung, J., Gulcehre, C., Cho, K., Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555

  • Dai Z, Yang Z, Yang Y, Carbonell J, Le, Q.V., Salakhutdinov, R. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860

  • Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR, pp. 248–255.

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., & Gelly, S., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.

  • Fei-Fei, L., Fergus, R., & Perona, P. (2006). One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4), 594–611.

    Article  Google Scholar 

  • Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, PMLR, pp. 1126–1135.

  • Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., & Qiao, Y. (2021). Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544

  • Goyal, R., Michalski, V., Materzy, J., Westphal, S., Kim, H., Haenel, V., Yianilos, P., Mueller-freitag, M., Hoppe, F., Thurau, C., Bax, I., & Memisevic, R. (2017). The “Something Something” Video Database for Learning and Evaluating Visual Common Sense. In ICCV.

  • Graves, A., Mohamed, Ar., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In ICASSP, pp. 6645–6649.

  • Gu, X., Lin, T.Y., Kuo, W., Cui, Y. (2022). Open-vocabulary object detection via vision and language knowledge distillation. In ICLR.

  • Guo, Y., Codella, N.C., Karlinsky, L., Codella, J.V., Smith, J.R., Saenko, K., Rosing, T., & Feris, R. (2020). A broader study of cross-domain few-shot learning. In ECCV, Springer, pp. 124–141.

  • Hariharan, B., & Girshick, R. (2017). Low-shot visual recognition by shrinking and hallucinating features. In ICCV, pp. 3018–3027.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR, pp. 770–778.

  • Huang, Y., Yang, L., & Sato, Y. (2022). Compound prototype matching for few-shot action recognition. In ECCV.

  • Jamal, M.A., & Qi, G.J. (2019). Task agnostic meta-learning for few-shot learning. In CVPR, pp. 11719–11727.

  • Jia. C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, PMLR, pp. 4904–4916.

  • Ju, C., Han, T., Zheng, K., Zhang, Y., & Xie, W. (2022). Prompting visual-language models for efficient video understanding. In ECCV.

  • Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

  • Kuehne, H., Serre, T., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: A large video database for human motion recognition. In ICCV, https://doi.org/10.1109/ICCV.2011.6126543

  • Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., & Ranftl, R. (2022a). Language-driven semantic segmentation. In ICLR.

  • Li, H., Eigen, D., Dodge, S., Zeiler, M., & Wang, X. (2019). Finding task-relevant features for few-shot learning by category traversal. In CVPR, pp. 1–10.

  • Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., & Hoi, S. C. H. (2021). Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS, 34, 9694–9705.

    Google Scholar 

  • Li, K., Zhang, Y., Li, K., & Fu, Y. (2020a). Adversarial feature hallucination networks for few-shot learning. In CVPR, pp. 13470–13479.

  • Li, S., Liu, H., Qian, R., Li, Y., See, J., Fei, M., Yu, X., & Lin, W. (2022b). Ta2n: Two-stage action alignment network for few-shot action recognition. In AAAI, pp. 1404–1411.

  • Li, W., Gao, C., Niu, G., Xiao, X., Liu, H., Liu, J., Wu, H., & Wang, H. (2020b). Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409

  • Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., & Wei, F., et al. (2020c). Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, Springer, pp. 121–137

  • Li, Z., Zhou, F., Chen, F., & Li, H. (2017). Meta-sgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835

  • Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. In ICCV, pp. 7083–7093

  • Lin, Z., Geng, S., Zhang, R., Gao, P., de Melo, G., Wang, X., Dai, J., Qiao, Y., & Li, H (2022). Frozen clip models are efficient video learners. In ECCV.

  • Liu, Y., Xiong, P., Xu, L., Cao, S., & Jin, Q. (2022). Ts2-net: Token shift and selection transformer for text-video retrieval. In ECCV.

  • Luo, J., Li, Y., Pan, Y., Yao, T., Chao, H., & Mei, T. (2021). Coco-bert: Improving video-language pre-training with contrastive cross-modal matching and denoising. In ACMMM, pp. 5600–5608.

  • Müller, M. (2007). Dynamic time warping. Information Retrieval for Music and Motion pp. 69–84.

  • Nguyen, K.D., Tran, Q.H., Nguyen, K., Hua, B.S., & Nguyen, R. (2022). Inductive and transductive few-shot video classification via appearance and temporal alignments. In ECCV.

  • Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., & Ling, H. (2022). Expanding language-image pretrained models for general video recognition. In ECCV, Springer, pp. 1–18.

  • Pahde, F., Ostapenko, O., Hnichen, P.J., Klein, T., & Nabi, M. (2019). Self-paced adversarial training for multimodal few-shot learning. In WACV, IEEE, pp. 218–226.

  • Pahde, F., Puscas, M., Klein, T., & Nabi, M. (2021). Multimodal prototypical networks for few-shot learning. In WACV, pp. 2644–2653.

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., & Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. NeurIPS 32.

  • Perrett, T., Masullo, A., Burghardt, T., Mirmehdi, M., & Damen, D. (2021). Temporal-relational crosstransformers for few-shot action recognition. In CVPR, pp. 475–484.

  • Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In ICML, PMLR, pp. 8748–8763.

  • Rajeswaran, A., Finn, C., Kakade, S.M., & Levine, S. (2019). Meta-learning with implicit gradients. In NeurIPS, vol 32.

  • Rao, Y., Zhao, W., Chen, G., Tang, Y., Zhu, Z., Huang, G., Zhou. J., & Lu, J. (2022). Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR, pp. 18082–18091.

  • Rasheed, H., Khattak, M.U., Maaz, M., Khan, S., & Khan, F.S. (2023). Fine-tuned clip models are efficient video learners. In CVPR, pp. 6545–6554.

  • Ravi, S., & Larochelle, H. (2017). Optimization as a model for few-shot learning. In ICLR.

  • Rusu, A.A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., & Hadsell, R. (2019). Meta-learning with latent embedding optimization. In ICLR.

  • Shi, H., Hayat, M., Wu, Y., & Cai, J. (2022). Proposalclip: Unsupervised open-category object proposal generation via exploiting clip cues. In CVPR, pp. 9611–9620.

  • Shi, Z., Liang, J., Li, Q., Zheng, H., Gu, Z., Dong, J., & Zheng, B. (2021). Multi-modal multi-action video recognition. In ICCV, pp. 13678–13687.

  • Snell, J., Swersky, K., & Zemel, R. (2017). Prototypical networks for few-shot learning. NeurIPS, 30, 4077–4087.

    Google Scholar 

  • Soomro, K., Zamir, A.R., & Shah, M. (2012). UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv arXiv:1212.0402

  • Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P,H., & Hospedales, T.M. (2018). Learning to compare: Relation network for few-shot learning. In CVPR, pp. 1199–1208.

  • Thatipelli, A., Narayan, S., Khan, S., Anwer, R,M., Khan, F,S., & Ghanem, B. (2022). Spatio-temporal relation modeling for few-shot action recognition. In CVPR, pp. 19958–19967.

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In NeurIPS, pp. 5998–6008.

  • Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., & Wierstra, D. (2016). Matching Networks for One Shot Learning. In: NeurIPS, arXiv:1606.04080v2

  • Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In ECCV, Springer, pp. 20–36.

  • Wang, M., Xing, J., & Liu, Y. (2021a). Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472

  • Wang, T., Jiang, W., Lu, Z., Zheng, F., Cheng, R., Yin, C., & Luo, P. (2022a). Vlmixer: Unpaired vision-language pre-training via cross-modal cutmix. In ICML, PMLR, pp. 22680–22690.

  • Wang, X., Zhang, S., Qing, Z., Shao, Y., Gao, C., & Sang, N. (2021b). Self-supervised learning for semi-supervised temporal action proposal. In CVPR, pp. 1905–1914.

  • Wang, X., Zhang, S., Qing, Z., Shao, Y., Zuo, Z., Gao, C., & Sang, N. (2021c). Oadtr: Online action detection with transformers. In ICCV, pp. 7565–7575.

  • Wang, X., Zhang, S., Qing, Z., Tang, M., Zuo, Z., Gao, C., Jin, R., & Sang, N. (2022b). Hybrid relation guided set matching for few-shot action recognition. In CVPR, pp. 19948–19957.

  • Wang, X., Zhang, S., Qing, Z., Zuo, Z., Gao, C., Jin, R., & Sang, N. (2023). Hyrsm++: Hybrid relation guided temporal set matching for few-shot action recognition. arXiv preprint arXiv:2301.03330

  • Wang, Z., Lu, Y., Li, Q., Tao, X., Guo, Y., Gong, M., & Liu, T. (2022c). Cris: Clip-driven referring image segmentation. In CVPR, pp. 11686–11695.

  • Wu, J., Zhang, T., Zhang, Z., Wu, .F, & Zhang, Y. (2022). Motion-modulated temporal fragment alignment network for few-shot action recognition. In CVPR, pp. 9151–9160.

  • Wu, W., Sun, Z., & Ouyang, W. (2023). Revisiting classifier: Transferring vision-language models for video recognition. In AAAI, pp. 7–8.

  • Xing, C., Rostamzadeh, N., Oreshkin, B., & O Pinheiro, P.O. (2019). Adaptive cross-modal few-shot learning. NeurIPS 32.

  • Xu, W., Xian, Y., Wang, J., Schiele, B., & Akata, Z. (2022). Attribute prototype network for any-shot learning. IJCV, 130(7), 1735–1753.

    Article  Google Scholar 

  • Yang, J., Li, C., Zhang, P., Xiao, B., Liu, C., Yuan, L., & Gao, J. (2022). Unified contrastive learning in image-text-label space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19163–19173.

  • Ye, H.J., Hu, H., Zhan, D.C., & Sha, F. (2020). Few-shot learning via embedding adaptation with set-to-set functions. In CVPR, pp. 8808–8817.

  • Ye, H. J., Hu, H., & Zhan, D. C. (2021). Learning adaptive classifiers synthesis for generalized few-shot learning. IJCV, 129, 1930–1953.

  • Yoon, S.W., Seo, J., & Moon, J. (2019). Tapnet: Neural network augmented with task-adaptive projection for few-shot learning. In ICML, PMLR, pp. 7115–7123.

  • Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., & Beyer, L. (2022). Lit: Zero-shot transfer with locked-image text tuning. In CVPR, pp. 18123–18133.

  • Zhang, H., Zhang, L., Qi, X., Li, H., Torr, P.H., Koniusz, P. (2020). Few-shot action recognition with permutation-invariant attention. In ECCV, Springer, pp. 525–542.

  • Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L., & Shum, H. (2022a). Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In ICLR.

  • Zhang, R., Che, T., Ghahramani, Z., Bengio, Y., & Song, Y. (2018). Metagan: An adversarial approach to few-shot learning. NeurIPS 31.

  • Zhang, R., Fang, R., Gao, P., Zhang, W., Li, K., Dai, J., Qiao, Y., & Li, H. (2022b). Tip-adapter: Training-free clip-adapter for better vision-language modeling. In ECCV.

  • Zhang, S., Zhou, J., & He, X. (2021). Learning implicit temporal alignment for few-shot video classification. In IJCAI.

  • Zheng, S., Chen, S., & Jin, Q. (2022). Few-shot action recognition with hierarchical matching and contrastive learning. In ECCV, Springer.

  • Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., & Li, Y., et al. (2022). Regionclip: Region-based language-image pretraining. In CVPR, pp. 16793–16803.

  • Zhou, B., Andonian, A., Oliva, A., & Torralba, A. (2018). Temporal relational reasoning in videos. In ECCV, pp. 803–818.

  • Zhou, K., Yang, J., Loy, C.C., & Liu, Z. (2022a). Conditional prompt learning for vision-language models. In CVPR, pp. 16816–16825.

  • Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Learning to prompt for vision-language models. IJCV, 130(9), 2337–2348.

    Article  Google Scholar 

  • Zhu, L., & Yang, Y. (2018). Compound memory networks for few-shot video classification. In ECCV, pp. 751–766.

  • Zhu, L., & Yang, Y. (2020). Label independent memory for semi-supervised few-shot video classification. IEEE Transactions on Pattern Analysis and Machine Intelligence., 44(1), 273–85.

    Google Scholar 

  • Zhu, X., Toisoul, A., Perez-Rua, J.M., Zhang, L., Martinez, B., & Xiang, T. (2021). Few-shot action recognition with prototype-centered attentive learning. arXiv preprint arXiv:2101.08085

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China under grant U22B2053 and Alibaba Group through Alibaba Research Intern Program.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Shiwei Zhang or Nong Sang.

Additional information

Communicated by Xiaohua Zhai.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, X., Zhang, S., Cen, J. et al. CLIP-guided Prototype Modulating for Few-shot Action Recognition. Int J Comput Vis 132, 1899–1912 (2024). https://doi.org/10.1007/s11263-023-01917-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-023-01917-4

Keywords