这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

Animal-CLIP: A Dual-Prompt Enhanced Vision-Language Model for Animal Action Recognition

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Animal action recognition has a wide range of applications. With the rise of visual-language pretraining models (VLMs), new possibilities have emerged for action recognition. However, while current VLMs perform well on human-centric videos, they still struggle with animal videos. This is primarily due to the lack of domain-specific knowledge during model training and more pronounced intra-class variations compared to humans. To address these issues, we introduce Animal-CLIP, a specialized and efficient animal action recognition framework built upon existing VLMs. To address the lack of domain-specific knowledge in animal actions, we leverage the extensive expertise of large language models (LLMs) to automatically generate external prompts, thereby expanding the semantic scope of labels and enhancing the model’s generalization capability. To effectively integrate external knowledge into the model, we propose a knowledge-enhanced internal prompt fine-tuning approach. We design a text feature refinement module to reduce potential recognition inconsistencies. Furthermore, to address the high intra-class variation in animal actions, a novel category-specific prompting method is introduced to generate adaptive prompts to optimize the alignment between text and video features, facilitating more precise partitioning of the action space. Experimental results demonstrate that our method outperforms six previous action recognition methods across three large-scale multi-species, multi-action datasets and exhibits strong generalization capability on unseen animals.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availability

All datasets used in this study are open-access and have been cited in the paper. The code of this work will be available in the Github repository, https://github.com/PRIS-CV/Animal-CLIP.

References

  • Anderson, D. J., & Perona, P. (2014). Toward a science of computational ethology. Neuron, 84(1), 18–31.

    Article  Google Scholar 

  • Bahng, H., Jahanian, A., Sankaranarayanan, S., & Isola, P. (2022). Exploring visual prompts for adapting large-scale models, 1(3), 4. arXiv:2203.17274.

    Google Scholar 

  • Bourdev, L. (2012) Dataset of keypoints and foreground annotations for all categories of pascal 2011

  • Carreira, J., & Zisserman, A. (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6299–6308).

  • Chatgpt, 2022.

  • Chen, F., Han, M., Zhao, H., Zhang, Q., Shi, J., Xu, S., & Xu, B. (2023) X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160

  • Chen, J., Hu, M., Coker, D. J., Berumen, M. L., Costelloe, B., Beery, S., Rohrbach, A., & Elhoseiny, M. (2023) Mammalnet: A large-scale video benchmark for mammal recognition and behavior understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (pp. 13052–13061)

  • Del Pero, L., Ricco, S., Sukthankar, R., & Ferrari, V. (2017). Behavior discovery and alignment of articulated object classes from unstructured video. International Journal of Computer Vision, 121, 303–325.

    Article  Google Scholar 

  • Feichtenhofer, C., (2020) X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 203–213)

  • Feichtenhofer, C., Fan, H., Malik, J., & He, K.(2019) Slowfast networks for video recognition. In Proceedings of the IEEE Int’l Conference on Computer Vision (pp. 6202–6211)

  • Feng, L., Zhao, Y., Sun, Y., Zhao, W., & Tang, J. (2021). Action recognition using a spatial-temporal network for wild felines. Animals, 11(2), 485.

    Article  Google Scholar 

  • Geuther, B. Q., Peer, A., He, H., Sabnis, G., Philip, V. M., & Kumar, V. (2021). Action detection using a neural network elucidates the genetics of mouse grooming behavior. Elife, 10, Article e63207.

    Article  Google Scholar 

  • Gpt4, 2023.

  • Graving, J. M., Chae, D., Naik, H., Li, L., Koger, B., Costelloe, B. R., & Couzin, I. D. (2019). Deepposekit, a software toolkit for fast and robust animal pose estimation using deep learning. Elife, 8, Article e47994.

    Article  Google Scholar 

  • Gu, X., Chen, G., Wang, Y., Zhang, L., Luo, T., & Wen, L. (2023) Text with knowledge graph augmented transformer for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 18941–18951)

  • Huang, X., Huang, Y. J., Zhang, Y., Tian, W., Feng, R., Zhang, Y., Xie, Y., Li, Y., & Zhang, L. (2023) Open-set image tagging with multi-grained text supervision. arXiv e-prints, pages arXiv–2310

  • Hudson, D. A., & Manning, C. D. (2019) Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 6700–6709

  • Huynh, D., & Elhamifar, E. (2020) A shared multi-attention framework for multi-label zero-shot learning. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition pp. 8776–8786

  • Jia, M., Tang, L., Chen, B. C., Cardie, C., Belongie, S., Hariharan, B., & Lim, S. N. (2022) Visual prompt tuning. In Computer Vision–ECCV 2022: 17th European Conf., Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII pp. 709–727. Springer

  • Jia, C., Yang, Y., Xia, Y., Chen, Y. T., Parekh, Z., Pham, H., Quoc Le, Sung, Y-H., Zhen Li, & Duerig, T. (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In Int’l Conference on Machine Learning (pp. 4904–4916)

  • Jiang, Z., Xu, F. F., Araki, J., & Neubig, G. (2020). How can we know what language models know? Transactions of the Association for Computational Linguistics, 8, 423–438.

    Article  Google Scholar 

  • Ju, C., Han, T., Zheng, K., Zhang, Y., & Xie, W. (2022) Prompting visual-language models for efficient video understanding. In Computer Vision–ECCV 2022: 17th European Conf., Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV pp. 105–124. Springer

  • Kampffmeyer, M., Chen, Y., Liang, X., Wang, H., Zhang, Y., & Xing, E. P. (2019) Rethinking knowledge graph propagation for zero-shot learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11487–11496)

  • Karashchuk, P., Rupp, K. L., Dickinson, E. S., Walling-Bell, S., Sanders, E., Azim, E., Brunton, B. W., & Tuthill, J. C. (2021). Anipose: a toolkit for robust markerless 3d pose estimation. Cell Reports, 36(13), Article 109730.

    Article  Google Scholar 

  • Khattak, M. U., Rasheed, H., Maaz, M., Khan, S., & Khan, F. S. (2022) Maple: Multi-modal prompt learning. arXiv preprint arXiv:2210.03117

  • Khosla, A., Jayadevaprakash, N., Yao, B., & Li, F. F. (2011) Novel dataset for fine-grained image categorization: Stanford dogs. In Proceedings CVPR workshop on fine-grained visual categorization (FGVC), volume 2

  • Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., et al. (2023) Mvbench: A comprehensive multi-modal video understanding benchmark. arXiv preprint arXiv:2311.17005

  • Li, Y., Wu, C. Y., Fan, H., Mangalam, K., Xiong, B., Malik, J., & Feichtenhofer, C. (2021) Multiscale vision transformers. In Proceedings of the IEEE Int’l Conf. on Computer Vision pp. 6824–6835

  • Li, B., Zhang, Y., Chen, L., Wang, J., Pu, F., Yang, J., & Liu, Z. Otter: A multi-modal model with in-context instruction tuning.

  • Liang, K., Wang, X., Wei, T., Chen, W., Ma, Z., & Guo, J. (2023a) Attribute learning with knowledge enhanced partial annotations. In 2023 IEEE International Conference on Image Processing (ICIP), pp. 1715–1719. IEEE

  • Liang, K., Wang, X., Zhang, H., Ma, Z., Guo, J. (2023b) Hierarchical visual attribute learning in the wild. In Proceedings of the 31st ACM International Conference on Multimedia, pp. 3415–3423

  • Lin, Z., Geng, S., Zhang, R., Gao, P., De Melo, G., Wang, X., Dai, J., Qiao, Y., & Li, H. (2022) Frozen clip models are efficient video learners. In Computer Vision–ECCV 2022: 17th European Conf., Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV pages 388–404. Springer

  • Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., & Yuan, L. (2023) Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122

  • Liu, D., Hou, J., Huang, S., Liu, J., He, Y., Zheng, B., & Zhang, J. (2023) Lote-animal: A long time-span dataset for endangered animal behavior understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 20064–20075)

  • Mathis, A., Biasi, T., Schneider, S., Yuksekgonul, M., Rogers, B., Bethge, M., & Mathis, M. W. (2021) Pretraining boosts out-of-domain robustness for pose estimation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision pp. 1859–1868

  • Miao, Z., Gaynor, K. M., Wang, J., Liu, Z., Muellerklein, O., Norouzzadeh, M. S., McInturff, A., Bowie, R. C. K., Nathan, R., Yu, S. X., et al. (2019). Insights and approaches using deep learning to classify wildlife. Scientific Reports, 9(1), 8137.

    Article  Google Scholar 

  • Miech, A., Zhukov, D., Alayrac, J. B., Tapaswi, M., Laptev, I., & Sivic, J. (2019) Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE Int’l Conference on Computer Vision, pp. 2630–2640

  • Mondal, A., Nag, S., Prada, J. M., Zhu, X., & Dutta, A. (2023) Actor-agnostic multi-label action recognition with multi-modal query. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops (pp. 784–794)

  • Naeem, M. F., Xian, Y., Tombari, F., & Akata, Z. (2021) Learning graph embeddings for compositional zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 953–962)

  • Ng, X. L., Ong, K. E., Zheng, Q., Ni, Y., Yeo, S. Y., & Liu, J. (2022) Animal kingdom: A large and diverse dataset for animal behavior understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 19023–19034).

  • Nguyen, H., Maclagan, S. J., Nguyen, T. D., Nguyen, T., Flemons, P., Andrews, K., & Phung, D. (2017) Animal recognition and identification with deep convolutional neural networks for automated wildlife monitoring. In 2017 IEEE international conference on Data Science and Advanced Analytics (DSAA) (pp. 40–49).

  • Nguyen, C., Wang, D., Von Richter, K., Valencia, P., Alvarenga, F. A., & Bishop–Hurley, G. (2021) Video-based cattle identification and action recognition. In 2021 Digital Image Computing: Techniques and Applications (DICTA) (pp. 01–05).

  • Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., & Ling, H. (2022) Expanding language-image pretrained models for general video recognition. In Computer Vision–ECCV 2022: 17th European Conf., Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV (pp. 1–18) Springer

  • Pascoe, J., Ryan, N., & Morse, D. (2000). Using while moving: Hci issues in fieldwork environments. ACM Transactions on Computer-Human Interaction (TOCHI), 7(3), 417–437.

    Article  Google Scholar 

  • Pero, L. D., Ricco, S., Sukthankar, R., & Ferrari, V. (2015) Articulated motion discovery using pairs of trajectories. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2151–2160).

  • Pratt, S., Covert, I., Liu, R., & Farhadi, A. (2022) What does a platypus look like? generating customized prompts for zero-shot image classification

  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021) Learning transferable visual models from natural language supervision. In Int’l Conference on Machine Learning (pp. 8748–8763)

  • Ravbar, P., Branson, K., & Simpson, J. H. (2019). An automatic behavior recognition system classifies animal behaviors using movements and their temporal context. Journal of neuroscience methods, 326, Article 108352.

    Article  Google Scholar 

  • Romanelli, C., Cooper, D., Campbell-Lendrum, D., Maiero, M., Karesh, W. B., Hunter, D., & Golden, C. D. (2015) Connecting global priorities: biodiversity and human health: a state of knowledge review. World Health Organistion/Secretariat of the UN Convention on Biological.

  • Segalin, C., Williams, J., Karigo, T., Hui, M., Zelikowsky, M., Sun, J. J., Perona, P., Anderson, D. J., & Kennedy, A. (2021). The mouse action recognition system (mars) software pipeline for automated analysis of social behaviors in mice. Elife, 10, Article e63720.

    Article  Google Scholar 

  • Shah, S., Mishra, A., Yadati, N., & Talukdar, P. P. (2019). Kvqa: Knowledge-aware visual question answering. In Proceedings of the AAAI conference on artificial intelligence (pp. 8876-8884)

  • Shen, S., Li, C., Xiaowei, H., Xie, Y., Yang, J., Zhang, P., Gan, Z., Lijuan Wang, L., Yuan, C. L., et al. (2022). K-lite: Learning transferable visual models with external knowledge. Advances in Neural Information Processing Systems, 35, 15558–15573.

    Google Scholar 

  • Shin, T., Razeghi, Y., Logan IV, R. L., Wallace, E., & Singh, S. (2020) Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980

  • Singh, A., Pietrasik, M., Natha, G., Ghouaiel, N., Brizel, K., & Ray, N. (2020) Animal detection in man-made environments. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, (pp. 1438–1449).

  • Su, Y., Lan, T., Li, H., Xu, J., Wang, Y., & Cai, D. (2023) Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355

  • Sun, C., Baradel, F., Murphy, K., & Schmid, C. (2019a) Learning video representations using contrastive bidirectional transformer. arXiv preprint arXiv:1906.05743

  • Sun, C., Myers, A., Vondrick, C., Murphy, K., & Schmid, C. (2019b) Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE Int’l Conferernce on Computer Vision pp. 7464–7473

  • Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., & Hashimoto, T. B. (2023) Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7

  • Tapanainen, P., Piitulainen, J., & Jarvinen, T. (1998) Idiomatic object usage and support verbs. In COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics

  • Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

  • Van Horn, G., Mac Aodha, O., Song, Y., Cui, Y., Sun, C., Shepard, A., Adam, H., Perona, P., & Belongie, S. (2018) The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition pp. 8769–8778

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017) Attention is all you need. Advances in neural information processing systems

  • Veit, A., Alldrin, N., Chechik, G., Krasin, I., Gupta, A., & Belongie, S. (2017) Learning from noisy large-scale datasets with minimal supervision. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition pp. 839–847

  • von Ziegler, L., Sturman, O., & Bohacek, J. (2021). Big behavior: Challenges and opportunities in a new era of deep behavior profiling. Neuropsychopharmacology, 46(1), 33–44.

    Article  Google Scholar 

  • Wang, Z., Chen, T., Li, G., Xu, R., & Lin, L. (2017) Multi-label image recognition by recurrently discovering attentional regions. In Proceedings of the IEEE Int’l Conf. on Computer Vision (pp. 464–472)

  • Wang, J., Chen, D., Luo, C., Dai, X., Yuan, L., Wu, Z., & Jiang, Y. G. (2023) Chatvideo: A tracklet-centric multimodal and versatile video understanding system. arXiv preprint arXiv:2304.14407

  • Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., & Xu, W. (2016) Cnn-rnn: A unified framework for multi-label image classification. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (pp. 2285–2294)

  • Wang, J., Chen, D., Zuxuan, W., Luo, C., Zhou, L., Zhao, Y., Xie, Y., Liu, C., Jiang, Y.-G., & Yuan, L. (2022). Omnivl: One foundation model for image-language and video-language tasks. Advances in neural information processing systems, 35, 5696–5710.

    Google Scholar 

  • Wenhao, W., Sun, Z., Song, Y., Wang, J., & Ouyang, W. (2024). Transferring vision-language models for visual recognition: A classifier perspective. International Journal of Computer Vision, 132(2), 392–409.

    Article  Google Scholar 

  • Wu, W., Wang, X., Luo, H., Wang, J., Yang, Y., & Ouyang, W. (2022) Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. arXiv preprint arXiv:2301.00182

  • Wu, Y., Zhang, G., Gao, Y., Deng, X., Gong, K., Liang, X., & Lin, L. (2020) Bidirectional graph reasoning network for panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 9080–9089

  • Yang, Q., Xiao, D., & Lin, S. (2018). Feeding behavior recognition for group-housed pigs with the faster r-cnn. Computers and Electronics in Agriculture, 155, 453–460.

    Article  Google Scholar 

  • Zang, Y., Li, W., Zhou, K., Huang, C., & Loy, C. C. 2022 Unified vision and language prompt learning. arXiv preprint arXiv:2210.07225

  • Zhang, H., Li, X., & Bing, L. (2023) Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858

  • Zhao, W., & Wu, X. (2023) Boosting entity-aware image captioning with multi-modal knowledge graph. IEEE Transactions on Multimedia

  • Zhong, Z., Friedman, D., & Chen, D. (2021) Factual probing is [mask]: Learning vs. learning to recall. arXiv preprint arXiv:2104.05240

  • Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022a) Conditional prompt learning for vision-language models. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition pp. 16816–16825

  • Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022b). Learning to prompt for vision-language models. Int’l Journal of Computer Vision, 130(9), 2337–2348.

    Article  Google Scholar 

  • Zhu, L., & Yang, Y. (2020) Actbert: Learning global-local video-text representations. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 8746–8755

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (NSFC) No. 62476029, 62225601, and U23B2052, sponsored by Beijing Nova Program, supported in part by the BUPT Excellent Ph.D. Students Foundation No. CX20241086, and in part by scholarships from China Scholarship Council (CSC) under Grant CSC No. 202406470082.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kongming Liang.

Ethics declarations

Conflict of interest

The authors declared that they have no Conflict of interest to this work.

Additional information

Communicated by Anna Zamansky.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jing, Y., Liang, K., Zhang, R. et al. Animal-CLIP: A Dual-Prompt Enhanced Vision-Language Model for Animal Action Recognition. Int J Comput Vis 133, 5062–5082 (2025). https://doi.org/10.1007/s11263-025-02408-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-025-02408-4

Keywords