Abstract
Animal action recognition has a wide range of applications. With the rise of visual-language pretraining models (VLMs), new possibilities have emerged for action recognition. However, while current VLMs perform well on human-centric videos, they still struggle with animal videos. This is primarily due to the lack of domain-specific knowledge during model training and more pronounced intra-class variations compared to humans. To address these issues, we introduce Animal-CLIP, a specialized and efficient animal action recognition framework built upon existing VLMs. To address the lack of domain-specific knowledge in animal actions, we leverage the extensive expertise of large language models (LLMs) to automatically generate external prompts, thereby expanding the semantic scope of labels and enhancing the model’s generalization capability. To effectively integrate external knowledge into the model, we propose a knowledge-enhanced internal prompt fine-tuning approach. We design a text feature refinement module to reduce potential recognition inconsistencies. Furthermore, to address the high intra-class variation in animal actions, a novel category-specific prompting method is introduced to generate adaptive prompts to optimize the alignment between text and video features, facilitating more precise partitioning of the action space. Experimental results demonstrate that our method outperforms six previous action recognition methods across three large-scale multi-species, multi-action datasets and exhibits strong generalization capability on unseen animals.
Similar content being viewed by others
Data Availability
All datasets used in this study are open-access and have been cited in the paper. The code of this work will be available in the Github repository, https://github.com/PRIS-CV/Animal-CLIP.
References
Anderson, D. J., & Perona, P. (2014). Toward a science of computational ethology. Neuron, 84(1), 18–31.
Bahng, H., Jahanian, A., Sankaranarayanan, S., & Isola, P. (2022). Exploring visual prompts for adapting large-scale models, 1(3), 4. arXiv:2203.17274.
Bourdev, L. (2012) Dataset of keypoints and foreground annotations for all categories of pascal 2011
Carreira, J., & Zisserman, A. (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6299–6308).
Chatgpt, 2022.
Chen, F., Han, M., Zhao, H., Zhang, Q., Shi, J., Xu, S., & Xu, B. (2023) X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160
Chen, J., Hu, M., Coker, D. J., Berumen, M. L., Costelloe, B., Beery, S., Rohrbach, A., & Elhoseiny, M. (2023) Mammalnet: A large-scale video benchmark for mammal recognition and behavior understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (pp. 13052–13061)
Del Pero, L., Ricco, S., Sukthankar, R., & Ferrari, V. (2017). Behavior discovery and alignment of articulated object classes from unstructured video. International Journal of Computer Vision, 121, 303–325.
Feichtenhofer, C., (2020) X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 203–213)
Feichtenhofer, C., Fan, H., Malik, J., & He, K.(2019) Slowfast networks for video recognition. In Proceedings of the IEEE Int’l Conference on Computer Vision (pp. 6202–6211)
Feng, L., Zhao, Y., Sun, Y., Zhao, W., & Tang, J. (2021). Action recognition using a spatial-temporal network for wild felines. Animals, 11(2), 485.
Geuther, B. Q., Peer, A., He, H., Sabnis, G., Philip, V. M., & Kumar, V. (2021). Action detection using a neural network elucidates the genetics of mouse grooming behavior. Elife, 10, Article e63207.
Gpt4, 2023.
Graving, J. M., Chae, D., Naik, H., Li, L., Koger, B., Costelloe, B. R., & Couzin, I. D. (2019). Deepposekit, a software toolkit for fast and robust animal pose estimation using deep learning. Elife, 8, Article e47994.
Gu, X., Chen, G., Wang, Y., Zhang, L., Luo, T., & Wen, L. (2023) Text with knowledge graph augmented transformer for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 18941–18951)
Huang, X., Huang, Y. J., Zhang, Y., Tian, W., Feng, R., Zhang, Y., Xie, Y., Li, Y., & Zhang, L. (2023) Open-set image tagging with multi-grained text supervision. arXiv e-prints, pages arXiv–2310
Hudson, D. A., & Manning, C. D. (2019) Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 6700–6709
Huynh, D., & Elhamifar, E. (2020) A shared multi-attention framework for multi-label zero-shot learning. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition pp. 8776–8786
Jia, M., Tang, L., Chen, B. C., Cardie, C., Belongie, S., Hariharan, B., & Lim, S. N. (2022) Visual prompt tuning. In Computer Vision–ECCV 2022: 17th European Conf., Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII pp. 709–727. Springer
Jia, C., Yang, Y., Xia, Y., Chen, Y. T., Parekh, Z., Pham, H., Quoc Le, Sung, Y-H., Zhen Li, & Duerig, T. (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In Int’l Conference on Machine Learning (pp. 4904–4916)
Jiang, Z., Xu, F. F., Araki, J., & Neubig, G. (2020). How can we know what language models know? Transactions of the Association for Computational Linguistics, 8, 423–438.
Ju, C., Han, T., Zheng, K., Zhang, Y., & Xie, W. (2022) Prompting visual-language models for efficient video understanding. In Computer Vision–ECCV 2022: 17th European Conf., Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV pp. 105–124. Springer
Kampffmeyer, M., Chen, Y., Liang, X., Wang, H., Zhang, Y., & Xing, E. P. (2019) Rethinking knowledge graph propagation for zero-shot learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11487–11496)
Karashchuk, P., Rupp, K. L., Dickinson, E. S., Walling-Bell, S., Sanders, E., Azim, E., Brunton, B. W., & Tuthill, J. C. (2021). Anipose: a toolkit for robust markerless 3d pose estimation. Cell Reports, 36(13), Article 109730.
Khattak, M. U., Rasheed, H., Maaz, M., Khan, S., & Khan, F. S. (2022) Maple: Multi-modal prompt learning. arXiv preprint arXiv:2210.03117
Khosla, A., Jayadevaprakash, N., Yao, B., & Li, F. F. (2011) Novel dataset for fine-grained image categorization: Stanford dogs. In Proceedings CVPR workshop on fine-grained visual categorization (FGVC), volume 2
Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., et al. (2023) Mvbench: A comprehensive multi-modal video understanding benchmark. arXiv preprint arXiv:2311.17005
Li, Y., Wu, C. Y., Fan, H., Mangalam, K., Xiong, B., Malik, J., & Feichtenhofer, C. (2021) Multiscale vision transformers. In Proceedings of the IEEE Int’l Conf. on Computer Vision pp. 6824–6835
Li, B., Zhang, Y., Chen, L., Wang, J., Pu, F., Yang, J., & Liu, Z. Otter: A multi-modal model with in-context instruction tuning.
Liang, K., Wang, X., Wei, T., Chen, W., Ma, Z., & Guo, J. (2023a) Attribute learning with knowledge enhanced partial annotations. In 2023 IEEE International Conference on Image Processing (ICIP), pp. 1715–1719. IEEE
Liang, K., Wang, X., Zhang, H., Ma, Z., Guo, J. (2023b) Hierarchical visual attribute learning in the wild. In Proceedings of the 31st ACM International Conference on Multimedia, pp. 3415–3423
Lin, Z., Geng, S., Zhang, R., Gao, P., De Melo, G., Wang, X., Dai, J., Qiao, Y., & Li, H. (2022) Frozen clip models are efficient video learners. In Computer Vision–ECCV 2022: 17th European Conf., Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV pages 388–404. Springer
Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., & Yuan, L. (2023) Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122
Liu, D., Hou, J., Huang, S., Liu, J., He, Y., Zheng, B., & Zhang, J. (2023) Lote-animal: A long time-span dataset for endangered animal behavior understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 20064–20075)
Mathis, A., Biasi, T., Schneider, S., Yuksekgonul, M., Rogers, B., Bethge, M., & Mathis, M. W. (2021) Pretraining boosts out-of-domain robustness for pose estimation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision pp. 1859–1868
Miao, Z., Gaynor, K. M., Wang, J., Liu, Z., Muellerklein, O., Norouzzadeh, M. S., McInturff, A., Bowie, R. C. K., Nathan, R., Yu, S. X., et al. (2019). Insights and approaches using deep learning to classify wildlife. Scientific Reports, 9(1), 8137.
Miech, A., Zhukov, D., Alayrac, J. B., Tapaswi, M., Laptev, I., & Sivic, J. (2019) Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE Int’l Conference on Computer Vision, pp. 2630–2640
Mondal, A., Nag, S., Prada, J. M., Zhu, X., & Dutta, A. (2023) Actor-agnostic multi-label action recognition with multi-modal query. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops (pp. 784–794)
Naeem, M. F., Xian, Y., Tombari, F., & Akata, Z. (2021) Learning graph embeddings for compositional zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 953–962)
Ng, X. L., Ong, K. E., Zheng, Q., Ni, Y., Yeo, S. Y., & Liu, J. (2022) Animal kingdom: A large and diverse dataset for animal behavior understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 19023–19034).
Nguyen, H., Maclagan, S. J., Nguyen, T. D., Nguyen, T., Flemons, P., Andrews, K., & Phung, D. (2017) Animal recognition and identification with deep convolutional neural networks for automated wildlife monitoring. In 2017 IEEE international conference on Data Science and Advanced Analytics (DSAA) (pp. 40–49).
Nguyen, C., Wang, D., Von Richter, K., Valencia, P., Alvarenga, F. A., & Bishop–Hurley, G. (2021) Video-based cattle identification and action recognition. In 2021 Digital Image Computing: Techniques and Applications (DICTA) (pp. 01–05).
Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., & Ling, H. (2022) Expanding language-image pretrained models for general video recognition. In Computer Vision–ECCV 2022: 17th European Conf., Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV (pp. 1–18) Springer
Pascoe, J., Ryan, N., & Morse, D. (2000). Using while moving: Hci issues in fieldwork environments. ACM Transactions on Computer-Human Interaction (TOCHI), 7(3), 417–437.
Pero, L. D., Ricco, S., Sukthankar, R., & Ferrari, V. (2015) Articulated motion discovery using pairs of trajectories. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2151–2160).
Pratt, S., Covert, I., Liu, R., & Farhadi, A. (2022) What does a platypus look like? generating customized prompts for zero-shot image classification
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021) Learning transferable visual models from natural language supervision. In Int’l Conference on Machine Learning (pp. 8748–8763)
Ravbar, P., Branson, K., & Simpson, J. H. (2019). An automatic behavior recognition system classifies animal behaviors using movements and their temporal context. Journal of neuroscience methods, 326, Article 108352.
Romanelli, C., Cooper, D., Campbell-Lendrum, D., Maiero, M., Karesh, W. B., Hunter, D., & Golden, C. D. (2015) Connecting global priorities: biodiversity and human health: a state of knowledge review. World Health Organistion/Secretariat of the UN Convention on Biological.
Segalin, C., Williams, J., Karigo, T., Hui, M., Zelikowsky, M., Sun, J. J., Perona, P., Anderson, D. J., & Kennedy, A. (2021). The mouse action recognition system (mars) software pipeline for automated analysis of social behaviors in mice. Elife, 10, Article e63720.
Shah, S., Mishra, A., Yadati, N., & Talukdar, P. P. (2019). Kvqa: Knowledge-aware visual question answering. In Proceedings of the AAAI conference on artificial intelligence (pp. 8876-8884)
Shen, S., Li, C., Xiaowei, H., Xie, Y., Yang, J., Zhang, P., Gan, Z., Lijuan Wang, L., Yuan, C. L., et al. (2022). K-lite: Learning transferable visual models with external knowledge. Advances in Neural Information Processing Systems, 35, 15558–15573.
Shin, T., Razeghi, Y., Logan IV, R. L., Wallace, E., & Singh, S. (2020) Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980
Singh, A., Pietrasik, M., Natha, G., Ghouaiel, N., Brizel, K., & Ray, N. (2020) Animal detection in man-made environments. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, (pp. 1438–1449).
Su, Y., Lan, T., Li, H., Xu, J., Wang, Y., & Cai, D. (2023) Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355
Sun, C., Baradel, F., Murphy, K., & Schmid, C. (2019a) Learning video representations using contrastive bidirectional transformer. arXiv preprint arXiv:1906.05743
Sun, C., Myers, A., Vondrick, C., Murphy, K., & Schmid, C. (2019b) Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE Int’l Conferernce on Computer Vision pp. 7464–7473
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., & Hashimoto, T. B. (2023) Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7
Tapanainen, P., Piitulainen, J., & Jarvinen, T. (1998) Idiomatic object usage and support verbs. In COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971
Van Horn, G., Mac Aodha, O., Song, Y., Cui, Y., Sun, C., Shepard, A., Adam, H., Perona, P., & Belongie, S. (2018) The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition pp. 8769–8778
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017) Attention is all you need. Advances in neural information processing systems
Veit, A., Alldrin, N., Chechik, G., Krasin, I., Gupta, A., & Belongie, S. (2017) Learning from noisy large-scale datasets with minimal supervision. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition pp. 839–847
von Ziegler, L., Sturman, O., & Bohacek, J. (2021). Big behavior: Challenges and opportunities in a new era of deep behavior profiling. Neuropsychopharmacology, 46(1), 33–44.
Wang, Z., Chen, T., Li, G., Xu, R., & Lin, L. (2017) Multi-label image recognition by recurrently discovering attentional regions. In Proceedings of the IEEE Int’l Conf. on Computer Vision (pp. 464–472)
Wang, J., Chen, D., Luo, C., Dai, X., Yuan, L., Wu, Z., & Jiang, Y. G. (2023) Chatvideo: A tracklet-centric multimodal and versatile video understanding system. arXiv preprint arXiv:2304.14407
Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., & Xu, W. (2016) Cnn-rnn: A unified framework for multi-label image classification. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (pp. 2285–2294)
Wang, J., Chen, D., Zuxuan, W., Luo, C., Zhou, L., Zhao, Y., Xie, Y., Liu, C., Jiang, Y.-G., & Yuan, L. (2022). Omnivl: One foundation model for image-language and video-language tasks. Advances in neural information processing systems, 35, 5696–5710.
Wenhao, W., Sun, Z., Song, Y., Wang, J., & Ouyang, W. (2024). Transferring vision-language models for visual recognition: A classifier perspective. International Journal of Computer Vision, 132(2), 392–409.
Wu, W., Wang, X., Luo, H., Wang, J., Yang, Y., & Ouyang, W. (2022) Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. arXiv preprint arXiv:2301.00182
Wu, Y., Zhang, G., Gao, Y., Deng, X., Gong, K., Liang, X., & Lin, L. (2020) Bidirectional graph reasoning network for panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 9080–9089
Yang, Q., Xiao, D., & Lin, S. (2018). Feeding behavior recognition for group-housed pigs with the faster r-cnn. Computers and Electronics in Agriculture, 155, 453–460.
Zang, Y., Li, W., Zhou, K., Huang, C., & Loy, C. C. 2022 Unified vision and language prompt learning. arXiv preprint arXiv:2210.07225
Zhang, H., Li, X., & Bing, L. (2023) Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858
Zhao, W., & Wu, X. (2023) Boosting entity-aware image captioning with multi-modal knowledge graph. IEEE Transactions on Multimedia
Zhong, Z., Friedman, D., & Chen, D. (2021) Factual probing is [mask]: Learning vs. learning to recall. arXiv preprint arXiv:2104.05240
Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022a) Conditional prompt learning for vision-language models. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition pp. 16816–16825
Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022b). Learning to prompt for vision-language models. Int’l Journal of Computer Vision, 130(9), 2337–2348.
Zhu, L., & Yang, Y. (2020) Actbert: Learning global-local video-text representations. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 8746–8755
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (NSFC) No. 62476029, 62225601, and U23B2052, sponsored by Beijing Nova Program, supported in part by the BUPT Excellent Ph.D. Students Foundation No. CX20241086, and in part by scholarships from China Scholarship Council (CSC) under Grant CSC No. 202406470082.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declared that they have no Conflict of interest to this work.
Additional information
Communicated by Anna Zamansky.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jing, Y., Liang, K., Zhang, R. et al. Animal-CLIP: A Dual-Prompt Enhanced Vision-Language Model for Animal Action Recognition. Int J Comput Vis 133, 5062–5082 (2025). https://doi.org/10.1007/s11263-025-02408-4
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02408-4