Open-Vocabulary Animal Keypoint Detection with Semantic-Feature Matching

Zhang, Hao; Xu, Lumin; Lai, Shenqi; Shao, Wenqi; Zheng, Nanning; Luo, Ping; Qiao, Yu; Zhang, Kaipeng

doi:10.1007/s11263-024-02126-3

Open-Vocabulary Animal Keypoint Detection with Semantic-Feature Matching

Published: 25 June 2024

Volume 132, pages 5741–5758, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

1128 Accesses
13 Citations
Explore all metrics

Abstract

Current image-based keypoint detection methods for animal (including human) bodies and faces are generally divided into fully supervised and few-shot class-agnostic approaches. The former typically relies on laborious and time-consuming manual annotations, posing considerable challenges in expanding keypoint detection to a broader range of keypoint categories and animal species. The latter, though less dependent on extensive manual input, still requires necessary support images with annotation for reference during testing. To realize zero-shot keypoint detection without any prior annotation, we introduce the Open-Vocabulary Keypoint Detection (OVKD) task, which is innovatively designed to use text prompts for identifying arbitrary keypoints across any species. In pursuit of this goal, we have developed a novel framework named Open-Vocabulary Keypoint Detection with Semantic-feature Matching (KDSM). This framework synergistically combines vision and language models, creating an interplay between language features and local keypoint visual features. KDSM enhances its capabilities by integrating Domain Distribution Matrix Matching (DDMM) and other special modules, such as the Vision-Keypoint Relational Awareness (VKRA) module, improving the framework’s generalizability and overall performance. Our comprehensive experiments demonstrate that KDSM significantly outperforms the baseline in terms of performance and achieves remarkable success in the OVKD task. Impressively, our method, operating in a zero-shot fashion, still yields results comparable to state-of-the-art few-shot species class-agnostic keypoint detection methods. Codes and data are available at https://github.com/zhanghao5201/KDSM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

OpenKD: Opening Prompt Diversity for Zero- and Few-Shot Keypoint Detection

OV-DAR: Open-Vocabulary Object Detection and Attributes Recognition

Article 13 June 2024

Open-Vocabulary Camouflaged Object Segmentation

Data availibility

The dataset MP-100 for this study can be downloaded at: https://github.com/luminxu/Pose-for-Everything. Our reorganized and partitioned dataset MP-78 is released together with our source code.

Notes

We refer to the method “Few-shot keypoint detection with uncertainty learning for unseen species” as FS-ULUS.

References

Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2D human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3686–3693)
Bangalath, H., Maaz, M., Khattak, M. U., Khan, S. H., & Shahbaz Khan, F. (2022). Bridging the gap between object and image-level representations for open-vocabulary detection. Advances in Neural Information Processing Systems, 35, 33781–33794.
Google Scholar
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Agarwal, S. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In Computer vision—ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part I 16 (pp. 213–229)
Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., & Lin, D. (2023). Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793.
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). IEEE
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). Image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Fang, H.-S., Xie, S., Tai, Y.-W., & Lu, C. (2017). RMPE: Regional multi-person pose estimation. In ICCV (pp. 2334–2343)
Feighelstein, M., Shimshoni, I., Finka, L. R., Luna, S. P. L., Mills, D. S., & Zamansky, A. (2022). Automated recognition of pain in cats. Scientific Reports, 12(1), 9575.
Article Google Scholar
Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning (pp. 1126–1135). PMLR
Graving, J. M., Chae, D., Naik, H., Li, L., Koger, B., Costelloe, B. R., & Couzin, I. D. (2019). Deepposekit, a software toolkit for fast and robust animal pose estimation using deep learning. Elife, 8, e47994.
Article Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778)
Hu, S., Zheng, C., Zhou, Z., Chen, C., & Sukthankar, G. (2023). Lamp: Leveraging language prompts for multi-person pose estimation. In 2023 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 3759–3766). IEEE
Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, H., Sung, Y.-H., Li, Z., & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning (pp. 4904–4916). PMLR
Khan, M. H., McDonagh, J., Khan, S., Shahabuddin, M., Arora, A., Khan, F. S., Shao, L., & Tzimiropoulos, G. (2020). Animalweb: A large-scale hierarchical dataset of annotated animal faces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6939–6948)
Koestinger, M., Wohlhart, P., Roth, P. M., & Bischof, H. (2011). Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In 2011 IEEE international conference on computer vision workshops (ICCV workshops) (pp. 2144–2151). IEEE
Kumar, A., Marks, T. K., Mou, W., Wang, Y., Jones, M., Cherian, A., Koike-Akino, T., Liu, X., & Feng, C. (2020). Luvli face alignment: Estimating landmarks’ location, uncertainty, and visibility likelihood. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8236–8246)
Labuguen, R., Matsumoto, J., Negrete, S. B., Nishimaru, H., Nishijo, H., Takada, M., Go, Y., Inoue, K., & Shibata, T. (2021). Macaquepose: A novel “in the wild” macaque monkey pose dataset for markerless motion capture. Frontiers in Behavioral Neuroscience,14, 581154
Li, B., Weinberger, K. Q., Belongie, S., Koltun, V., & Ranftl, R. (2022). Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546
Li, D., Li, J., & Hoi, S. (2024). Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. Advances in Neural Information Processing Systems,36
Lin, B., Tang, Z., Ye, Y., Cui, J., Zhu, B., Jin, P., Zhang, J., Ning, M., & Yuan, L. (2024). Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Computer vision—ECCV 2014: 13th European conference, Zurich, Switzerland, September 6–12, 2014, proceedings, part V 13 (pp. 740–755)
Lu, C., & Koniusz, P. (2022). Few-shot keypoint detection with uncertainty learning for unseen species. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 19416–19426)
Martvel, G., Farhat, N., Shimshoni, I., & Zamansky, A. (2023). Catflw: Cat facial landmarks in the wild dataset. arXiv preprint arXiv:2305.04232
Nakamura, A., & Harada, T. (2019). Revisiting fine-tuning for few-shot learning. arXiv preprint arXiv:1910.00216
Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In Computer vision—ECCV 2016: 14th European conference, Amsterdam, The Netherlands, October 11–14, 2016, proceedings, part VIII 14 (pp. 483–499)
Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., & Ling, H. (2022). Expanding language-image pretrained models for general video recognition. In Computer vision—ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, proceedings, part IV (pp. 1–18)
Pan, Y., Yao, T., Li, Y., & Mei, T. (2020). X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10971–10980)
Patel, M., Gu, Y., Carstensen, L. C., Hasselmo, M. E., & Betke, M. (2023). Animal pose tracking: 3D multimodal dataset and token-based pose optimization. International Journal of Computer Vision, 131(2), 514–530.
Article Google Scholar
Pereira, T. D., Aldarondo, D. E., Willmore, L., Kislin, M., Wang, S.S.-H., Murthy, M., & Shaevitz, J. W. (2019). Fast animal pose estimation using deep neural networks. Nature Methods, 16(1), 117–125.
Article Google Scholar
Pessanha, F., Salah, A. A., van Loon, T. J. P. A. M., & Veltkamp, R. C. (2023). Facial image-based automatic assessment of equine pain. IEEE Transactions on Affective Computing, 14(3), 2064–2076.
Article Google Scholar
Pourpanah, F., Abdar, M., Luo, Y., Zhou, X., Wang, R., Lim, C. P., Wang, X. Z., & Wu, Q. J. (2022). A review of generalized zero-shot learning methods. IEEE Transactions on Pattern Analysis and Machine Intelligence
Qian, R., Li, Y., Xu, Z., Yang, M.-H., Belongie, S., & Cui, Y. (2022). Multimodal open-vocabulary video classification via pre-trained vision and language models. arXiv preprint arXiv:2207.07646
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., & Krueger, G., Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748–8763). PMLR.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B.. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684–10695).
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv 2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510–4520).
Shi, M., Huang, Z., Ma, X., Hu, X., & Cao, Z. (2023). Matching is not enough: A two-stage framework for category-agnostic pose estimation. In IEEE/CVF conference on computer vision and pattern recognition, CVPR 2023, Vancouver, BC, Canada, June 17–24, 2023 (pp. 7308–7317). IEEE
Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (pp. 6105–6114).
Tu, J., Wu, G., & Wang, L. (2023). Dual graph networks for pose estimation in crowded scenes. International Journal of Computer Vision, 1–21.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems,30.
Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Yadong, M., Tan, M., Wang, X., et al. (2020). Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10), 3349–3364.
Article Google Scholar
Wang, Y., Peng, C., & Liu, Y. (2018). Mask-pose cascaded CNN for 2D hand pose estimation from single color image. IEEE Transactions on Circuits and Systems for Video Technology, 29(11), 3258–3268.
Article Google Scholar
Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., & Perona, P. (2010). Caltech-UCSD birds 200.
Weng, T., Xiao, J., Pan, H., & Jiang, H. (2023). PartCom: Part composition learning for 3d open-set recognition. International Journal of Computer Vision, 1–24.
Wu, W., Qian, C., Yang, S., Wang, Q., Cai, Y., & Zhou, Q. (2018). Look at boundary: A boundary-aware face alignment algorithm. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2129–2138).
Xiao, B., Wu, H., & Wei, Y. (2018). Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV) (pp. 466–481).
Xu, L., Jin, S., Zeng, W., Liu, W., Qian, C., Ouyang, W., Luo, P., & Wang, X. (2022). Pose for everything: Towards category-agnostic pose estimation. In Computer vision—ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, proceedings, part VI (pp. 398–416)
Xu, M., Zhang, Z., Wei, F., Hu, H., & Bai, X. (2023). Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2945–2954).
Xu, Y., Zhang, J., Zhang, Q., & Tao, D. (2024). Vitpose++: Vision transformer for generic body pose estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(2), 1212–1230.
Article Google Scholar
Yao, L., Han, J., Wen, Y., Liang, X., Xu, D., Zhang, W., Li, Z., Xu, C., & Xu, H. (2022). Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection. arXiv preprint arXiv:2209.09407
Yu, H., Xu, Y., Zhang, J., Zhao, W., Guan, Z., & Tao, D. (2021). Ap-10k: A benchmark for animal pose estimation in the wild. arXiv preprint arXiv:2108.12617
Zhang, H., Lai, S., Wang, Y., Da, Z., Dun, Y., & Qian, X. (2023). Scgnet: Shifting and cascaded group network. IEEE Transactions on Circuits and Systems for Video Technology
Zhang, H., Dun, Y., Pei, Y., Lai, S., Liu, C., Zhang, K., & Qian, X. (2024). HF-HRNet: A simple hardware friendly high-resolution network. IEEE Transactions on Circuits and Systems for Video Technology. https://doi.org/10.1109/TCSVT.2024.3377365
Article Google Scholar
Zhang, H., Shao, W., Liu, H., Ma, Y., Luo, P., Qiao, Y., & Zhang, K. (2024b). AVIbench: Towards evaluating the robustness of large vision-language model on adversarial visual-instructions. arXiv preprint arXiv:2403.09346
Zhou, Z., Li, H., Liu, H., Wang, N., Yu, G., & Ji, R. (2023). Star loss: Reducing semantic ambiguity in facial landmark detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15475–15484).
Zhu, X., Zhang, R., He, B., Guo, Z., Zeng, Z., Qin, Z., Zhang, S., & Gao, P. (2023). Pointclip v2: Prompting clip and GPT for powerful 3d open-world learning. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2639–2650).

Download references

Acknowledgements

This work was supported in part by the National Science Foundation of China (Grant No. 62088102), in part by the National Key R&D Program of China (NO.2022ZD0160101).

Author information

Authors and Affiliations

National Key Laboratory of Human–Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, 710049, Shaanxi, China
Hao Zhang & Nanning Zheng
Shanghai AI Laboratory, Xuhui, Shanghai, China
Hao Zhang, Wenqi Shao, Yu Qiao & Kaipeng Zhang
The Chinese University of Hong Kong, Hong Kong, China
Lumin Xu
Zhejiang University, Hangzhou, China
Shenqi Lai
The University of Hong Kong, Hong Kong, China
Ping Luo

Authors

Hao Zhang
View author publications
Search author on:PubMed Google Scholar
Lumin Xu
View author publications
Search author on:PubMed Google Scholar
Shenqi Lai
View author publications
Search author on:PubMed Google Scholar
Wenqi Shao
View author publications
Search author on:PubMed Google Scholar
Nanning Zheng
View author publications
Search author on:PubMed Google Scholar
Ping Luo
View author publications
Search author on:PubMed Google Scholar
Yu Qiao
View author publications
Search author on:PubMed Google Scholar
Kaipeng Zhang
View author publications
Search author on:PubMed Google Scholar

Corresponding authors

Correspondence to Nanning Zheng or Kaipeng Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no Conflict of interest.

Additional information

Communicated by Hong Liu

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was done during Hao Zhang’s internship at Shanghai Artificial Intelligence Laboratory.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, H., Xu, L., Lai, S. et al. Open-Vocabulary Animal Keypoint Detection with Semantic-Feature Matching. Int J Comput Vis 132, 5741–5758 (2024). https://doi.org/10.1007/s11263-024-02126-3

Download citation

Received: 11 December 2023
Accepted: 16 May 2024
Published: 25 June 2024
Version of record: 25 June 2024
Issue date: December 2024
DOI: https://doi.org/10.1007/s11263-024-02126-3

Keywords

Part of a collection:

Special Issue on Open-World Visual Recognition

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Open-Vocabulary Animal Keypoint Detection with Semantic-Feature Matching

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

OpenKD: Opening Prompt Diversity for Zero- and Few-Shot Keypoint Detection

OV-DAR: Open-Vocabulary Object Detection and Attributes Recognition

Open-Vocabulary Camouflaged Object Segmentation

Explore related subjects

Data availibility

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now