这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

Open-Vocabulary Animal Keypoint Detection with Semantic-Feature Matching

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Current image-based keypoint detection methods for animal (including human) bodies and faces are generally divided into fully supervised and few-shot class-agnostic approaches. The former typically relies on laborious and time-consuming manual annotations, posing considerable challenges in expanding keypoint detection to a broader range of keypoint categories and animal species. The latter, though less dependent on extensive manual input, still requires necessary support images with annotation for reference during testing. To realize zero-shot keypoint detection without any prior annotation, we introduce the Open-Vocabulary Keypoint Detection (OVKD) task, which is innovatively designed to use text prompts for identifying arbitrary keypoints across any species. In pursuit of this goal, we have developed a novel framework named Open-Vocabulary Keypoint Detection with Semantic-feature Matching (KDSM). This framework synergistically combines vision and language models, creating an interplay between language features and local keypoint visual features. KDSM enhances its capabilities by integrating Domain Distribution Matrix Matching (DDMM) and other special modules, such as the Vision-Keypoint Relational Awareness (VKRA) module, improving the framework’s generalizability and overall performance. Our comprehensive experiments demonstrate that KDSM significantly outperforms the baseline in terms of performance and achieves remarkable success in the OVKD task. Impressively, our method, operating in a zero-shot fashion, still yields results comparable to state-of-the-art few-shot species class-agnostic keypoint detection methods. Codes and data are available at https://github.com/zhanghao5201/KDSM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availibility

The dataset MP-100 for this study can be downloaded at: https://github.com/luminxu/Pose-for-Everything. Our reorganized and partitioned dataset MP-78 is released together with our source code.

Notes

  1. We refer to the method “Few-shot keypoint detection with uncertainty learning for unseen species” as FS-ULUS.

References

  • Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2D human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3686–3693)

  • Bangalath, H., Maaz, M., Khattak, M. U., Khan, S. H., & Shahbaz Khan, F. (2022). Bridging the gap between object and image-level representations for open-vocabulary detection. Advances in Neural Information Processing Systems, 35, 33781–33794.

    Google Scholar 

  • Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Agarwal, S. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

    Google Scholar 

  • Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In Computer vision—ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part I 16 (pp. 213–229)

  • Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., & Lin, D. (2023). Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793.

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). IEEE

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). Image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  • Fang, H.-S., Xie, S., Tai, Y.-W., & Lu, C. (2017). RMPE: Regional multi-person pose estimation. In ICCV (pp. 2334–2343)

  • Feighelstein, M., Shimshoni, I., Finka, L. R., Luna, S. P. L., Mills, D. S., & Zamansky, A. (2022). Automated recognition of pain in cats. Scientific Reports, 12(1), 9575.

    Article  Google Scholar 

  • Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning (pp. 1126–1135). PMLR

  • Graving, J. M., Chae, D., Naik, H., Li, L., Koger, B., Costelloe, B. R., & Couzin, I. D. (2019). Deepposekit, a software toolkit for fast and robust animal pose estimation using deep learning. Elife, 8, e47994.

    Article  Google Scholar 

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778)

  • Hu, S., Zheng, C., Zhou, Z., Chen, C., & Sukthankar, G. (2023). Lamp: Leveraging language prompts for multi-person pose estimation. In 2023 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 3759–3766). IEEE

  • Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, H., Sung, Y.-H., Li, Z., & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning (pp. 4904–4916). PMLR

  • Khan, M. H., McDonagh, J., Khan, S., Shahabuddin, M., Arora, A., Khan, F. S., Shao, L., & Tzimiropoulos, G. (2020). Animalweb: A large-scale hierarchical dataset of annotated animal faces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6939–6948)

  • Koestinger, M., Wohlhart, P., Roth, P. M., & Bischof, H. (2011). Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In 2011 IEEE international conference on computer vision workshops (ICCV workshops) (pp. 2144–2151). IEEE

  • Kumar, A., Marks, T. K., Mou, W., Wang, Y., Jones, M., Cherian, A., Koike-Akino, T., Liu, X., & Feng, C. (2020). Luvli face alignment: Estimating landmarks’ location, uncertainty, and visibility likelihood. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8236–8246)

  • Labuguen, R., Matsumoto, J., Negrete, S. B., Nishimaru, H., Nishijo, H., Takada, M., Go, Y., Inoue, K., & Shibata, T. (2021). Macaquepose: A novel “in the wild” macaque monkey pose dataset for markerless motion capture. Frontiers in Behavioral Neuroscience,14, 581154

  • Li, B., Weinberger, K. Q., Belongie, S., Koltun, V., & Ranftl, R. (2022). Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546

  • Li, D., Li, J., & Hoi, S. (2024). Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. Advances in Neural Information Processing Systems,36

  • Lin, B., Tang, Z., Ye, Y., Cui, J., Zhu, B., Jin, P., Zhang, J., Ning, M., & Yuan, L. (2024). Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Computer vision—ECCV 2014: 13th European conference, Zurich, Switzerland, September 6–12, 2014, proceedings, part V 13 (pp. 740–755)

  • Lu, C., & Koniusz, P. (2022). Few-shot keypoint detection with uncertainty learning for unseen species. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 19416–19426)

  • Martvel, G., Farhat, N., Shimshoni, I., & Zamansky, A. (2023). Catflw: Cat facial landmarks in the wild dataset. arXiv preprint arXiv:2305.04232

  • Nakamura, A., & Harada, T. (2019). Revisiting fine-tuning for few-shot learning. arXiv preprint arXiv:1910.00216

  • Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In Computer vision—ECCV 2016: 14th European conference, Amsterdam, The Netherlands, October 11–14, 2016, proceedings, part VIII 14 (pp. 483–499)

  • Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., & Ling, H. (2022). Expanding language-image pretrained models for general video recognition. In Computer vision—ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, proceedings, part IV (pp. 1–18)

  • Pan, Y., Yao, T., Li, Y., & Mei, T. (2020). X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10971–10980)

  • Patel, M., Gu, Y., Carstensen, L. C., Hasselmo, M. E., & Betke, M. (2023). Animal pose tracking: 3D multimodal dataset and token-based pose optimization. International Journal of Computer Vision, 131(2), 514–530.

    Article  Google Scholar 

  • Pereira, T. D., Aldarondo, D. E., Willmore, L., Kislin, M., Wang, S.S.-H., Murthy, M., & Shaevitz, J. W. (2019). Fast animal pose estimation using deep neural networks. Nature Methods, 16(1), 117–125.

    Article  Google Scholar 

  • Pessanha, F., Salah, A. A., van Loon, T. J. P. A. M., & Veltkamp, R. C. (2023). Facial image-based automatic assessment of equine pain. IEEE Transactions on Affective Computing, 14(3), 2064–2076.

    Article  Google Scholar 

  • Pourpanah, F., Abdar, M., Luo, Y., Zhou, X., Wang, R., Lim, C. P., Wang, X. Z., & Wu, Q. J. (2022). A review of generalized zero-shot learning methods. IEEE Transactions on Pattern Analysis and Machine Intelligence

  • Qian, R., Li, Y., Xu, Z., Yang, M.-H., Belongie, S., & Cui, Y. (2022). Multimodal open-vocabulary video classification via pre-trained vision and language models. arXiv preprint arXiv:2207.07646

  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., & Krueger, G., Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748–8763). PMLR.

  • Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B.. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684–10695).

  • Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv 2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510–4520).

  • Shi, M., Huang, Z., Ma, X., Hu, X., & Cao, Z. (2023). Matching is not enough: A two-stage framework for category-agnostic pose estimation. In IEEE/CVF conference on computer vision and pattern recognition, CVPR 2023, Vancouver, BC, Canada, June 17–24, 2023 (pp. 7308–7317). IEEE

  • Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (pp. 6105–6114).

  • Tu, J., Wu, G., & Wang, L. (2023). Dual graph networks for pose estimation in crowded scenes. International Journal of Computer Vision, 1–21.

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems,30.

  • Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Yadong, M., Tan, M., Wang, X., et al. (2020). Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10), 3349–3364.

    Article  Google Scholar 

  • Wang, Y., Peng, C., & Liu, Y. (2018). Mask-pose cascaded CNN for 2D hand pose estimation from single color image. IEEE Transactions on Circuits and Systems for Video Technology, 29(11), 3258–3268.

    Article  Google Scholar 

  • Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., & Perona, P. (2010). Caltech-UCSD birds 200.

  • Weng, T., Xiao, J., Pan, H., & Jiang, H. (2023). PartCom: Part composition learning for 3d open-set recognition. International Journal of Computer Vision, 1–24.

  • Wu, W., Qian, C., Yang, S., Wang, Q., Cai, Y., & Zhou, Q. (2018). Look at boundary: A boundary-aware face alignment algorithm. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2129–2138).

  • Xiao, B., Wu, H., & Wei, Y. (2018). Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV) (pp. 466–481).

  • Xu, L., Jin, S., Zeng, W., Liu, W., Qian, C., Ouyang, W., Luo, P., & Wang, X. (2022). Pose for everything: Towards category-agnostic pose estimation. In Computer vision—ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, proceedings, part VI (pp. 398–416)

  • Xu, M., Zhang, Z., Wei, F., Hu, H., & Bai, X. (2023). Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2945–2954).

  • Xu, Y., Zhang, J., Zhang, Q., & Tao, D. (2024). Vitpose++: Vision transformer for generic body pose estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(2), 1212–1230.

    Article  Google Scholar 

  • Yao, L., Han, J., Wen, Y., Liang, X., Xu, D., Zhang, W., Li, Z., Xu, C., & Xu, H. (2022). Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection. arXiv preprint arXiv:2209.09407

  • Yu, H., Xu, Y., Zhang, J., Zhao, W., Guan, Z., & Tao, D. (2021). Ap-10k: A benchmark for animal pose estimation in the wild. arXiv preprint arXiv:2108.12617

  • Zhang, H., Lai, S., Wang, Y., Da, Z., Dun, Y., & Qian, X. (2023). Scgnet: Shifting and cascaded group network. IEEE Transactions on Circuits and Systems for Video Technology

  • Zhang, H., Dun, Y., Pei, Y., Lai, S., Liu, C., Zhang, K., & Qian, X. (2024). HF-HRNet: A simple hardware friendly high-resolution network. IEEE Transactions on Circuits and Systems for Video Technology. https://doi.org/10.1109/TCSVT.2024.3377365

    Article  Google Scholar 

  • Zhang, H., Shao, W., Liu, H., Ma, Y., Luo, P., Qiao, Y., & Zhang, K. (2024b). AVIbench: Towards evaluating the robustness of large vision-language model on adversarial visual-instructions. arXiv preprint arXiv:2403.09346

  • Zhou, Z., Li, H., Liu, H., Wang, N., Yu, G., & Ji, R. (2023). Star loss: Reducing semantic ambiguity in facial landmark detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15475–15484).

  • Zhu, X., Zhang, R., He, B., Guo, Z., Zeng, Z., Qin, Z., Zhang, S., & Gao, P. (2023). Pointclip v2: Prompting clip and GPT for powerful 3d open-world learning. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2639–2650).

Download references

Acknowledgements

This work was supported in part by the National Science Foundation of China (Grant No. 62088102), in part by the National Key R&D Program of China (NO.2022ZD0160101).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Nanning Zheng or Kaipeng Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no Conflict of interest.

Additional information

Communicated by Hong Liu

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was done during Hao Zhang’s internship at Shanghai Artificial Intelligence Laboratory.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, H., Xu, L., Lai, S. et al. Open-Vocabulary Animal Keypoint Detection with Semantic-Feature Matching. Int J Comput Vis 132, 5741–5758 (2024). https://doi.org/10.1007/s11263-024-02126-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-024-02126-3

Keywords