这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

Towards Training-Free Open-World Segmentation via Image Prompt Foundation Models

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The realm of computer vision has witnessed a paradigm shift with the advent of foundational models, mirroring the transformative influence of large language models in the domain of natural language processing. This paper delves into the exploration of open-world segmentation, presenting a novel approach called Image Prompt Segmentation (IPSeg) that harnesses the power of vision foundational models. IPSeg lies the principle of a training-free paradigm, which capitalizes on image prompt techniques. Specifically, IPSeg utilizes a single image containing a subjective visual concept as a flexible prompt to query vision foundation models like DINOv2 and Stable Diffusion. Our approach extracts robust features for the prompt image and input image, then matches the input representations to the prompt representations via a novel feature interaction module to generate point prompts highlighting target objects in the input image. The generated point prompts are further utilized to guide the Segment Anything Model to segment the target object in the input image. The proposed method stands out by eliminating the need for exhaustive training sessions, thereby offering a more efficient and scalable solution. Experiments on COCO, PASCAL VOC, and other datasets demonstrate IPSeg’s efficacy for flexible open-world segmentation using intuitive image prompts. This work pioneers tapping foundation models for open-world understanding through visual concepts conveyed in images.

This is a preview of subscription content, log in via an institution to check access.

Access this article

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data Availability Statement

All the data used in this study are available from third-party institutions. Researchers can access the data through the instructions presented in the original works of the corresponding datasets. However, researchers should follow specific regulations stated by these datasets and use them for only academic purposes.

Code Availability

The code of this work is released at https://github.com/luckybird1994/IPSeg.

References

  • Angus, M., Czarnecki, K., & Salay, R. (2019). Efficacy of pixel-level ood detection for semantic segmentation. arXiv preprint arXiv:1911.02897

  • Brown, T., Mann, B., Ryder, N., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

    Google Scholar 

  • Bucher, M., Vu, T. H., Cord, M., et al. (2019). Zero-shot semantic segmentation. Advances in Neural Information Processing Systems. https://doi.org/10.48550/arXiv.1906.00817

    Article  Google Scholar 

  • Caesar, H., Uijlings, JRR. & Ferrari, V. (2018). Coco-stuff: Thing and stuff classes in context. In: Proceedings of the IEEE conference on computer vision and pattern recognition, computer vision foundation / IEEE computer society, pp 1209–1218

  • Cen, J., Yun, P., Cai, J., et al. (2021). Deep metric learning for open world semantic segmentation. In International conference on computer vision , pp 15,333–15,342

  • Cen, J., Zhou, Z., Fang, J., et al. (2023). Segment anything in 3d with nerfs. arXiv preprint arXiv:2304.12308

  • Cha, J., Mun, J., & Roh, B. (2023). Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In Conference on computer vision and pattern recognition IEEE, pp 11,165–11,174

  • Chen, L. C., Papandreou, G., Kokkinos, I., et al. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.

    Article  Google Scholar 

  • Chen, T., Mai, Z., Li, R., et al. (2023). Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:2305.05803

  • Cheng, J., Nandi, S., Natarajan, P., et al. (2021). SIGN: Spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In International conference on computer vision IEEE, pp 9536–9546

  • Cheng, Y., Li, L., Xu, Y., et al. (2023). Segment and track anything. arXiv preprint arXiv:2305.06558

  • Chowdhery, A., Narang, S., Devlin, J., et al. (2023). Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240), 1–113.

    MATH  Google Scholar 

  • Cui, Z., Longshi, W., & Wang, R. (2020). Open set semantic segmentation with statistical test and adaptive threshold. In 2020 IEEE International conference on multimedia and expo (ICME), IEEE, pp 1–6

  • Dai, W., Li, J., Li, D., et al. (2023). Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500

  • Devlin, J., Chang, MW., Lee, K., et al. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  • Everingham, M., Gool, L. V., Williams, C. K. I., et al. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Article  Google Scholar 

  • Ghiasi, G., Gu, X., Cui, Y., et al. (2022). Scaling open-vocabulary image segmentation with image-level labels. In European conference on computer vision, vol 13696. Springer, pp 540–557

  • Gu, Z., Zhou, S., Niu, L., et al. (2020). Context-aware feature generation for zero-shot semantic segmentation. In Proceedings of the 28th ACM international conference on multimedia, pp 1921–1929

  • Guo, J., Hao, Z., Wang, C., et al. (2024). Data-efficient large vision models through sequential autoregression. arXiv preprint arXiv:2402.04841

  • Gupta, A., Dollár, P., & Girshick, RB. (2019). LVIS: A dataset for large vocabulary instance segmentation. In CVPR computer vision foundation / IEEE, pp 5356–5364

  • Hammam, A., Bonarens, F., Ghobadi, SE., et al. (2023). Identifying out-of-domain objects with dirichlet deep neural networks. In International conference on computer vision, pp 4560–4569

  • He, K., Chen, X., Xie, S., et al. (2022). Masked autoencoders are scalable vision learners. In Conference on computer vision and pattern recognition, pp 16,000–16,009

  • Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.

    Google Scholar 

  • Jiang, PT., & Yang, Y. (2023). Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:2305.01275

  • Kirillov, A., Mintun, E., Ravi, N., et al. (2023). Segment anything. arXiv preprint arXiv:2304.02643

  • Li. J., Li, D., Xiong, C., et al. (2022). BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, Proceedings of Machine Learning Research, PMLR, vol 162, pp 12,888–12,900

  • Li, X., Wei, T., Chen, YP., et al. (2020). FSS-1000: A 1000-class dataset for few-shot segmentation. In Conference on computer vision and pattern recognition, Computer vision foundation / IEEE, pp 2866–2875

  • Liang, F., Wu, B., Dai, X., et al. (2023). Open-vocabulary semantic segmentation with mask-adapted CLIP. In Conference on computer vision and pattern recognition. IEEE, pp 7061–7070

  • Liu, H., Li, C., Wu, Q., et al. (2023a). Visual instruction tuning. CoRR abs/2304.08485

  • Liu, Q., Wen, Y., Han, J., et al. (2022). Open-world semantic segmentation via contrasting and clustering vision-language embedding. European conference on computer vision, Springer, pp. 275–292

  • Liu, Y., Zhu, M., Li, H., et al. (2023b). Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:2305.13310

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Conference on computer vision and pattern recognition, pp 3431–3440

  • Lu, J., Batra, D., Parikh, D., et al. (2019). Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems, 32

  • Luo, H., Bao, J., Wu, Y., et al. (2023). Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. International Conference on Machine Learning, PMLR, pp. 23033–23044

  • Ma, C., Yang, Y., Wang, Y., et al. (2022). Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:2210.15138

  • Morabia, K., Arora, J., & Vijaykumar, T. (2020). Attention-based joint detection of object and semantic part. CoRR abs/2007.02419

  • Mottaghi, R., Chen, X., Liu, X., et al. (2014). The role of context for object detection and semantic segmentation in the wild. In Conference on computer vision and pattern recognition. IEEE Computer Society, pp 891–898

  • Nguyen, K., & Todorovic, S. (2019). Feature weighting and boosting for few-shot segmentation. In International conference on computer vision, IEEE, pp 622–631

  • Oin, J., Wu, J., Yan, P., et al. (2023). Freeseg: Unified, universal and open-vocabulary image segmentation. In Conference on computer vision and pattern recognition. IEEE, pp 19,446–19,455

  • Oquab, M., Darcet, T., & Théo Moutakanni, ea. (2023). Dinov2: Learning robust visual features without supervision. CoRR

  • Pont-Tuset, J., Perazzi, F., Caelles, S., et al. (2017). The 2017 DAVIS challenge on video object segmentation. CoRR abs/1704.00675

  • Qi, L., Kuen, J., Wang, Y., et al. (2022). Open world entity segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 8743–8756.

    Google Scholar 

  • Radford, A., Narasimhan, K., Salimans, T., et al. (2018). Improving language understanding by generative pre-training. OpenAI

  • Radford, A., Wu, J., Child, R., et al. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.

    Google Scholar 

  • Radford, A., Kim, JW., Hallacy, C., et al. (2021). Learning transferable visual models from natural language supervision. In International Conference on machine learning, Proceedings of Machine Learning Research, PMLR, vol 139, pp 8748–8763

  • Ramanathan, V., Kalia, A., Petrovic, V., et al. (2023). PACO: parts and attributes of common objects. In CVPR, IEEE, pp 7141–7151

  • Ranftl, R., Bochkovskiy, A., & Koltun, V. (2021). Vision transformers for dense prediction. In Conference on computer vision and pattern Recognition, pp 12,179–12,188

  • Rombach, R., Blattmann, A., Lorenz, D., et al. (2022). High-resolution image synthesis with latent diffusion models. In Conference on computer vision and pattern recognition

  • Shen, Q., Yang, X., & Wang, X. (2023). Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:2304.10261

  • Tang, L., Xiao, H., & Li, B. (2023). Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:2304.04709

  • Touvron, H., Lavril, T., Izacard, G., et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

  • Wang, X., Wang, W., Cao, Y., et al. (2023a). Images speak in images: A generalist painter for in-context visual learning. In Conference on computer vision and pattern recognition. IEEE, pp 6830–6839

  • Wang, X., Zhang, X., Cao, Y., et al. (2023b). Seggpt: Towards segmenting everything in context. In International conference on computer vision. IEEE, pp 1130–1140

  • Xia, Y., Zhang, Y., Liu, F., et al. (2020). Synthesize then compare: Detecting failures and anomalies for semantic segmentation. European conference on computer vision, Springer, pp. 145–161

  • Xian, Y., Choudhury, S., He, Y., et al. (2019). Semantic projection network for zero-and few-label semantic segmentation. In Conference on computer vision and pattern recognition, pp 8256–8265

  • Xu, J., Mello, SD., Liu, S., et al. (2022a). Groupvit: Semantic segmentation emerges from text supervision. In Conference on computer vision and pattern recognition. IEEE, pp 18,113–18,123

  • Xu, M., Zhang, Z., Wei, F., et al. (2022). A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. European conference on computer vision, Springer, pp. 736–753

  • Xu, M., Zhang, Z., Wei, F., et al. (2023). Side adapter network for open-vocabulary semantic segmentation. In Conference on Computer Vision and Pattern Recognition, pp 2945–2954

  • Yang, J., Gao, M., Li, Z., et al. (2023). Track anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968

  • Zhang, K., & Liu, D. (2023). Customized segment anything model for medical image segmentation. arXiv preprint arXiv:2304.13785

  • Zhang, R., Han, J., Zhou, A., et al. (2023a). Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199

  • Zhang, R., Jiang, Z., Guo, Z., et al. (2023b). Personalize segment anything model with one shot. arXiv preprint arXiv:2305.03048

  • Zhang, S., Roller, S., Goyal, N., et al. (2022). Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068

  • Zhao, W., Rao, Y., Liu, Z., et al. (2023). Unleashing text-to-image diffusion models for visual perception. CoRR

  • Zhou, C., Loy, C. C., & Dai, B. (2022). Extract free dense labels from clip. European conference on computer vision, Springer, pp. 696–712

  • Zhou, H., Chen, P., Yang, L., et al. (2023). Activation to saliency: Forming high-quality labels for unsupervised salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 33(2), 743–755.

    Article  MATH  Google Scholar 

  • Zhou, H., Qiao, B., Yang, L., et al. (2023b). Texture-guided saliency distilling for unsupervised salient object detection. In Conference on computer vision and pattern recognition

  • Zhou, Z., Lei, Y., Zhang, B., et al. (2023c). Zegclip: Towards adapting clip for zero-shot semantic segmentation. In IEEE conference on computer vision and pattern recognition, pp 11,175–11,185

  • Zhu, D., Chen, J., Shen, X., et al. (2023a). Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592

  • Zhu, J., Chen, Z., Hao, Z., et al. (2023b). Tracking anything in high quality. arXiv preprint arXiv:2307.13974

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bo Li.

Additional information

Communicated by Hong Liu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tang, L., Jiang, PT., Xiao, H. et al. Towards Training-Free Open-World Segmentation via Image Prompt Foundation Models. Int J Comput Vis 133, 1–15 (2025). https://doi.org/10.1007/s11263-024-02185-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-024-02185-6

Keywords