Towards Training-Free Open-World Segmentation via Image Prompt Foundation Models

Tang, Lv; Jiang, Peng-Tao; Xiao, Haoke; Li, Bo

doi:10.1007/s11263-024-02185-6

Towards Training-Free Open-World Segmentation via Image Prompt Foundation Models

Published: 16 July 2024

Volume 133, pages 1–15, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Lv Tang¹^na1,
Peng-Tao Jiang¹^na1,
Haoke Xiao¹^na1 &
…
Bo Li ORCID: orcid.org/0000-0001-7817-0665¹

1688 Accesses
14 Citations
Explore all metrics

Abstract

The realm of computer vision has witnessed a paradigm shift with the advent of foundational models, mirroring the transformative influence of large language models in the domain of natural language processing. This paper delves into the exploration of open-world segmentation, presenting a novel approach called Image Prompt Segmentation (IPSeg) that harnesses the power of vision foundational models. IPSeg lies the principle of a training-free paradigm, which capitalizes on image prompt techniques. Specifically, IPSeg utilizes a single image containing a subjective visual concept as a flexible prompt to query vision foundation models like DINOv2 and Stable Diffusion. Our approach extracts robust features for the prompt image and input image, then matches the input representations to the prompt representations via a novel feature interaction module to generate point prompts highlighting target objects in the input image. The generated point prompts are further utilized to guide the Segment Anything Model to segment the target object in the input image. The proposed method stands out by eliminating the need for exhaustive training sessions, thereby offering a more efficient and scalable solution. Experiments on COCO, PASCAL VOC, and other datasets demonstrate IPSeg’s efficacy for flexible open-world segmentation using intuitive image prompts. This work pioneers tapping foundation models for open-world understanding through visual concepts conveyed in images.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Fig. 1

Segment Anything in Context with Vision Foundation Models

Article 31 July 2025

Segment Using Just One Example

The Role of Optimum Connectivity in Image Segmentation: Can the Algorithm Learn Object Information During the Process?

Data Availability Statement

All the data used in this study are available from third-party institutions. Researchers can access the data through the instructions presented in the original works of the corresponding datasets. However, researchers should follow specific regulations stated by these datasets and use them for only academic purposes.

Code Availability

The code of this work is released at https://github.com/luckybird1994/IPSeg.

References

Angus, M., Czarnecki, K., & Salay, R. (2019). Efficacy of pixel-level ood detection for semantic segmentation. arXiv preprint arXiv:1911.02897
Brown, T., Mann, B., Ryder, N., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
Google Scholar
Bucher, M., Vu, T. H., Cord, M., et al. (2019). Zero-shot semantic segmentation. Advances in Neural Information Processing Systems. https://doi.org/10.48550/arXiv.1906.00817
Article Google Scholar
Caesar, H., Uijlings, JRR. & Ferrari, V. (2018). Coco-stuff: Thing and stuff classes in context. In: Proceedings of the IEEE conference on computer vision and pattern recognition, computer vision foundation / IEEE computer society, pp 1209–1218
Cen, J., Yun, P., Cai, J., et al. (2021). Deep metric learning for open world semantic segmentation. In International conference on computer vision , pp 15,333–15,342
Cen, J., Zhou, Z., Fang, J., et al. (2023). Segment anything in 3d with nerfs. arXiv preprint arXiv:2304.12308
Cha, J., Mun, J., & Roh, B. (2023). Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In Conference on computer vision and pattern recognition IEEE, pp 11,165–11,174
Chen, L. C., Papandreou, G., Kokkinos, I., et al. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.
Article Google Scholar
Chen, T., Mai, Z., Li, R., et al. (2023). Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:2305.05803
Cheng, J., Nandi, S., Natarajan, P., et al. (2021). SIGN: Spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In International conference on computer vision IEEE, pp 9536–9546
Cheng, Y., Li, L., Xu, Y., et al. (2023). Segment and track anything. arXiv preprint arXiv:2305.06558
Chowdhery, A., Narang, S., Devlin, J., et al. (2023). Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240), 1–113.
MATH Google Scholar
Cui, Z., Longshi, W., & Wang, R. (2020). Open set semantic segmentation with statistical test and adaptive threshold. In 2020 IEEE International conference on multimedia and expo (ICME), IEEE, pp 1–6
Dai, W., Li, J., Li, D., et al. (2023). Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500
Devlin, J., Chang, MW., Lee, K., et al. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Everingham, M., Gool, L. V., Williams, C. K. I., et al. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.
Article Google Scholar
Ghiasi, G., Gu, X., Cui, Y., et al. (2022). Scaling open-vocabulary image segmentation with image-level labels. In European conference on computer vision, vol 13696. Springer, pp 540–557
Gu, Z., Zhou, S., Niu, L., et al. (2020). Context-aware feature generation for zero-shot semantic segmentation. In Proceedings of the 28th ACM international conference on multimedia, pp 1921–1929
Guo, J., Hao, Z., Wang, C., et al. (2024). Data-efficient large vision models through sequential autoregression. arXiv preprint arXiv:2402.04841
Gupta, A., Dollár, P., & Girshick, RB. (2019). LVIS: A dataset for large vocabulary instance segmentation. In CVPR computer vision foundation / IEEE, pp 5356–5364
Hammam, A., Bonarens, F., Ghobadi, SE., et al. (2023). Identifying out-of-domain objects with dirichlet deep neural networks. In International conference on computer vision, pp 4560–4569
He, K., Chen, X., Xie, S., et al. (2022). Masked autoencoders are scalable vision learners. In Conference on computer vision and pattern recognition, pp 16,000–16,009
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.
Google Scholar
Jiang, PT., & Yang, Y. (2023). Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:2305.01275
Kirillov, A., Mintun, E., Ravi, N., et al. (2023). Segment anything. arXiv preprint arXiv:2304.02643
Li. J., Li, D., Xiong, C., et al. (2022). BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, Proceedings of Machine Learning Research, PMLR, vol 162, pp 12,888–12,900
Li, X., Wei, T., Chen, YP., et al. (2020). FSS-1000: A 1000-class dataset for few-shot segmentation. In Conference on computer vision and pattern recognition, Computer vision foundation / IEEE, pp 2866–2875
Liang, F., Wu, B., Dai, X., et al. (2023). Open-vocabulary semantic segmentation with mask-adapted CLIP. In Conference on computer vision and pattern recognition. IEEE, pp 7061–7070
Liu, H., Li, C., Wu, Q., et al. (2023a). Visual instruction tuning. CoRR abs/2304.08485
Liu, Q., Wen, Y., Han, J., et al. (2022). Open-world semantic segmentation via contrasting and clustering vision-language embedding. European conference on computer vision, Springer, pp. 275–292
Liu, Y., Zhu, M., Li, H., et al. (2023b). Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:2305.13310
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Conference on computer vision and pattern recognition, pp 3431–3440
Lu, J., Batra, D., Parikh, D., et al. (2019). Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems, 32
Luo, H., Bao, J., Wu, Y., et al. (2023). Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. International Conference on Machine Learning, PMLR, pp. 23033–23044
Ma, C., Yang, Y., Wang, Y., et al. (2022). Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:2210.15138
Morabia, K., Arora, J., & Vijaykumar, T. (2020). Attention-based joint detection of object and semantic part. CoRR abs/2007.02419
Mottaghi, R., Chen, X., Liu, X., et al. (2014). The role of context for object detection and semantic segmentation in the wild. In Conference on computer vision and pattern recognition. IEEE Computer Society, pp 891–898
Nguyen, K., & Todorovic, S. (2019). Feature weighting and boosting for few-shot segmentation. In International conference on computer vision, IEEE, pp 622–631
Oin, J., Wu, J., Yan, P., et al. (2023). Freeseg: Unified, universal and open-vocabulary image segmentation. In Conference on computer vision and pattern recognition. IEEE, pp 19,446–19,455
Oquab, M., Darcet, T., & Théo Moutakanni, ea. (2023). Dinov2: Learning robust visual features without supervision. CoRR
Pont-Tuset, J., Perazzi, F., Caelles, S., et al. (2017). The 2017 DAVIS challenge on video object segmentation. CoRR abs/1704.00675
Qi, L., Kuen, J., Wang, Y., et al. (2022). Open world entity segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 8743–8756.
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., et al. (2018). Improving language understanding by generative pre-training. OpenAI
Radford, A., Wu, J., Child, R., et al. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
Google Scholar
Radford, A., Kim, JW., Hallacy, C., et al. (2021). Learning transferable visual models from natural language supervision. In International Conference on machine learning, Proceedings of Machine Learning Research, PMLR, vol 139, pp 8748–8763
Ramanathan, V., Kalia, A., Petrovic, V., et al. (2023). PACO: parts and attributes of common objects. In CVPR, IEEE, pp 7141–7151
Ranftl, R., Bochkovskiy, A., & Koltun, V. (2021). Vision transformers for dense prediction. In Conference on computer vision and pattern Recognition, pp 12,179–12,188
Rombach, R., Blattmann, A., Lorenz, D., et al. (2022). High-resolution image synthesis with latent diffusion models. In Conference on computer vision and pattern recognition
Shen, Q., Yang, X., & Wang, X. (2023). Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:2304.10261
Tang, L., Xiao, H., & Li, B. (2023). Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:2304.04709
Touvron, H., Lavril, T., Izacard, G., et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971
Wang, X., Wang, W., Cao, Y., et al. (2023a). Images speak in images: A generalist painter for in-context visual learning. In Conference on computer vision and pattern recognition. IEEE, pp 6830–6839
Wang, X., Zhang, X., Cao, Y., et al. (2023b). Seggpt: Towards segmenting everything in context. In International conference on computer vision. IEEE, pp 1130–1140
Xia, Y., Zhang, Y., Liu, F., et al. (2020). Synthesize then compare: Detecting failures and anomalies for semantic segmentation. European conference on computer vision, Springer, pp. 145–161
Xian, Y., Choudhury, S., He, Y., et al. (2019). Semantic projection network for zero-and few-label semantic segmentation. In Conference on computer vision and pattern recognition, pp 8256–8265
Xu, J., Mello, SD., Liu, S., et al. (2022a). Groupvit: Semantic segmentation emerges from text supervision. In Conference on computer vision and pattern recognition. IEEE, pp 18,113–18,123
Xu, M., Zhang, Z., Wei, F., et al. (2022). A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. European conference on computer vision, Springer, pp. 736–753
Xu, M., Zhang, Z., Wei, F., et al. (2023). Side adapter network for open-vocabulary semantic segmentation. In Conference on Computer Vision and Pattern Recognition, pp 2945–2954
Yang, J., Gao, M., Li, Z., et al. (2023). Track anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968
Zhang, K., & Liu, D. (2023). Customized segment anything model for medical image segmentation. arXiv preprint arXiv:2304.13785
Zhang, R., Han, J., Zhou, A., et al. (2023a). Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199
Zhang, R., Jiang, Z., Guo, Z., et al. (2023b). Personalize segment anything model with one shot. arXiv preprint arXiv:2305.03048
Zhang, S., Roller, S., Goyal, N., et al. (2022). Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068
Zhao, W., Rao, Y., Liu, Z., et al. (2023). Unleashing text-to-image diffusion models for visual perception. CoRR
Zhou, C., Loy, C. C., & Dai, B. (2022). Extract free dense labels from clip. European conference on computer vision, Springer, pp. 696–712
Zhou, H., Chen, P., Yang, L., et al. (2023). Activation to saliency: Forming high-quality labels for unsupervised salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 33(2), 743–755.
Article MATH Google Scholar
Zhou, H., Qiao, B., Yang, L., et al. (2023b). Texture-guided saliency distilling for unsupervised salient object detection. In Conference on computer vision and pattern recognition
Zhou, Z., Lei, Y., Zhang, B., et al. (2023c). Zegclip: Towards adapting clip for zero-shot semantic segmentation. In IEEE conference on computer vision and pattern recognition, pp 11,175–11,185
Zhu, D., Chen, J., Shen, X., et al. (2023a). Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592
Zhu, J., Chen, Z., Hao, Z., et al. (2023b). Tracking anything in high quality. arXiv preprint arXiv:2307.13974

Download references

Author information

L. Tang, P. Jiang and H. Xiao have contributed equally to this work.

Authors and Affiliations

vivo Mobile Communication Co., Ltd, Shanghai, China
Lv Tang, Peng-Tao Jiang, Haoke Xiao & Bo Li

Authors

Lv Tang
View author publications
Search author on:PubMed Google Scholar
Peng-Tao Jiang
View author publications
Search author on:PubMed Google Scholar
Haoke Xiao
View author publications
Search author on:PubMed Google Scholar
Bo Li
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Bo Li.

Additional information

Communicated by Hong Liu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tang, L., Jiang, PT., Xiao, H. et al. Towards Training-Free Open-World Segmentation via Image Prompt Foundation Models. Int J Comput Vis 133, 1–15 (2025). https://doi.org/10.1007/s11263-024-02185-6

Download citation

Received: 16 December 2023
Accepted: 17 June 2024
Published: 16 July 2024
Version of record: 16 July 2024
Issue date: January 2025
DOI: https://doi.org/10.1007/s11263-024-02185-6

Keywords

Part of a collection:

Special Issue on Open-World Visual Recognition

Access this article

Log in via an institution

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Towards Training-Free Open-World Segmentation via Image Prompt Foundation Models

Abstract

Access this article

Similar content being viewed by others

Segment Anything in Context with Vision Foundation Models

Segment Using Just One Example

The Role of Optimum Connectivity in Image Segmentation: Can the Algorithm Learn Object Information During the Process?

Explore related subjects

Data Availability Statement

Code Availability

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords