Abstract
While promising results have been achieved in weakly-supervised semantic segmentation (WSSS), limited supervision from image-level tags inevitably induces discriminative reliance and spurious relations between target classes and background regions. Thus, Class Activation Map (CAM) usually tends to activate discriminative object regions and falsely includes lots of class-related backgrounds. Without pixel-level supervisions, it could be very difficult to enlarge the foreground activation and suppress those false activation of background regions. In this paper, we propose a novel framework of Cross Language Image Matching with Automatic Context Discovery (CLIMS++), based on the recently introduced Contrastive Language-Image Pre-training (CLIP) model, for WSSS. The core idea of our framework is to introduce natural language supervision to activate more complete object regions and suppress class-related background regions in CAM. In particular, we design object, background region, and text label matching losses to guide the model to excite more reasonable object regions of each category. In addition, we propose to automatically find spurious relations between foreground categories and backgrounds, through which a background suppression loss is designed to suppress the activation of class-related backgrounds. The above designs enable the proposed CLIMS++ to generate a more complete and compact activation map for the target objects. Extensive experiments on PASCAL VOC 2012 and MS COCO 2014 datasets show that our CLIMS++ significantly outperforms the previous state-of-the-art methods.
Similar content being viewed by others
Data Availability
The datasets adopted in this study are available from the PASCAL VOC 2012 (http://host.robots.ox.ac.uk/pascal/VOC/voc2012) and the MS COCO 2014 (https://cocodataset.org).
References
Ahn, J., & Kwak, S. (2018). Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 4981–4990)
Ahn, J., Cho, S., & Kwak, S. (2019). Weakly supervised learning of instance segmentation with inter-pixel relations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 2209–2218)
Araslanov, N., & Roth, S. (2020). Single-stage semantic segmentation from image labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 4253–4262)
Bearman, A. L., Russakovsky, O., Ferrari, V., & Li, F. (2016) What’s the point: Semantic segmentation with point supervision. In Proceeding of the European Conference on Computer Vision, (pp. 549–565)
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc.
Chang, Y., Wang, Q., Hung, W., Piramuthu, R., Ai, Y., & Yang, M. (2020). Weakly-supervised semantic segmentation via sub-category exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 8988–8997)
Chen, Z., & Sun, Q. (2023). Extracting class activation maps from non-discriminative features as well. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 3135–3144)
Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2015). Semantic image segmentation with deep convolutional nets and fully connected CRFS. In ICLR
Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS. TPAMI, 40, 834–848.
Chen, L., Wu, W., Fu, C., Han, X., & Zhang, Y. (2020). Weakly supervised semantic segmentation with boundary exploration. In Proceeding of the European Conference on Computer Vision, (pp. 347–362)
Chen, Q., Yang, L., Lai, J. H., & Xie, X. (2022a). Self-supervised image-specific prototype exploration for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 4288–4298)
Chen, Z., Wang, T., Wu, X., Hua, X. S., Zhan, H., & Sun, Q. (2022b). Class re-activation maps for weakly-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 969–978)
Chen, L., Lei, C., Li, R., Li, S., Zhang, Z., & Zhang, L. (2023a). Fpr: False positive rectification for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1108–1118
Chen, T., Mai, Z., Li, R., & Chao, W. l. (2023b). Segment anything model (SAM) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:2305.05803
Cheng, Z., Qiao, P., Li, K., Li, S., Wei, P., Ji, X., Yuan, L., Liu, C. and Chen, J., (2023). Out-of-candidate rectification for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 23673–23684)
Dai, J., He, K., & Sun, J. (2015). Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In ICCV, pp. 1635–1643
Deng, S., Zhuo, W., Xie, J., & Shen, L. (2023). Qa-clims: Question-answer cross language image matching for weakly supervised semantic segmentation. In Proceedings of the 31st ACM International Conference on Multimedia, (pp. 5572–5583)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S. & Uszkoreit, J., (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Du, Y., Fu, Z., Liu, Q., & Wang, Y. (2022). Weakly supervised semantic segmentation by pixel-to-prototype contrast. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 4320–4329)
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2012). The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html
Gao, W., Wan, F., Pan, X., Peng, Z., Tian, Q., Han, Z., Zhou, B., & Ye, Q. (2021). Ts-cam: Token semantic coupled attention map for weakly supervised object localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 2886–2895)
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 770–778)
Hou, Q., Jiang, P., Wei, Y., & Cheng, M. M. (2018). Self-erasing network for integral object attention. Advances in neural information processing systems, 31, 547–557.
Huang, Z., Wang, X., Wang, J., Liu, W., & Wang, J. (2018). Weakly-supervised semantic segmentation network with deep seeded region growing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (pp. 7014–7023)
Jang, S., Yun, J., Kwon, J., Lee, E., & Kim, Y. (2024). Dial: Dense image-text alignment for weakly supervised semantic segmentation. arXiv preprint arXiv:2409.15801
Jiang, P. T., & Yang, Y. (2023). Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:2305.01275
Jiang, P. T., Hou, Q., Cao, Y., Cheng, M. M., Wei, Y. & Xiong, H. K., (2019). Integral object mining via online attention accumulation. In ICCV, (pp. 2070–2079)
Jiang, P. T., Yang, Y., Hou, Q., & Wei, Y. (2022). L2g: A simple local-to-global knowledge transfer framework for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp. (16886–16896)
Jo, S., & Yu, I. J. (2021). Puzzle-cam: Improved localization via matching partial and full features. In IEEE International Conference on Image Processing, (pp. 639–643)
Jo, S., Yu, I. J., & Kim, K. (2023). Mars: Model-agnostic biased object removal without additional supervision for weakly-supervised semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 614–623)
Jo, S., Pan, F., Yu, I. J. & Kim, K., (2024). Dhr: Dual features-driven hierarchical rebalancing in inter- and intra-class regions for weakly-supervised semantic segmentation. In European Conference on Computer Vision (ECCV)
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W. Y. & Dollár, P., (2023). Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 4015–4026)
Kolesnikov, A., & Lampert, C. H. (2016). Seed, expand and constrain: Three principles for weakly-supervised image segmentation. Proceeding of the european conference on computer vision (pp. 695–711). Cham: Springer.
Krähenbühl, P., & Koltun, V. (2011). Efficient inference in fully connected CRFS with gaussian edge potentials. Advances in neural information processing systems, 24, 109–117.
Kweon, H., & Yoon, K. J. (2024). From SAM to cams: Exploring segment anything model for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 19499–19509)
Kweon, H., Yoon, S. H., Kim, H., Park, D. & Yoon, K. J., (2021). Unlocking the potential of ordinary classifier: Class-specific adversarial erasing framework for weakly supervised semantic segmentation. In: ICCV, (pp. 6994–7003)
Lee, J., Kim, E., Lee, S., Lee, J., & Yoon, S. (2019). Ficklenet: Weakly and semi-supervised semantic image segmentation using stochastic inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 5267–5276)
Lee, J., Kim, E., & Yoon, S. (2021a). Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 4071–4080)
Lee, J., Yi, J., Shin, C., & Yoon, S. (2021b). Bbam: Bounding box attribution map for weakly supervised semantic and instance segmentation. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, (pp. 2643–2652)
Lee, S., Lee, M., Lee, J., & Shim, H. (2021c). Railroad is not a train: Saliency as pseudo-pixel supervision for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 5495–5505)
Lee, J., Oh, S. J., Yun, S., Choe, J., Kim, E., & Yoon, S. (2022a). Weakly supervised semantic segmentation using out-of-distribution data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 16897–16906)
Lee, M., Kim, D., & Shim, H. (2022b). Threshold matters in WSSS: Manipulating the activation for the robust and accurate segmentation model against thresholds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (pp. 4330–4339)
Li, B., Weinberger, K. Q., Belongie, S., Koltun, V., & Ranftl, R. (2022a). Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546
Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P., & Marculescu, D. (2023). Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 7061–7070)
Li, Y., Duan, Y., Kuang, Z., Chen, Y., Zhang, W., & Li, X. (2022). Uncertainty estimation via response scaling for pseudo-mask noise mitigation in weakly-supervised semantic segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 36, 1447–1455.
Li, J., Jie, Z., Wang, X., Wei, X., & Ma, L. (2022). Expansion and shrinkage of localization for weakly-supervised semantic segmentation. Advances in Neural Information Processing Systems, 35, 16037–16051.
Lin, Y., Chen, M., Wang, W., Wu, B., Li, K., Lin, B., Liu, H., & He, X. (2023). Clip is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 15305–15314)
Lin, D., Dai, J., Jia, J., He, K., & Sun, J. (2016). Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 3159–3167)
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. Proceeding of the European Conference on Computer Vision (pp. 740–755). Cham: Springer.
Liu, Y., Wu, Y. H., Wen, P., Shi, Y., Qiu, Y., & Cheng, M. M. (2020). Leveraging instance-, image-and dataset-level information for weakly supervised instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3), 1415–1428.
Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al. (2025). Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. European Conference on Computer Vision (pp. 38–55). Cham: Springer.
Papandreou, G., Chen, L. C., Murphy, K.P. , & Yuille, A. L. (2015). Weakly- and nsemi-supervised learning of a deep convolutional network for semantic image segmentation. In ICCV, (pp. 1742–1750)
Pathak, D., Krähenbühl, P., & Darrell, T. (2015). Constrained convolutional neural networks for weakly supervised segmentation. In: ICCV, (pp. 1796–1804)
Peng, Z., Wang, G., Xie, L., Jiang, D., Shen, W., & Tian, Q. (2023). Usage: A unified seed area generation paradigm for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 624–634)
Pinheiro, P. H. O., & Collobert, R. (2015). From image-level to pixel-level labeling with convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 1713–1721)
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J. & Krueger, G. (2021). Learning transferable visual models from natural language supervision. In ICML, (pp. 8748–8763)
Ru, L., Du, B., & Wu, C. (2021). Learning visual words for weakly-supervised semantic segmentation. In IJCAI
Ru, L., Zhan, Y., Yu, B., & Du, B. (2022). Learning affinity from attention: End-to-end weakly-supervised semantic segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 16846–16855)
Ru, L., Zheng, H., Zhan, Y., & Du, B. (2023). Token contrast for weakly-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 3093–3102)
Su, Y., Sun, R., Lin, G., & Wu, Q. (2021). Context decoupling augmentation for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 7004–7014)
Su, G., Wang, W., Dai, J., & Gool, L. V. (2020). Mining cross-image semantics for weakly supervised semantic segmentation. In: Proceeding of the European Conference on Computer Vision, (pp. 347–365)
Sun W, Liu Z, Zhang Y, Zhong Y, & Barnes N (2023) An alternative to WSSS? an empirical study of the segment anything model (SAM) on weakly-supervised semantic segmentation problems. arXiv preprint arXiv:2305.01586
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research 9(11)
Vernaza, P., & Chandraker, M. (2017). Learning random-walk label propagation for weakly-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 7158–7166)
Wang, X., Liu, S., Ma, H., & Yang, M. (2020a). Weakly-supervised semantic segmentation by iterative affinity learning. International Journal of Computer Vision, 128, 1736–1749.
Wang, Y., Zhang, J., Kan, M., Shan, S., & Chen, X. (2020b). Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. (12275–12284)
Wang, Z., Lu, Y., Li, Q., Tao, X., Guo, Y., Gong, M., & Liu, T. (2022a). Cris: Clip-driven referring image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., & Yang, H. (2022b). Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, PMLR, (pp. 23318–23340)
Wu, T., Huang, J., Gao, G., Wei, X., Wei, X., Luo, X., & Liu, C. H. (2021). Embedded discriminative attention mechanism for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 16765–16774)
Wu, Z., Shen, C., & van den Hengel, A. (2019). Wider or deeper: Revisiting the resnet model for visual recognition. Pattern Recognition, 90, 119–133.
Xie, J., Hou, X., Ye, K., & Shen, L. (2022a). Clims: Cross language image matching for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 4483–4492)
Xie, J., Xiang, J., Chen, J., Hou, X., Zhao, X., & Shen, L. (2022b). C2am: Contrastive learning of class-agnostic activation map for weakly supervised object localization and semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 989–998)
Xu, J., De Mello, S., Liu, S., Byeon, W., Breuel, T., Kautz, J., & Wang, X. (2022a). Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 18134–18144)
Xu, L., Ouyang, W., Bennamoun, M., Boussaid, F., & Xu, D. (2022b). Multi-class token transformer for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 4310–4319)
Xu, L., Ouyang, W., Bennamoun, M., Boussaid, F., & Xu, D. (2023). Learning multi-modal class-specific tokens for weakly supervised dense object localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 19596–19605)
Xu, L., Ouyang, W., Bennamoun, M., Boussaid, F., Sohel, F., & Xu, D. (2021). Leveraging auxiliary tasks with affinity learning for weakly supervised semantic segmentation. In ICCV, (pp. 6984–6993)
Xu, M., Zhang, Z., Wei, F., Lin, Y., Cao, Y., Hu, H., & Bai, X. (2022). A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. European Conference on Computer Vision (pp. 736–753). Cham: Springer.
Yang, X., & Gong, X. (2024). Foundation model assisted weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, (pp. 523–532)
Yoon, S. H., Kwon, H., Jeong, J., Park, D., & Yoon, K. J. (2025). Diffusion-guided weakly supervised semantic segmentation. European Conference on Computer Vision (pp. 393–411). Cham: Springer.
Zhang, B., Xiao, J., Jiao, J., Wei, Y., & Zhao, Y. (2021). Affinity attention graph neural network for weakly supervised semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11), 8082–8096.
Zhang, D., Zhang, H., Tang, J., Hua, X., & Sun, Q. (2020). Causal intervention for weakly-supervised semantic segmentation. Advances in neural information processing systems, 33, 655–666.
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 2921–2929)
Zhou, C., Loy, C. C., & Dai, B. (2022a). Extract free dense labels from clip. European Conference on Computer Vision (pp. 696–712). Cham: Springer.
Zhou, T., Zhang, M., & Zhao, F., Li, J. (2022b). Regional semantic contrast and aggregation for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 4299–4309)
Zhu, L., Wang, X., Feng, J., Cheng, T., Li, Y., Jiang, B., Zhang, D., & Han, J. (2024). Weakclip: Adapting clip for weakly-supervised semantic segmentation. International Journal of Computer Vision. 1–21
Acknowledgements
This work was supported by the National Key R&D Program of China (No. 2024YFF0618403), National Natural Science Foundation of China under Grant 82261138629, Guangdong-Macao Science and Technology Innovation Joint Fundation under Grant 2024A0505090003, Guangdong Provincial Key Laboratory under Grant 2023B1212060076, and Shenzhen Municipal Science and Technology Innovation Council under Grant JCYJ20220531101412030.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Bryan Allen Plummer.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xie, J., Deng, S., Hou, X. et al. CLIMS++: Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation. Int J Comput Vis 133, 5569–5588 (2025). https://doi.org/10.1007/s11263-025-02442-2
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02442-2