Abstract
Contrastive language and image pre-training (CLIP) achieves great success in various computer vision tasks and also presents an opportune avenue for enhancing weakly-supervised image understanding with its large-scale pre-trained knowledge. As an effective way to reduce the reliance on pixel-level human-annotated labels, weakly-supervised semantic segmentation (WSSS) aims to refine the class activation map (CAM) and produce high-quality pseudo masks. Weakly-supervised semantic segmentation (WSSS) aims to refine the class activation map (CAM) as pseudo masks, but heavily relies on inductive biases like hand-crafted priors and digital image processing methods. For the vision-language pre-trained model, i.e. CLIP, we propose a novel text-to-pixel matching paradigm for WSSS. However, directly applying CLIP to WSSS is challenging due to three critical problems: (1) the task gap between contrastive pre-training and WSSS CAM refinement, (2) lacking text-to-pixel modeling to fully utilize the pre-trained knowledge, and (3) the insufficient details owning to the \(\frac{1}{16}\) down-sampling resolution of ViT. Thus, we propose WeakCLIP to address the problems and leverage the pre-trained knowledge from CLIP to WSSS. Specifically, we first address the task gap by proposing a pyramid adapter and learnable prompts to extract WSSS-specific representation. We then design a co-attention matching module to model text-to-pixel relationships. Finally, the pyramid adapter and text-guided decoder are introduced to gather multi-level information and integrate it with text guidance hierarchically. WeakCLIP provides an effective and parameter-efficient way to transfer CLIP knowledge to refine CAM. Extensive experiments demonstrate that WeakCLIP achieves the state-of-the-art WSSS performance on standard benchmarks, i.e., 74.0% mIoU on the val set of PASCAL VOC 2012 and 46.1% mIoU on the val set of COCO 2014. The source code and model checkpoints are released at https://github.com/hustvl/WeakCLIP.
Similar content being viewed by others
References
Ahn, J., & Kwak, S. (2018). Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In: Proc. of CVPR. https://doi.org/10.1109/CVPR.2018.00523
Ahn, J., Cho, S., & Kwak, S. (2019). Weakly supervised learning of instance segmentation with inter-pixel relations. In: Proc. of CVPR. https://doi.org/10.1109/CVPR.2019.00231
Bearman, A., Russakovsky, O., Ferrari, V., Fei-Fei, L. (2016). What’s the point: Semantic segmentation with point supervision. In: Proc. of ECCV.
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In: Proc. of ICCV. https://doi.org/10.1109/ICCV48922.2021.00951
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In: Proc. of CVPR. https://doi.org/10.1109/CVPR.2017.502
Chang, Y., Wang, Q., Hung, W., Piramuthu, R., Tsai, Y., & Yang, M. (2020). Weakly-supervised semantic segmentation via sub-category exploration. In: Proc. of CVPR. https://doi.org/10.1109/CVPR42600.2020.00901
Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A. L. (2015) Semantic image segmentation with deep convolutional nets and fully connected crfs. In: Proc. of ICLR.
Chen, L., Wu, W., Fu, C., Han, X., Zhang, Y. (2020a). Weakly supervised semantic segmentation with boundary exploration. In: Proc. of ECCV.
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 40(4), 834–848.
Chen, Q., Yang, L., Lai, J., & Xie, X. (2022). Self-supervised image-specific prototype exploration for weakly supervised semantic segmentation. In: Proc. of CVPR. https://doi.org/10.1109/CVPR52688.2022.00425
Chen, T., Kornblith, S., Norouzi, M., Hinton, G. E. (2020b). A simple framework for contrastive learning of visual representations. In: Proc. of ICML.
Chen, T., Mai, Z., Li, R., Chao, W. l. (2023). Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. Proc of NeurIPS-W.
Chen, Z., Wang, T., Wu, X., Hua, X., Zhang, H., & Sun, Q. (2022). Class re-activation maps for weakly-supervised semantic segmentation. In: Proc. of CVPR. https://doi.org/10.1109/CVPR52688.2022.00104
Cheng, Z., Qiao, P., Li, K., Li, S., Wei, P., Ji, X., Yuan, L., Liu, C., Chen, J. (2023). Out-of-candidate rectification for weakly supervised semantic segmentation. In: Proc. of CVPR.
Dai, J., He, K., & Sun, J. (2015). Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In: Proc. of ICCV. https://doi.org/10.1109/ICCV.2015.191
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N. (2021). An image is worth 16 x 16 words: Transformers for image recognition at scale. In: Proc. of ICLR.
Du, Y., Fu, Z., Liu, Q., & Wang, Y. (2022). Weakly supervised semantic segmentation by pixel-to-prototype contrast. In: Proc. of CVPR. https://doi.org/10.1109/CVPR52688.2022.00428
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88, 303–338.
Fan, J., Zhang, Z., Song, C., & Tan, T. (2020). Learning integral objects with intra-class discriminator for weakly-supervised semantic segmentation. In: Proc. of CVPR. https://doi.org/10.1109/CVPR42600.2020.00434
Feng, J., Wang, X., & Liu, W. (2021). Deep graph cut network for weakly-supervised semantic segmentation. Science China Information Sciences, 64, 1–12.
Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., & Qiao, Y. (2024). Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, 132(2), 581–595.
Hariharan, B., Arbelaez, P., Bourdev, L. D., Maji, S., & Malik, J. (2011). Semantic contours from inverse detectors. In: Proc. of ICCV. https://doi.org/10.1109/ICCV.2011.6126343
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: Proc. of CVPR. https://doi.org/10.1109/CVPR.2016.90
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. B. (2020). Momentum contrast for unsupervised visual representation learning. In: Proc. of CVPR. https://doi.org/10.1109/CVPR42600.2020.00975
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., Gelly, S. (2019). Parameter-efficient transfer learning for NLP. In: Proc. of ICML.
Hoyer, L., Dai, D., & Gool, L. V. (2022). Daformer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. In: Proc. of CVPR. https://doi.org/10.1109/CVPR52688.2022.00969
Huang, Z., Wang, X., Wang, J., Liu, W., & Wang, J. (2018). Weakly-supervised semantic segmentation network with deep seeded region growing. In: Proc. of CVPR. https://doi.org/10.1109/CVPR.2018.00733
Jiang, P., Hou, Q., Cao, Y., Cheng, M., Wei, Y., & Xiong, H. (2019). Integral object mining via online attention accumulation. In: Proc. of ICCV. https://doi.org/10.1109/ICCV.2019.00216
Jiang, P., Yang, Y., Hou, Q., & Wei, Y. (2022). L2G: A simple local-to-global knowledge transfer framework for weakly supervised semantic segmentation. In: Proc. of CVPR. https://doi.org/10.1109/CVPR52688.2022.01638
Khoreva, A., Benenson, R., Hosang, J. H., Hein, M., & Schiele, B. (2017). Simple does it: Weakly supervised instance and semantic segmentation. In: Proc. of CVPR. https://doi.org/10.1109/CVPR.2017.181
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo WY, et al. (2023) Segment anything. In: Proc. of ICCV.
Kolesnikov, A., Lampert, C. H. (2016). Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In: Proc. of ECCV.
Krähenbühl, P., Koltun, V. (2011). Efficient inference in fully connected crfs with gaussian edge potentials. In: Proc. of NeurIPS.
Kweon, H., Yoon, S., Kim, H., Park, D., & Yoon, K. (2021). Unlocking the potential of ordinary classifier: Class-specific adversarial erasing framework for weakly supervised semantic segmentation. In: Proc. of ICCV. https://doi.org/10.1109/ICCV48922.2021.00691
Kweon, H., Yoon, S.H., Yoon, K. J. (2023). Weakly supervised semantic segmentation via adversarial learning of classifier and reconstructor. In: Proc. of CVPR.
Lee, J., Kim, E., Lee, S., Lee, J., & Yoon, S. (2019). Ficklenet: Weakly and semi-supervised semantic image segmentation using stochastic inference. In: Proc. of CVPR. https://doi.org/10.1109/CVPR.2019.00541
Lee, J., Kim, E., Lee, S., Lee, J., & Yoon, S. (2019). Frame-to-frame aggregation of active regions in web videos for weakly supervised semantic segmentation. In: Proc. of ICCV. https://doi.org/10.1109/ICCV.2019.00691
Lee, J., Choi, J., Mok, J., Yoon, S. (2021a). Reducing information bottleneck for weakly supervised semantic segmentation. In: Proc. of NeurIPS.
Lee, J., Kim, E., & Yoon, S. (2021). Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation. In: Proc. of CVPR. https://doi.org/10.1109/CVPR46437.2021.00406
Lee, J., Yi, J., Shin, C., & Yoon, S. (2021). BBAM: bounding box attribution map for weakly supervised semantic and instance segmentation. In: Proc. of CVPR. https://doi.org/10.1109/CVPR46437.2021.00267
Lee, J., Oh, S. J., Yun, S., Choe, J., Kim, E., & Yoon, S. (2022). Weakly supervised semantic segmentation using out-of-distribution data. In: Proc. of CVPR. https://doi.org/10.1109/CVPR52688.2022.01639
Lee, M., Kim, D., & Shim, H. (2022). Threshold matters in WSSS: manipulating the activation for the robust and accurate segmentation model against thresholds. Proc. of CVPR. https://doi.org/10.1109/CVPR52688.2022.00429
Lee, S., Lee, M., Lee, J., & Shim, H. (2021). Railroad is not a train: Saliency as pseudo-pixel supervision for weakly supervised semantic segmentation. Proc. of CVPR. https://doi.org/10.1109/CVPR46437.2021.00545
Lei, J., Li, L., Zhou, L., Gan, Z., Berg, T. L., Bansal, M., & Liu, J. (2021). Less is more: Clipbert for video-and-language learning via sparse sampling. In: Proc. of CVPR. https://doi.org/10.1109/CVPR46437.2021.00725
Li, J., Fan, J., & Zhang, Z. (2022). Towards noiseless object contours for weakly supervised semantic segmentation. In: Proc. of CVPR. https://doi.org/10.1109/CVPR52688.2022.01635
Li, Y., Kuang, Z., Liu, L., Chen, Y., & Zhang, W. (2021). Pseudo-mask matters in weakly-supervised semantic segmentation. In: Proc. of ICCV. https://doi.org/10.1109/ICCV48922.2021.00688
Li, Y., Duan, Y., Kuang, Z., Chen, Y., Zhang, W., Li, X. (2022b). Uncertainty estimation via response scaling for pseudo-mask noise mitigation in weakly-supervised semantic segmentation. In: Proc. of AAAI.
Lin, D., Dai, J., Jia, J., He, K., & Sun, J. (2016). Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In: Proc. of CVPR. https://doi.org/10.1109/CVPR.2016.344
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In: Proc. of ECCV.
Lin, Y., Chen, M., Wang, W., Wu, B., Li, K., Lin, B., Liu, H., He. X. (2023). Clip is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation. In: Proc. of CVPR.
Loshchilov, I., Hutter, F. (2019). Decoupled weight decay regularization. In: Proc. of ICLR.
Lu, J., Batra, D., Parikh, D., Lee, S. (2019). Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Proc. of NeurIPS.
Pathak, D., Krähenbühl, P., & Darrell, T. (2015). Constrained convolutional neural networks for weakly supervised segmentation. In: Proc. of ICCV. https://doi.org/10.1109/ICCV.2015.209
Pinheiro, P. H. O., & Collobert, R. (2015). From image-level to pixel-level labeling with convolutional networks. In: Proc. of CVPR. https://doi.org/10.1109/CVPR.2015.7298780
Qin, J., Wu, J., Xiao, X., Li, L., Wang, X. (2022). Activation modulation and recalibration scheme for weakly supervised semantic segmentation. In: Proc. of AAAI.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In: Proc. of ICML.
Rao, Y., Zhao, W., Chen, G., Tang, Y., Zhu, Z., Huang, G., Zhou, J., Lu, J. (2022). Denseclip: Language-guided dense prediction with context-aware prompting. In: Proc. of CVPR.
Rong, S., Tu, B., Wang, Z., Li, J. (2023). Boundary-enhanced co-training for weakly supervised semantic segmentation. In: Proc. of CVPR.
Rossetti S, Zappia D, Sanzari M, Schaerf M, Pirri F (2022) Max pooling with vision transformers reconciles class and shape in weakly supervised semantic segmentation. In: Proc. of ECCV
Ru, L., Du, B., Zhan, Y., & Wu, C. (2022). Weakly-supervised semantic segmentation with visual words learning and hybrid pooling. International Journal of Computer Vision, 130(4), 1127–1144.
Ru, L., Zheng, H., Zhan, Y., Du, B. (2023). Token contrast for weakly-supervised semantic segmentation. In: Proc. of CVPR.
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proc. of ICCV. https://doi.org/10.1109/ICCV.2017.74
Shen, T., Lin, G., Liu, L., Shen, C., Reid, I. (2017). Weakly supervised semantic segmentation based on co-segmentation. In: Proc. of BMVC.
Song, C., Huang, Y., Ouyang, W., & Wang, L. (2019). Box-driven class-wise region masking and filling rate guided loss for weakly supervised semantic segmentation. In: Proc. of CVPR. https://doi.org/10.1109/CVPR.2019.00325
Strudel, R., Pinel, R. G., Laptev, I., & Schmid, C. (2021). Segmenter: Transformer for semantic segmentation. In: Proc. of ICCV. https://doi.org/10.1109/ICCV48922.2021.00717
Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., Dai, J. (2020). VL-BERT: pre-training of generic visual-linguistic representations. In: Proc. of ICLR.
Su, Y., Sun, R., Lin, G., & Wu, Q. (2021). Context decoupling augmentation for weakly supervised semantic segmentation. In: Proc. of ICCV. https://doi.org/10.1109/ICCV48922.2021.00692
Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In: Proc. of ICCV. https://doi.org/10.1109/ICCV.2017.97
Sun, K., Shi, H., Zhang, Z., & Huang, Y. (2021). Ecs-net: Improving weakly supervised semantic segmentation by using connections between class activation maps. In: Proc. of ICCV. https://doi.org/10.1109/ICCV48922.2021.00719
Sun, W., Liu, Z., Zhang, Y., Zhong, Y., Barnes, N. (2023). An alternative to wsss? an empirical study of the segment anything model (sam) on weakly-supervised semantic segmentation problems. ArXiv preprint.
Tang, M., Djelouah, A., Perazzi, F., Boykov, Y., & Schroers, C. (2018). Normalized cut loss for weakly-supervised CNN segmentation. In: Proc. of CVPR. https://doi.org/10.1109/CVPR.2018.00195
Vernaza, P., & Chandraker, M. (2017). Learning random-walk label propagation for weakly-supervised semantic segmentation. In: Proc. of CVPR. https://doi.org/10.1109/CVPR.2017.315
Wang, Y., Zhang, J., Kan, M., Shan, S., Chen, X. (2020). Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In: Proc. of CVPR, https://doi.org/10.1109/CVPR42600.2020.01229
Wei, Y., Liang, X., Chen, Y., Shen, X., Cheng, M. M., Feng, J., Zhao, Y., & Yan, S. (2016). Stc: A simple to complex framework for weakly-supervised semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11), 2314–2320.
Wei, Y., Feng, J., Liang, X., Cheng, M., Zhao, Y., & Yan, S. (2017). Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In: Proc. of CVPR. https://doi.org/10.1109/CVPR.2017.687
Wei, Y., Xiao, H., Shi, H., Jie, Z., Feng, J., & Huang, T. S. (2018). Revisiting dilated convolution: A simple approach for weakly- and semi-supervised semantic segmentation. In: Proc. of CVPR. https://doi.org/10.1109/CVPR.2018.00759
Wu, Z., Shen, C., & Van Den Hengel, A. (2019). Wider or deeper: Revisiting the resnet model for visual recognition. Pattern Recognition, 90, 119–133.
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M., Luo, P. (2021). Segformer: Simple and efficient design for semantic segmentation with transformers. In: Proc. of NeurIPS.
Xie, J., Hou, X., Ye, K., Shen, L. (2022). Clims: Cross language image matching for weakly supervised semantic segmentation. In: Proc. of CVPR.
Xu, J., Schwing, A. G., & Urtasun, R. (2015). Learning to segment under various forms of weak supervision. In: Proc. of CVPR. https://doi.org/10.1109/CVPR.2015.7299002
Xu, L., Ouyang, W., Bennamoun, M., Boussaïd, F., Sohel, F., & Xu, D. (2021). Leveraging auxiliary tasks with affinity learning for weakly supervised semantic segmentation. In: Proc. of ICCV. https://doi.org/10.1109/ICCV48922.2021.00690
Xu, L., Ouyang, W., Bennamoun, M., Boussaïd, F., Sohel, F., & Xu, D. (2021). Leveraging auxiliary tasks with affinity learning for weakly supervised semantic segmentation. In: Proc. of ICCV. https://doi.org/10.1109/ICCV48922.2021.00690
Xu, L., Ouyang, W., Bennamoun, M., Boussaid, F., Xu, D. (2022). Multi-class token transformer for weakly supervised semantic segmentation. In: Proc. of CVPR.
Yang, J., Sun, X., Lai, Y. K., Zheng, L., & Cheng, M. M. (2018). Recognition from web data: A progressive filtering approach. IEEE Transactions on Image Processing, 27(11), 5303–5315.
Yang, X., Gong, X. (2024). Foundation model assisted weakly supervised semantic segmentation. In: Proc. of WACV.
Yao, Y., Chen, T., Xie, G., Zhang, C., Shen, F., Wu, Q., Tang, Z., & Zhang, J. (2021). Non-salient region object mining for weakly supervised semantic segmentation. In: Proc. of CVPR. https://doi.org/10.1109/CVPR46437.2021.00265
Yoon, S. H., Kweon, H., Cho, J., Kim, S., Yoon, K. J. (2022). Adversarial erasing framework via triplet with gated pyramid pooling layer for weakly supervised semantic segmentation. In: Proc. of ECCV.
Zeiler, M. D., Krishnan, D., Taylor, G. W., & Fergus, R. (2010). Deconvolutional networks. In: Proc. of CVPR. https://doi.org/10.1109/CVPR.2010.5539957
Zhang, B,. Xiao, J., Wei, Y., Sun, M., Huang, K. (2020a). Reliability does matter: An end-to-end weakly supervised semantic segmentation approach. In: Proc. of AAAI.
Zhang, D., Zhang, H., Tang, J., Hua, X., Sun, Q. (2020b). Causal intervention for weakly-supervised semantic segmentation. In: Proc. of NeurIPS.
Zhang, F., Gu, C., Zhang, C., & Dai, Y. (2021). Complementary patch for weakly supervised semantic segmentation. In: Proc. of ICCV. https://doi.org/10.1109/ICCV48922.2021.00715
Zhang, R., Fang, R., Gao, P., Zhang, W., Li, K., Dai, J., Qiao, Y., Li, H. (2021b). Tip-adapter: Training-free clip-adapter for better vision-language modeling. ArXiv preprint.
Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In: Proc. of CVPR. https://doi.org/10.1109/CVPR.2016.319
Zhou, C., Loy, C. C., Dai, B. (2022a). Extract free dense labels from clip. In: Proc. of ECCV.
Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9), 2337–2348.
Zhu, L., Li, Y., Fang, J., Liu, Y., Xin, H., Liu, W., Wang, X. (2023). Weaktr: Exploring plain vision transformer for weakly-supervised semantic segmentation. ArXiv preprint.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Gunhee Kim.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhu, L., Wang, X., Feng, J. et al. WeakCLIP: Adapting CLIP for Weakly-Supervised Semantic Segmentation. Int J Comput Vis 133, 1085–1105 (2025). https://doi.org/10.1007/s11263-024-02224-2
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-024-02224-2