Weakly Supervised Training of Universal Visual Concepts for Multi-domain Semantic Segmentation

Bevandić, Petra; Oršić, Marin; Šarić, Josip; Grubišić, Ivan; Šegvić, Siniša

doi:10.1007/s11263-024-01986-z

Weakly Supervised Training of Universal Visual Concepts for Multi-domain Semantic Segmentation

Published: 30 January 2024

Volume 132, pages 2450–2472, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

514 Accesses
5 Citations
3 Altmetric
Explore all metrics

Abstract

Deep supervised models have an unprecedented capacity to absorb large quantities of training data. Hence, training on multiple datasets becomes a method of choice towards strong generalization in usual scenes and graceful performance degradation in edge cases. Unfortunately, popular datasets often have discrepant granularities. For instance, the Cityscapes road class subsumes all driving surfaces, while Vistas defines separate classes for road markings, manholes etc. Furthermore, many datasets have overlapping labels. For instance, pickups are labeled as trucks in VIPER, cars in Vistas, and vans in ADE20k. We address this challenge by considering labels as unions of universal visual concepts. This allows seamless and principled learning on multi-domain dataset collections without requiring any relabeling effort. Our method improves within-dataset and cross-dataset generalization, and provides opportunity to learn visual concepts which are not separately labeled in any of the training datasets. Experiments reveal competitive or state-of-the-art performance on two multi-domain dataset collections and on the WildDash 2 benchmark.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Disentangling Masked Autoencoders for Unsupervised Domain Generalization

Learning Semantic Segmentation from Multiple Datasets with Label Shifts

DySeT: A Dynamic Masked Self-distillation Approach for Robust Trajectory Prediction

Data Availability Statement

We perform our experiments on the following publically available datasets: ADE20k Zhou et al. (2017), BDD Yu et al. (2018), Camvid Badrinarayanan et al. (2017), Cityscapes Cordts et al. (2016), COCO Lin et al. (2014), IDD Varma et al. (2019), KITTI Geiger et al. (2013), MSeg Lambert et al. (2020), SUN RGBD Song et al. (2015), Scannet Dai et al. (2017), Viper Richter et al. (2017), Vistas Neuhold et al. (2017), and WildDash 2 Zendel et al. (2018). Our universal taxonomy for these datasets is available online Bevandic (2022).

References

Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12), 2481–2495.
Article Google Scholar
Bevandic, P. (2022). Universal taxonomies for semantic segmentation (source code). Accessed 02 Dec 2022. https://github.com/UNIZG-FER-D307/universal_taxonomies.
Bevandic, P., & Segvic, S. (2022). Automatic universal taxonomies for multi-domain semantic segmentation. In: BMVC
Bevandić, P., Oršić, M., Grubišić, I., Šarić, J., & Šegvić, S. (2022). Multi-domain semantic segmentation with overlapping labels. In: Proceedings of the IEEE/CVF Winter conference on applications of computer vision (WACV), pp. 2615–2624.
Bevandić, P., Krešo, I., Oršić, M., & Šegvić, S. (2022). Dense open-set recognition based on training with noisy negative images. Image and Vision Computing, 124, 104490. https://doi.org/10.1016/j.imavis.2022.104490
Article Google Scholar
Biase, G.D., Blum, H., Siegwart, R., & Cadena, C. (2021). Pixel-wise anomaly detection in complex driving scenes. In: Computer vision and pattern recognition, CVPR
Blum, H., Sarlin, P., Nieto, J. I., Siegwart, R., & Cadena, C. (2021). The fishyscapes benchmark: Measuring blind spots in semantic segmentation. International Journal of Computer Vision, 129(11), 3119–3135.
Article Google Scholar
Chan, R., Lis, K., Uhlemeyer, S., Blum, H., Honari, S., Siegwart, R., et al. (2021). SegmentMeIfYouCan: A benchmark for anomaly segmentation. In: Vanschoren J, Yeung S, (Eds.) NeurIPS;
Chan, R., Rottmann, M., & Gottschalk, H. (2021). Entropy maximization and meta classification for out-of-distribution detection in semantic segmentation. In: International conference on computer vision, ICCV;
Cheng, B., Collins, M.D., Zhu, Y., Liu, T., Huang, T.S., Adam, H., et al. (2020). Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12475–12485.
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., & Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. In: CVPR; pp. 1280–1289.
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. In: CVPR
Cheng, B., Schwing, A.G., & Kirillov, A. (2021). Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.
Article Google Scholar
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., et al. (2016). The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223.
Cour, T., Sapp, B., & Taskar, B. (2011). Learning from partial labels. The Journal of Machine Learning Research., 12, 1501–1536.
MathSciNet Google Scholar
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., & Nießner, M. (2017). ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. In: CVPR
Everingham, M., Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The Pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88, 303–338.
Article Google Scholar
Fourure, D., Emonet, R., Fromont, E., Muselet, D., Neverova, N., Trémeau, A., et al. (2017). Multi-task, multi-domain learning: application to semantic segmentation and pose regression. Neurocomputing.
Galleguillos, C., & Belongie, S. (2010). Context based object categorization: A critical survey. Computer Vision and Image Understanding, 06(114), 712–722. https://doi.org/10.1016/j.cviu.2010.02.004
Article Google Scholar
Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The KITTI dataset. Int J Robotics Res., 32(11), 1231–1237.
Article Google Scholar
Gupta, A., Dollar, P., & Girshick, R. (2019). LVIS: A dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition
He, K., Zhang, X., Ren, S., & Sun, J. (2016).Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.
Huang, G., Liu, Z., Pleiss, G., Van Der Maaten, L., & Weinberger, K. (2019). Convolutional networks with dense connectivity. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Kalluri, T., Varma, G., Chandraker, M., & Jawahar, C. (2019). Universal semi-supervised semantic segmentation. In: Proceedings of the IEEE international conference on computer vision, pp. 5259–5270.
Kim, D., Tsai, Y., Suh, Y., Faraki, M., Garg, S., Chandraker, M., et al. (2022). Learning semantic segmentation from multiple datasets with label shifts. CoRR. abs/2202.14030.
Kim, D., Tsai, Y., Suh, Y., Faraki, M., Garg, S., Chandraker, M., et al. (2022). learning semantic segmentation from multiple datasets with label shifts. In: ECCV
Krešo, I., Krapac, J., & Šegvić, S. (2020). Efficient ladder-style densenets for semantic segmentation of large images. IEEE Transactions on Intelligent Transportation Systems.
Kreso, I., Krapac, J., & Segvic, S. (2021). Efficient ladder-style DenseNets for semantic segmentation of large images. IEEE Transactions on Intelligent Transportation Systems, 22(8), 4951–4961.
Article Google Scholar
Lambert, J., Liu, Z., Sener, O., Hays, J., Koltun, V. (2020). MSeg: A composite dataset for multi-domain semantic segmentation. In: CVPR
Lee, D.H. (2013). Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks. WREPL. 07;.
Li, L., Zhou, T., Wang, W., Li, J., & Yang, Y. (2022). Deep hierarchical semantic segmentation. In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 1236–1247.
Liang, X., Zhou, H., & Xing, E. (2018). Dynamic-structured semantic propagation network. In: CVPR, pp. 752–761.
Lin, T., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., et al. (2014). Microsoft COCO: Common Objects in Context. In: ECCV, pp. 740–755.
Liu, Y., Ge, P., Liu, Q., Fan, S., & Wang, Y. (2022). An empirical study on multi-domain robust semantic segmentation. arXiv preprint arXiv:2212.04221.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440.
Masaki, S., Hirakawa, T., Yamashita, T., & Fujiyoshi, H. (2021). Multi-Domain Semantic-Segmentation using Multi-Head Model. In: 2021 IEEE international intelligent transportation systems conference (ITSC), pp. 2802–2807.
McClosky, D., Charniak, E., & Johnson, M. (2006). Effective self-training for parsing. In: NAACL, pp. 152–159.
Meletis, P., & Dubbelman, G. (2018). Training of Convolutional networks on multiple heterogeneous datasets for street scene semantic segmentation. In: Intelligent vehicles symposium, pp. 1045–1050.
Mohan, R., & Valada, A. (2020). EfficientPS: Efficient panoptic segmentation. International Journal of Computer Vision, 129, 1551–1579.
Article Google Scholar
Neuhold, G., Ollmann, T., Rota Bulò, S., & Kontschieder, P. (2017). Mapillary Vistas Dataset for Semantic Understanding of Street Scenes. In: ICCV, pp. 5000–5009.
Oršić, M., & Šegvić, S. (2021). Efficient semantic segmentation with pyramidal fusion. Pattern Recognition p. 107611
Orsic, M., Bevandic, P., Grubisic, I., Saric, J., & Segvic, S. (2020). Multi-domain semantic segmentation with pyramidal fusion. arXiv preprint arXiv:2009.01636, CVPRW RVC.
Porzi, L., Bulò, S.R., Colovic, A., & Kontschieder, P. (2019). Seamless scene segmentation. In: CVPR, pp. 8277–8286.
Richter, S. R., Hayder, Z., & Koltun, V. (2017). Playing for Benchmarks. In: ICCV, pp. 2232–2241.
Robust Vision Challenge. Accessed: 2022-12-02. http://www.robustvision.net/index.php.
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, pp. 234–241
Rota Bulò, S., Porzi, L., & Kontschieder, P. (2018). In-place activated batchnorm for memory-optimized training of DNNS. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5639–5647.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). MobileNetV2: Inverted Residuals and Linear Bottlenecks, pp. 4510–4520.
Schwartz, R., Dodge, J., Smith, N. A., & Etzioni, O. (2020). Green AI. Communications of the ACM, 63(12), 54–63.
Article Google Scholar
Shelhamer, E., Long, J., & Darrell, T. (2017). Fully Convolutional Networks for Semantic Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 640–651. https://doi.org/10.1109/TPAMI.2016.2572683
Article Google Scholar
Song, S., Lichtenberg, S.P., & Xiao, J. (2015). SUN RGB-D: A RGB-D scene understanding benchmark suite. In: CVPR, pp. 567–576.
Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society; p. 843–852.
Uijlings, J.R.R., Mensink, T., Ferrari, V. (2022). The missing link: Finding label relations across datasets. In: ECCV, pp. 540–556.
Varma, G., Subramanian, A., Namboodiri, A.M., Chandraker, M., & Jawahar, C.V. (2019). IDD: A dataset for exploring problems of autonomous navigation in unconstrained environments. In: WACV, pp. 1743–1751.
Xiao, J., Xu, Z., Lan, S., Yu, Z., Yuille, A., & Anandkumar, A. (2022). 1st place solution of the robust vision challenge 2022 semantic segmentation track. CoRR. abs/2210.12852.
Yin, W., Liu, Y., Shen, C., van den Hengel, A., & Sun, B. (2022). The devil is in the labels: Semantic segmentation from sentences. CoRR. abs/2202.02002.
Yu, F., & Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions. In: International conference on learning representations (ICLR)
Yu, F., Xian, W., Chen, Y., Liu, F., Liao, M., Madhavan, V., et al. (2018). BDD100K: A diverse driving video database with scalable annotation tooling. arXiv:1805.04687.
Zendel, O., Honauer, K., Murschitz, M., Steininger, D., & Fernandez Dominguez, G. (2018). WildDash–creating Hazard-Aware benchmarks. In: ECCV
Zendel, O., Schörghuber, M., Rainer, B., Murschitz, M., & Beleznai, C. (2022). Unifying panoptic segmentation for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 21351–21360.
Zhao, H., Qi, X., Shen, X., Shi, J., & Jia, J. (2018). ICNet for real-time semantic segmentation on high-resolution images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., (Eds.), Proceedings of 15th European conference computer vision—ECCV 2018, Munich, Germany, September 8-14, 2018, Part III. vol. 11207 of Lecture Notes in Computer Science. Springer; pp. 418–434.
Zhao, X., Schulter, S., Sharma, G., Tsai, Y., Chandraker, M., & Wu, Y. (2020). Object detection with a unified label space from multiple datasets. In: ECCV, pp. 178–193.
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890.
Zhen, M., Wang, J., Zhou, L., Fang, T., & Quan, L. (2019). Learning fully dense neural networks for image semantic segmentation. In: AAAI
Zhou, X., Koltun, V., & Krähenbühl, P. (2022). Simple multi-dataset detection. In: CVPR
Zhou, T., Wang, W., Konukoglu, E., & Van Goo, L. (2022). Rethinking semantic segmentation: A prototype view. In: CVPR, pp. 2572–2583.
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In: CVPR, pp. 633–641.
Zhu, Y., Sapra, K., Reda, F.A., Shih, K.J., Newsam, S., Tao, A., et al. (2019). Improving semantic segmentation via video propagation and label relaxation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8856–8865.
Zlateski, A., Jaroensri, R., Sharma, P., & Durand F. (2018). On the importance of label quality for semantic segmentation. In: CVPR, pp. 1479–1487.

Download references

Acknowledgements

This work has been supported by Croatian Science Foundation grant IP-2020-02-5851 ADEPT, by NVIDIA Academic Hardware Grant Program, by European Regional Development Fund grant KK.01.1.1.01.0009 DATACROSS and by VSITE College for Information Technologies who provided access to 6 GPU Tesla-V100 32GB.

Author information

Authors and Affiliations

Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, Zagreb, 10000, Croatia
Petra Bevandić, Marin Oršić, Josip Šarić, Ivan Grubišić & Siniša Šegvić

Authors

Petra Bevandić
View author publications
Search author on:PubMed Google Scholar
Marin Oršić
View author publications
Search author on:PubMed Google Scholar
Josip Šarić
View author publications
Search author on:PubMed Google Scholar
Ivan Grubišić
View author publications
Search author on:PubMed Google Scholar
Siniša Šegvić
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Petra Bevandić.

Additional information

Communicated by Oliver Zendel.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bevandić, P., Oršić, M., Šarić, J. et al. Weakly Supervised Training of Universal Visual Concepts for Multi-domain Semantic Segmentation. Int J Comput Vis 132, 2450–2472 (2024). https://doi.org/10.1007/s11263-024-01986-z

Download citation

Received: 21 December 2022
Accepted: 01 January 2024
Published: 30 January 2024
Version of record: 30 January 2024
Issue date: July 2024
DOI: https://doi.org/10.1007/s11263-024-01986-z

Keywords

Profiles

Marin Oršić View author profile

Part of a collection:

Special Issue on Robust Vision

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Weakly Supervised Training of Universal Visual Concepts for Multi-domain Semantic Segmentation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Disentangling Masked Autoencoders for Unsupervised Domain Generalization

Learning Semantic Segmentation from Multiple Datasets with Label Shifts

DySeT: A Dynamic Masked Self-distillation Approach for Robust Trajectory Prediction

Explore related subjects

Data Availability Statement

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now