Abstract
Deep supervised models have an unprecedented capacity to absorb large quantities of training data. Hence, training on multiple datasets becomes a method of choice towards strong generalization in usual scenes and graceful performance degradation in edge cases. Unfortunately, popular datasets often have discrepant granularities. For instance, the Cityscapes road class subsumes all driving surfaces, while Vistas defines separate classes for road markings, manholes etc. Furthermore, many datasets have overlapping labels. For instance, pickups are labeled as trucks in VIPER, cars in Vistas, and vans in ADE20k. We address this challenge by considering labels as unions of universal visual concepts. This allows seamless and principled learning on multi-domain dataset collections without requiring any relabeling effort. Our method improves within-dataset and cross-dataset generalization, and provides opportunity to learn visual concepts which are not separately labeled in any of the training datasets. Experiments reveal competitive or state-of-the-art performance on two multi-domain dataset collections and on the WildDash 2 benchmark.
Similar content being viewed by others
Data Availability Statement
We perform our experiments on the following publically available datasets: ADE20k Zhou et al. (2017), BDD Yu et al. (2018), Camvid Badrinarayanan et al. (2017), Cityscapes Cordts et al. (2016), COCO Lin et al. (2014), IDD Varma et al. (2019), KITTI Geiger et al. (2013), MSeg Lambert et al. (2020), SUN RGBD Song et al. (2015), Scannet Dai et al. (2017), Viper Richter et al. (2017), Vistas Neuhold et al. (2017), and WildDash 2 Zendel et al. (2018). Our universal taxonomy for these datasets is available online Bevandic (2022).
References
Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12), 2481–2495.
Bevandic, P. (2022). Universal taxonomies for semantic segmentation (source code). Accessed 02 Dec 2022. https://github.com/UNIZG-FER-D307/universal_taxonomies.
Bevandic, P., & Segvic, S. (2022). Automatic universal taxonomies for multi-domain semantic segmentation. In: BMVC
Bevandić, P., Oršić, M., Grubišić, I., Šarić, J., & Šegvić, S. (2022). Multi-domain semantic segmentation with overlapping labels. In: Proceedings of the IEEE/CVF Winter conference on applications of computer vision (WACV), pp. 2615–2624.
Bevandić, P., Krešo, I., Oršić, M., & Šegvić, S. (2022). Dense open-set recognition based on training with noisy negative images. Image and Vision Computing, 124, 104490. https://doi.org/10.1016/j.imavis.2022.104490
Biase, G.D., Blum, H., Siegwart, R., & Cadena, C. (2021). Pixel-wise anomaly detection in complex driving scenes. In: Computer vision and pattern recognition, CVPR
Blum, H., Sarlin, P., Nieto, J. I., Siegwart, R., & Cadena, C. (2021). The fishyscapes benchmark: Measuring blind spots in semantic segmentation. International Journal of Computer Vision, 129(11), 3119–3135.
Chan, R., Lis, K., Uhlemeyer, S., Blum, H., Honari, S., Siegwart, R., et al. (2021). SegmentMeIfYouCan: A benchmark for anomaly segmentation. In: Vanschoren J, Yeung S, (Eds.) NeurIPS;
Chan, R., Rottmann, M., & Gottschalk, H. (2021). Entropy maximization and meta classification for out-of-distribution detection in semantic segmentation. In: International conference on computer vision, ICCV;
Cheng, B., Collins, M.D., Zhu, Y., Liu, T., Huang, T.S., Adam, H., et al. (2020). Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12475–12485.
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., & Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. In: CVPR; pp. 1280–1289.
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. In: CVPR
Cheng, B., Schwing, A.G., & Kirillov, A. (2021). Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., et al. (2016). The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223.
Cour, T., Sapp, B., & Taskar, B. (2011). Learning from partial labels. The Journal of Machine Learning Research., 12, 1501–1536.
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., & Nießner, M. (2017). ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. In: CVPR
Everingham, M., Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The Pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88, 303–338.
Fourure, D., Emonet, R., Fromont, E., Muselet, D., Neverova, N., Trémeau, A., et al. (2017). Multi-task, multi-domain learning: application to semantic segmentation and pose regression. Neurocomputing.
Galleguillos, C., & Belongie, S. (2010). Context based object categorization: A critical survey. Computer Vision and Image Understanding, 06(114), 712–722. https://doi.org/10.1016/j.cviu.2010.02.004
Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The KITTI dataset. Int J Robotics Res., 32(11), 1231–1237.
Gupta, A., Dollar, P., & Girshick, R. (2019). LVIS: A dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition
He, K., Zhang, X., Ren, S., & Sun, J. (2016).Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.
Huang, G., Liu, Z., Pleiss, G., Van Der Maaten, L., & Weinberger, K. (2019). Convolutional networks with dense connectivity. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Kalluri, T., Varma, G., Chandraker, M., & Jawahar, C. (2019). Universal semi-supervised semantic segmentation. In: Proceedings of the IEEE international conference on computer vision, pp. 5259–5270.
Kim, D., Tsai, Y., Suh, Y., Faraki, M., Garg, S., Chandraker, M., et al. (2022). Learning semantic segmentation from multiple datasets with label shifts. CoRR. abs/2202.14030.
Kim, D., Tsai, Y., Suh, Y., Faraki, M., Garg, S., Chandraker, M., et al. (2022). learning semantic segmentation from multiple datasets with label shifts. In: ECCV
Krešo, I., Krapac, J., & Šegvić, S. (2020). Efficient ladder-style densenets for semantic segmentation of large images. IEEE Transactions on Intelligent Transportation Systems.
Kreso, I., Krapac, J., & Segvic, S. (2021). Efficient ladder-style DenseNets for semantic segmentation of large images. IEEE Transactions on Intelligent Transportation Systems, 22(8), 4951–4961.
Lambert, J., Liu, Z., Sener, O., Hays, J., Koltun, V. (2020). MSeg: A composite dataset for multi-domain semantic segmentation. In: CVPR
Lee, D.H. (2013). Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks. WREPL. 07;.
Li, L., Zhou, T., Wang, W., Li, J., & Yang, Y. (2022). Deep hierarchical semantic segmentation. In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 1236–1247.
Liang, X., Zhou, H., & Xing, E. (2018). Dynamic-structured semantic propagation network. In: CVPR, pp. 752–761.
Lin, T., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., et al. (2014). Microsoft COCO: Common Objects in Context. In: ECCV, pp. 740–755.
Liu, Y., Ge, P., Liu, Q., Fan, S., & Wang, Y. (2022). An empirical study on multi-domain robust semantic segmentation. arXiv preprint arXiv:2212.04221.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440.
Masaki, S., Hirakawa, T., Yamashita, T., & Fujiyoshi, H. (2021). Multi-Domain Semantic-Segmentation using Multi-Head Model. In: 2021 IEEE international intelligent transportation systems conference (ITSC), pp. 2802–2807.
McClosky, D., Charniak, E., & Johnson, M. (2006). Effective self-training for parsing. In: NAACL, pp. 152–159.
Meletis, P., & Dubbelman, G. (2018). Training of Convolutional networks on multiple heterogeneous datasets for street scene semantic segmentation. In: Intelligent vehicles symposium, pp. 1045–1050.
Mohan, R., & Valada, A. (2020). EfficientPS: Efficient panoptic segmentation. International Journal of Computer Vision, 129, 1551–1579.
Neuhold, G., Ollmann, T., Rota Bulò, S., & Kontschieder, P. (2017). Mapillary Vistas Dataset for Semantic Understanding of Street Scenes. In: ICCV, pp. 5000–5009.
Oršić, M., & Šegvić, S. (2021). Efficient semantic segmentation with pyramidal fusion. Pattern Recognition p. 107611
Orsic, M., Bevandic, P., Grubisic, I., Saric, J., & Segvic, S. (2020). Multi-domain semantic segmentation with pyramidal fusion. arXiv preprint arXiv:2009.01636, CVPRW RVC.
Porzi, L., Bulò, S.R., Colovic, A., & Kontschieder, P. (2019). Seamless scene segmentation. In: CVPR, pp. 8277–8286.
Richter, S. R., Hayder, Z., & Koltun, V. (2017). Playing for Benchmarks. In: ICCV, pp. 2232–2241.
Robust Vision Challenge. Accessed: 2022-12-02. http://www.robustvision.net/index.php.
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, pp. 234–241
Rota Bulò, S., Porzi, L., & Kontschieder, P. (2018). In-place activated batchnorm for memory-optimized training of DNNS. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5639–5647.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). MobileNetV2: Inverted Residuals and Linear Bottlenecks, pp. 4510–4520.
Schwartz, R., Dodge, J., Smith, N. A., & Etzioni, O. (2020). Green AI. Communications of the ACM, 63(12), 54–63.
Shelhamer, E., Long, J., & Darrell, T. (2017). Fully Convolutional Networks for Semantic Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 640–651. https://doi.org/10.1109/TPAMI.2016.2572683
Song, S., Lichtenberg, S.P., & Xiao, J. (2015). SUN RGB-D: A RGB-D scene understanding benchmark suite. In: CVPR, pp. 567–576.
Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society; p. 843–852.
Uijlings, J.R.R., Mensink, T., Ferrari, V. (2022). The missing link: Finding label relations across datasets. In: ECCV, pp. 540–556.
Varma, G., Subramanian, A., Namboodiri, A.M., Chandraker, M., & Jawahar, C.V. (2019). IDD: A dataset for exploring problems of autonomous navigation in unconstrained environments. In: WACV, pp. 1743–1751.
Xiao, J., Xu, Z., Lan, S., Yu, Z., Yuille, A., & Anandkumar, A. (2022). 1st place solution of the robust vision challenge 2022 semantic segmentation track. CoRR. abs/2210.12852.
Yin, W., Liu, Y., Shen, C., van den Hengel, A., & Sun, B. (2022). The devil is in the labels: Semantic segmentation from sentences. CoRR. abs/2202.02002.
Yu, F., & Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions. In: International conference on learning representations (ICLR)
Yu, F., Xian, W., Chen, Y., Liu, F., Liao, M., Madhavan, V., et al. (2018). BDD100K: A diverse driving video database with scalable annotation tooling. arXiv:1805.04687.
Zendel, O., Honauer, K., Murschitz, M., Steininger, D., & Fernandez Dominguez, G. (2018). WildDash–creating Hazard-Aware benchmarks. In: ECCV
Zendel, O., Schörghuber, M., Rainer, B., Murschitz, M., & Beleznai, C. (2022). Unifying panoptic segmentation for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 21351–21360.
Zhao, H., Qi, X., Shen, X., Shi, J., & Jia, J. (2018). ICNet for real-time semantic segmentation on high-resolution images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., (Eds.), Proceedings of 15th European conference computer vision—ECCV 2018, Munich, Germany, September 8-14, 2018, Part III. vol. 11207 of Lecture Notes in Computer Science. Springer; pp. 418–434.
Zhao, X., Schulter, S., Sharma, G., Tsai, Y., Chandraker, M., & Wu, Y. (2020). Object detection with a unified label space from multiple datasets. In: ECCV, pp. 178–193.
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890.
Zhen, M., Wang, J., Zhou, L., Fang, T., & Quan, L. (2019). Learning fully dense neural networks for image semantic segmentation. In: AAAI
Zhou, X., Koltun, V., & Krähenbühl, P. (2022). Simple multi-dataset detection. In: CVPR
Zhou, T., Wang, W., Konukoglu, E., & Van Goo, L. (2022). Rethinking semantic segmentation: A prototype view. In: CVPR, pp. 2572–2583.
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In: CVPR, pp. 633–641.
Zhu, Y., Sapra, K., Reda, F.A., Shih, K.J., Newsam, S., Tao, A., et al. (2019). Improving semantic segmentation via video propagation and label relaxation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8856–8865.
Zlateski, A., Jaroensri, R., Sharma, P., & Durand F. (2018). On the importance of label quality for semantic segmentation. In: CVPR, pp. 1479–1487.
Acknowledgements
This work has been supported by Croatian Science Foundation grant IP-2020-02-5851 ADEPT, by NVIDIA Academic Hardware Grant Program, by European Regional Development Fund grant KK.01.1.1.01.0009 DATACROSS and by VSITE College for Information Technologies who provided access to 6 GPU Tesla-V100 32GB.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Oliver Zendel.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bevandić, P., Oršić, M., Šarić, J. et al. Weakly Supervised Training of Universal Visual Concepts for Multi-domain Semantic Segmentation. Int J Comput Vis 132, 2450–2472 (2024). https://doi.org/10.1007/s11263-024-01986-z
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-024-01986-z