Abstract
The task of Novel Class Discovery (NCD) in semantic segmentation involves training a model to accurately segment unlabelled (novel) classes, using the supervision available from annotated (base) classes. The NCD task within the 3D point cloud domain is novel, and it is characterised by assumptions and challenges absent in its 2D counterpart. This paper advances the analysis of point cloud data in four directions. Firstly, it introduces the novel task of NCD for point cloud semantic segmentation. Secondly, it demonstrates that directly applying an existing NCD method for 2D image semantic segmentation to 3D data yields limited results. Thirdly, it presents a new NCD approach based on online clustering, uncertainty estimation, and semantic distillation. Lastly, it proposes a novel evaluation protocol to rigorously assess the performance of NCD in point cloud semantic segmentation. Through comprehensive evaluations on the SemanticKITTI, SemanticPOSS, and S3DIS datasets, our approach show superior performance compared to the considered baselines.
Similar content being viewed by others
References
Achlioptas, P., Diamanti, O., Mitliagkas, I. & Guibas, L. (2018). Learning representations and generative models for 3D point clouds. International conference on machine learning (pp. 40–49).
Alonso, I., Riazuelo, L., Montesano, L., & Murillo, A. C. (2020). 3d-mininet: Learning a 2D representation from point clouds for fast and efficient 3D LiDAR semantic segmentation. IEEE Robotics and Automation Letters, 5(4), 5432–5439.
Armeni, I., Sener, O., Zamir, A.R., Jiang, H., Brilakis, I., Fischer, M. & Savarese, S. (2016). 3D semantic parsing of large-scale indoor spaces. Proceedings of the ieee conference on computer vision and pattern recognition (pp. 1534–1543).
Asano, Y.M., Rupprecht, C. & Vedaldi, A. (2020). Self-labelling via simultaneous clustering and representation learning. 8th International conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020.
Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S., Gall, J., & Stachniss, C. (2021). Towards 3D LiDAR-based semantic scene understanding of 3D point cloud sequences: The SemanticKITTI Dataset. The International Journal of Robotics Research, 40(8–9), 959–967.
Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S., Stachniss, C. & Gall, J. (2019). SemanticKITTI: A dataset for semantic scene understanding of LiDAR sequences. Proceedings of the IEEE/CVF international conference on computer vision (pp. 9297–9307).
Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q &. Beijbom, O. (2020). nuscenes: A multimodal dataset for autonomous driving. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11621–11631).
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P. & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems (Vol. 33, pp. 9912–9924).
Cen, J., Yun, P., Zhang, S., Cai, J., Luan, D., Tang, M. & Yu Wang, M. (2022). Open-world semantic segmentation for LiDAR point clouds. Proceedings of the european conference on computer vision (pp. 318–334).
Chang, A., Dai, A., Funkhouser, T., Halber, M., Niessner, M., Savva, M. & Zhang, Y. (2017). Matterport3d: Learning from RGB-D data in indoor environments. International conference on 3D vision.
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.
Chen, L-C., Zhu, Y., Papandreou, G., Schroff, F. & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. Proceedings of the European conference on computer vision (pp. 801–818).
Cheng, R., Razani, R., Taghavi, E., Li, E. & Liu, B. (2021). (AF)2-S3Net: Attentive feature fusion with adaptive feature selection for sparse semantic segmentation network. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12547–12556).
Choy, C., Gwak, J. & Savarese, S. (2019). 4d spatio-temporal convnets: Minkowski convolutional neural networks. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3075–3084).
Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems (Vol. 26).
Deng, H., Birdal, T. & Ilic, S. (2018). Ppf-foldnet: Unsupervised learning of rotation invariant 3D local descriptors. Proceedings of the European conference on computer vision (pp. 602–618).
Dong, X., Bao, J., Zheng, Y., Zhang, T., Chen, D., Yang, H. & Yu, N. (2023). Maskclip: Masked self-distillation advances contrastive language-image pretraining. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10995–11005).
Fini, E., Sangineto, E., Lathuilière, S., Zhong, Z., Nabi, M. & Ricci, E. (2021). A unified objective for novel class discovery. Proceedings of the IEEE/CVF international conference on computer vision (pp. 9284–9292).
Fong, W. K., Mohan, R., Hurtado, J. V., Zhou, L., Caesar, H., Beijbom, O., & Valada, A. (2022). Panoptic nuscenes: A large-scale benchmark for LiDAR panoptic segmentation and tracking. IEEE Robotics and Automation Letters, 7(2), 3795–3802.
Gadelha, M., RoyChowdhury, A., Sharma, G., Kalogerakis, E., Cao, L., Learned-Miller, E. & Maji, S. (2020). Label-efficient learning on point clouds using approximate convex decompositions. Proceedings of the european conference on computer vision (pp. 473–491).
Geiger, A., Lenz, P. & Urtasun, R. (2012). Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. 2012 ieee conference on computer vision and pattern recognition (pp. 3354–3361).
Giuliari, F., Skenderi, G., Cristani, M., Wang, Y. & Del Bue, A. (2022). Spatial commonsense graph for object localisation in partial scenes. Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 19518–19527).
Graham, B., Engelcke, M. & Van Der Maaten, L. (2018). 3D semantic segmentation with submanifold sparse convolutional networks. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 9224–9232).
Graham, B., & van der Maaten, L. (2017). Submanifold sparse convolutional networks.
Guzhov, A., Raue, F., Hees, J. & Dengel, A. (2022). Audioclip: Extending clip to image, text and audio. ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 976–980).
Han, K., Vedaldi, A. & Zisserman, A. (2019). Learning to discover novel visual categories via deep transfer clustering. Proceedings of the IEEE/CVF international conference on computer vision (pp. 8401–8409).
Hinton, G.E., & Salakhutdinov, R.R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507
Hu, Q., Yang, B., Xie, L., Rosa, S., Guo, Y., Wang, Z. & Markham, A. (2020). Randla-net: Efficient semantic segmentation of large-scale point clouds. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11108–11117).
Huang, S., Xie, Y., Zhu, S-C. & Zhu, Y. (2021). Spatio-temporal self-supervised representation learning for 3D point clouds. Proceedings of the IEEE/CVF international conference on computer vision (pp. 6535–6545).
Jatavallabhula, K., Kuwajerwala, A., Gu, Q., Omama, M., Chen, T., Li, S. & Torralba, A. (2023). Conceptfusion: Open-set multimodal 3D mapping. Proceedings of robotics: Science and systems.
Ji, X., Henriques, J.F. & Vedaldi, A. (2019). Invariant information clustering for unsupervised image classification and segmentation. Proceedings of the IEEE/CVF international conference on computer vision (pp. 9865–9874).
Jia, X., Han, K., Zhu, Y. & Green, B. (2021). Joint representation learning and novel category discovery on single-and multi-modal data. Proceedings of the IEEE/CVF international conference on computer vision (pp. 610–619).
Jiang, H., Shen, Y., Xie, J., Li, J., Qian, J. & Yang, J. (2021). Sampling network guided cross-entropy method for unsupervised point cloud registration. Proceedings of the IEEE/CVF international conference on computer vision (pp. 6128–6137).
Joseph, K., Paul, S., Aggarwal, G., Biswas, S., Rai, P., Han, K. & Balasubramanian, V.N. (2022). Novel class discovery without forgetting. Proceedings of the European conference on computer vision (pp. 570–586).
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L. & Girshick, R. (2023). Segment anything.
Li, R., Li, X., Fu, C-W., Cohen-Or, D. & Heng, P-A. (2019). Pu-gan: a point cloud upsampling adversarial network. Proceedings of the IEEE/CVF international conference on computer vision (pp. 7203–7212).
Mei, G., Poiesi, F., Saltori, C., Zhang, J., Ricci, E. & Sebe, N. (2023). Overlap-guided gaussian mixture models for point cloud registration. Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 4511–4520).
Mei, G., Saltori, C., Poiesi, F., Zhang, J., Ricci, E., Sebe, N. & Wu, Q. (2022). Data augmentation-free unsupervised learning for 3D point cloud understanding. British machine vision conference.
Milioto, A., Vizzo, I., Behley, J. & Stachniss, C. (2019). Rangenet++: Fast and accurate LiDAR semantic segmentation. 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 4213–4220).
Miller, G. A. (1995). Wordnet: A lexical database for English. Communications of the ACM, 38(11), 39–41.
Pan, Y., Gao, B., Mei, J., Geng, S., Li, C. & Zhao, H. (2020). SemanticPOSS: A point cloud dataset with large quantity of dynamic instances. 2020 IEEE intelligent vehicles symposium (iv) (pp. 687–693).
Pang, Y., Wang, W., Tay, F.E., Liu, W., Tian, Y. & Yuan, L. (2022). Masked autoencoders for point cloud self-supervised learning. Proceedings of the European conference on computer vision (pp. 604–621).
Peng, S., Genova, K., Jiang, C.M., Tagliasacchi, A., Pollefeys, M. & Funkhouser, T. (2023). Openscene: 3D scene understanding with open vocabularies. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 815–824).
Poiesi, F., & Boscaini, D. (2022). Learning general and distinctive 3D local deep descriptors for point cloud registration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 3979–3985.
Poursaeed, O., Jiang, T., Qiao, H., Xu, N. & Kim, V.G. (2020). Self-supervised learning of point clouds via orientation estimation. 2020 international conference on 3D vision (3dv) (pp. 1018–1028).
Qi, C.R., Su, H., Mo, K. & Guibas, L.J. (2017). Pointnet: Deep learning on point sets for 3D classification and segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 652–660).
Qi, C.R., Yi, L., Su, H. & Guibas, L.J. (2017). Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems (Vol. 30).
Qian, G., Abualshour, A., Li, G., Thabet, A. & Ghanem, B. (2021). Pu-gcn: Point cloud upsampling using graph convolutional networks. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11683–11692).
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S. & Sutskever, I. (2021a). Learning transferable visual models from natural language supervision. International conference on machine learning (pp. 8748–8763).
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S. & Sutskever, I. (2021b). Learning transferable visual models from natural language supervision. International conference on machine learning (pp. 8748–8763).
Riz, L., Saltori, C., Ricci, E. & Poiesi, F. (2023). Novel class discovery for 3D point cloud semantic segmentation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9393–9402).
Ronneberger, O., Fischer, P. & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. Medical image computing and computer-assisted intervention–miccai 2015: 18th international conference, Munich, Germany, October 5-9, 2015, Proceedings, Part iii 18 (pp. 234–241).
Roy, S., Liu, M., Zhong, Z., Sebe, N. & Ricci, E. (2022). Class-incremental novel class discovery. Proceedings of the european conference on computer vision (pp. 317–333).
Rozenberszki, D., Litany, O. & Dai, A. (2022). Language-grounded indoor 3D semantic segmentation in the wild. Proceedings of the European conference on computer vision (pp. 125–141).
Saltori, C., Galasso, F., Fiameni, G., Sebe, N., Ricci, E. & Poiesi, F. (2022). Cosmix: Compositional semantic mix for domain adaptation in 3D LiDAR segmentation. Proceedings of the European conference on computer vision (pp. 586–602).
Saltori, C., Krivosheev, E., Lathuiliére, S., Sebe, N., Galasso, F., Fiameni, G. & Poiesi, F. (2022). Gipso: Geometrically informed propagation for online adaptation in 3D LiDAR segmentation. Proceedings of the european conference on computer vision (pp. 567–585).
Sauder, J., & Sievers, B. (2019). Self-supervised deep learning on point clouds by reconstructing space. Advances in neural information processing systems (Vol. 32).
Shu, D.W., Park, S.W. & Kwon, J. (2019). 3D point cloud generative adversarial network based on tree structured graph convolutions. Proceedings of the IEEE/CVF international conference on computer vision (pp. 3859–3868).
Song, R., Zhang, W., Zhao, Y., Liu, Y. & Rosin, P.L. (2021). Mesh saliency: An independent perceptual measure or a derivative of image saliency? Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8853–8862).
Souly, N., Spampinato, C. & Shah, M. (2017). Semi supervised semantic segmentation using generative adversarial network. Proceedings of the IEEE international conference on computer vision (pp. 5688–5696).
Tang, Y., Wang, J., Gao, B., Dellandréa, E., Gaizauskas, R. & Chen, L. (2016). Large scale semi-supervised object detection using visual and semantic knowledge transfer. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2119–2128).
Thomas, H., Qi, C.R., Deschaud, J-E., Marcotegui, B., Goulette, F. & Guibas, L.J. (2019). Kpconv: Flexible and deformable convolution for point clouds. Proceedings of the IEEE/CVF international conference on computer vision (pp. 6411–6420).
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using T-SNE. Journal of Machine Learning Research, 9(86), 2579–2605.
Vaze, S., Han, K., Vedaldi, A. & Zisserman, A. (2022). Generalized category discovery. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7492–7501).
Wen, X., Li, T., Han, Z. & Liu, Y-S. (2020). Point cloud completion by skip-attention network with hierarchical folding. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1939–1948).
Wu, B., Wan, A., Yue, X. & Keutzer, K. (2018). Squeezeseg: Convolutional neural nets with recurrent CRF for real-time road-object segmentation from 3D LiDAR point cloud. 2018 IEEE international conference on robotics and automation (ICRA) (pp. 1887–1893).
Wu, B., Zhou, X., Zhao, S., Yue, X. & Keutzer, K. (2019). Squeezesegv2: Improved model structure and unsupervised domain adaptation for road-object segmentation from a LiDAR point cloud. 2019 international conference on robotics and automation (ICRA) (pp. 4376–4382).
Wu, J., Zhang, C., Xue, T., Freeman, B. & Tenenbaum, J. (2016). Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. Advances in neural information processing systems (Vol. 29).
Xiao, A., Huang, J., Guan, D., Zhang, X., Lu, S., & Shao, L. (2023). Unsupervised point cloud representation learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9), 11321–11339.
Xie, S., Gu, J., Guo, D., Qi, C.R., Guibas, L. & Litany, O. (2020). Pointcontrast: Unsupervised pre-training for 3D point cloud understanding. Proceedings of the European conference on computer vision (pp. 574–591).
Yang, G., Huang, X., Hao, Z., Liu, M-Y., Belongie, S. & Hariharan, B. (2019). Pointflow: 3D point cloud generation with continuous normalizing flows. Proceedings of the IEEE/CVF international conference on computer vision (pp. 4541–4550).
Yang, J., Ahn, P., Kim, D., Lee, H. & Kim, J. (2021). Progressive seed generation auto-encoder for unsupervised point cloud learning. Proceedings of the IEEE/CVF international conference on computer vision (pp. 6413–6422).
Yang, M., Zhu, Y., Yu, J., Wu, A. & Deng, C. (2022). Divide and conquer: Compositional experts for generalized novel class discovery. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14268–14277).
Yang, Y., Feng, C., Shen, Y. & Tian, D. (2018). Foldingnet: Point cloud auto-encoder via deep grid deformation. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 206–215).
Yu, X., Tang, L., Rao, Y., Huang, T., Zhou, J. & Lu, J. (2022). Point-bert: Pre-training 3D point cloud transformers with masked point modeling. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 19313–19322).
Yuan, W., Khot, T., Held, D., Mertz, C. & Hebert, M. (2018). PCN: Point completion network. 2018 international conference on 3d vision (3dv) (pp. 728–737).
Zhang, L., & Qi, G-J. (2020). Wcp: Worst-case perturbations for semi-supervised deep learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3912–3921).
Zhang, Y., Zhou, Z., David, P., Yue, X., Xi, Z., Gong, B. & Foroosh, H. (2020). Polarnet: An improved grid representation for online LiDAR point clouds semantic segmentation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9601–9610).
Zhang, Z., Girdhar, R., Joulin, A. & Misra, I. (2021). Self-supervised pretraining of 3D features on any point-cloud. Proceedings of the IEEE/CVF international conference on computer vision (pp. 10252–10263).
Zhao, Y., Zhong, Z., Sebe, N. & Lee, G.H. (2022). Novel class discovery in semantic segmentation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4340–4349).
Zhong, Z., Fini, E., Roy, S., Luo, Z., Ricci, E. & Sebe, N. (2021). Neighborhood contrastive learning for novel class discovery. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10867–10875).
Zhong, Z., Zhu, L., Luo, Z., Li, S., Yang, Y. & Sebe, N. (2021). Openmix: Reviving known knowledge for discovering novel visual categories in an open world. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9462–9470).
Zhou, Y., & Tuzel, O. (2018). Voxelnet: End-to-end learning for point cloud based 3D object detection. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4490–4499).
Zhu, X., Zhou, H., Wang, T., Hong, F., Ma, Y., Li, W. & Lin, D. (2021). Cylindrical and asymmetrical 3D convolution networks for LiDAR segmentation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9939–9948).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Ming-Hsuan Yang.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Project page: https://luigiriz.github.io/SNOPS_website/. This project has received funding from the European Union’s Horizon Europe research and innovation programme under the projects AI-PRISM (grant agreement No. 101058589) and FEROX (grant agreement No. 101070440). This work was also partially sponsored by the PRIN project LEGO-AI (Prot. 2020TA3K9N), EU ISFP PRECRISIS (ISFP-2022-TFI-AG-PROTECT-02-101100539), PNRR ICSC National Research Centre for HPC, Big Data and Quantum Computing (CN00000013) and the FAIR - Future AI Research (PE00000013), funded by NextGeneration EU. It was carried out in the Vision and Learning joint laboratory of FBK and UNITN.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Riz, L., Saltori, C., Wang, Y. et al. Novel Class Discovery Meets Foundation Models for 3D Semantic Segmentation. Int J Comput Vis 133, 527–548 (2025). https://doi.org/10.1007/s11263-024-02180-x
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-024-02180-x