Abstract
DEtection TRansformer (DETR) started a trend that uses a group of learnable queries for unified visual perception. This work begins by applying this appealing paradigm to LiDAR-based point cloud segmentation and obtains a simple yet effective baseline. Although the naive adaptation obtains fair results, the instance segmentation performance is noticeably inferior to previous works. By diving into the details, we observe that instances in the sparse point clouds are relatively small to the whole scene and often have similar geometry but lack distinctive appearance for segmentation, which are rare in the image domain. Considering instances in 3D are more featured by their positional information, we emphasize their roles during the modeling and design a robust Mixed-parameterized Positional Embedding (MPE) to guide the segmentation process. It is embedded into backbone features and later guides the mask prediction and query update processes iteratively, leading to Position-Aware Segmentation (PA-Seg) and Masked Focal Attention (MFA). All these designs impel the queries to attend to specific regions and identify various instances. The method, named Position-guided Point cloud Panoptic segmentation transFormer (P3Former), outperforms previous state-of-the-art methods by 2.7% and 1.2% PQ on SemanticKITTI and nuScenes datasets, respectively. The source code and models are available at https://github.com/OpenRobotLab/P3Former.
Similar content being viewed by others
Data Availibility
The datasets that support the findings of this study are all publicly available for research purposes.
References
Alonso, I., et al. (2020). 3D-mininet: Learning a 2D representation from point clouds for fast and efficient 3d lidar semantic segmentation. IEEE Robotics and Automation Letters, 5(4), 5432–5439.
Behley, J., et al. (2019) Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9297– 9307).
Behley, J., Milioto, A., & Stachniss, C. (2021) A benchmark for LiDAR-based panoptic segmentation based on KITTI. In 2021 IEEE international conference on robotics and automation (ICRA) (pp. 13596–13603). IEEE.
Bichen, Wu., et al. (2018). Squeezeseg: Convolutional neural nets with recurrent CRF for real-time road-object segmentation from 3D lidar point cloud. In IEEE international conference on robotics and automation (ICRA) (pp. 1887–1893). IEEE.
Caesar, H., et al. (2020) nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11621–11631).
Carion, N., et al. (2020). End-to-end object detection with transformers. In Computer Vision- ECCV 2020: 16th European Conference Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. (pp. 213-229). Springer.
Chen, L.-C., et al. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.
Cheng, B., et al. (2020). Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12475–12485).
Cheng, R., et al. (2021). 2-s3net: Attentive feature fusion with adaptive feature selection for sparse semantic segmentation network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12547–12556).
Cheng, B., et al. (2021). Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1290–1299).
Cheng, B., Schwing, A., & Kirillov, A. (2021). Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34, 17864–17875.
Engelmann, F., et al. (2020). 3d-mpa: Multiproposal aggregation for 3d semantic instancesegmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9031–9040).
Fan, L., et al. (2022). Embracing single stride 3D object detector with sparse transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8458–8468).
Fong, W. K., et al. (2022). Panoptic nuscenes: A large-scale benchmark for lidar panoptic segmentation and tracking. IEEE Robotics and Automation Letters, 7(2), 3795–3802.
Gasperini, S., et al. (2021). Panoster: End-to-end panoptic segmentation of lidar point clouds. IEEE Robotics and Automation Letters, 6(2), 3216–3223.
Graham, B., Engelcke, M., & Van Der Maaten, L., . (2018). 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 9224–9232).
Han, L., et al. (2020). Occuseg: Occupancy-aware 3D instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2940–2949).
Hong, F., et al. (2021). Lidar-based panoptic segmentation via dynamic shifting network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13090–13099).
Hou, Y., et al. (2022). Point-to-voxel knowledge distillation for lidar semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8479–8488).
Hu, Q., et al. (2020). Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11108–11117).
Kirillov, A., et al. (2019). Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6399–6408).
Kirillov, A., et al. (2019). Panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9404–9413).
Lang, AH., et al. (2019). Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12697–12705).
Li, J., et al. (2022). Panoptic-PHNet: Towards real-time and high-precision LiDAR panoptic segmentation via clustering pseudo heatmap. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11809–1118).
Li, Q., Qi, X., & Torr, PHS. (2020). Unifying training and inference for panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13320–13328).
Lin, T-Yi., et al. (2014 ). Microsoft coco: Common objects in context. In Computer vision-ECCV 2014: 13th European conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. (pp. 740–755). Springer.
Lin, T-Yi., et al. (2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (pp. 2980–2988).
Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv Preprint retrieved from arXiv:1711.05101
Lyu, Y., Huang, X., & Zhang, Z. (2020). Learning to segment 3D point clouds in 2D image space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12255–12264).
Mao, J., Wang, X., & Li, H. (2019). Interpolated convolutional networks for 3D point cloud understanding. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1578–1587).
Marcuzzi, R., et al. (2023). Mask-based panoptic lidar segmentation for autonomous driving. IEEE Robotics and Automation Letters, 8(2), 1141–1148.
Meng, H-Yu., et al. (2019). Vv-net: Voxel vae net with group convolutions for point cloud segmentation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8500–8508).
Milioto, A., et al. (2019). Rangenet++: Fast and accurate lidar semantic segmentation. In IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 4213–4220). IEEE.
Milioto, A., et al. (2020). Lidar panoptic segmentation for autonomous driving. In 2020IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 8505–8512). IEEE.
MMDetection3D Contributors. (2020). MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. https://github.com/openmmlab/mmdetection3d
Porzi, L., et al. (2019). Seamless scene segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8277–8286).
Qi, CR., et al. (2017). Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 652–660).
Qi, RC., et al. (2017). Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in Neural Information Processing Systems, 30
Razani, R., et al. (2021). GP-S3Net: Graph-based panoptic sparse semantic segmentation network. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 16076–16085).
Sirohi, K., et al. (2021). Efficientlps: Efficient lidar panoptic segmentation. IEEE Transactions on Robotics, 38(3), 1894–1914.
Su, S., et al. (2023). PUPS: Point cloud unified panoptic segmentation. arXiv Preprint retrieved from arXiv:2302.06185
Sudre, CH. et al. (2017). Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Deep learning in medical image analysis and multimodal learning for clinical decision support: Third international workshop, DLMIA 2017, and 7th international workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, September 14, Proceedings 3. (pp. 240–248). Springer.
Tang, H., et al. (2020). Searching efficient 3D architectures with sparse point-voxel convolution. In Computer vision-ECCV 2020: 16th European conference Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII. (pp. 685–702). Springer.
Thomas, H., et al. (2019). Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6411–6420).
Wang, X., et al. (2020). Solov2: Dynamic and fast instance segmentation. Advances in Neural Information Processing Systems, 33, 17721–17732.
Wu, W., Qi, Z., & Fuxin, L. (2019). Pointconv: Deep convolutional networks on 3D point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9621–9630).
Xiong, Y., et al. (2019). Upsnet: A unified panoptic segmentation network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8818–8826).
Xu,J., et al. (2021). Rpvnet: A deep and efficient range-point-voxel fusion network for lidar point cloud segmentation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 16024–16033).
Xu, S., et al. (2022). Sparse cross-scale attention network for efficient lidar panoptic segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 36(3), 2920–2928.
Zhang, W., et al. (2021). K-net: Towards unified image segmentation. Advances in Neural Information Processing Systems, 34, 10326–10338.
Zhou, Z., Zhang, Y., Foroosh, H. (2021) Panoptic-polarnet: Proposal-free lidar point cloud panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13194–13203).
Zhu, X., et al. (2021). Cylindrical and asymmetrical 3D convolution networks for lidar segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9939–9948).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Takayuki Okatani.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xiao, Z., Zhang, W., Wang, T. et al. Position-Guided Point Cloud Panoptic Segmentation Transformer. Int J Comput Vis 133, 275–290 (2025). https://doi.org/10.1007/s11263-024-02162-z
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-024-02162-z