Abstract
LiDAR-based 3D object detection is a crucial task for autonomous driving, owing to its accurate object recognition and localization capabilities in the 3D real-world space. However, existing methods heavily rely on time-consuming and laborious large-scale labeled LiDAR data, posing a bottleneck for both performance improvement and practical applications. In this paper, we propose Contrastive Masked AutoEncoders for self-supervised 3D object detection, dubbed as CMAE-3D, which is a promising solution to effectively alleviate label dependency in 3D perception. Specifically, we integrate Contrastive Learning (CL) and Masked AutoEncoders (MAE) into one unified framework to fully utilize the complementary characteristics of global semantic representation and local spatial perception. Furthermore, from the perspective of MAE, we develop the Geometric-Semantic Hybrid Masking (GSHM) to selectively mask representative regions in point clouds with imbalanced foreground-background and uneven density distribution, and design the Multi-scale Latent Feature Reconstruction (MLFR) to capture high-level semantic features while mitigating the redundant reconstruction of low-level details. From the perspective of CL, we present Hierarchical Relational Contrastive Learning (HRCL) to mine rich semantic similarity information while alleviating the issue of negative sample mismatch from both the voxel-level and frame-level. Extensive experiments demonstrate the effectiveness of our pre-training method when applied to multiple mainstream 3D object detectors (SECOND, CenterPoint and PV-RCNN) on three popular datasets (KITTI, Waymo and nuScenes).
Similar content being viewed by others
Data Availibility
The KITTI (Geiger et al., 2012), Waymo (Sun et al., 2020) and nuScenes (Caesar et al., 2020) databases used in this manuscript are deposited in publicly available repositories respectively: http://www.cvlibs.net/datasets/kitti, https://waymo.com/open/data/perception and https://www.nuscenes.org/nuscenes.
Change history
15 January 2025
The affiliation of the first author has been corrected
22 January 2025
A Correction to this paper has been published: https://doi.org/10.1007/s11263-025-02359-w
References
Bao, H., Dong, L., Piao, S., & Wei, F. (2022). Beit: Bert pre-training of image transformers. in International Conference on Learning Representations.
Boulch, A., Sautier, C., Michele, B., Puy, G., & Marlet, R. (2023). Also: Automotive lidar self-supervision by occupancy estimation. in |textitProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13455–13465.
Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., & Beijbom, O. (2020). nuscenes: A multimodal dataset for autonomous driving. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11621–11631.
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33, 9912–9924.
Chen, X., & He, K. (2021). Exploring simple siamese representation learning. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758.
Chen, C., Chen, Z., Zhang, J., & Tao, D. (2022a). Sasa: Semantics-augmented set abstraction for point-based 3d object detection. in Proceedings of the AAAI Conference on Artificial Intelligence, pp. 221–229.
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. in International Conference on Machine Learning, PMLR, pp. 1597–1607.
Chen, Y., Liu, S., Shen, X., & Jia, J. (2019). Fast point r-cnn. in Proceedings of the IEEE/CVF International conference on Computer Vision, pp. 9775–9784.
Chen, R., Mu, Y., Xu, R., Shao, W., Jiang, C., Xu, H., Li, Z., & Luo, P. (2022b). \({\rm Co}^3\) Cooperative unsupervised 3d representation learning for autonomous driving. arXiv:2206.04028
Du, B., Gao, X., Hu, W., & Li, X. (2021) Self-contrastive learning with hard negative sampling for self-supervised point cloud learning. in Proceedings of the 29th ACM International Conference on Multimedia, pp. 3133–3142.
Fan, L., Pang, Z., Zhang, T., Wang, Y. X., Zhao, H., Wang, F., Wang, N., & Zhang, Z. (2022). Embracing single stride 3d object detector with sparse transformer. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8458–8468.
Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? the kitti vision benchmark suite. in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1, 3354–3361.
Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33, 21271–21284.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022b). Masked autoencoders are scalable vision learners. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16000–16009.
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020b). Momentum contrast for unsupervised visual representation learning. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729–9738.
He, C., Li, R., Li, S., & Zhang, L. (2022a). Voxel set transformer: A set-to-set approach to 3d object detection from point clouds. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8417–8427.
He, C., Zeng, H., Huang, J., Hua, X. S., & Zhang, L. (2020a). Structure aware single-stage 3d object detection from point cloud. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11873–11882.
Hess, G., Jaxing, J., Svensson, E., Hagerman, D., Petersson, C., & Svensson, L. (2023). Masked autoencoder for self-supervised pre-training on lidar point clouds. in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 350–359.
Hou, J., Graham, B., Nießner, M., & Xie, S. (2021). Exploring data-efficient 3d scene understanding with contrastive scene contexts. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15587–15597.
Huang, S., Xie, Y., Zhu, S. C., & Zhu, Y. (2021). Spatio-temporal self-supervised representation learning for 3d point clouds. in Proceedings of the IEEE/CVF International Conference on Computer Vision, 6535–6545.
Krispel, G., Schinagl, D., Fruhwirth-Reisinger, C., Possegger, H., & Bischof, H. (2022). Maeli-masked autoencoder for large-scale lidar point clouds. arXiv:2212.07207
Lang, A. H., Vora, S., Caesar, H., Zhou, L., Yang, J., & Beijbom, O. (2019). Pointpillars: Fast encoders for object detection from point clouds. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12697–12705.
Liang, H., Jiang, C., Feng, D., Chen, X., Xu, H., Liang, X., Zhang, W., Li, Z., & Van Gool, L. (2021). Exploring geometry-aware contrast and clustering harmonization for self-supervised 3d object detection. in Proceedings of the IEEE/CVF International Conference on Computer Vision, 3293–3302.
Lin, Z., & Wang, Y. (2022). Bev-mae: Bird’s eye view masked autoencoders for outdoor point cloud pre-training. arXiv:2212.05758
Liu, H., Cai, M., & Lee, Y. J. (2022). Masked discrimination for self-supervised learning on point clouds. in European Conference on Computer Vision, Springer, 657–675.
Lu, Z., Dai, Y., Li, W., & Su, Z. (2023). Joint data and feature augmentation for self-supervised representation learning on point clouds. Graphical Models, 129, 101188.
Mao, J., Niu, M., Jiang, C., Liang, H., Chen, J., Liang, X., Li, Y., Ye, C., Zhang, W., Li, Z., et al. (2021a). One million scenes for autonomous driving: Once dataset. arXiv:2106.11037.
Mao, J., Xue, Y., Niu, M., Bai, H., Feng, J., Liang, X., Xu, H., & Xu, C. (2021b). Voxel transformer for 3d object detection. in Proceedings of the IEEE/CVF International Conference on Computer Vision, 3164–3173.
Mao, J., Shi, S., Wang, X., & Li, H. (2023). 3d object detection for autonomous driving: A comprehensive survey. International Journal of Computer Vision, 1, 1–55.
Min, C., Xiao, L., Zhao, D., Nie, Y., & Dai, B. (2023). Occupancy-mae: Self-supervised pre-training large-scale lidar point clouds with masked occupancy autoencoders. IEEE Transactions on Intelligent Vehicles.
Noh, J., Lee, S., & Ham, B. (2021). Hvpr: Hybrid voxel-point representation for single-stage 3d object detection. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14605–14614.
Pang, Y., Wang, W., Tay, F. E., Liu, W., Tian, Y., & Yuan, L. (2022). Masked autoencoders for point cloud self-supervised learning. in European Conference on Computer Vision, Springer, 604–621.
Qi, C. R., Liu, W., Wu, C., Su, H., & Guibas, L. J. (2018). Frustum pointnets for 3d object detection from rgb-d data. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 918–927.
Qi, C. R., Su, H., Mo, K., & Guibas, L. J. (2017a). Pointnet: Deep learning on point sets for 3d classification and segmentation. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 652–660.
Qi, C. R., Yi, L., Su, H., & Guibas, L. J. (2017b). Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in Neural Information Processing Systems, 30
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28,
Sanghi, A. (2020). Info3d: Representation learning on 3d objects using mutual information maximization and contrastive learning. in European Conference on Computer Vision, Springer 626–642.
Sautier, C., Puy, G., Boulch, A., Marlet, R., & Lepetit, V. (2023). Bevcontrast: Self-supervision in bev space for automotive lidar point clouds. arXiv:2310.17281
Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., & Li, H. (2020a). Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10529–10538.
Shi, S., Wang, X., & Li, H. (2019). Pointrcnn: 3d object proposal generation and detection from point cloud. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–779.
Shi, S., Jiang, L., Deng, J., Wang, Z., Guo, C., Shi, J., Wang, X., & Li, H. (2023). Pv-rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection. International Journal of Computer Vision, 131(2), 531–551.
Shi, S., Wang, Z., Shi, J., Wang, X., & Li, H. (2020b). From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(8), 2647–2664.
Shrout, O., Nitzan, O., Ben-Shabat, Y., & Tal, A. (2023) Patchcontrast: Self-supervised pre-training for 3d object detection. arXiv:2308.06985
Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., et al. (2020). Scalability in perception for autonomous driving: Waymo open dataset. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2446–2454.
Tian, X., Ran, H., Wang, Y., & Zhao, H. (2023). Geomae: Masked geometric target prediction for self-supervised point cloud pre-training. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13570–13580.
Wang, Y., Chen, X., You, Y., Li, L. E., Hariharan, B., Campbell, M., Weinberger, K. Q., & Chao, W. L. (2020). Train in Germany, test in the USA: Making 3d object detectors generalize. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11713–11723.
Wang, Y., Mao, Q., Zhu, H., Deng, J., Zhang, Y., Ji, J., Li, H., & Zhang, Y. (2023). Multi-modal 3d object detection in autonomous driving: A survey. International Journal of Computer Vision, pp. 1–31.
Wei, C., Fan, H., Xie, S., Wu, C. Y., Yuille, A., & Feichtenhofer, C. (2022). Masked feature prediction for self-supervised visual pre-training. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14668–14678.
Xie, S., Gu, J., Guo, D., Qi, C. R., Guibas, L., & Litany, O. (2020). Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. in Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. Springer, pp. 574–591.
Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., & Hu, H. (2022). Simmim: A simple framework for masked image modeling. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9653–9663.
Xu, R., Wang, T., Zhang, W., Chen, R., Cao, J., Pang, J., & Lin, D. (2023). Mv-jar: Masked voxel jigsaw and reconstruction for lidar-based self-supervised pre-training. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13445–13454.
Yang, H., He, T., Liu, J., Chen, H., Wu, B., Lin, B., He, X., & Ouyang, W. (2023). Gd-mae: generative decoder for mae pre-training on lidar point clouds. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9403–9414.
Yang, J., Shi, S., Wang, Z., Li, H., & Qi, X. (2021). St3d: Self-training for unsupervised domain adaptation on 3d object detection. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 1.
Yang, Z., Sun, Y., Liu, S., & Jia, J. (2020). 3dssd: Point-based 3d single stage object detector. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11040–11048.
Yang, Z., Sun, Y., Liu, S., Shen, X., & Jia, J. (2019). Std: Sparse-to-dense 3d object detector for point cloud. in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1951–1960.
Yan, Y., Mao, Y., & Li, B. (2018). Second: Sparsely embedded convolutional detection. Sensors, 18(10), 3337.
Yin, T., Zhou, X., & Krahenbuhl, P. (2021). Center-based 3d object detection and tracking. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11784–11793.
Yin, J., Zhou, D., Zhang, L., Fang, J., Xu, C. Z., Shen, J., & Wang, W. (2022). Proposalcontrast: Unsupervised pre-training for lidar-based 3d object detection. in European Conference on Computer Vision, Springer, pp. 17–33.
Yu, X., Tang, L., Rao, Y., Huang, T., Zhou, J., & Lu, J. (2022). Point-bert: Pre-training 3d point cloud transformers with masked point modeling. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19313–19322.
Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). Barlow twins: Self-supervised learning via redundancy reduction. in International Conference on Machine Learning, PMLR, pp. 12310–12320.
Zhang, L., & Zhu, Z. (2019). Unsupervised feature learning for point cloud understanding by contrasting and clustering using graph convolutional neural networks. in International Conference on 3D Vision (3DV). IEEE, pp. 395–404.
Zhang, Z., Girdhar, R., Joulin, A., & Misra, I. (2021b). Self-supervised pretraining of 3d features on any point-cloud. in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10252–10263.
Zhang, Y., Hou, J., & Yuan, Y. (2023a). A comprehensive study of the robustness for lidar-based 3d object detectors against adversarial attacks. International Journal of Computer Vision, pp. 1–33.
Zhang, Y., Hu, Q., Xu, G., Ma, Y., Wan, J., & Guo, Y. (2022b). Not all points are equal: Learning highly efficient point-based detectors for 3d lidar point clouds. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18953–18962.
Zhang, Y., Huang, D., & Wang, Y. (2021a). Pc-rgnn: Point cloud completion and graph neural network for 3d object detection. in Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3430–3437.
Zhang, Y., Lin, J., He, C., Chen, Y., Jia, K., & Zhang, L. (2022c). Masked surfel prediction for self-supervised point cloud learning. arXiv:2207.03111
Zhang, R., Guo, Z., Gao, P., Fang, R., Zhao, B., Wang, D., Qiao, Y., & Li, H. (2022). Point-m2ae: Multi-scale masked autoencoders for hierarchical point cloud pre-training. Advances in Neural Information Processing Systems, 35, 27061–27074.
Zhang, Y., Zhang, Q., Zhu, Z., Hou, J., & Yuan, Y. (2023). Glenet: Boosting 3d object detectors with generative label uncertainty estimation. International Journal of Computer Vision, 131(12), 3332–3352.
Zheng, W., Chen, W., Huang, Y., Zhang, B., Duan, Y., & Lu, J. (2023) Occworld: Learning a 3d occupancy world model for autonomous driving. arXiv:2311.16038
Zhou, Y., & Tuzel, O. (2018). Voxelnet: End-to-end learning for point cloud based 3d object detection. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499.
Zhou, C., Zhang, Y., Chen, J., & Huang, D. (2023). Octr: Octree-based transformer for 3d object detection. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5166–5175.
Acknowledgements
This work is partly supported by the National Natural Science Foundation of China (62022011), the Research Program of State Key Laboratory of Critical Software Environment, and the Fundamental Research Funds for the Central Universities.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Seon Joo Kim.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, Y., Chen, J. & Huang, D. CMAE-3D: Contrastive Masked AutoEncoders for Self-Supervised 3D Object Detection. Int J Comput Vis 133, 2783–2804 (2025). https://doi.org/10.1007/s11263-024-02313-2
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-024-02313-2