CMAE-3D: Contrastive Masked AutoEncoders for Self-Supervised 3D Object Detection

Zhang, Yanan; Chen, Jiaxin; Huang, Di

doi:10.1007/s11263-024-02313-2

CMAE-3D: Contrastive Masked AutoEncoders for Self-Supervised 3D Object Detection

Published: 11 December 2024

Volume 133, pages 2783–2804, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

1227 Accesses
2 Citations
Explore all metrics

A Correction to this article was published on 22 January 2025

This article has been updated

Abstract

LiDAR-based 3D object detection is a crucial task for autonomous driving, owing to its accurate object recognition and localization capabilities in the 3D real-world space. However, existing methods heavily rely on time-consuming and laborious large-scale labeled LiDAR data, posing a bottleneck for both performance improvement and practical applications. In this paper, we propose Contrastive Masked AutoEncoders for self-supervised 3D object detection, dubbed as CMAE-3D, which is a promising solution to effectively alleviate label dependency in 3D perception. Specifically, we integrate Contrastive Learning (CL) and Masked AutoEncoders (MAE) into one unified framework to fully utilize the complementary characteristics of global semantic representation and local spatial perception. Furthermore, from the perspective of MAE, we develop the Geometric-Semantic Hybrid Masking (GSHM) to selectively mask representative regions in point clouds with imbalanced foreground-background and uneven density distribution, and design the Multi-scale Latent Feature Reconstruction (MLFR) to capture high-level semantic features while mitigating the redundant reconstruction of low-level details. From the perspective of CL, we present Hierarchical Relational Contrastive Learning (HRCL) to mine rich semantic similarity information while alleviating the issue of negative sample mismatch from both the voxel-level and frame-level. Extensive experiments demonstrate the effectiveness of our pre-training method when applied to multiple mainstream 3D object detectors (SECOND, CenterPoint and PV-RCNN) on three popular datasets (KITTI, Waymo and nuScenes).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-supervised Learning

Open-world Semantic Segmentation for LIDAR Point Clouds

Multi-scale and Cross-scale Contrastive Learning for Semantic Segmentation

Data Availibility

The KITTI (Geiger et al., 2012), Waymo (Sun et al., 2020) and nuScenes (Caesar et al., 2020) databases used in this manuscript are deposited in publicly available repositories respectively: http://www.cvlibs.net/datasets/kitti, https://waymo.com/open/data/perception and https://www.nuscenes.org/nuscenes.

Change history

15 January 2025
The affiliation of the first author has been corrected
22 January 2025
A Correction to this paper has been published: https://doi.org/10.1007/s11263-025-02359-w

References

Bao, H., Dong, L., Piao, S., & Wei, F. (2022). Beit: Bert pre-training of image transformers. in International Conference on Learning Representations.
Boulch, A., Sautier, C., Michele, B., Puy, G., & Marlet, R. (2023). Also: Automotive lidar self-supervision by occupancy estimation. in |textitProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13455–13465.
Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., & Beijbom, O. (2020). nuscenes: A multimodal dataset for autonomous driving. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11621–11631.
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33, 9912–9924.
Google Scholar
Chen, X., & He, K. (2021). Exploring simple siamese representation learning. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758.
Chen, C., Chen, Z., Zhang, J., & Tao, D. (2022a). Sasa: Semantics-augmented set abstraction for point-based 3d object detection. in Proceedings of the AAAI Conference on Artificial Intelligence, pp. 221–229.
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. in International Conference on Machine Learning, PMLR, pp. 1597–1607.
Chen, Y., Liu, S., Shen, X., & Jia, J. (2019). Fast point r-cnn. in Proceedings of the IEEE/CVF International conference on Computer Vision, pp. 9775–9784.
Chen, R., Mu, Y., Xu, R., Shao, W., Jiang, C., Xu, H., Li, Z., & Luo, P. (2022b). ${\rm Co}^3$ Cooperative unsupervised 3d representation learning for autonomous driving. arXiv:2206.04028
Du, B., Gao, X., Hu, W., & Li, X. (2021) Self-contrastive learning with hard negative sampling for self-supervised point cloud learning. in Proceedings of the 29th ACM International Conference on Multimedia, pp. 3133–3142.
Fan, L., Pang, Z., Zhang, T., Wang, Y. X., Zhao, H., Wang, F., Wang, N., & Zhang, Z. (2022). Embracing single stride 3d object detector with sparse transformer. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8458–8468.
Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? the kitti vision benchmark suite. in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1, 3354–3361.
Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33, 21271–21284.
Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022b). Masked autoencoders are scalable vision learners. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16000–16009.
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020b). Momentum contrast for unsupervised visual representation learning. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729–9738.
He, C., Li, R., Li, S., & Zhang, L. (2022a). Voxel set transformer: A set-to-set approach to 3d object detection from point clouds. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8417–8427.
He, C., Zeng, H., Huang, J., Hua, X. S., & Zhang, L. (2020a). Structure aware single-stage 3d object detection from point cloud. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11873–11882.
Hess, G., Jaxing, J., Svensson, E., Hagerman, D., Petersson, C., & Svensson, L. (2023). Masked autoencoder for self-supervised pre-training on lidar point clouds. in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 350–359.
Hou, J., Graham, B., Nießner, M., & Xie, S. (2021). Exploring data-efficient 3d scene understanding with contrastive scene contexts. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15587–15597.
Huang, S., Xie, Y., Zhu, S. C., & Zhu, Y. (2021). Spatio-temporal self-supervised representation learning for 3d point clouds. in Proceedings of the IEEE/CVF International Conference on Computer Vision, 6535–6545.
Krispel, G., Schinagl, D., Fruhwirth-Reisinger, C., Possegger, H., & Bischof, H. (2022). Maeli-masked autoencoder for large-scale lidar point clouds. arXiv:2212.07207
Lang, A. H., Vora, S., Caesar, H., Zhou, L., Yang, J., & Beijbom, O. (2019). Pointpillars: Fast encoders for object detection from point clouds. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12697–12705.
Liang, H., Jiang, C., Feng, D., Chen, X., Xu, H., Liang, X., Zhang, W., Li, Z., & Van Gool, L. (2021). Exploring geometry-aware contrast and clustering harmonization for self-supervised 3d object detection. in Proceedings of the IEEE/CVF International Conference on Computer Vision, 3293–3302.
Lin, Z., & Wang, Y. (2022). Bev-mae: Bird’s eye view masked autoencoders for outdoor point cloud pre-training. arXiv:2212.05758
Liu, H., Cai, M., & Lee, Y. J. (2022). Masked discrimination for self-supervised learning on point clouds. in European Conference on Computer Vision, Springer, 657–675.
Lu, Z., Dai, Y., Li, W., & Su, Z. (2023). Joint data and feature augmentation for self-supervised representation learning on point clouds. Graphical Models, 129, 101188.
Article Google Scholar
Mao, J., Niu, M., Jiang, C., Liang, H., Chen, J., Liang, X., Li, Y., Ye, C., Zhang, W., Li, Z., et al. (2021a). One million scenes for autonomous driving: Once dataset. arXiv:2106.11037.
Mao, J., Xue, Y., Niu, M., Bai, H., Feng, J., Liang, X., Xu, H., & Xu, C. (2021b). Voxel transformer for 3d object detection. in Proceedings of the IEEE/CVF International Conference on Computer Vision, 3164–3173.
Mao, J., Shi, S., Wang, X., & Li, H. (2023). 3d object detection for autonomous driving: A comprehensive survey. International Journal of Computer Vision, 1, 1–55.
Google Scholar
Min, C., Xiao, L., Zhao, D., Nie, Y., & Dai, B. (2023). Occupancy-mae: Self-supervised pre-training large-scale lidar point clouds with masked occupancy autoencoders. IEEE Transactions on Intelligent Vehicles.
Noh, J., Lee, S., & Ham, B. (2021). Hvpr: Hybrid voxel-point representation for single-stage 3d object detection. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14605–14614.
Pang, Y., Wang, W., Tay, F. E., Liu, W., Tian, Y., & Yuan, L. (2022). Masked autoencoders for point cloud self-supervised learning. in European Conference on Computer Vision, Springer, 604–621.
Qi, C. R., Liu, W., Wu, C., Su, H., & Guibas, L. J. (2018). Frustum pointnets for 3d object detection from rgb-d data. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 918–927.
Qi, C. R., Su, H., Mo, K., & Guibas, L. J. (2017a). Pointnet: Deep learning on point sets for 3d classification and segmentation. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 652–660.
Qi, C. R., Yi, L., Su, H., & Guibas, L. J. (2017b). Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in Neural Information Processing Systems, 30
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28,
Sanghi, A. (2020). Info3d: Representation learning on 3d objects using mutual information maximization and contrastive learning. in European Conference on Computer Vision, Springer 626–642.
Sautier, C., Puy, G., Boulch, A., Marlet, R., & Lepetit, V. (2023). Bevcontrast: Self-supervision in bev space for automotive lidar point clouds. arXiv:2310.17281
Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., & Li, H. (2020a). Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10529–10538.
Shi, S., Wang, X., & Li, H. (2019). Pointrcnn: 3d object proposal generation and detection from point cloud. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–779.
Shi, S., Jiang, L., Deng, J., Wang, Z., Guo, C., Shi, J., Wang, X., & Li, H. (2023). Pv-rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection. International Journal of Computer Vision, 131(2), 531–551.
Article Google Scholar
Shi, S., Wang, Z., Shi, J., Wang, X., & Li, H. (2020b). From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(8), 2647–2664.
Google Scholar
Shrout, O., Nitzan, O., Ben-Shabat, Y., & Tal, A. (2023) Patchcontrast: Self-supervised pre-training for 3d object detection. arXiv:2308.06985
Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., et al. (2020). Scalability in perception for autonomous driving: Waymo open dataset. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2446–2454.
Tian, X., Ran, H., Wang, Y., & Zhao, H. (2023). Geomae: Masked geometric target prediction for self-supervised point cloud pre-training. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13570–13580.
Wang, Y., Chen, X., You, Y., Li, L. E., Hariharan, B., Campbell, M., Weinberger, K. Q., & Chao, W. L. (2020). Train in Germany, test in the USA: Making 3d object detectors generalize. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11713–11723.
Wang, Y., Mao, Q., Zhu, H., Deng, J., Zhang, Y., Ji, J., Li, H., & Zhang, Y. (2023). Multi-modal 3d object detection in autonomous driving: A survey. International Journal of Computer Vision, pp. 1–31.
Wei, C., Fan, H., Xie, S., Wu, C. Y., Yuille, A., & Feichtenhofer, C. (2022). Masked feature prediction for self-supervised visual pre-training. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14668–14678.
Xie, S., Gu, J., Guo, D., Qi, C. R., Guibas, L., & Litany, O. (2020). Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. in Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. Springer, pp. 574–591.
Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., & Hu, H. (2022). Simmim: A simple framework for masked image modeling. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9653–9663.
Xu, R., Wang, T., Zhang, W., Chen, R., Cao, J., Pang, J., & Lin, D. (2023). Mv-jar: Masked voxel jigsaw and reconstruction for lidar-based self-supervised pre-training. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13445–13454.
Yang, H., He, T., Liu, J., Chen, H., Wu, B., Lin, B., He, X., & Ouyang, W. (2023). Gd-mae: generative decoder for mae pre-training on lidar point clouds. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9403–9414.
Yang, J., Shi, S., Wang, Z., Li, H., & Qi, X. (2021). St3d: Self-training for unsupervised domain adaptation on 3d object detection. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 1.
Yang, Z., Sun, Y., Liu, S., & Jia, J. (2020). 3dssd: Point-based 3d single stage object detector. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11040–11048.
Yang, Z., Sun, Y., Liu, S., Shen, X., & Jia, J. (2019). Std: Sparse-to-dense 3d object detector for point cloud. in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1951–1960.
Yan, Y., Mao, Y., & Li, B. (2018). Second: Sparsely embedded convolutional detection. Sensors, 18(10), 3337.
Article Google Scholar
Yin, T., Zhou, X., & Krahenbuhl, P. (2021). Center-based 3d object detection and tracking. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11784–11793.
Yin, J., Zhou, D., Zhang, L., Fang, J., Xu, C. Z., Shen, J., & Wang, W. (2022). Proposalcontrast: Unsupervised pre-training for lidar-based 3d object detection. in European Conference on Computer Vision, Springer, pp. 17–33.
Yu, X., Tang, L., Rao, Y., Huang, T., Zhou, J., & Lu, J. (2022). Point-bert: Pre-training 3d point cloud transformers with masked point modeling. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19313–19322.
Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). Barlow twins: Self-supervised learning via redundancy reduction. in International Conference on Machine Learning, PMLR, pp. 12310–12320.
Zhang, L., & Zhu, Z. (2019). Unsupervised feature learning for point cloud understanding by contrasting and clustering using graph convolutional neural networks. in International Conference on 3D Vision (3DV). IEEE, pp. 395–404.
Zhang, Z., Girdhar, R., Joulin, A., & Misra, I. (2021b). Self-supervised pretraining of 3d features on any point-cloud. in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10252–10263.
Zhang, Y., Hou, J., & Yuan, Y. (2023a). A comprehensive study of the robustness for lidar-based 3d object detectors against adversarial attacks. International Journal of Computer Vision, pp. 1–33.
Zhang, Y., Hu, Q., Xu, G., Ma, Y., Wan, J., & Guo, Y. (2022b). Not all points are equal: Learning highly efficient point-based detectors for 3d lidar point clouds. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18953–18962.
Zhang, Y., Huang, D., & Wang, Y. (2021a). Pc-rgnn: Point cloud completion and graph neural network for 3d object detection. in Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3430–3437.
Zhang, Y., Lin, J., He, C., Chen, Y., Jia, K., & Zhang, L. (2022c). Masked surfel prediction for self-supervised point cloud learning. arXiv:2207.03111
Zhang, R., Guo, Z., Gao, P., Fang, R., Zhao, B., Wang, D., Qiao, Y., & Li, H. (2022). Point-m2ae: Multi-scale masked autoencoders for hierarchical point cloud pre-training. Advances in Neural Information Processing Systems, 35, 27061–27074.
Zhang, Y., Zhang, Q., Zhu, Z., Hou, J., & Yuan, Y. (2023). Glenet: Boosting 3d object detectors with generative label uncertainty estimation. International Journal of Computer Vision, 131(12), 3332–3352.
Article Google Scholar
Zheng, W., Chen, W., Huang, Y., Zhang, B., Duan, Y., & Lu, J. (2023) Occworld: Learning a 3d occupancy world model for autonomous driving. arXiv:2311.16038
Zhou, Y., & Tuzel, O. (2018). Voxelnet: End-to-end learning for point cloud based 3d object detection. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499.
Zhou, C., Zhang, Y., Chen, J., & Huang, D. (2023). Octr: Octree-based transformer for 3d object detection. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5166–5175.

Download references

Acknowledgements

This work is partly supported by the National Natural Science Foundation of China (62022011), the Research Program of State Key Laboratory of Critical Software Environment, and the Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

State Key Laboratory of Complex and Critical Software Environment, Beihang University, Beijing, 100191, China
Yanan Zhang & Di Huang
School of Computer Science and Engineering, Beihang University, Beijing, 100191, China
Yanan Zhang, Jiaxin Chen & Di Huang
Hangzhou Innovation Institute, Beihang University, Hangzhou, 310051, China
Di Huang

Authors

Yanan Zhang
View author publications
Search author on:PubMed Google Scholar
Jiaxin Chen
View author publications
Search author on:PubMed Google Scholar
Di Huang
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Di Huang.

Additional information

Communicated by Seon Joo Kim.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, Y., Chen, J. & Huang, D. CMAE-3D: Contrastive Masked AutoEncoders for Self-Supervised 3D Object Detection. Int J Comput Vis 133, 2783–2804 (2025). https://doi.org/10.1007/s11263-024-02313-2

Download citation

Received: 14 January 2024
Accepted: 26 November 2024
Published: 11 December 2024
Version of record: 11 December 2024
Issue date: May 2025
DOI: https://doi.org/10.1007/s11263-024-02313-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CMAE-3D: Contrastive Masked AutoEncoders for Self-Supervised 3D Object Detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-supervised Learning

Open-world Semantic Segmentation for LIDAR Point Clouds

Multi-scale and Cross-scale Contrastive Learning for Semantic Segmentation

Explore related subjects

Data Availibility

Change history

15 January 2025

22 January 2025

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now