Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion

Zhang, Gongjie; Luo, Zhipeng; Huang, Jiaxing; Lu, Shijian; Xing, Eric P.

doi:10.1007/s11263-024-02005-x

Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion

Published: 20 February 2024

Volume 132, pages 2825–2844, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Gongjie Zhang¹^na1,
Zhipeng Luo¹^na1,
Jiaxing Huang¹,
Shijian Lu ORCID: orcid.org/0000-0002-6766-2506¹ &
…
Eric P. Xing^2,3

1615 Accesses
19 Citations
Explore all metrics

Abstract

The recently proposed DEtection TRansformer (DETR) has established a fully end-to-end paradigm for object detection. However, DETR suffers from slow training convergence, which hinders its applicability to various detection tasks. We observe that DETR’s slow convergence is largely attributed to the difficulty in matching object queries to relevant regions due to the unaligned semantics between object queries and encoded image features. With this observation, we design Semantic-Aligned-Matching DETR++ (SAM-DETR++) to accelerate DETR’s convergence and improve detection performance. The core of SAM-DETR++ is a plug-and-play module that projects object queries and encoded image features into the same feature embedding space, where each object query can be easily matched to relevant regions with similar semantics. Besides, SAM-DETR++ searches for multiple representative keypoints and exploits their features for semantic-aligned matching with enhanced representation capacity. Furthermore, SAM-DETR++ can effectively fuse multi-scale features in a coarse-to-fine manner on the basis of the designed semantic-aligned matching. Extensive experiments show that the proposed SAM-DETR++ achieves superior convergence speed and competitive detection accuracy. Additionally, as a plug-and-play method, SAM-DETR++ can complement existing DETR convergence solutions with even better performance, achieving 44.8% AP with merely 12 training epochs and 49.1% AP with 50 training epochs on COCO val 2017 with ResNet-50. Codes are available at https://github.com/ZhangGongjie/SAM-DETR .

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FCDS-DETR: detection transformer based on feature correction and double sampling

Article 09 February 2024

Relation DETR: Exploring Explicit Position Relation Prior for Object Detection

SelfLoc: High Quality Unsupervised Object Localization with Self-Prompt SAM

Data Availability Statement

The datasets used in this study, MS-COCO and Pascal VOC, are publicly available on the official websites of the respective datasets. The code implementation of our proposed methods, SAM-DETR and SAM-DETR++, along with their associated trained model weights and training logs, are also publicly available for non-commercial use at https://github.com/ZhangGongjie/SAM-DETR . These resources are provided to facilitate reproducibility of our results and to encourage further research in the field.

References

Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. (2016). Fully-convolutional siamese networks for object tracking. Eccv, 850–865.
Cai, Z., & Vasconcelos, N. (2021). Cascade R-CNN: High quality object detection and instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 1483–1498.
Article Google Scholar
Cao, X., Yuan, P., Feng, B., & Niu, K. (2022). CFDETR: Coarse-to-fine transformers for endto- end object detection. Aaai, 36, 185–193.
Article Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-toend object detection with transformers. Eccv,213–229.
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., & Lu, H. (2021). Transformer tracking. Cvpr, 8126–8135.
Chen, Y., Zhang, Z., Cao, Y., Wang, L., Lin, S., & Hu, H. (2020). RepPoints v2: Verification meets regression for object detection. Neurips, 33, 5621–5631.
Google Scholar
Chung, D., Tahboub, K., & Delp, E. J. (2017). A two stream siamese convolutional neural network for person re-identification. Iccv, 1983–1991.
Dai, Z., Cai, B., Lin, Y., & Chen, J. (2021). UPDETR: Unsupervised pre-training for object detection with transformers. Cvpr, 1601–1610.
Dai, X., Chen, Y., Yang, J., Zhang, P., Yuan, L., & Zhang, L. (2021). Dynamic DETR: End-to-end object detection with dynamic attention. Iccv, 2988–2997.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. Cvpr, 248–255.
Dong, X., & Shen, J. (2018). Triplet loss in siamese network for object tracking. Eccv, 459–474.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The Pascal Visual Object Classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.
Article Google Scholar
Fan, Q., Zhuo, W., Tang, C.-K., & Tai, Y.-W. (2020). Few-shot object detection with attention-RPN and multi-relation detector. Cvpr, 4013–4022.
Gao, P., Zheng, M., Wang, X., Dai, J., & Li, H. (2021). Fast convergence of DETR with spatially modulated co-attention. Iccv, 3621–3630.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. Iccv, 2961–2969.
He, A., Luo, C., Tian, X., & Zeng, W. (2018). A twofold siamese network for real-time object tracking. Cvpr, 4834–4843.
He, S., Luo, H., Wang, P., Wang, F., Li, H., & Jiang, W. (2021). TransReID: Transformerbased object re-identification. Iccv, 15013–15022.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Cvpr, 770–778.
Hsieh, T.-I., Lo, Y.-C., Chen, H.-T., & Liu, T.-L. (2019). One-shot object detection with co-attention and co-excitation. Neurips, 32.
Hu, H., Gu, J., Zhang, Z., Dai, J., & Wei, Y. (2018). Relation networks for object detection. Cvpr, 3588–3597.
Kang, B., Liu, Z., Wang, X., Yu, F., Feng, J., & Darrell, T. (2019). Few-shot object detection via feature reweighting. Iccv, 8420–8429.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. Iclr.
Koch, G., Zemel, R., & Salakhutdinov, R. (2015). Siamese neural networks for one-shot image recognition. Icml Deep Learning Workshop,2.
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., & Yan, J. (2019). SiamRPN++: Evolution of siamese visual tracking with very deep networks. Cvpr, 4282–4291.
Li, B., Yan, J., Wu, W., Zhu, Z., & Hu, X. (2018). High performance visual tracking with siamese region proposal network. Cvpr, 8971–8980.
Li, F., Zhang, H., Liu, S., Guo, J., Ni, L. M., & Zhang, L. (2022). DN-DETR: Accelerate DETR training by introducing query denoising. Cvpr, 13619–13627.
Li, F., Zhang, H., Xu, H., Liu, S., Zhang, L., Ni, L. M., & Shum, H.-Y. (2023). Mask DINO: Towards a unified transformer-based framework for object detection and segmentation. Cvpr, 3041–3050.
Liao, M., Lyu, P., He, M., Yao, C., Wu, W., & Bai, X. (2021). Mask TextSpotter: An end-toend trainable neural network for spotting text with arbitrary shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(2), 532–548.
Article Google Scholar
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. Cvpr, 2117–2125.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. Iccv, 2980–2988.
Lin, T.-Y., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick, R. B., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. Eccv, 740–755.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). SSD: Single shot multibox detector. Eccv, 21–37.
Liu, S., Huang, D., & Wang, Y. (2018). Receptive field block net for accurate and fast object detection. Eccv, 385–400.
Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., & Zhang, L. (2022). DAB-DETR: Dynamic anchor boxes are better queries for DETR. Iclr.
Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., & Pietikäinen, M. (2020). Deep learning for generic object detection: A survey. International Journal of Computer Vision, 128, 261–318.
Article Google Scholar
Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. Iclr.
Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L., & Wang, J. (2021). Conditional DETR for fast training convergence. Iccv, 3651–3660.
Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., & Lin, D. (2019). Libra R-CNN: Towards balanced learning for object detection. Cvpr, 821–830.
Perez-Rua, J.-M., Zhu, X., Hospedales, T. M., & Xiang, T. (2020). Incremental few-shot object detection. Cvpr, 13846–13855.
Redmon, J., & Farhadi, A. (2017). YOLO 9000: Better, faster, stronger. Cvpr, 7263–7271.
Ren, S., He, K., Girshick, R. B., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 1137–1149.
Article Google Scholar
Roh, B., Shin, J., Shin, W., & Kim, S. (2022). Sparse DETR: Efficient end-to-end object detection with learnable sparsity. Iclr.
Schroff, F., Kalenichenko, D., & Philbin, J. (2015). FaceNet: A unified embedding for face recognition and clustering. Cvpr, 815–823.
Shen, C., Jin, Z., Zhao, Y., Fu, Z., Jiang, R., Chen, Y., & Hua, X.-S. (2017). Deep siamese network with multi-level similarity perception for person re-identification. Acm mm, 1942–1950.
Shen, Y., Xiao, T., Li, H., Yi, S., & Wang, X. (2017). Learning deep neural networks for vehicle Re-ID with visual-spatio-temporal path proposals. Iccv, 1900–1909.
Snell, J., Swersky, K., & Zemel, R. (2017). Prototypical networks for few-shot learning. Neurips,30.
Song, L., Gong, D., Li, Z., Liu, C., & Liu, W. (2019). Occlusion robust face recognition based on mask learning with pairwise differential siamese network. Iccv, 773–782.
Song, H., Sun, D., Chun, S., Jampani, V., Han, D., Heo, B., Kim, W., & Yang, M.-H. (2022). ViDT: An efficient and effective fully transformerbased object detector. Iclr.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research (JMLR), 15(1), 1929–1958.
MathSciNet Google Scholar
Sun, Z., Cao, S., Yang, Y., & Kitani, K. M. (2021). Rethinking Transformer-based set prediction for object detection. Iccv, 3611–3620.
Sun, P., Zhang, R., Jiang, Y., Kong, T., Xu, C., Zhan, W., Tomizuka, M., Li, L., Yuan, Z., Wang, C., & Luo, P. (2021). Sparse R-CNN: End-to-end object detection with learnable proposals. Cvpr, 14454–14463.
Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H., & Hospedales, T. M. (2018). Learning to compare: Relation network for few-shot learning. Cvpr, 1199–1208.
Tan, M., Pang, R., & Le, Q. V. (2020). EfficientDet: Scalable and efficient object detection. Cvpr, 10781–10790.
Tao, R., Gavves, E., & Smeulders, A. W. (2016). Siamese instance search for tracking. Cvpr, 1420–1429.
Tian, Z., Shen, C., Chen, H., & He, T. (2019). FCOS: Fully convolutional one-stage object detection. Iccv, 9626–9635.
Tychsen-Smith, L., & Petersson, L. (2018). Improving object localization with fitness NMS and bounded IoU loss. Cvpr, 6877–6885.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Neurips,30.
Voigtlaender, P., Luiten, J., Torr, P. H., & Leibe, B. (2020). Siam R-CNN: Visual tracking by redetection. Cvpr, 6578–6588.
Wang, X., Huang, T. E., Darrell, T., Gonzalez, J. E., & Yu, F. (2020). Frustratingly simple few-shot object detection. Icml.
Wang, T., Yuan, L., Chen, Y., Feng, J., & Yan, S. (2021). PnP-DETR: Towards efficient visual analysis with Transformers. Iccv, 4661–4670.
Wang, N., Zhou, W., Wang, J., & Li, H. (2021). Transformer meets tracker: Exploiting temporal context for robust visual tracking. Cvpr, 1571–1580.
Wang, Y., Zhang, X., Yang, T., & Sun, J. (2022). Anchor DETR: Query design for transformer-based detector. Aaai, 36, 2567–2575.
Article Google Scholar
Wu, R., Zhang, G., Lu, S., & Chen, T. (2020). Cascade EF-GAN: Progressive facial expression editing with local focuses. Cvpr, 5020–5029.
Wu, L., Wang, Y., Gao, J., & Li, X. (2018). Where-and-when to look: Deep siamese attention networks for video-based person re-identification. IEEE Transactions on Multimedia, 21(6), 1412–1424.
Article Google Scholar
Xiao, Y., & Marlet, R. (2020). Few-shot object detection and viewpoint estimation for objects in the wild. Eccv, 192–210.
Yan, X., Chen, Z., Xu, A., Wang, X., Liang, X., & Lin, L. (2019). Meta R-CNN: Towards general solver for instance-level low-shot learning. Iccv, 9577–9586.
Yang, Z., Liu, S., Hu, H., Wang, L., & Lin, S. (2019). RepPoints: Point set representation for object detection. Iccv, 9657–9666.
Yao, Z., Ai, J., Li, B., & Zhang, C. (2021). Efficient DETR: improving end-to-end object detector with dense prior. arXiv:2104.01318 .
Zhang, Z., & Peng, H. (2019). Deeper and wider siamese networks for real-time visual tracking. Cvpr, 4591–4600.
Zhang, G., Cui, K., Wu, R., Lu, S., & Tian, Y. (2021). PNPDet: Efficient few-shot detection without forgetting via plug-and-play sub-networks. Wacv, 3822–3831.
Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L. M., & Shum, H.-Y. (2023). DINO: DETR with improved denoising anchor boxes for end-to-end object detection. Iclr.
Zhang, G., Lin, J., Wu, S., Song, Y., Luo, Z., Xue, Y., Lu, S., & Wang, Z. (2023). Online map vectorization for autonomous driving: A rasterization perspective. Neurips. Retrieved from https://openreview.net/forum?id=YvO5yTVv5Y
Zhang, G., Luo, Z., Cui, K., & Lu, S. (2021). Meta- DETR: Image-level few-shot object detection with inter-class correlation exploitation. arXiv:2103.11731 .
Zhang, G., Luo, Z., Tian, Z., Zhang, J., Zhang, X., & Lu, S. (2023). Towards efficient use of multi-scale features in transformer-based object detectors. Cvpr, 6206–6216.
Zhang, G., Luo, Z., Yu, Y., Cui, K., & Lu, S. (2022). Accelerating DETR convergence via semantic-aligned matching. Cvpr, 949–958.
Zhang, S., Wen, L., Bian, X., Lei, Z., & Li, S. Z. (2018). Single-shot refinement neural network for object detection. Cvpr, 4203–4212.
Zhang, G., Luo, Z., Cui, K., Lu, S., & Xing, E. P. (2023). Meta-DETR: Image-level fewshot detection with inter-class correlation exploitation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11), 12832–12843. https://doi.org/10.1109/TPAMI.2022.3195735
Article Google Scholar
Zhang, G., Lu, S., & Zhang, W. (2019). CAD-Net: A context-aware detection network for objects in remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing, 57(12), 10015–10024.
Zhao, Q., Sheng, T., Wang, Y., Tang, Z., Chen, Y., Cai, L., & Ling, H. (2019). M2Det: A singleshot object detector based on multi-level feature pyramid network. Aaai, 33, 9259–9266.
Article Google Scholar
Zheng, M., Karanam, S., Wu, Z., & Radke, R. J. (2019). Re-identification with consistent attentive siamese networks. Cvpr, 5735–5744.
Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as points. arxiv:1904.07850.
Zhou, X., Zhuo, J., & Krahenbuhl, P. (2019). Bottom-up object detection by grouping extreme and center points. Cvpr, 850–859
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021). Deformable DETR: Deformable transformers for end-to-end object detection. Iclr.
Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., & Hu, W. (2018). Distractor-aware siamese networks for visual object tracking. Eccv, 101–117.

Download references

Acknowledgements

This study is supported under the RIE 2020 Industry Alignment Fund - Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).

Author information

Gongjie Zhang and Zhipeng Luo contributed equally to this work.

Authors and Affiliations

Nanyang Technological University, Singapore, Singapore
Gongjie Zhang, Zhipeng Luo, Jiaxing Huang & Shijian Lu
Carnegie Mellon University, Pittsburgh, PA, USA
Eric P. Xing
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Eric P. Xing

Authors

Gongjie Zhang
View author publications
Search author on:PubMed Google Scholar
Zhipeng Luo
View author publications
Search author on:PubMed Google Scholar
Jiaxing Huang
View author publications
Search author on:PubMed Google Scholar
Shijian Lu
View author publications
Search author on:PubMed Google Scholar
Eric P. Xing
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Shijian Lu.

Additional information

Communicated by Ming-Hsuan Yang

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, G., Luo, Z., Huang, J. et al. Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion. Int J Comput Vis 132, 2825–2844 (2024). https://doi.org/10.1007/s11263-024-02005-x

Download citation

Received: 13 January 2023
Accepted: 14 January 2024
Published: 20 February 2024
Version of record: 20 February 2024
Issue date: August 2024
DOI: https://doi.org/10.1007/s11263-024-02005-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

FCDS-DETR: detection transformer based on feature correction and double sampling

Relation DETR: Exploring Explicit Position Relation Prior for Object Detection

SelfLoc: High Quality Unsupervised Object Localization with Self-Prompt SAM

Explore related subjects

Data Availability Statement

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now