Abstract
The recently proposed DEtection TRansformer (DETR) has established a fully end-to-end paradigm for object detection. However, DETR suffers from slow training convergence, which hinders its applicability to various detection tasks. We observe that DETR’s slow convergence is largely attributed to the difficulty in matching object queries to relevant regions due to the unaligned semantics between object queries and encoded image features. With this observation, we design Semantic-Aligned-Matching DETR++ (SAM-DETR++) to accelerate DETR’s convergence and improve detection performance. The core of SAM-DETR++ is a plug-and-play module that projects object queries and encoded image features into the same feature embedding space, where each object query can be easily matched to relevant regions with similar semantics. Besides, SAM-DETR++ searches for multiple representative keypoints and exploits their features for semantic-aligned matching with enhanced representation capacity. Furthermore, SAM-DETR++ can effectively fuse multi-scale features in a coarse-to-fine manner on the basis of the designed semantic-aligned matching. Extensive experiments show that the proposed SAM-DETR++ achieves superior convergence speed and competitive detection accuracy. Additionally, as a plug-and-play method, SAM-DETR++ can complement existing DETR convergence solutions with even better performance, achieving 44.8% AP with merely 12 training epochs and 49.1% AP with 50 training epochs on COCO val 2017 with ResNet-50. Codes are available at https://github.com/ZhangGongjie/SAM-DETR .
Similar content being viewed by others
Data Availability Statement
The datasets used in this study, MS-COCO and Pascal VOC, are publicly available on the official websites of the respective datasets. The code implementation of our proposed methods, SAM-DETR and SAM-DETR++, along with their associated trained model weights and training logs, are also publicly available for non-commercial use at https://github.com/ZhangGongjie/SAM-DETR . These resources are provided to facilitate reproducibility of our results and to encourage further research in the field.
References
Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. (2016). Fully-convolutional siamese networks for object tracking. Eccv, 850–865.
Cai, Z., & Vasconcelos, N. (2021). Cascade R-CNN: High quality object detection and instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 1483–1498.
Cao, X., Yuan, P., Feng, B., & Niu, K. (2022). CFDETR: Coarse-to-fine transformers for endto- end object detection. Aaai, 36, 185–193.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-toend object detection with transformers. Eccv,213–229.
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., & Lu, H. (2021). Transformer tracking. Cvpr, 8126–8135.
Chen, Y., Zhang, Z., Cao, Y., Wang, L., Lin, S., & Hu, H. (2020). RepPoints v2: Verification meets regression for object detection. Neurips, 33, 5621–5631.
Chung, D., Tahboub, K., & Delp, E. J. (2017). A two stream siamese convolutional neural network for person re-identification. Iccv, 1983–1991.
Dai, Z., Cai, B., Lin, Y., & Chen, J. (2021). UPDETR: Unsupervised pre-training for object detection with transformers. Cvpr, 1601–1610.
Dai, X., Chen, Y., Yang, J., Zhang, P., Yuan, L., & Zhang, L. (2021). Dynamic DETR: End-to-end object detection with dynamic attention. Iccv, 2988–2997.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. Cvpr, 248–255.
Dong, X., & Shen, J. (2018). Triplet loss in siamese network for object tracking. Eccv, 459–474.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The Pascal Visual Object Classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.
Fan, Q., Zhuo, W., Tang, C.-K., & Tai, Y.-W. (2020). Few-shot object detection with attention-RPN and multi-relation detector. Cvpr, 4013–4022.
Gao, P., Zheng, M., Wang, X., Dai, J., & Li, H. (2021). Fast convergence of DETR with spatially modulated co-attention. Iccv, 3621–3630.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. Iccv, 2961–2969.
He, A., Luo, C., Tian, X., & Zeng, W. (2018). A twofold siamese network for real-time object tracking. Cvpr, 4834–4843.
He, S., Luo, H., Wang, P., Wang, F., Li, H., & Jiang, W. (2021). TransReID: Transformerbased object re-identification. Iccv, 15013–15022.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Cvpr, 770–778.
Hsieh, T.-I., Lo, Y.-C., Chen, H.-T., & Liu, T.-L. (2019). One-shot object detection with co-attention and co-excitation. Neurips, 32.
Hu, H., Gu, J., Zhang, Z., Dai, J., & Wei, Y. (2018). Relation networks for object detection. Cvpr, 3588–3597.
Kang, B., Liu, Z., Wang, X., Yu, F., Feng, J., & Darrell, T. (2019). Few-shot object detection via feature reweighting. Iccv, 8420–8429.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. Iclr.
Koch, G., Zemel, R., & Salakhutdinov, R. (2015). Siamese neural networks for one-shot image recognition. Icml Deep Learning Workshop,2.
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., & Yan, J. (2019). SiamRPN++: Evolution of siamese visual tracking with very deep networks. Cvpr, 4282–4291.
Li, B., Yan, J., Wu, W., Zhu, Z., & Hu, X. (2018). High performance visual tracking with siamese region proposal network. Cvpr, 8971–8980.
Li, F., Zhang, H., Liu, S., Guo, J., Ni, L. M., & Zhang, L. (2022). DN-DETR: Accelerate DETR training by introducing query denoising. Cvpr, 13619–13627.
Li, F., Zhang, H., Xu, H., Liu, S., Zhang, L., Ni, L. M., & Shum, H.-Y. (2023). Mask DINO: Towards a unified transformer-based framework for object detection and segmentation. Cvpr, 3041–3050.
Liao, M., Lyu, P., He, M., Yao, C., Wu, W., & Bai, X. (2021). Mask TextSpotter: An end-toend trainable neural network for spotting text with arbitrary shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(2), 532–548.
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. Cvpr, 2117–2125.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. Iccv, 2980–2988.
Lin, T.-Y., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick, R. B., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. Eccv, 740–755.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). SSD: Single shot multibox detector. Eccv, 21–37.
Liu, S., Huang, D., & Wang, Y. (2018). Receptive field block net for accurate and fast object detection. Eccv, 385–400.
Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., & Zhang, L. (2022). DAB-DETR: Dynamic anchor boxes are better queries for DETR. Iclr.
Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., & Pietikäinen, M. (2020). Deep learning for generic object detection: A survey. International Journal of Computer Vision, 128, 261–318.
Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. Iclr.
Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L., & Wang, J. (2021). Conditional DETR for fast training convergence. Iccv, 3651–3660.
Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., & Lin, D. (2019). Libra R-CNN: Towards balanced learning for object detection. Cvpr, 821–830.
Perez-Rua, J.-M., Zhu, X., Hospedales, T. M., & Xiang, T. (2020). Incremental few-shot object detection. Cvpr, 13846–13855.
Redmon, J., & Farhadi, A. (2017). YOLO 9000: Better, faster, stronger. Cvpr, 7263–7271.
Ren, S., He, K., Girshick, R. B., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 1137–1149.
Roh, B., Shin, J., Shin, W., & Kim, S. (2022). Sparse DETR: Efficient end-to-end object detection with learnable sparsity. Iclr.
Schroff, F., Kalenichenko, D., & Philbin, J. (2015). FaceNet: A unified embedding for face recognition and clustering. Cvpr, 815–823.
Shen, C., Jin, Z., Zhao, Y., Fu, Z., Jiang, R., Chen, Y., & Hua, X.-S. (2017). Deep siamese network with multi-level similarity perception for person re-identification. Acm mm, 1942–1950.
Shen, Y., Xiao, T., Li, H., Yi, S., & Wang, X. (2017). Learning deep neural networks for vehicle Re-ID with visual-spatio-temporal path proposals. Iccv, 1900–1909.
Snell, J., Swersky, K., & Zemel, R. (2017). Prototypical networks for few-shot learning. Neurips,30.
Song, L., Gong, D., Li, Z., Liu, C., & Liu, W. (2019). Occlusion robust face recognition based on mask learning with pairwise differential siamese network. Iccv, 773–782.
Song, H., Sun, D., Chun, S., Jampani, V., Han, D., Heo, B., Kim, W., & Yang, M.-H. (2022). ViDT: An efficient and effective fully transformerbased object detector. Iclr.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research (JMLR), 15(1), 1929–1958.
Sun, Z., Cao, S., Yang, Y., & Kitani, K. M. (2021). Rethinking Transformer-based set prediction for object detection. Iccv, 3611–3620.
Sun, P., Zhang, R., Jiang, Y., Kong, T., Xu, C., Zhan, W., Tomizuka, M., Li, L., Yuan, Z., Wang, C., & Luo, P. (2021). Sparse R-CNN: End-to-end object detection with learnable proposals. Cvpr, 14454–14463.
Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H., & Hospedales, T. M. (2018). Learning to compare: Relation network for few-shot learning. Cvpr, 1199–1208.
Tan, M., Pang, R., & Le, Q. V. (2020). EfficientDet: Scalable and efficient object detection. Cvpr, 10781–10790.
Tao, R., Gavves, E., & Smeulders, A. W. (2016). Siamese instance search for tracking. Cvpr, 1420–1429.
Tian, Z., Shen, C., Chen, H., & He, T. (2019). FCOS: Fully convolutional one-stage object detection. Iccv, 9626–9635.
Tychsen-Smith, L., & Petersson, L. (2018). Improving object localization with fitness NMS and bounded IoU loss. Cvpr, 6877–6885.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Neurips,30.
Voigtlaender, P., Luiten, J., Torr, P. H., & Leibe, B. (2020). Siam R-CNN: Visual tracking by redetection. Cvpr, 6578–6588.
Wang, X., Huang, T. E., Darrell, T., Gonzalez, J. E., & Yu, F. (2020). Frustratingly simple few-shot object detection. Icml.
Wang, T., Yuan, L., Chen, Y., Feng, J., & Yan, S. (2021). PnP-DETR: Towards efficient visual analysis with Transformers. Iccv, 4661–4670.
Wang, N., Zhou, W., Wang, J., & Li, H. (2021). Transformer meets tracker: Exploiting temporal context for robust visual tracking. Cvpr, 1571–1580.
Wang, Y., Zhang, X., Yang, T., & Sun, J. (2022). Anchor DETR: Query design for transformer-based detector. Aaai, 36, 2567–2575.
Wu, R., Zhang, G., Lu, S., & Chen, T. (2020). Cascade EF-GAN: Progressive facial expression editing with local focuses. Cvpr, 5020–5029.
Wu, L., Wang, Y., Gao, J., & Li, X. (2018). Where-and-when to look: Deep siamese attention networks for video-based person re-identification. IEEE Transactions on Multimedia, 21(6), 1412–1424.
Xiao, Y., & Marlet, R. (2020). Few-shot object detection and viewpoint estimation for objects in the wild. Eccv, 192–210.
Yan, X., Chen, Z., Xu, A., Wang, X., Liang, X., & Lin, L. (2019). Meta R-CNN: Towards general solver for instance-level low-shot learning. Iccv, 9577–9586.
Yang, Z., Liu, S., Hu, H., Wang, L., & Lin, S. (2019). RepPoints: Point set representation for object detection. Iccv, 9657–9666.
Yao, Z., Ai, J., Li, B., & Zhang, C. (2021). Efficient DETR: improving end-to-end object detector with dense prior. arXiv:2104.01318 .
Zhang, Z., & Peng, H. (2019). Deeper and wider siamese networks for real-time visual tracking. Cvpr, 4591–4600.
Zhang, G., Cui, K., Wu, R., Lu, S., & Tian, Y. (2021). PNPDet: Efficient few-shot detection without forgetting via plug-and-play sub-networks. Wacv, 3822–3831.
Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L. M., & Shum, H.-Y. (2023). DINO: DETR with improved denoising anchor boxes for end-to-end object detection. Iclr.
Zhang, G., Lin, J., Wu, S., Song, Y., Luo, Z., Xue, Y., Lu, S., & Wang, Z. (2023). Online map vectorization for autonomous driving: A rasterization perspective. Neurips. Retrieved from https://openreview.net/forum?id=YvO5yTVv5Y
Zhang, G., Luo, Z., Cui, K., & Lu, S. (2021). Meta- DETR: Image-level few-shot object detection with inter-class correlation exploitation. arXiv:2103.11731 .
Zhang, G., Luo, Z., Tian, Z., Zhang, J., Zhang, X., & Lu, S. (2023). Towards efficient use of multi-scale features in transformer-based object detectors. Cvpr, 6206–6216.
Zhang, G., Luo, Z., Yu, Y., Cui, K., & Lu, S. (2022). Accelerating DETR convergence via semantic-aligned matching. Cvpr, 949–958.
Zhang, S., Wen, L., Bian, X., Lei, Z., & Li, S. Z. (2018). Single-shot refinement neural network for object detection. Cvpr, 4203–4212.
Zhang, G., Luo, Z., Cui, K., Lu, S., & Xing, E. P. (2023). Meta-DETR: Image-level fewshot detection with inter-class correlation exploitation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11), 12832–12843. https://doi.org/10.1109/TPAMI.2022.3195735
Zhang, G., Lu, S., & Zhang, W. (2019). CAD-Net: A context-aware detection network for objects in remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing, 57(12), 10015–10024.
Zhao, Q., Sheng, T., Wang, Y., Tang, Z., Chen, Y., Cai, L., & Ling, H. (2019). M2Det: A singleshot object detector based on multi-level feature pyramid network. Aaai, 33, 9259–9266.
Zheng, M., Karanam, S., Wu, Z., & Radke, R. J. (2019). Re-identification with consistent attentive siamese networks. Cvpr, 5735–5744.
Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as points. arxiv:1904.07850.
Zhou, X., Zhuo, J., & Krahenbuhl, P. (2019). Bottom-up object detection by grouping extreme and center points. Cvpr, 850–859
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021). Deformable DETR: Deformable transformers for end-to-end object detection. Iclr.
Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., & Hu, W. (2018). Distractor-aware siamese networks for visual object tracking. Eccv, 101–117.
Acknowledgements
This study is supported under the RIE 2020 Industry Alignment Fund - Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Ming-Hsuan Yang
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, G., Luo, Z., Huang, J. et al. Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion. Int J Comput Vis 132, 2825–2844 (2024). https://doi.org/10.1007/s11263-024-02005-x
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-024-02005-x