这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The recently proposed DEtection TRansformer (DETR) has established a fully end-to-end paradigm for object detection. However, DETR suffers from slow training convergence, which hinders its applicability to various detection tasks. We observe that DETR’s slow convergence is largely attributed to the difficulty in matching object queries to relevant regions due to the unaligned semantics between object queries and encoded image features. With this observation, we design Semantic-Aligned-Matching DETR++ (SAM-DETR++) to accelerate DETR’s convergence and improve detection performance. The core of SAM-DETR++ is a plug-and-play module that projects object queries and encoded image features into the same feature embedding space, where each object query can be easily matched to relevant regions with similar semantics. Besides, SAM-DETR++ searches for multiple representative keypoints and exploits their features for semantic-aligned matching with enhanced representation capacity. Furthermore, SAM-DETR++ can effectively fuse multi-scale features in a coarse-to-fine manner on the basis of the designed semantic-aligned matching. Extensive experiments show that the proposed SAM-DETR++ achieves superior convergence speed and competitive detection accuracy. Additionally, as a plug-and-play method, SAM-DETR++ can complement existing DETR convergence solutions with even better performance, achieving 44.8% AP with merely 12 training epochs and 49.1% AP with 50 training epochs on COCO val 2017 with ResNet-50. Codes are available at  https://github.com/ZhangGongjie/SAM-DETR .

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data Availability Statement

The datasets used in this study, MS-COCO and Pascal VOC, are publicly available on the official websites of the respective datasets. The code implementation of our proposed methods, SAM-DETR and SAM-DETR++, along with their associated trained model weights and training logs, are also publicly available for non-commercial use at  https://github.com/ZhangGongjie/SAM-DETR . These resources are provided to facilitate reproducibility of our results and to encourage further research in the field.

References

  • Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. (2016). Fully-convolutional siamese networks for object tracking. Eccv, 850–865.

  • Cai, Z., & Vasconcelos, N. (2021). Cascade R-CNN: High quality object detection and instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 1483–1498.

    Article  Google Scholar 

  • Cao, X., Yuan, P., Feng, B., & Niu, K. (2022). CFDETR: Coarse-to-fine transformers for endto- end object detection. Aaai, 36, 185–193.

    Article  Google Scholar 

  • Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-toend object detection with transformers. Eccv,213–229.

  • Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., & Lu, H. (2021). Transformer tracking. Cvpr, 8126–8135.

  • Chen, Y., Zhang, Z., Cao, Y., Wang, L., Lin, S., & Hu, H. (2020). RepPoints v2: Verification meets regression for object detection. Neurips, 33, 5621–5631.

    Google Scholar 

  • Chung, D., Tahboub, K., & Delp, E. J. (2017). A two stream siamese convolutional neural network for person re-identification. Iccv, 1983–1991.

  • Dai, Z., Cai, B., Lin, Y., & Chen, J. (2021). UPDETR: Unsupervised pre-training for object detection with transformers. Cvpr, 1601–1610.

  • Dai, X., Chen, Y., Yang, J., Zhang, P., Yuan, L., & Zhang, L. (2021). Dynamic DETR: End-to-end object detection with dynamic attention. Iccv, 2988–2997.

  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. Cvpr, 248–255.

  • Dong, X., & Shen, J. (2018). Triplet loss in siamese network for object tracking. Eccv, 459–474.

  • Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The Pascal Visual Object Classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Article  Google Scholar 

  • Fan, Q., Zhuo, W., Tang, C.-K., & Tai, Y.-W. (2020). Few-shot object detection with attention-RPN and multi-relation detector. Cvpr, 4013–4022.

  • Gao, P., Zheng, M., Wang, X., Dai, J., & Li, H. (2021). Fast convergence of DETR with spatially modulated co-attention. Iccv, 3621–3630.

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. Iccv, 2961–2969.

  • He, A., Luo, C., Tian, X., & Zeng, W. (2018). A twofold siamese network for real-time object tracking. Cvpr, 4834–4843.

  • He, S., Luo, H., Wang, P., Wang, F., Li, H., & Jiang, W. (2021). TransReID: Transformerbased object re-identification. Iccv, 15013–15022.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Cvpr, 770–778.

  • Hsieh, T.-I., Lo, Y.-C., Chen, H.-T., & Liu, T.-L. (2019). One-shot object detection with co-attention and co-excitation. Neurips, 32.

  • Hu, H., Gu, J., Zhang, Z., Dai, J., & Wei, Y. (2018). Relation networks for object detection. Cvpr, 3588–3597.

  • Kang, B., Liu, Z., Wang, X., Yu, F., Feng, J., & Darrell, T. (2019). Few-shot object detection via feature reweighting. Iccv, 8420–8429.

  • Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. Iclr.

  • Koch, G., Zemel, R., & Salakhutdinov, R. (2015). Siamese neural networks for one-shot image recognition. Icml Deep Learning Workshop,2.

  • Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., & Yan, J. (2019). SiamRPN++: Evolution of siamese visual tracking with very deep networks. Cvpr, 4282–4291.

  • Li, B., Yan, J., Wu, W., Zhu, Z., & Hu, X. (2018). High performance visual tracking with siamese region proposal network. Cvpr, 8971–8980.

  • Li, F., Zhang, H., Liu, S., Guo, J., Ni, L. M., & Zhang, L. (2022). DN-DETR: Accelerate DETR training by introducing query denoising. Cvpr, 13619–13627.

  • Li, F., Zhang, H., Xu, H., Liu, S., Zhang, L., Ni, L. M., & Shum, H.-Y. (2023). Mask DINO: Towards a unified transformer-based framework for object detection and segmentation. Cvpr, 3041–3050.

  • Liao, M., Lyu, P., He, M., Yao, C., Wu, W., & Bai, X. (2021). Mask TextSpotter: An end-toend trainable neural network for spotting text with arbitrary shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(2), 532–548.

    Article  Google Scholar 

  • Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. Cvpr, 2117–2125.

  • Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. Iccv, 2980–2988.

  • Lin, T.-Y., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick, R. B., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. Eccv, 740–755.

  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). SSD: Single shot multibox detector. Eccv, 21–37.

  • Liu, S., Huang, D., & Wang, Y. (2018). Receptive field block net for accurate and fast object detection. Eccv, 385–400.

  • Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., & Zhang, L. (2022). DAB-DETR: Dynamic anchor boxes are better queries for DETR. Iclr.

  • Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., & Pietikäinen, M. (2020). Deep learning for generic object detection: A survey. International Journal of Computer Vision, 128, 261–318.

    Article  Google Scholar 

  • Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. Iclr.

  • Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L., & Wang, J. (2021). Conditional DETR for fast training convergence. Iccv, 3651–3660.

  • Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., & Lin, D. (2019). Libra R-CNN: Towards balanced learning for object detection. Cvpr, 821–830.

  • Perez-Rua, J.-M., Zhu, X., Hospedales, T. M., & Xiang, T. (2020). Incremental few-shot object detection. Cvpr, 13846–13855.

  • Redmon, J., & Farhadi, A. (2017). YOLO 9000: Better, faster, stronger. Cvpr, 7263–7271.

  • Ren, S., He, K., Girshick, R. B., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 1137–1149.

    Article  Google Scholar 

  • Roh, B., Shin, J., Shin, W., & Kim, S. (2022). Sparse DETR: Efficient end-to-end object detection with learnable sparsity. Iclr.

  • Schroff, F., Kalenichenko, D., & Philbin, J. (2015). FaceNet: A unified embedding for face recognition and clustering. Cvpr, 815–823.

  • Shen, C., Jin, Z., Zhao, Y., Fu, Z., Jiang, R., Chen, Y., & Hua, X.-S. (2017). Deep siamese network with multi-level similarity perception for person re-identification. Acm mm, 1942–1950.

  • Shen, Y., Xiao, T., Li, H., Yi, S., & Wang, X. (2017). Learning deep neural networks for vehicle Re-ID with visual-spatio-temporal path proposals. Iccv, 1900–1909.

  • Snell, J., Swersky, K., & Zemel, R. (2017). Prototypical networks for few-shot learning. Neurips,30.

  • Song, L., Gong, D., Li, Z., Liu, C., & Liu, W. (2019). Occlusion robust face recognition based on mask learning with pairwise differential siamese network. Iccv, 773–782.

  • Song, H., Sun, D., Chun, S., Jampani, V., Han, D., Heo, B., Kim, W., & Yang, M.-H. (2022). ViDT: An efficient and effective fully transformerbased object detector. Iclr.

  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research (JMLR), 15(1), 1929–1958.

    MathSciNet  Google Scholar 

  • Sun, Z., Cao, S., Yang, Y., & Kitani, K. M. (2021). Rethinking Transformer-based set prediction for object detection. Iccv, 3611–3620.

  • Sun, P., Zhang, R., Jiang, Y., Kong, T., Xu, C., Zhan, W., Tomizuka, M., Li, L., Yuan, Z., Wang, C., & Luo, P. (2021). Sparse R-CNN: End-to-end object detection with learnable proposals. Cvpr, 14454–14463.

  • Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H., & Hospedales, T. M. (2018). Learning to compare: Relation network for few-shot learning. Cvpr, 1199–1208.

  • Tan, M., Pang, R., & Le, Q. V. (2020). EfficientDet: Scalable and efficient object detection. Cvpr, 10781–10790.

  • Tao, R., Gavves, E., & Smeulders, A. W. (2016). Siamese instance search for tracking. Cvpr, 1420–1429.

  • Tian, Z., Shen, C., Chen, H., & He, T. (2019). FCOS: Fully convolutional one-stage object detection. Iccv, 9626–9635.

  • Tychsen-Smith, L., & Petersson, L. (2018). Improving object localization with fitness NMS and bounded IoU loss. Cvpr, 6877–6885.

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Neurips,30.

  • Voigtlaender, P., Luiten, J., Torr, P. H., & Leibe, B. (2020). Siam R-CNN: Visual tracking by redetection. Cvpr, 6578–6588.

  • Wang, X., Huang, T. E., Darrell, T., Gonzalez, J. E., & Yu, F. (2020). Frustratingly simple few-shot object detection. Icml.

  • Wang, T., Yuan, L., Chen, Y., Feng, J., & Yan, S. (2021). PnP-DETR: Towards efficient visual analysis with Transformers. Iccv, 4661–4670.

  • Wang, N., Zhou, W., Wang, J., & Li, H. (2021). Transformer meets tracker: Exploiting temporal context for robust visual tracking. Cvpr, 1571–1580.

  • Wang, Y., Zhang, X., Yang, T., & Sun, J. (2022). Anchor DETR: Query design for transformer-based detector. Aaai, 36, 2567–2575.

    Article  Google Scholar 

  • Wu, R., Zhang, G., Lu, S., & Chen, T. (2020). Cascade EF-GAN: Progressive facial expression editing with local focuses. Cvpr, 5020–5029.

  • Wu, L., Wang, Y., Gao, J., & Li, X. (2018). Where-and-when to look: Deep siamese attention networks for video-based person re-identification. IEEE Transactions on Multimedia, 21(6), 1412–1424.

    Article  Google Scholar 

  • Xiao, Y., & Marlet, R. (2020). Few-shot object detection and viewpoint estimation for objects in the wild. Eccv, 192–210.

  • Yan, X., Chen, Z., Xu, A., Wang, X., Liang, X., & Lin, L. (2019). Meta R-CNN: Towards general solver for instance-level low-shot learning. Iccv, 9577–9586.

  • Yang, Z., Liu, S., Hu, H., Wang, L., & Lin, S. (2019). RepPoints: Point set representation for object detection. Iccv, 9657–9666.

  • Yao, Z., Ai, J., Li, B., & Zhang, C. (2021). Efficient DETR: improving end-to-end object detector with dense prior. arXiv:2104.01318 .

  • Zhang, Z., & Peng, H. (2019). Deeper and wider siamese networks for real-time visual tracking. Cvpr, 4591–4600.

  • Zhang, G., Cui, K., Wu, R., Lu, S., & Tian, Y. (2021). PNPDet: Efficient few-shot detection without forgetting via plug-and-play sub-networks. Wacv, 3822–3831.

  • Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L. M., & Shum, H.-Y. (2023). DINO: DETR with improved denoising anchor boxes for end-to-end object detection. Iclr.

  • Zhang, G., Lin, J., Wu, S., Song, Y., Luo, Z., Xue, Y., Lu, S., & Wang, Z. (2023). Online map vectorization for autonomous driving: A rasterization perspective. Neurips. Retrieved from https://openreview.net/forum?id=YvO5yTVv5Y

  • Zhang, G., Luo, Z., Cui, K., & Lu, S. (2021). Meta- DETR: Image-level few-shot object detection with inter-class correlation exploitation. arXiv:2103.11731 .

  • Zhang, G., Luo, Z., Tian, Z., Zhang, J., Zhang, X., & Lu, S. (2023). Towards efficient use of multi-scale features in transformer-based object detectors. Cvpr, 6206–6216.

  • Zhang, G., Luo, Z., Yu, Y., Cui, K., & Lu, S. (2022). Accelerating DETR convergence via semantic-aligned matching. Cvpr, 949–958.

  • Zhang, S., Wen, L., Bian, X., Lei, Z., & Li, S. Z. (2018). Single-shot refinement neural network for object detection. Cvpr, 4203–4212.

  • Zhang, G., Luo, Z., Cui, K., Lu, S., & Xing, E. P. (2023). Meta-DETR: Image-level fewshot detection with inter-class correlation exploitation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11), 12832–12843. https://doi.org/10.1109/TPAMI.2022.3195735

    Article  Google Scholar 

  • Zhang, G., Lu, S., & Zhang, W. (2019). CAD-Net: A context-aware detection network for objects in remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing, 57(12), 10015–10024.

  • Zhao, Q., Sheng, T., Wang, Y., Tang, Z., Chen, Y., Cai, L., & Ling, H. (2019). M2Det: A singleshot object detector based on multi-level feature pyramid network. Aaai, 33, 9259–9266.

    Article  Google Scholar 

  • Zheng, M., Karanam, S., Wu, Z., & Radke, R. J. (2019). Re-identification with consistent attentive siamese networks. Cvpr, 5735–5744.

  • Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as points. arxiv:1904.07850.

  • Zhou, X., Zhuo, J., & Krahenbuhl, P. (2019). Bottom-up object detection by grouping extreme and center points. Cvpr, 850–859

  • Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021). Deformable DETR: Deformable transformers for end-to-end object detection. Iclr.

  • Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., & Hu, W. (2018). Distractor-aware siamese networks for visual object tracking. Eccv, 101–117.

Download references

Acknowledgements

This study is supported under the RIE 2020 Industry Alignment Fund - Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shijian Lu.

Additional information

Communicated by Ming-Hsuan Yang

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, G., Luo, Z., Huang, J. et al. Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion. Int J Comput Vis 132, 2825–2844 (2024). https://doi.org/10.1007/s11263-024-02005-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-024-02005-x

Keywords