Universal Object Detection with Large Vision Model

Lin, Feng; Hu, Wenze; Wang, Yaowei; Tian, Yonghong; Lu, Guangming; Chen, Fanglin; Xu, Yong; Wang, Xiaoyu

doi:10.1007/s11263-023-01929-0

Universal Object Detection with Large Vision Model

Published: 07 November 2023

Volume 132, pages 1258–1276, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Feng Lin ORCID: orcid.org/0000-0001-8541-7400^1,2,
Wenze Hu¹,
Yaowei Wang³,
Yonghong Tian³,
Guangming Lu²,
Fanglin Chen²,
Yong Xu⁴ &
…
Xiaoyu Wang¹

1415 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

Over the past few years, there has been growing interest in developing a broad, universal, and general-purpose computer vision system. Such systems have the potential to address a wide range of vision tasks simultaneously, without being limited to specific problems or data domains. This universality is crucial for practical, real-world computer vision applications. In this study, our focus is on a specific challenge: the large-scale, multi-domain universal object detection problem, which contributes to the broader goal of achieving a universal vision system. This problem presents several intricate challenges, including cross-dataset category label duplication, label conflicts, and the necessity to handle hierarchical taxonomies. To address these challenges, we introduce our approach to label handling, hierarchy-aware loss design, and resource-efficient model training utilizing a pre-trained large vision model. Our method has demonstrated remarkable performance, securing a prestigious second-place ranking in the object detection track of the Robust Vision Challenge 2022 (RVC 2022) on a million-scale cross-dataset object detection benchmark. We believe that our comprehensive study will serve as a valuable reference and offer an alternative approach for addressing similar challenges within the computer vision community. The source code for our work is openly available at https://github.com/linfeng93/Large-UniDet.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploiting Unlabeled Data with Vision and Language Models for Object Detection

Beyond traditional visual object tracking: a survey

Article 26 August 2024

Interpreting the Structure of Multi-object Representations in Vision Encoders

Data Availability Statement

The authors confirm the data supporting the findings of this work are available within the article or its supplementary materials.

Notes

www.robustvision.net/leaderboard.php?benchmark=object, IFFF_RVC entry on Leaderboard.
https://github.com/ozendelait/rvc_devkit/blob/master/objdet/obj_det_mapping.csv
The used object detectors are built upon Cascade R-CNN enhanced with NAS-FPN ($\times $7) and Cascade RPN, utilizing SEER-RegNet32gf as the backbone. As detailed in Sect. 4.2, the universal object detector undergoes training for 1.15M iterations with a batch size of 16. Considering the re-sampled training set size of approximately 2.3M, the calculated training epochs amount to 8. Consequently, for Individual OD on OID, the training epochs are set to 8. On a comparable scale, for Individual OD on COCO, the object detector is trained for 12 epochs in line with the default configuration in the mmdetection codebase. Lastly, for Individual OD on MVD, the object detector is also trained for 12 epochs, initializing its network parameters with the object detector mentioned earlier for COCO.

References

Azizi, S., Mustafa, B., Ryan, F., Beaver, Z., Freyberg, J., & Deaton, J., et al. (2021). Big self-supervised models advance medical image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3478–3488).
Bello, I., Fedus, W., Du, X., Cubuk, E. D., Srinivas, A., Lin, T.-Y., & Zoph, B. (2021). Revisiting resnets: Improved training and scaling strategies. Advances in Neural Information Processing Systems, 34, 22614–22627.
Google Scholar
Bevandić, P., & Šegvić, S. (2022). Automatic universal taxonomies for multidomain semantic segmentation. arXiv preprint arXiv:2207.08445 .
Bodla, N., Singh, B., Chellappa, R. & Davis, L.S. (2017). Soft-nms-improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision (pp. 5561–5569).
Bu, X., Peng, J., Yan, J., Tan, T. & Zhang, Z. (2021). Gaia: A transfer learning system of object detection that fits your needs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 274–283).
Cai, L., Zhang, Z., Zhu, Y., Zhang, L., Li, M. & Xue, X. (2022). Bigdetection: A largescale benchmark for improved object detector pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4777–4787).
Cai, Z., & Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6154–6162).
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. & Zagoruyko, S. (2020). End-to-end object detection with transformers. In European Conference on Computer Vision (pp. 213–229).
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33, 9912–9924.
Google Scholar
Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., et al. (2019). Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 .
Dai, Z., Liu, H., Le, Q. V., & Tan, M. (2021). Coatnet: Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems, 34, 3965–3977.
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. & Fei-Fei, L. (2009). Imagenet: A largescale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (pp. 248–255).
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 .
Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2011). Pedestrian detection: An evaluation of the state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4), 743–761.
Article Google Scholar
Gao, M., Yu, R., Li, A., Morariu, V.I. & Davis, L.S. (2018). Dynamic zoom-in network for fast object detection in large images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6926–6935).
Ghiasi, G., Lin, T.-Y. & Le, Q.V. (2019). Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7036–7045).
Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1440–1448).
Gong, R., Dai, D., Chen, Y., Li, W. & Van Gool, L. (2021). mdalu: Multi-source domain adaptation and label unification with partial datasets. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 8876–8885).
Goyal, P., Duval, Q., Seessel, I., Caron, M., Singh, M., Misra, I., & Bojanowski, P. (2022). Vision models are more robust and fair when pretrained on uncurated images without supervision. arXiv preprint arXiv:2202.08360 .
Gupta, A., Dollar, P. & Girshick, R. (2019). Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5356–5364).
Gupta, T., Kamath, A., Kembhavi, A. & Hoiem, D. (2022). Towards general purpose vision systems: An end-to-end task-agnostic visionlanguage architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16399–16409).
Hasan, I., Liao, S., Li, J., Akram, S.U. & Shao, L. (2021). Generalizable pedestrian detection: The elephant in the room. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11328–11337).
He, K., Chen, X., Xie, S., Li, Y., Dollár, P. & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16000–16009).
He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp.9729–9738).
He, Y., Huang, G., Chen, S., Teng, J., Wang, K., Yin, Z. & Shao, J. (2022). Xlearner: Learning cross sources and tasks for universal visual representation. In European Conference on Computer Vision (pp.509–528).
Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 .
Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., & Fathi, A., et al. (2017). Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7310–7311).
Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H. & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning (pp. 4904–4916).
Joulin, A., Maaten, L.v.d., Jabri, A. & Vasilache, N. (2016). Learning visual features from large weakly supervised data. In European Conference on Computer Vision (pp. 67–84).
Kamath, A., Clark, C., Gupta, T., Kolve, E., Hoiem, D. & Kembhavi, A. (2022). Webly supervised concept expansion for general purpose vision models. In European Conference on Computer Vision (pp. 662–681).
Kolesnikov, A., Zhai, X. & Beyer, L. (2019). Revisiting self-supervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1920–1929).
Kornblith, S., Shlens, J. & Le, Q.V. (2019). Do better imagenet models transfer better?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2661–2671).
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1), 32–73.
Article MathSciNet Google Scholar
Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., et al. (2020). The open images dataset v4. International Journal of Computer Vision, 128(7), 1956–1981.
Article Google Scholar
Lambert, J., Liu, Z., Sener, O., Hays, J. & Koltun, V. (2020). Mseg: A composite dataset for multi-domain semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2879–2888).
Lin, F., Xu, H., Li, H., Xiong, H. & Qi, G.-J. (2021). Auto-encoding transformations in reparameterized lie groups for unsupervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, pp. 8610–8617).
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B. & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2117–2125).
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D. & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In European Conference on Computer Vision (pp. 740–755).
Liu, S., Qi, L., Qin, H., Shi, J. & Jia, J. (2018). Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 8759–8768).
Liu, Y., Wang, Y., Wang, S., Liang, T., Zhao, Q., Tang, Z. & Ling, H. (2020). Cbnet: A novel composite backbone network architecture for object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, pp. 11653–11660).
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., et al. (2022). Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12009–12019).
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z. & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10012–10022).
Loshchilov, I., & Hutter, F. (2016). Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 .
Lu, J., Clark, C., Zellers, R., Mottaghi, R. & Kembhavi, A. (2022). Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 .
Meng, L., Dai, X., Chen, Y., Zhang, P., Chen, D., Liu, M., & Jiang, Y.-G. (2022). Detection hub: Unifying object detection datasets via query adaptation on language embedding. arXiv preprint arXiv:2206.03484 .
Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., et al. (2018). Mixed precision training. In International Conference on Learning Representations.
Neuhold, G., Ollmann, T., Rota Bulo, S. & Kontschieder, P. (2017). The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE International Conference on Computer Vision (pp. 4990–4999).
Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W. & Lin, D. (2019). Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp.821–830).
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al. (2021). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (pp. 8748–8763).
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. (2018). Improving language understanding by generative pretraining.
Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K. & Dollár, P. (2020). Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10428–10436).
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1), 5485–5551.
MathSciNet Google Scholar
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., & Koltun, V. (2020). Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3), 1623–37.
Article Google Scholar
Ren, S., He, K., Girshick, R. & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28.
Shao, J., Chen, S., Li, Y., Wang, K., Yin, Z., He, Y., et al. (2021). Intern: A new learning paradigm towards general vision. arXiv preprint arXiv:2111.08687 .
Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X. & Sun, J. (2019). Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 8430–8439).
Singh, B., & Davis, L.S. (2018). An analysis of scale invariance in object detection snip. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3578–3587).
Sun, C., Shrivastava, A., Singh, S. & Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE International Conference on Computer Vision (pp. 843–852).
Tan, M., Pang, R. & Le, Q.V. (2020). Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10781–10790).
Vasconcelos, C., Birodkar, V. & Dumoulin, V. (2022). Proper reuse of image classification features improves object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13628–13637).
Vu, T., Jang, H., Pham, T.X. & Yoo, C. (2019). Cascade rpn: Delving into high-quality region proposal network with adaptive convolution. Advances in Neural Information Processing Systems, 32.
Wang, X., Cai, Z., Gao, D. & Vasconcelos, N. (2019). Towards universal object detection by domain attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7289–7298).
Xu, H., Fang, L., Liang, X., Kang, W. & Li, Z. (2020). Universal-rcnn: Universal object detector via transferable graph r-cnn. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, pp. 12492–12499).
Xu, H., Zhang, X., Li, H., Xie, L., Dai, W., Xiong, H., & Tian, Q. (2022). Seed the views: Hierarchical semantic alignment for contrastive representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 3753–67.
Google Scholar
Yu, J., Jiang, Y., Wang, Z., Cao, Z. & Huang, T. (2016). Unitbox: An advanced object detection network. In Proceedings of the 24th ACM International Conference on Multimedia (pp. 516–520).
Yuan, L., Chen, D., Chen, Y.-L., Codella, N., Dai, X., Gao, J., et al. (2021). Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432 .
Zhang, S., Benenson, R. & Schiele, B. (2017). Citypersons: A diverse dataset for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3213–3221).
Zhao, X., Schulter, S., Sharma, G., Tsai, Y.-H., Chandraker, M. & Wu, Y. (2020). Object detection with a unified label space from multiple datasets. In European Conference on Computer Vision (pp. 178–193).
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A. & Torralba, A. (2017). Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 633–641).
Zhou, X., Koltun, V. & Krähenbühl, P. (2022). Simple multi-dataset detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7571–7580).

Download references

Author information

Authors and Affiliations

Intellifusion Inc., Shenzhen, China
Feng Lin, Wenze Hu & Xiaoyu Wang
Harbin Institute of Technology, Shenzhen, China
Feng Lin, Guangming Lu & Fanglin Chen
Peng Cheng Laboratory, Shenzhen, China
Yaowei Wang & Yonghong Tian
South China University of Technology, Guangzhou, China
Yong Xu

Authors

Feng Lin
View author publications
Search author on:PubMed Google Scholar
Wenze Hu
View author publications
Search author on:PubMed Google Scholar
Yaowei Wang
View author publications
Search author on:PubMed Google Scholar
Yonghong Tian
View author publications
Search author on:PubMed Google Scholar
Guangming Lu
View author publications
Search author on:PubMed Google Scholar
Fanglin Chen
View author publications
Search author on:PubMed Google Scholar
Yong Xu
View author publications
Search author on:PubMed Google Scholar
Xiaoyu Wang
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Xiaoyu Wang.

Additional information

Communicated by Bernhard Egger.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

RVC Submission

We provide an in-depth presentation of our RVC final submission, which deviates slightly from the configuration outlined in the main text. Our final RVC submission employed a customized variant of the proposed detector built upon SEER-RegNet256gf. To align with the RVC deadline, specific simplifications were incorporated into the model:

Instead of utilizing the default setting of $C_2$, $C_3$, $C_4$, $C_5$ in NAS-FPN, we opted to employ the side-outputs $C_3$, $C_4$, $C_5$ and applied a $2\times $ downsampling on C5 twice, resulting in a 5-level feature pyramid. While this simplification led to a reduction in accuracy for detecting small objects, it significantly accelerated the training process.
The basic anchor scale in Cascade RPN was reduced to 5.04 ($4\times 2^{1/3}$). This adjustment was made to align with the changes in NAS-FPN and to minimize instances of missed detections for small objects.
The model underwent training for 720k iterations, with a learning rate reduction of 0.1 applied at 600k iterations.

During the dataset-agnostic inference procedure, the Soft-NMS (Bodla et al., 2017) was performed with an IoU threshold of 0.6 and a score threshold of 0.001. Then,

for COCO, we limited the maximum number of predictions per image to 100, and the short edge of the input image was resized to 800 pixels;
for MVD, we limited the maximum number of predictions per image to 300, and the short edge of the input image was resized to 2048 pixels;
for OID, we limited the maximum number of predictions per image to 300, and the short edge of the input image was resized to 800 pixels.

We did not employ any advanced inference techniques, such as multi-scale test augmentation. The performance of our submission (IFFF_RVC) on the three datasets is summarized in Table 4. For comparison with the results presented in this paper, we evaluated the model for our RVC submission on the validation sets with a maximum of 300 predictions per image, using the standard Non-Maximum Suppression (NMS) method. All other testing configurations remained consistent with those used on the test sets.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lin, F., Hu, W., Wang, Y. et al. Universal Object Detection with Large Vision Model. Int J Comput Vis 132, 1258–1276 (2024). https://doi.org/10.1007/s11263-023-01929-0

Download citation

Received: 16 December 2022
Accepted: 09 October 2023
Published: 07 November 2023
Version of record: 07 November 2023
Issue date: April 2024
DOI: https://doi.org/10.1007/s11263-023-01929-0

Keywords

Part of a collection:

Special Issue on Robust Vision

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Universal Object Detection with Large Vision Model

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Exploiting Unlabeled Data with Vision and Language Models for Object Detection

Beyond traditional visual object tracking: a survey

Interpreting the Structure of Multi-object Representations in Vision Encoders

Explore related subjects

Data Availability Statement

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

RVC Submission

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now