Abstract
Over the past few years, there has been growing interest in developing a broad, universal, and general-purpose computer vision system. Such systems have the potential to address a wide range of vision tasks simultaneously, without being limited to specific problems or data domains. This universality is crucial for practical, real-world computer vision applications. In this study, our focus is on a specific challenge: the large-scale, multi-domain universal object detection problem, which contributes to the broader goal of achieving a universal vision system. This problem presents several intricate challenges, including cross-dataset category label duplication, label conflicts, and the necessity to handle hierarchical taxonomies. To address these challenges, we introduce our approach to label handling, hierarchy-aware loss design, and resource-efficient model training utilizing a pre-trained large vision model. Our method has demonstrated remarkable performance, securing a prestigious second-place ranking in the object detection track of the Robust Vision Challenge 2022 (RVC 2022) on a million-scale cross-dataset object detection benchmark. We believe that our comprehensive study will serve as a valuable reference and offer an alternative approach for addressing similar challenges within the computer vision community. The source code for our work is openly available at https://github.com/linfeng93/Large-UniDet.
Similar content being viewed by others
Data Availability Statement
The authors confirm the data supporting the findings of this work are available within the article or its supplementary materials.
Notes
www.robustvision.net/leaderboard.php?benchmark=object, IFFF_RVC entry on Leaderboard.
The used object detectors are built upon Cascade R-CNN enhanced with NAS-FPN (\(\times \)7) and Cascade RPN, utilizing SEER-RegNet32gf as the backbone. As detailed in Sect. 4.2, the universal object detector undergoes training for 1.15M iterations with a batch size of 16. Considering the re-sampled training set size of approximately 2.3M, the calculated training epochs amount to 8. Consequently, for Individual OD on OID, the training epochs are set to 8. On a comparable scale, for Individual OD on COCO, the object detector is trained for 12 epochs in line with the default configuration in the mmdetection codebase. Lastly, for Individual OD on MVD, the object detector is also trained for 12 epochs, initializing its network parameters with the object detector mentioned earlier for COCO.
References
Azizi, S., Mustafa, B., Ryan, F., Beaver, Z., Freyberg, J., & Deaton, J., et al. (2021). Big self-supervised models advance medical image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3478–3488).
Bello, I., Fedus, W., Du, X., Cubuk, E. D., Srinivas, A., Lin, T.-Y., & Zoph, B. (2021). Revisiting resnets: Improved training and scaling strategies. Advances in Neural Information Processing Systems, 34, 22614–22627.
Bevandić, P., & Šegvić, S. (2022). Automatic universal taxonomies for multidomain semantic segmentation. arXiv preprint arXiv:2207.08445 .
Bodla, N., Singh, B., Chellappa, R. & Davis, L.S. (2017). Soft-nms-improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision (pp. 5561–5569).
Bu, X., Peng, J., Yan, J., Tan, T. & Zhang, Z. (2021). Gaia: A transfer learning system of object detection that fits your needs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 274–283).
Cai, L., Zhang, Z., Zhu, Y., Zhang, L., Li, M. & Xue, X. (2022). Bigdetection: A largescale benchmark for improved object detector pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4777–4787).
Cai, Z., & Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6154–6162).
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. & Zagoruyko, S. (2020). End-to-end object detection with transformers. In European Conference on Computer Vision (pp. 213–229).
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33, 9912–9924.
Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., et al. (2019). Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 .
Dai, Z., Liu, H., Le, Q. V., & Tan, M. (2021). Coatnet: Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems, 34, 3965–3977.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. & Fei-Fei, L. (2009). Imagenet: A largescale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (pp. 248–255).
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 .
Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2011). Pedestrian detection: An evaluation of the state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4), 743–761.
Gao, M., Yu, R., Li, A., Morariu, V.I. & Davis, L.S. (2018). Dynamic zoom-in network for fast object detection in large images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6926–6935).
Ghiasi, G., Lin, T.-Y. & Le, Q.V. (2019). Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7036–7045).
Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1440–1448).
Gong, R., Dai, D., Chen, Y., Li, W. & Van Gool, L. (2021). mdalu: Multi-source domain adaptation and label unification with partial datasets. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 8876–8885).
Goyal, P., Duval, Q., Seessel, I., Caron, M., Singh, M., Misra, I., & Bojanowski, P. (2022). Vision models are more robust and fair when pretrained on uncurated images without supervision. arXiv preprint arXiv:2202.08360 .
Gupta, A., Dollar, P. & Girshick, R. (2019). Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5356–5364).
Gupta, T., Kamath, A., Kembhavi, A. & Hoiem, D. (2022). Towards general purpose vision systems: An end-to-end task-agnostic visionlanguage architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16399–16409).
Hasan, I., Liao, S., Li, J., Akram, S.U. & Shao, L. (2021). Generalizable pedestrian detection: The elephant in the room. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11328–11337).
He, K., Chen, X., Xie, S., Li, Y., Dollár, P. & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16000–16009).
He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp.9729–9738).
He, Y., Huang, G., Chen, S., Teng, J., Wang, K., Yin, Z. & Shao, J. (2022). Xlearner: Learning cross sources and tasks for universal visual representation. In European Conference on Computer Vision (pp.509–528).
Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 .
Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., & Fathi, A., et al. (2017). Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7310–7311).
Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H. & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning (pp. 4904–4916).
Joulin, A., Maaten, L.v.d., Jabri, A. & Vasilache, N. (2016). Learning visual features from large weakly supervised data. In European Conference on Computer Vision (pp. 67–84).
Kamath, A., Clark, C., Gupta, T., Kolve, E., Hoiem, D. & Kembhavi, A. (2022). Webly supervised concept expansion for general purpose vision models. In European Conference on Computer Vision (pp. 662–681).
Kolesnikov, A., Zhai, X. & Beyer, L. (2019). Revisiting self-supervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1920–1929).
Kornblith, S., Shlens, J. & Le, Q.V. (2019). Do better imagenet models transfer better?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2661–2671).
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1), 32–73.
Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., et al. (2020). The open images dataset v4. International Journal of Computer Vision, 128(7), 1956–1981.
Lambert, J., Liu, Z., Sener, O., Hays, J. & Koltun, V. (2020). Mseg: A composite dataset for multi-domain semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2879–2888).
Lin, F., Xu, H., Li, H., Xiong, H. & Qi, G.-J. (2021). Auto-encoding transformations in reparameterized lie groups for unsupervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, pp. 8610–8617).
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B. & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2117–2125).
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D. & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In European Conference on Computer Vision (pp. 740–755).
Liu, S., Qi, L., Qin, H., Shi, J. & Jia, J. (2018). Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 8759–8768).
Liu, Y., Wang, Y., Wang, S., Liang, T., Zhao, Q., Tang, Z. & Ling, H. (2020). Cbnet: A novel composite backbone network architecture for object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, pp. 11653–11660).
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., et al. (2022). Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12009–12019).
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z. & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10012–10022).
Loshchilov, I., & Hutter, F. (2016). Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 .
Lu, J., Clark, C., Zellers, R., Mottaghi, R. & Kembhavi, A. (2022). Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 .
Meng, L., Dai, X., Chen, Y., Zhang, P., Chen, D., Liu, M., & Jiang, Y.-G. (2022). Detection hub: Unifying object detection datasets via query adaptation on language embedding. arXiv preprint arXiv:2206.03484 .
Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., et al. (2018). Mixed precision training. In International Conference on Learning Representations.
Neuhold, G., Ollmann, T., Rota Bulo, S. & Kontschieder, P. (2017). The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE International Conference on Computer Vision (pp. 4990–4999).
Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W. & Lin, D. (2019). Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp.821–830).
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al. (2021). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (pp. 8748–8763).
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. (2018). Improving language understanding by generative pretraining.
Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K. & Dollár, P. (2020). Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10428–10436).
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1), 5485–5551.
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., & Koltun, V. (2020). Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3), 1623–37.
Ren, S., He, K., Girshick, R. & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28.
Shao, J., Chen, S., Li, Y., Wang, K., Yin, Z., He, Y., et al. (2021). Intern: A new learning paradigm towards general vision. arXiv preprint arXiv:2111.08687 .
Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X. & Sun, J. (2019). Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 8430–8439).
Singh, B., & Davis, L.S. (2018). An analysis of scale invariance in object detection snip. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3578–3587).
Sun, C., Shrivastava, A., Singh, S. & Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE International Conference on Computer Vision (pp. 843–852).
Tan, M., Pang, R. & Le, Q.V. (2020). Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10781–10790).
Vasconcelos, C., Birodkar, V. & Dumoulin, V. (2022). Proper reuse of image classification features improves object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13628–13637).
Vu, T., Jang, H., Pham, T.X. & Yoo, C. (2019). Cascade rpn: Delving into high-quality region proposal network with adaptive convolution. Advances in Neural Information Processing Systems, 32.
Wang, X., Cai, Z., Gao, D. & Vasconcelos, N. (2019). Towards universal object detection by domain attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7289–7298).
Xu, H., Fang, L., Liang, X., Kang, W. & Li, Z. (2020). Universal-rcnn: Universal object detector via transferable graph r-cnn. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, pp. 12492–12499).
Xu, H., Zhang, X., Li, H., Xie, L., Dai, W., Xiong, H., & Tian, Q. (2022). Seed the views: Hierarchical semantic alignment for contrastive representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 3753–67.
Yu, J., Jiang, Y., Wang, Z., Cao, Z. & Huang, T. (2016). Unitbox: An advanced object detection network. In Proceedings of the 24th ACM International Conference on Multimedia (pp. 516–520).
Yuan, L., Chen, D., Chen, Y.-L., Codella, N., Dai, X., Gao, J., et al. (2021). Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432 .
Zhang, S., Benenson, R. & Schiele, B. (2017). Citypersons: A diverse dataset for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3213–3221).
Zhao, X., Schulter, S., Sharma, G., Tsai, Y.-H., Chandraker, M. & Wu, Y. (2020). Object detection with a unified label space from multiple datasets. In European Conference on Computer Vision (pp. 178–193).
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A. & Torralba, A. (2017). Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 633–641).
Zhou, X., Koltun, V. & Krähenbühl, P. (2022). Simple multi-dataset detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7571–7580).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Bernhard Egger.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
RVC Submission
We provide an in-depth presentation of our RVC final submission, which deviates slightly from the configuration outlined in the main text. Our final RVC submission employed a customized variant of the proposed detector built upon SEER-RegNet256gf. To align with the RVC deadline, specific simplifications were incorporated into the model:
-
Instead of utilizing the default setting of \(C_2\), \(C_3\), \(C_4\), \(C_5\) in NAS-FPN, we opted to employ the side-outputs \(C_3\), \(C_4\), \(C_5\) and applied a \(2\times \) downsampling on C5 twice, resulting in a 5-level feature pyramid. While this simplification led to a reduction in accuracy for detecting small objects, it significantly accelerated the training process.
-
The basic anchor scale in Cascade RPN was reduced to 5.04 (\(4\times 2^{1/3}\)). This adjustment was made to align with the changes in NAS-FPN and to minimize instances of missed detections for small objects.
-
The model underwent training for 720k iterations, with a learning rate reduction of 0.1 applied at 600k iterations.
During the dataset-agnostic inference procedure, the Soft-NMS (Bodla et al., 2017) was performed with an IoU threshold of 0.6 and a score threshold of 0.001. Then,
-
for COCO, we limited the maximum number of predictions per image to 100, and the short edge of the input image was resized to 800 pixels;
-
for MVD, we limited the maximum number of predictions per image to 300, and the short edge of the input image was resized to 2048 pixels;
-
for OID, we limited the maximum number of predictions per image to 300, and the short edge of the input image was resized to 800 pixels.
We did not employ any advanced inference techniques, such as multi-scale test augmentation. The performance of our submission (IFFF_RVC) on the three datasets is summarized in Table 4. For comparison with the results presented in this paper, we evaluated the model for our RVC submission on the validation sets with a maximum of 300 predictions per image, using the standard Non-Maximum Suppression (NMS) method. All other testing configurations remained consistent with those used on the test sets.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lin, F., Hu, W., Wang, Y. et al. Universal Object Detection with Large Vision Model. Int J Comput Vis 132, 1258–1276 (2024). https://doi.org/10.1007/s11263-023-01929-0
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-023-01929-0