这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

Video Instance Segmentation in an Open-World

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Existing video instance segmentation (VIS) approaches generally follow a closed-world assumption, where only seen category instances are identified and spatio-temporally segmented at inference. Open-world formulation relaxes the close-world static-learning assumption as follows: (a) first, it distinguishes a set of known categories as well as labels an unknown object as ‘unknown’ and then (b) it incrementally learns the class of an unknown as and when the corresponding semantic labels become available. We propose the first open-world VIS approach, named OW-VISFormer, that introduces a novel feature enrichment mechanism and a spatio-temporal objectness (STO) module. The feature enrichment mechanism based on a light-weight auxiliary network aims at accurate pixel-level (unknown) object delineation from the background as well as distinguishing category-specific known semantic classes. The STO module strives to generate instance-level pseudo-labels by enhancing the foreground activations through a contrastive loss. Moreover, we also introduce an extensive experimental protocol to measure the characteristics of OW-VIS. Our OW-VISFormer performs favorably against a solid baseline in OW-VIS setting. Further, we evaluate our contributions in the standard fully-supervised VIS setting by integrating them into the recent SeqFormer, achieving an absolute gain of 1.6% AP on Youtube-VIS 2019 val. set. Lastly, we show the generalizability of our contributions for the open-world detection (OWOD) setting, outperforming the best existing OWOD method in the literature. Code, models along with OW-VIS splits are available at https://github.com/OmkarThawakar/OWVISFormer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Proposals are obtained from the instance features \({\varvec{Q}}^I\) and only those remaining after selecting the ground-truth class instances through Hungarian matching are considered.

References

  • Athar, A., Mahadevan, S., Osep, A., Leal-Taixé, L., & Leibe, B. (2020). Stem-seg: Spatio-temporal embeddings for instance segmentation in videos. In ECCV.

  • Awais, M., Naseer, M., Khan, S., Anwer, R.M., Cholakkal, H., Shah, M., Yang, M.H., & Khan, F.S. (2023). Foundational models defining a new era in vision: A survey and outlook. arXiv preprint arXiv:2307.13721.

  • Bertasius, G., & Torresani, L. (2020). Classifying, segmenting, and tracking object instances in video with mask propagation. In CVPR.

  • Caelles, A., Meinhardt, T., Brasó, G., & Leal-Taixé, L. (2022). DeVIS: Making deformable transformers work for video instance segmentation. arXiv:2207.11103.

  • Cao, J., Anwer, R.M., Cholakkal, H., Khan, F.S., Pang, Y., & Shao, L. (2020). Sipmask: Spatial information preservation for fast image and video instance segmentation. In ECCV.

  • Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In ECCV.

  • Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In ICCV.

  • Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., & Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. In CVPR, pp. 1290–1299.

  • Dudhane, A., Thawakar, O., Zamir, S.W., Khan, S., Khan, F.S., & Yang, M.-H. (2024). Dynamic pre-training: Towards efficient and scalable all-in-one image restoration. arXiv preprint arXiv:2404.02154.

  • Dudhane, A., Zamir, S.W., Khan, S., Khan, F.S., & Yang, M.-H. (2023). Burstormer: Burst image restoration and enhancement transformer. In CVPR, pp. 5703–5712. IEEE.

  • Fu, Y., Yang, L., Liu, D., Huang, T.S., & Shi, H. (2021). Compfeat: Comprehensive feature aggregation for video instance segmentation. AAAI.

  • Geng, Z., Liang, L., Ding, T., & Zharkov, I. (2022). Rstt: Real-time spatial temporal transformer for space-time video super-resolution. In CVPR, pp. 17441–17451.

  • Gu, X., Lin, T.Y., Kuo, W., & Cui, Y. (2021). Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921

  • Guo, P., Huang, T., He, P., Liu, X., Xiao, T., Chen, Z., & Zhang, W. (2023). Openvis: Open-vocabulary video instance segmentation. arXiv preprint arXiv:2305.16835.

  • Gupta, A., Narayan, S., Joseph, K., Khan, S., Khan, F.S., & Shah, M. (2022). Ow-detr: Open-world detection transformer. In CVPR.

  • Han, W., Jun, T., Xiaodong, L., Shanyan, G., Rong, X., & Li, S. (2022). Ptseformer: Progressive temporal-spatial enhanced transformer towards video object detection. ECCV.

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R.B. (2017). Mask r-cnn. In ICCV.

  • Heo, M., Hwang, S., Hyun, J., Kim, H., Oh, S.W., Lee, J.-Y., & Kim, S.J. (2023). A generalized framework for video instance segmentation. In CVPR, pp. 14623–14632.

  • Heo, M., Hwang, S., Oh, S. W., Lee, J. Y., & Kim, S. J. (2022). Vita: Video instance segmentation via object token association. NeurIPS, 35, 23109–23120.

    Google Scholar 

  • Hwang, S., Heo, M., Oh, S. W., & Kim, S. J. (2021). Video instance segmentation using inter-frame communication transformers. NeurIPS, 34, 13352–13363.

    Google Scholar 

  • Joseph, K., Khan, S., Khan, F.S., & Balasubramanian, V.N. (2021). Towards open world object detection. In CVPR.

  • Ke, L., Li, X., Danelljan, M., Tai, Y. W., Tang, C.K., & Yu, F. (2021). Prototypical cross-attention networks for multiple object tracking and segmentation. In NeurIPS.

  • Kuhn., H.W. (1955). The hungarian method for the assignment problem. Naval Research Logistics Quarterly2(1-2), 83–97.

  • Kuniaki, S., Ping, H., Trevor, D., & Saenko, K. (2022). Learning to detect every thing in an open world. ECCV.

  • Li, X., Ding, H., Yuan, H., Zhang, W., Pang, J., Cheng, G., Chen, K., Liu, Z., & Loy, C.C. (2023). Transformer-based visual segmentation: A survey. arXiv preprint arXiv:2304.09854

  • Li, X., Yuan, H., Zhang, W., Cheng, G., Pang, J., & Loy, C.C. (2023). Tube-link: A flexible cross tube baseline for universal video segmentation. arXiv preprint arXiv:2303.12782.

  • Lin, T., Goyal, P., Girshick, R.B., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In ICCV.

  • Lin, C., Hung, Y., Feris, R., & He, L. (2020). Video instance segmentation tracking with a modified vae architecture. In CVPR.

  • Lin, T., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In: ECCV.

  • Liu, D., Cui, Y., Tan, W., & Chen, Y. (2021). Sg-net: Spatial granularity network for one-stage video instance segmentation. In CVPR.

  • Liu, Y., Zulfikar, I.E., Luiten, J., Dave, A., Ramanan, D., Leibe, B., Ošep, A., & Leal-Taixé, L. (2021). Opening up open-world tracking. In CVPR.

  • Naseer, M., Ranasinghe, K., Khan, S., Khan, F.S., & Porikli, F. (2021). On improving adversarial transferability of vision transformers. arXiv preprint arXiv:2106.04169.

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In Wallach, H., Larochelle, H., Beygelzimer, A., Alché-Buc, F., Fox, E., Garnett, R. (2019). (eds.) NeurIPS, pp. 8024–8035. Curran Associates, Inc.,. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.

  • Ranasinghe, K., Naseer, M., Khan, S., Khan, F.S., & Ryoo, M.S. (2022). Self-supervised video transformer. In CVPR, pp. 2874–2884.

  • Rasheed, H., Khattak, M.U., Maaz, M., Khan, S., & Khan, F.S. (2023). Fine-tuned clip models are efficient video learners. In CVPR, pp. 6545–6554.

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.S., Berg, A.C., & Li, F. (2015). Imagenet large scale visual recognition challenge. In IJCV.

  • Thawakar, O., Anwer, R.M., Laaksonen, J., Reiner, O., Shah, M., & Khan, F.S. (2023). 3d mitochondria instance segmentation with spatio-temporal transformers. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 613–623. Springer.

  • Thawakar, O., Narayan, S., Cao, J., Cholakkal, H., Anwer, R.M., Khan, M.H., Khan, S., Felsberg, M., & Khan, F.S. (2022). Video instance segmentation via multi-scale spatio-temporal split attention transformer. In ECCV, pp. 666–681. Springer

  • Wang, W., Feiszli, M., Wang, H., & Tran, D. (2021). Unidentified video objects: A benchmark for dense, open-world segmentation. In ICCV, pp. 10776–10785.

  • Wang, W., Feiszli, M., Wang, H., Malik, J., & Tran, D. (2022). Open-world instance segmentation: Exploiting pseudo ground truth from learned pairwise affinity. In CVPR, pp. 4422–4432.

  • Wang, Y., Xu, Z., Wang, X., Shen, C., Cheng, B., Shen, H., & Xia, H. (2021). End-to-end video instance segmentation with transformers. CVPR.

  • Wu, J., Jiang, Y., Zhang, W., Bai, X., & Bai, S. (2022). Seqformer: a frustratingly simple model for video instance segmentation. ECCV.

  • Wu, J., Li, X., Xu, S., Yuan, H., Ding, H., Yang, Y., Li, X., Zhang, J., Tong, Y., & Jiang, X., et al. (2024). Towards open vocabulary learning: A survey. TPAMI.

  • Wu, J., Liu, Q., Jiang, Y., Bai, S., Yuille, A., & Bai, X. (2022). In defense of online models for video instance segmentation. In ECCV, pp. 588–605. Springer.

  • Xu, N., Yang, L., Yang, J., Yue, D., Fan, Y., Liang, Y., & Huang, T.S. (2021). YouTube-VIS Dataset 2021 Version. https://youtube-vos.org/dataset/vis.

  • Yang, L., Fan, Y., & Xu, N. (2019). Video instance segmentation. In ICCV.

  • Yang, S., Fang, Y., Wang, X., Li, Y., Fang, C., Shan, Y., Feng, B., & Liu, W. (2021). Crossover learning for fast online video instance segmentation. In ICCV.

  • Zhang, T., Tian, X., Wu, Y., Ji, S., Wang, X., Zhang, Y., & Wan, P. (2023). Dvis: Decoupled video instance segmentation framework. arXiv preprint arXiv:2306.03413.

  • Zhang, T., Tian, X., Wu, Y., Ji, S., Wang, X., Zhang, Y., & Wan, P. (2023). Dvis: Decoupled video instance segmentation framework. In: ICCV, pp. 1282–1291.

  • Zhou, Q., Li, X., He, L., Yang, Y., Cheng, G., Tong, Y., Ma, L., & Tao, D. (2022). Transvod: end-to-end video object detection with spatial-temporal transformers. TPAMI.

  • Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021). Deformable detr: Deformable transformers for end-to-end object detection. In ICLR.

Download references

Acknowledgements

The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) at Alvis partially funded by the Swedish Research Council through grant agreement no. 2022-06725, the LUMI supercomputer hosted by CSC (Finland) and the LUMI consortium, and by the Berzelius resource provided by the Knut and Alice Wallenberg Foundation at the National Supercomputer Centre.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Omkar Thawakar.

Additional information

Communicated by Hong Liu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Thawakar, O., Narayan, S., Cholakkal, H. et al. Video Instance Segmentation in an Open-World. Int J Comput Vis 133, 398–409 (2025). https://doi.org/10.1007/s11263-024-02195-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-024-02195-4

Keywords