Video Instance Segmentation in an Open-World

Thawakar, Omkar; Narayan, Sanath; Cholakkal, Hisham; Anwer, Rao Muhammad; Khan, Salman; Laaksonen, Jorma; Shah, Mubarak; Khan, Fahad Shahbaz

doi:10.1007/s11263-024-02195-4

Video Instance Segmentation in an Open-World

Published: 30 July 2024

Volume 133, pages 398–409, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

606 Accesses
2 Citations
Explore all metrics

Abstract

Existing video instance segmentation (VIS) approaches generally follow a closed-world assumption, where only seen category instances are identified and spatio-temporally segmented at inference. Open-world formulation relaxes the close-world static-learning assumption as follows: (a) first, it distinguishes a set of known categories as well as labels an unknown object as ‘unknown’ and then (b) it incrementally learns the class of an unknown as and when the corresponding semantic labels become available. We propose the first open-world VIS approach, named OW-VISFormer, that introduces a novel feature enrichment mechanism and a spatio-temporal objectness (STO) module. The feature enrichment mechanism based on a light-weight auxiliary network aims at accurate pixel-level (unknown) object delineation from the background as well as distinguishing category-specific known semantic classes. The STO module strives to generate instance-level pseudo-labels by enhancing the foreground activations through a contrastive loss. Moreover, we also introduce an extensive experimental protocol to measure the characteristics of OW-VIS. Our OW-VISFormer performs favorably against a solid baseline in OW-VIS setting. Further, we evaluate our contributions in the standard fully-supervised VIS setting by integrating them into the recent SeqFormer, achieving an absolute gain of 1.6% AP on Youtube-VIS 2019 val. set. Lastly, we show the generalizability of our contributions for the open-world detection (OWOD) setting, outperforming the best existing OWOD method in the literature. Code, models along with OW-VIS splits are available at https://github.com/OmkarThawakar/OWVISFormer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

OV-VIS: Open-Vocabulary Video Instance Segmentation

Article 31 May 2024

SOS: Segment Object System for Open-World Instance Segmentation with Object Priors

LVMUM: Toward Open-World Object Detection with Large Vision Models and Unsupervised Modeling

Notes

Proposals are obtained from the instance features ${\varvec{Q}}^I$ and only those remaining after selecting the ground-truth class instances through Hungarian matching are considered.

References

Athar, A., Mahadevan, S., Osep, A., Leal-Taixé, L., & Leibe, B. (2020). Stem-seg: Spatio-temporal embeddings for instance segmentation in videos. In ECCV.
Awais, M., Naseer, M., Khan, S., Anwer, R.M., Cholakkal, H., Shah, M., Yang, M.H., & Khan, F.S. (2023). Foundational models defining a new era in vision: A survey and outlook. arXiv preprint arXiv:2307.13721.
Bertasius, G., & Torresani, L. (2020). Classifying, segmenting, and tracking object instances in video with mask propagation. In CVPR.
Caelles, A., Meinhardt, T., Brasó, G., & Leal-Taixé, L. (2022). DeVIS: Making deformable transformers work for video instance segmentation. arXiv:2207.11103.
Cao, J., Anwer, R.M., Cholakkal, H., Khan, F.S., Pang, Y., & Shao, L. (2020). Sipmask: Spatial information preservation for fast image and video instance segmentation. In ECCV.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In ECCV.
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In ICCV.
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., & Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. In CVPR, pp. 1290–1299.
Dudhane, A., Thawakar, O., Zamir, S.W., Khan, S., Khan, F.S., & Yang, M.-H. (2024). Dynamic pre-training: Towards efficient and scalable all-in-one image restoration. arXiv preprint arXiv:2404.02154.
Dudhane, A., Zamir, S.W., Khan, S., Khan, F.S., & Yang, M.-H. (2023). Burstormer: Burst image restoration and enhancement transformer. In CVPR, pp. 5703–5712. IEEE.
Fu, Y., Yang, L., Liu, D., Huang, T.S., & Shi, H. (2021). Compfeat: Comprehensive feature aggregation for video instance segmentation. AAAI.
Geng, Z., Liang, L., Ding, T., & Zharkov, I. (2022). Rstt: Real-time spatial temporal transformer for space-time video super-resolution. In CVPR, pp. 17441–17451.
Gu, X., Lin, T.Y., Kuo, W., & Cui, Y. (2021). Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921
Guo, P., Huang, T., He, P., Liu, X., Xiao, T., Chen, Z., & Zhang, W. (2023). Openvis: Open-vocabulary video instance segmentation. arXiv preprint arXiv:2305.16835.
Gupta, A., Narayan, S., Joseph, K., Khan, S., Khan, F.S., & Shah, M. (2022). Ow-detr: Open-world detection transformer. In CVPR.
Han, W., Jun, T., Xiaodong, L., Shanyan, G., Rong, X., & Li, S. (2022). Ptseformer: Progressive temporal-spatial enhanced transformer towards video object detection. ECCV.
He, K., Gkioxari, G., Dollár, P., & Girshick, R.B. (2017). Mask r-cnn. In ICCV.
Heo, M., Hwang, S., Hyun, J., Kim, H., Oh, S.W., Lee, J.-Y., & Kim, S.J. (2023). A generalized framework for video instance segmentation. In CVPR, pp. 14623–14632.
Heo, M., Hwang, S., Oh, S. W., Lee, J. Y., & Kim, S. J. (2022). Vita: Video instance segmentation via object token association. NeurIPS, 35, 23109–23120.
Google Scholar
Hwang, S., Heo, M., Oh, S. W., & Kim, S. J. (2021). Video instance segmentation using inter-frame communication transformers. NeurIPS, 34, 13352–13363.
Google Scholar
Joseph, K., Khan, S., Khan, F.S., & Balasubramanian, V.N. (2021). Towards open world object detection. In CVPR.
Ke, L., Li, X., Danelljan, M., Tai, Y. W., Tang, C.K., & Yu, F. (2021). Prototypical cross-attention networks for multiple object tracking and segmentation. In NeurIPS.
Kuhn., H.W. (1955). The hungarian method for the assignment problem. Naval Research Logistics Quarterly2(1-2), 83–97.
Kuniaki, S., Ping, H., Trevor, D., & Saenko, K. (2022). Learning to detect every thing in an open world. ECCV.
Li, X., Ding, H., Yuan, H., Zhang, W., Pang, J., Cheng, G., Chen, K., Liu, Z., & Loy, C.C. (2023). Transformer-based visual segmentation: A survey. arXiv preprint arXiv:2304.09854
Li, X., Yuan, H., Zhang, W., Cheng, G., Pang, J., & Loy, C.C. (2023). Tube-link: A flexible cross tube baseline for universal video segmentation. arXiv preprint arXiv:2303.12782.
Lin, T., Goyal, P., Girshick, R.B., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In ICCV.
Lin, C., Hung, Y., Feris, R., & He, L. (2020). Video instance segmentation tracking with a modified vae architecture. In CVPR.
Lin, T., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In: ECCV.
Liu, D., Cui, Y., Tan, W., & Chen, Y. (2021). Sg-net: Spatial granularity network for one-stage video instance segmentation. In CVPR.
Liu, Y., Zulfikar, I.E., Luiten, J., Dave, A., Ramanan, D., Leibe, B., Ošep, A., & Leal-Taixé, L. (2021). Opening up open-world tracking. In CVPR.
Naseer, M., Ranasinghe, K., Khan, S., Khan, F.S., & Porikli, F. (2021). On improving adversarial transferability of vision transformers. arXiv preprint arXiv:2106.04169.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In Wallach, H., Larochelle, H., Beygelzimer, A., Alché-Buc, F., Fox, E., Garnett, R. (2019). (eds.) NeurIPS, pp. 8024–8035. Curran Associates, Inc.,. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
Ranasinghe, K., Naseer, M., Khan, S., Khan, F.S., & Ryoo, M.S. (2022). Self-supervised video transformer. In CVPR, pp. 2874–2884.
Rasheed, H., Khattak, M.U., Maaz, M., Khan, S., & Khan, F.S. (2023). Fine-tuned clip models are efficient video learners. In CVPR, pp. 6545–6554.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.S., Berg, A.C., & Li, F. (2015). Imagenet large scale visual recognition challenge. In IJCV.
Thawakar, O., Anwer, R.M., Laaksonen, J., Reiner, O., Shah, M., & Khan, F.S. (2023). 3d mitochondria instance segmentation with spatio-temporal transformers. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 613–623. Springer.
Thawakar, O., Narayan, S., Cao, J., Cholakkal, H., Anwer, R.M., Khan, M.H., Khan, S., Felsberg, M., & Khan, F.S. (2022). Video instance segmentation via multi-scale spatio-temporal split attention transformer. In ECCV, pp. 666–681. Springer
Wang, W., Feiszli, M., Wang, H., & Tran, D. (2021). Unidentified video objects: A benchmark for dense, open-world segmentation. In ICCV, pp. 10776–10785.
Wang, W., Feiszli, M., Wang, H., Malik, J., & Tran, D. (2022). Open-world instance segmentation: Exploiting pseudo ground truth from learned pairwise affinity. In CVPR, pp. 4422–4432.
Wang, Y., Xu, Z., Wang, X., Shen, C., Cheng, B., Shen, H., & Xia, H. (2021). End-to-end video instance segmentation with transformers. CVPR.
Wu, J., Jiang, Y., Zhang, W., Bai, X., & Bai, S. (2022). Seqformer: a frustratingly simple model for video instance segmentation. ECCV.
Wu, J., Li, X., Xu, S., Yuan, H., Ding, H., Yang, Y., Li, X., Zhang, J., Tong, Y., & Jiang, X., et al. (2024). Towards open vocabulary learning: A survey. TPAMI.
Wu, J., Liu, Q., Jiang, Y., Bai, S., Yuille, A., & Bai, X. (2022). In defense of online models for video instance segmentation. In ECCV, pp. 588–605. Springer.
Xu, N., Yang, L., Yang, J., Yue, D., Fan, Y., Liang, Y., & Huang, T.S. (2021). YouTube-VIS Dataset 2021 Version. https://youtube-vos.org/dataset/vis.
Yang, L., Fan, Y., & Xu, N. (2019). Video instance segmentation. In ICCV.
Yang, S., Fang, Y., Wang, X., Li, Y., Fang, C., Shan, Y., Feng, B., & Liu, W. (2021). Crossover learning for fast online video instance segmentation. In ICCV.
Zhang, T., Tian, X., Wu, Y., Ji, S., Wang, X., Zhang, Y., & Wan, P. (2023). Dvis: Decoupled video instance segmentation framework. arXiv preprint arXiv:2306.03413.
Zhang, T., Tian, X., Wu, Y., Ji, S., Wang, X., Zhang, Y., & Wan, P. (2023). Dvis: Decoupled video instance segmentation framework. In: ICCV, pp. 1282–1291.
Zhou, Q., Li, X., He, L., Yang, Y., Cheng, G., Tong, Y., Ma, L., & Tao, D. (2022). Transvod: end-to-end video object detection with spatial-temporal transformers. TPAMI.
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021). Deformable detr: Deformable transformers for end-to-end object detection. In ICLR.

Download references

Acknowledgements

The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) at Alvis partially funded by the Swedish Research Council through grant agreement no. 2022-06725, the LUMI supercomputer hosted by CSC (Finland) and the LUMI consortium, and by the Berzelius resource provided by the Knut and Alice Wallenberg Foundation at the National Supercomputer Centre.

Author information

Authors and Affiliations

Mohamed bin Zayed University of AI, Abu Dhabi, UAE
Omkar Thawakar, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan & Fahad Shahbaz Khan
Technology Innovation Institute, Abu Dhabi, UAE
Sanath Narayan
Aalto University, Espoo, Finland
Rao Muhammad Anwer & Jorma Laaksonen
University of Central Florida, Orlando, USA
Mubarak Shah
Linköping University, Linköping, Sweden
Fahad Shahbaz Khan

Authors

Omkar Thawakar
View author publications
Search author on:PubMed Google Scholar
Sanath Narayan
View author publications
Search author on:PubMed Google Scholar
Hisham Cholakkal
View author publications
Search author on:PubMed Google Scholar
Rao Muhammad Anwer
View author publications
Search author on:PubMed Google Scholar
Salman Khan
View author publications
Search author on:PubMed Google Scholar
Jorma Laaksonen
View author publications
Search author on:PubMed Google Scholar
Mubarak Shah
View author publications
Search author on:PubMed Google Scholar
Fahad Shahbaz Khan
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Omkar Thawakar.

Additional information

Communicated by Hong Liu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Thawakar, O., Narayan, S., Cholakkal, H. et al. Video Instance Segmentation in an Open-World. Int J Comput Vis 133, 398–409 (2025). https://doi.org/10.1007/s11263-024-02195-4

Download citation

Received: 15 December 2023
Accepted: 04 July 2024
Published: 30 July 2024
Version of record: 30 July 2024
Issue date: January 2025
DOI: https://doi.org/10.1007/s11263-024-02195-4

Keywords

Part of a collection:

Special Issue on Open-World Visual Recognition

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video Instance Segmentation in an Open-World

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

OV-VIS: Open-Vocabulary Video Instance Segmentation

SOS: Segment Object System for Open-World Instance Segmentation with Object Priors

LVMUM: Toward Open-World Object Detection with Large Vision Models and Unsupervised Modeling

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now