Abstract
Existing video instance segmentation (VIS) approaches generally follow a closed-world assumption, where only seen category instances are identified and spatio-temporally segmented at inference. Open-world formulation relaxes the close-world static-learning assumption as follows: (a) first, it distinguishes a set of known categories as well as labels an unknown object as ‘unknown’ and then (b) it incrementally learns the class of an unknown as and when the corresponding semantic labels become available. We propose the first open-world VIS approach, named OW-VISFormer, that introduces a novel feature enrichment mechanism and a spatio-temporal objectness (STO) module. The feature enrichment mechanism based on a light-weight auxiliary network aims at accurate pixel-level (unknown) object delineation from the background as well as distinguishing category-specific known semantic classes. The STO module strives to generate instance-level pseudo-labels by enhancing the foreground activations through a contrastive loss. Moreover, we also introduce an extensive experimental protocol to measure the characteristics of OW-VIS. Our OW-VISFormer performs favorably against a solid baseline in OW-VIS setting. Further, we evaluate our contributions in the standard fully-supervised VIS setting by integrating them into the recent SeqFormer, achieving an absolute gain of 1.6% AP on Youtube-VIS 2019 val. set. Lastly, we show the generalizability of our contributions for the open-world detection (OWOD) setting, outperforming the best existing OWOD method in the literature. Code, models along with OW-VIS splits are available at https://github.com/OmkarThawakar/OWVISFormer.
Similar content being viewed by others
Notes
Proposals are obtained from the instance features \({\varvec{Q}}^I\) and only those remaining after selecting the ground-truth class instances through Hungarian matching are considered.
References
Athar, A., Mahadevan, S., Osep, A., Leal-Taixé, L., & Leibe, B. (2020). Stem-seg: Spatio-temporal embeddings for instance segmentation in videos. In ECCV.
Awais, M., Naseer, M., Khan, S., Anwer, R.M., Cholakkal, H., Shah, M., Yang, M.H., & Khan, F.S. (2023). Foundational models defining a new era in vision: A survey and outlook. arXiv preprint arXiv:2307.13721.
Bertasius, G., & Torresani, L. (2020). Classifying, segmenting, and tracking object instances in video with mask propagation. In CVPR.
Caelles, A., Meinhardt, T., Brasó, G., & Leal-Taixé, L. (2022). DeVIS: Making deformable transformers work for video instance segmentation. arXiv:2207.11103.
Cao, J., Anwer, R.M., Cholakkal, H., Khan, F.S., Pang, Y., & Shao, L. (2020). Sipmask: Spatial information preservation for fast image and video instance segmentation. In ECCV.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In ECCV.
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In ICCV.
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., & Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. In CVPR, pp. 1290–1299.
Dudhane, A., Thawakar, O., Zamir, S.W., Khan, S., Khan, F.S., & Yang, M.-H. (2024). Dynamic pre-training: Towards efficient and scalable all-in-one image restoration. arXiv preprint arXiv:2404.02154.
Dudhane, A., Zamir, S.W., Khan, S., Khan, F.S., & Yang, M.-H. (2023). Burstormer: Burst image restoration and enhancement transformer. In CVPR, pp. 5703–5712. IEEE.
Fu, Y., Yang, L., Liu, D., Huang, T.S., & Shi, H. (2021). Compfeat: Comprehensive feature aggregation for video instance segmentation. AAAI.
Geng, Z., Liang, L., Ding, T., & Zharkov, I. (2022). Rstt: Real-time spatial temporal transformer for space-time video super-resolution. In CVPR, pp. 17441–17451.
Gu, X., Lin, T.Y., Kuo, W., & Cui, Y. (2021). Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921
Guo, P., Huang, T., He, P., Liu, X., Xiao, T., Chen, Z., & Zhang, W. (2023). Openvis: Open-vocabulary video instance segmentation. arXiv preprint arXiv:2305.16835.
Gupta, A., Narayan, S., Joseph, K., Khan, S., Khan, F.S., & Shah, M. (2022). Ow-detr: Open-world detection transformer. In CVPR.
Han, W., Jun, T., Xiaodong, L., Shanyan, G., Rong, X., & Li, S. (2022). Ptseformer: Progressive temporal-spatial enhanced transformer towards video object detection. ECCV.
He, K., Gkioxari, G., Dollár, P., & Girshick, R.B. (2017). Mask r-cnn. In ICCV.
Heo, M., Hwang, S., Hyun, J., Kim, H., Oh, S.W., Lee, J.-Y., & Kim, S.J. (2023). A generalized framework for video instance segmentation. In CVPR, pp. 14623–14632.
Heo, M., Hwang, S., Oh, S. W., Lee, J. Y., & Kim, S. J. (2022). Vita: Video instance segmentation via object token association. NeurIPS, 35, 23109–23120.
Hwang, S., Heo, M., Oh, S. W., & Kim, S. J. (2021). Video instance segmentation using inter-frame communication transformers. NeurIPS, 34, 13352–13363.
Joseph, K., Khan, S., Khan, F.S., & Balasubramanian, V.N. (2021). Towards open world object detection. In CVPR.
Ke, L., Li, X., Danelljan, M., Tai, Y. W., Tang, C.K., & Yu, F. (2021). Prototypical cross-attention networks for multiple object tracking and segmentation. In NeurIPS.
Kuhn., H.W. (1955). The hungarian method for the assignment problem. Naval Research Logistics Quarterly2(1-2), 83–97.
Kuniaki, S., Ping, H., Trevor, D., & Saenko, K. (2022). Learning to detect every thing in an open world. ECCV.
Li, X., Ding, H., Yuan, H., Zhang, W., Pang, J., Cheng, G., Chen, K., Liu, Z., & Loy, C.C. (2023). Transformer-based visual segmentation: A survey. arXiv preprint arXiv:2304.09854
Li, X., Yuan, H., Zhang, W., Cheng, G., Pang, J., & Loy, C.C. (2023). Tube-link: A flexible cross tube baseline for universal video segmentation. arXiv preprint arXiv:2303.12782.
Lin, T., Goyal, P., Girshick, R.B., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In ICCV.
Lin, C., Hung, Y., Feris, R., & He, L. (2020). Video instance segmentation tracking with a modified vae architecture. In CVPR.
Lin, T., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In: ECCV.
Liu, D., Cui, Y., Tan, W., & Chen, Y. (2021). Sg-net: Spatial granularity network for one-stage video instance segmentation. In CVPR.
Liu, Y., Zulfikar, I.E., Luiten, J., Dave, A., Ramanan, D., Leibe, B., Ošep, A., & Leal-Taixé, L. (2021). Opening up open-world tracking. In CVPR.
Naseer, M., Ranasinghe, K., Khan, S., Khan, F.S., & Porikli, F. (2021). On improving adversarial transferability of vision transformers. arXiv preprint arXiv:2106.04169.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In Wallach, H., Larochelle, H., Beygelzimer, A., Alché-Buc, F., Fox, E., Garnett, R. (2019). (eds.) NeurIPS, pp. 8024–8035. Curran Associates, Inc.,. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
Ranasinghe, K., Naseer, M., Khan, S., Khan, F.S., & Ryoo, M.S. (2022). Self-supervised video transformer. In CVPR, pp. 2874–2884.
Rasheed, H., Khattak, M.U., Maaz, M., Khan, S., & Khan, F.S. (2023). Fine-tuned clip models are efficient video learners. In CVPR, pp. 6545–6554.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.S., Berg, A.C., & Li, F. (2015). Imagenet large scale visual recognition challenge. In IJCV.
Thawakar, O., Anwer, R.M., Laaksonen, J., Reiner, O., Shah, M., & Khan, F.S. (2023). 3d mitochondria instance segmentation with spatio-temporal transformers. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 613–623. Springer.
Thawakar, O., Narayan, S., Cao, J., Cholakkal, H., Anwer, R.M., Khan, M.H., Khan, S., Felsberg, M., & Khan, F.S. (2022). Video instance segmentation via multi-scale spatio-temporal split attention transformer. In ECCV, pp. 666–681. Springer
Wang, W., Feiszli, M., Wang, H., & Tran, D. (2021). Unidentified video objects: A benchmark for dense, open-world segmentation. In ICCV, pp. 10776–10785.
Wang, W., Feiszli, M., Wang, H., Malik, J., & Tran, D. (2022). Open-world instance segmentation: Exploiting pseudo ground truth from learned pairwise affinity. In CVPR, pp. 4422–4432.
Wang, Y., Xu, Z., Wang, X., Shen, C., Cheng, B., Shen, H., & Xia, H. (2021). End-to-end video instance segmentation with transformers. CVPR.
Wu, J., Jiang, Y., Zhang, W., Bai, X., & Bai, S. (2022). Seqformer: a frustratingly simple model for video instance segmentation. ECCV.
Wu, J., Li, X., Xu, S., Yuan, H., Ding, H., Yang, Y., Li, X., Zhang, J., Tong, Y., & Jiang, X., et al. (2024). Towards open vocabulary learning: A survey. TPAMI.
Wu, J., Liu, Q., Jiang, Y., Bai, S., Yuille, A., & Bai, X. (2022). In defense of online models for video instance segmentation. In ECCV, pp. 588–605. Springer.
Xu, N., Yang, L., Yang, J., Yue, D., Fan, Y., Liang, Y., & Huang, T.S. (2021). YouTube-VIS Dataset 2021 Version. https://youtube-vos.org/dataset/vis.
Yang, L., Fan, Y., & Xu, N. (2019). Video instance segmentation. In ICCV.
Yang, S., Fang, Y., Wang, X., Li, Y., Fang, C., Shan, Y., Feng, B., & Liu, W. (2021). Crossover learning for fast online video instance segmentation. In ICCV.
Zhang, T., Tian, X., Wu, Y., Ji, S., Wang, X., Zhang, Y., & Wan, P. (2023). Dvis: Decoupled video instance segmentation framework. arXiv preprint arXiv:2306.03413.
Zhang, T., Tian, X., Wu, Y., Ji, S., Wang, X., Zhang, Y., & Wan, P. (2023). Dvis: Decoupled video instance segmentation framework. In: ICCV, pp. 1282–1291.
Zhou, Q., Li, X., He, L., Yang, Y., Cheng, G., Tong, Y., Ma, L., & Tao, D. (2022). Transvod: end-to-end video object detection with spatial-temporal transformers. TPAMI.
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021). Deformable detr: Deformable transformers for end-to-end object detection. In ICLR.
Acknowledgements
The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) at Alvis partially funded by the Swedish Research Council through grant agreement no. 2022-06725, the LUMI supercomputer hosted by CSC (Finland) and the LUMI consortium, and by the Berzelius resource provided by the Knut and Alice Wallenberg Foundation at the National Supercomputer Centre.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Hong Liu.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Thawakar, O., Narayan, S., Cholakkal, H. et al. Video Instance Segmentation in an Open-World. Int J Comput Vis 133, 398–409 (2025). https://doi.org/10.1007/s11263-024-02195-4
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-024-02195-4