Few-Shot Referring Video Single- and Multi-Object Segmentation Via Cross-Modal Affinity with Instance Sequence Matching

Liu, Heng; Li, Guanghui; Gao, Mingqi; Zhen, Xiantong; Zheng, Feng; Wang, Yang

doi:10.1007/s11263-025-02444-0

Few-Shot Referring Video Single- and Multi-Object Segmentation Via Cross-Modal Affinity with Instance Sequence Matching

Published: 28 April 2025

Volume 133, pages 5610–5628, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Heng Liu¹,
Guanghui Li¹,
Mingqi Gao²,
Xiantong Zhen²,
Feng Zheng² &
…
Yang Wang ORCID: orcid.org/0000-0003-1029-9280³

397 Accesses
Explore all metrics

Abstract

Referring Video Object Segmentation (RVOS) aims to segment specific objects in videos based on the provided natural language descriptions. As a new supervised visual learning task, achieving RVOS for a given scene requires a substantial amount of annotated data. However, only minimal annotations are usually available for new scenes in realistic scenarios. Another practical problem is that, apart from a single object, multiple objects of the same category coexist in the same scene. Both of these issues may significantly reduce the performance of existing RVOS methods in handling real-world applications. In this paper, we propose a simple yet effective model to address these issues by incorporating a newly designed cross-modal affinity (CMA) module based on a Transformer architecture. The CMA module facilitates the establishment of multi-modal affinity over a limited number of samples, allowing the rapid acquisition of new semantic information while fostering the model’s adaptability to diverse scenarios. Furthermore, we extend our FS-RVOS approach to multiple objects through a new instance sequence matching module over CMA, which filters out all object trajectories with similarity to language features that exceed a matching threshold, thereby achieving few-shot referring multi-object segmentation (FS-RVMOS). To foster research in this field, we establish a new dataset based on currently available datasets, which covers many scenarios in terms of single-object and multi-object data, hence effectively simulating real-world scenes. Extensive experiments and comparative analyses underscore the exceptional performance of our proposed FS-RVOS and FS-RVMOS methods. Our method consistently outperforms existing related approaches through practical performance evaluations and robustness studies, achieving optimal performance on metrics across diverse benchmark tests.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parameter-efficient weakly supervised referring video object segmentation via chain-of-thought reasoning

Article Open access 08 May 2025

Language-Guided Video Object Segmentation

Effective Feature Representation for Referring Video Object Segmentation

Data Availability

All the benchmark datasets supporting the findings of this study are available from the corresponding author on reasonable request. Additionally, some data sets, namely the Mini-Ref-YouTube-VOS data set and the Mini-Ref-SAIL-VOS data set, are also publicly available at https://github.com/hengliusky/Few_shot_RVOS.

References

Botach, A., Zheltonozhskii, E., & Baskin, C. (2022) End-to-end referring video object segmentation with multimodal transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4985–4995
Carion, N., Massa, F., & Synnaeve, G., et al (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision, Springer, pp 213–229
Chen, H., Wu, H., & Zhao, N., et al (2021) Delving deep into many-to-many attention for few-shot video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14,040–14,049
Cheng, B., Schwing, A., & Kirillov, A. (2021). Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34
Chen, J., Niu, L., Zhou, S., et al. (2022). Weak-shot semantic segmentation via dual similarity transfer. Advances in Neural Information Processing Systems, 35, 32525–32536.
Google Scholar
Ding, Z., Hui, T., & Huang, J., et al (2022) Language-bridged spatial-temporal interaction for referring video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4964–4973
Ding, H., Liu, C., & He, S., et al (2023) Mevis: A large-scale benchmark for video segmentation with motion expressions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2694–2703
Dong, N., & Xing, EP. (2018) Few-shot semantic segmentation with prototype learning. In: BMVC
Fan, Q., Pei, W., & Tai, YW., et al (2022) Self-support few-shot semantic segmentation. In: European Conference on Computer Vision, Springer, pp 701–719
Fu, TJ., Wang, XE., & Grafton, ST., et al (2022) M3l: Language-based video editing via multi-modal multi-level transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10,513–10,522
Gao, M., Zheng, F., Yu, J. J., et al. (2023). Deep learning for video object segmentation: A review. Artificial Intelligence Review, 56(1), 457–531.
Article Google Scholar
Gavrilyuk, K., Ghodrati, A., & Li, Z., et al (2018) Actor and action video segmentation from a sentence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5958–5966
He, K., Zhang, X., & Ren, S., et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hu, Y.T., Chen, H.S., & Hui, K., et al (2019) Sail-vos: Semantic amodal instance level video object segmentation-a synthetic dataset and baselines. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3105–3115
Hu, X., Hampiholi, B., & Neumann, H., et al (2024) Temporal context enhanced referring video object segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 5574–5583
Khoreva, A., Rohrbach, A., & Schiele, B. (2018) Video object segmentation with language referring expressions. In: Asian Conference on Computer Vision, Springer, pp 123–141
Lang, C., Cheng, G., & Tu, B., et al (2022) Learning what not to segment: A new perspective on few-shot segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 8057–8067
Li, G., Gao, M., & Liu, H., et al (2023a) Learning cross-modal affinity for referring video object segmentation targeting limited samples. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2684–2693
Li, X., Wang, J., & Xu, X., et al (2023b) Robust referring video object segmentation with cyclic structural consensus. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 22,236–22,245
Lin, TY., Dollár, P., & Girshick, R., et al (2017a) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125
Lin, TY., Goyal, P., & Girshick, R., et al (2017b) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
Liu, S., Li, F., & Zhang, H., et al (2022a) DAB-DETR: Dynamic anchor boxes are better queries for DETR. In: International Conference on Learning Representations, https://openreview.net/forum?id=oMI9PjOb9Jl
Liu, Y., Liu, N., & Cao, Q., et al (2022b) Learning non-target knowledge for few-shot semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11,573–11,582
Liu, N., Nan, K., & Zhao, W., et al (2023) Multi-grained temporal prototype learning for few-shot video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 18,862–18,871
Liu, Y., Ott, M., & Goyal, N., et al (2019) Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
Liu, H., Wang, Y., & Qian, B., et al (2024a) Structure matters: Tackling the semantic discrepancy in diffusion models for image inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8038–8047
Liu, Y., Liu, N., Wu, Y., et al. (2024). Ntrenet++: Unleashing the power of non-target knowledge for few-shot semantic segmentation. IEEE Transactions on Circuits and Systems for Video Technology. https://doi.org/10.1109/TCSVT.2024.3519573
Article Google Scholar
Milletari, F., Navab, N., & Ahmadi, SA. (2016) V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 2016 fourth international conference on 3D vision (3DV), IEEE, pp 565–571
Nguyen, K., & Todorovic, S. (2019) Feature weighting and boosting for few-shot segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 622–631
Seo, S., Lee, JY., & Han, B. (2020) Urvos: Unified referring video object segmentation network with a large-scale benchmark. In: European Conference on Computer Vision, Springer, pp 208–223
Shaban, A., Bansal, S., & Liu, Z., et al (2017) One-shot learning for semantic segmentation. In: BMVC
Siam, M., Derpanis, KG., & Wildes, RP. (2022) Temporal transductive inference for few-shot video object segmentation. arXiv preprint arXiv:2203.14308
Siam, M., Doraiswamy, N., & Oreshkin, BN., et al (2021) Weakly supervised few-shot object segmentation using co-attention with visual and semantic embeddings. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp 860–867
Snell, J., Swersky, K., & Zemel, R. (2017) Prototypical networks for few-shot learning. Advances in neural information processing systems 30
Su, J., Fan, Q., & Pei, W., et al (2024) Domain-rectifying adapter for cross-domain few-shot segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 24,036–24,045
Tang, Y., Chen, T., & Jiang, X., et al (2024) Holistic prototype attention network for few-shot video object segmentation. IEEE Transactions on Circuits and Systems for Video Technology 34(8)
Tang, J., Zheng, G., & Yang, S. (2023) Temporal collection and distribution for referring video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 15,466–15,476
Tian, Z., Lai, X,. & Jiang, L., et al (2022) Generalized few-shot semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 11,563–11,572
Tian, Z., Zhao, H., Shu, M., et al. (2020). Prior guided feature enrichment network for few-shot segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(2), 1050–1065.
Article Google Scholar
Tong, J., Zhou, H., & Liu, Y., et al (2024) Dynamic knowledge adapter with probabilistic calibration for generalized few-shot semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2781–2790
Vaswani, A., Shazeer, N., & Parmar, N., et al (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Wang, H., Deng, C., & Ma, F., et al (2020) Context modulated dynamic networks for actor and action video segmentation with language queries. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 12,152–12,159
Wang, K., Liew, J.H., & Zou, Y., et al (2019) Panet: Few-shot image semantic segmentation with prototype alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9197–9206
Wang, J., Zhang, B., & Pang, J., et al (2024) Rethinking prior information generation with clip for few-shot segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3941–3951
Wu, D., Dong, X., & Shao, L., et al (2022a) Multi-level representation learning with semantic alignment for referring video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4996–5005
Wu, D., Han, W., & Wang, T., et al (2023) Referring multi-object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14,633–14,642
Wu, J., Jiang, Y., & Sun, P., et al (2022b) Language as queries for referring video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4974–4984
Xu, C., Hsieh, SH., & Xiong, C., et al (2015) Can humans fly? action understanding with multiple classes of actors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2264–2273
Yan, S., Zhang, R., & Guo, Z., et al (2024) Referred by multi-modality: A unified temporal transformer for video object segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 6449–6457
Yang, B., Liu, C., & Li, B., et al (2020) Prototype mixture models for few-shot semantic segmentation. In: European Conference on Computer Vision, Springer, pp 763–778
Yu, L., Poirson, P., & Yang, S., et al (2016) Modeling context in referring expressions. In: European Conference on Computer Vision, Springer, pp 69–85
Yuan, L., Shi, M., & Yue, Z., et al (2024) Losh: Long-short text joint prediction network for referring video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14,001–14,010
Zhang, G., Kang, G., & Yang, Y., et al (2021) Few-shot segmentation via cycle-consistent transformer. Advances in Neural Information Processing Systems 34:21,984–21,996
Zhang, C., Lin, G., & Liu, F., et al (2019) Pyramid graph networks with connection attentions for region-based one-shot semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9587–9595
Zhang, Y., Wu, D., & Han, W., et al (2024) Bootstrapping referring multi-object tracking. arXiv preprint arXiv:2406.05039
Zhang, S., Wu, T., & Wu, S., et al (2022) Catrans: Context and affinity transformer for few-shot segmentation. arXiv preprint arXiv:2204.12817
Zhao, W., Wang, K., & Chu, X., et al (2022) Modeling motion with multi-modal features for text-based video segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 11,737–11,746
Zheng, S., Lu, J., & Zhao, H., et al (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6881–6890
Zhou, S., Niu, L., & Si, J., et al (2021) Weak-shot semantic segmentation by transferring semantic affinity and boundary. arXiv preprint arXiv:2110.01519
Zhou, Z., Xu, HM., & Shu, Y., et al (2024) Unlocking the potential of pre-trained vision transformers for few-shot semantic segmentation through relationship descriptors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3817–3827
Zhou, Z., Xu, HM., & Shu, Y., et al (2024) Unlocking the potential of pre-trained vision transformers for few-shot semantic segmentation through relationship descriptors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3817–3827

Download references

Acknowledgements

This work is partly supported by the National Natural Science Foundation of China under Grant Nos. 61971004,U21A20470, 62172136, 62122035, 62206006, and the Young and Middle-Aged Academic Leaders Cultivation Program of Anhui Province (No. DT2023014).

Author information

Authors and Affiliations

School of Computer Science and Technology, Anhui University of Technology, Maxiang Road, Ma’anshan, 243032, China
Heng Liu & Guanghui Li
Department of Computer Science and Engineering, Southern University of Science and Technology, Xueyuan Avenue, Shenzhen, 518055, China
Mingqi Gao, Xiantong Zhen & Feng Zheng
School of Computer Science and Information Engineering, Hefei University of Technology, Feicui Road, Hefei, 230601, China
Yang Wang

Authors

Heng Liu
View author publications
Search author on:PubMed Google Scholar
Guanghui Li
View author publications
Search author on:PubMed Google Scholar
Mingqi Gao
View author publications
Search author on:PubMed Google Scholar
Xiantong Zhen
View author publications
Search author on:PubMed Google Scholar
Feng Zheng
View author publications
Search author on:PubMed Google Scholar
Yang Wang
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Yang Wang.

Additional information

Communicated by Bryan Allen Plummer.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 5303 KB)

Supplementary file 2 (pdf 7615 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, H., Li, G., Gao, M. et al. Few-Shot Referring Video Single- and Multi-Object Segmentation Via Cross-Modal Affinity with Instance Sequence Matching. Int J Comput Vis 133, 5610–5628 (2025). https://doi.org/10.1007/s11263-025-02444-0

Download citation

Received: 23 August 2024
Accepted: 28 March 2025
Published: 28 April 2025
Version of record: 28 April 2025
Issue date: August 2025
DOI: https://doi.org/10.1007/s11263-025-02444-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Few-Shot Referring Video Single- and Multi-Object Segmentation Via Cross-Modal Affinity with Instance Sequence Matching

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Parameter-efficient weakly supervised referring video object segmentation via chain-of-thought reasoning

Language-Guided Video Object Segmentation

Effective Feature Representation for Referring Video Object Segmentation

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 5303 KB)

Supplementary file 2 (pdf 7615 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now