Abstract
Referring Video Object Segmentation (RVOS) aims to segment specific objects in videos based on the provided natural language descriptions. As a new supervised visual learning task, achieving RVOS for a given scene requires a substantial amount of annotated data. However, only minimal annotations are usually available for new scenes in realistic scenarios. Another practical problem is that, apart from a single object, multiple objects of the same category coexist in the same scene. Both of these issues may significantly reduce the performance of existing RVOS methods in handling real-world applications. In this paper, we propose a simple yet effective model to address these issues by incorporating a newly designed cross-modal affinity (CMA) module based on a Transformer architecture. The CMA module facilitates the establishment of multi-modal affinity over a limited number of samples, allowing the rapid acquisition of new semantic information while fostering the model’s adaptability to diverse scenarios. Furthermore, we extend our FS-RVOS approach to multiple objects through a new instance sequence matching module over CMA, which filters out all object trajectories with similarity to language features that exceed a matching threshold, thereby achieving few-shot referring multi-object segmentation (FS-RVMOS). To foster research in this field, we establish a new dataset based on currently available datasets, which covers many scenarios in terms of single-object and multi-object data, hence effectively simulating real-world scenes. Extensive experiments and comparative analyses underscore the exceptional performance of our proposed FS-RVOS and FS-RVMOS methods. Our method consistently outperforms existing related approaches through practical performance evaluations and robustness studies, achieving optimal performance on metrics across diverse benchmark tests.
Similar content being viewed by others
Data Availability
All the benchmark datasets supporting the findings of this study are available from the corresponding author on reasonable request. Additionally, some data sets, namely the Mini-Ref-YouTube-VOS data set and the Mini-Ref-SAIL-VOS data set, are also publicly available at https://github.com/hengliusky/Few_shot_RVOS.
References
Botach, A., Zheltonozhskii, E., & Baskin, C. (2022) End-to-end referring video object segmentation with multimodal transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4985–4995
Carion, N., Massa, F., & Synnaeve, G., et al (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision, Springer, pp 213–229
Chen, H., Wu, H., & Zhao, N., et al (2021) Delving deep into many-to-many attention for few-shot video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14,040–14,049
Cheng, B., Schwing, A., & Kirillov, A. (2021). Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34
Chen, J., Niu, L., Zhou, S., et al. (2022). Weak-shot semantic segmentation via dual similarity transfer. Advances in Neural Information Processing Systems, 35, 32525–32536.
Ding, Z., Hui, T., & Huang, J., et al (2022) Language-bridged spatial-temporal interaction for referring video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4964–4973
Ding, H., Liu, C., & He, S., et al (2023) Mevis: A large-scale benchmark for video segmentation with motion expressions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2694–2703
Dong, N., & Xing, EP. (2018) Few-shot semantic segmentation with prototype learning. In: BMVC
Fan, Q., Pei, W., & Tai, YW., et al (2022) Self-support few-shot semantic segmentation. In: European Conference on Computer Vision, Springer, pp 701–719
Fu, TJ., Wang, XE., & Grafton, ST., et al (2022) M3l: Language-based video editing via multi-modal multi-level transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10,513–10,522
Gao, M., Zheng, F., Yu, J. J., et al. (2023). Deep learning for video object segmentation: A review. Artificial Intelligence Review, 56(1), 457–531.
Gavrilyuk, K., Ghodrati, A., & Li, Z., et al (2018) Actor and action video segmentation from a sentence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5958–5966
He, K., Zhang, X., & Ren, S., et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hu, Y.T., Chen, H.S., & Hui, K., et al (2019) Sail-vos: Semantic amodal instance level video object segmentation-a synthetic dataset and baselines. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3105–3115
Hu, X., Hampiholi, B., & Neumann, H., et al (2024) Temporal context enhanced referring video object segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 5574–5583
Khoreva, A., Rohrbach, A., & Schiele, B. (2018) Video object segmentation with language referring expressions. In: Asian Conference on Computer Vision, Springer, pp 123–141
Lang, C., Cheng, G., & Tu, B., et al (2022) Learning what not to segment: A new perspective on few-shot segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 8057–8067
Li, G., Gao, M., & Liu, H., et al (2023a) Learning cross-modal affinity for referring video object segmentation targeting limited samples. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2684–2693
Li, X., Wang, J., & Xu, X., et al (2023b) Robust referring video object segmentation with cyclic structural consensus. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 22,236–22,245
Lin, TY., Dollár, P., & Girshick, R., et al (2017a) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125
Lin, TY., Goyal, P., & Girshick, R., et al (2017b) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
Liu, S., Li, F., & Zhang, H., et al (2022a) DAB-DETR: Dynamic anchor boxes are better queries for DETR. In: International Conference on Learning Representations, https://openreview.net/forum?id=oMI9PjOb9Jl
Liu, Y., Liu, N., & Cao, Q., et al (2022b) Learning non-target knowledge for few-shot semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11,573–11,582
Liu, N., Nan, K., & Zhao, W., et al (2023) Multi-grained temporal prototype learning for few-shot video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 18,862–18,871
Liu, Y., Ott, M., & Goyal, N., et al (2019) Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
Liu, H., Wang, Y., & Qian, B., et al (2024a) Structure matters: Tackling the semantic discrepancy in diffusion models for image inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8038–8047
Liu, Y., Liu, N., Wu, Y., et al. (2024). Ntrenet++: Unleashing the power of non-target knowledge for few-shot semantic segmentation. IEEE Transactions on Circuits and Systems for Video Technology. https://doi.org/10.1109/TCSVT.2024.3519573
Milletari, F., Navab, N., & Ahmadi, SA. (2016) V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 2016 fourth international conference on 3D vision (3DV), IEEE, pp 565–571
Nguyen, K., & Todorovic, S. (2019) Feature weighting and boosting for few-shot segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 622–631
Seo, S., Lee, JY., & Han, B. (2020) Urvos: Unified referring video object segmentation network with a large-scale benchmark. In: European Conference on Computer Vision, Springer, pp 208–223
Shaban, A., Bansal, S., & Liu, Z., et al (2017) One-shot learning for semantic segmentation. In: BMVC
Siam, M., Derpanis, KG., & Wildes, RP. (2022) Temporal transductive inference for few-shot video object segmentation. arXiv preprint arXiv:2203.14308
Siam, M., Doraiswamy, N., & Oreshkin, BN., et al (2021) Weakly supervised few-shot object segmentation using co-attention with visual and semantic embeddings. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp 860–867
Snell, J., Swersky, K., & Zemel, R. (2017) Prototypical networks for few-shot learning. Advances in neural information processing systems 30
Su, J., Fan, Q., & Pei, W., et al (2024) Domain-rectifying adapter for cross-domain few-shot segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 24,036–24,045
Tang, Y., Chen, T., & Jiang, X., et al (2024) Holistic prototype attention network for few-shot video object segmentation. IEEE Transactions on Circuits and Systems for Video Technology 34(8)
Tang, J., Zheng, G., & Yang, S. (2023) Temporal collection and distribution for referring video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 15,466–15,476
Tian, Z., Lai, X,. & Jiang, L., et al (2022) Generalized few-shot semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 11,563–11,572
Tian, Z., Zhao, H., Shu, M., et al. (2020). Prior guided feature enrichment network for few-shot segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(2), 1050–1065.
Tong, J., Zhou, H., & Liu, Y., et al (2024) Dynamic knowledge adapter with probabilistic calibration for generalized few-shot semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2781–2790
Vaswani, A., Shazeer, N., & Parmar, N., et al (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Wang, H., Deng, C., & Ma, F., et al (2020) Context modulated dynamic networks for actor and action video segmentation with language queries. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 12,152–12,159
Wang, K., Liew, J.H., & Zou, Y., et al (2019) Panet: Few-shot image semantic segmentation with prototype alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9197–9206
Wang, J., Zhang, B., & Pang, J., et al (2024) Rethinking prior information generation with clip for few-shot segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3941–3951
Wu, D., Dong, X., & Shao, L., et al (2022a) Multi-level representation learning with semantic alignment for referring video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4996–5005
Wu, D., Han, W., & Wang, T., et al (2023) Referring multi-object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14,633–14,642
Wu, J., Jiang, Y., & Sun, P., et al (2022b) Language as queries for referring video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4974–4984
Xu, C., Hsieh, SH., & Xiong, C., et al (2015) Can humans fly? action understanding with multiple classes of actors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2264–2273
Yan, S., Zhang, R., & Guo, Z., et al (2024) Referred by multi-modality: A unified temporal transformer for video object segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 6449–6457
Yang, B., Liu, C., & Li, B., et al (2020) Prototype mixture models for few-shot semantic segmentation. In: European Conference on Computer Vision, Springer, pp 763–778
Yu, L., Poirson, P., & Yang, S., et al (2016) Modeling context in referring expressions. In: European Conference on Computer Vision, Springer, pp 69–85
Yuan, L., Shi, M., & Yue, Z., et al (2024) Losh: Long-short text joint prediction network for referring video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14,001–14,010
Zhang, G., Kang, G., & Yang, Y., et al (2021) Few-shot segmentation via cycle-consistent transformer. Advances in Neural Information Processing Systems 34:21,984–21,996
Zhang, C., Lin, G., & Liu, F., et al (2019) Pyramid graph networks with connection attentions for region-based one-shot semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9587–9595
Zhang, Y., Wu, D., & Han, W., et al (2024) Bootstrapping referring multi-object tracking. arXiv preprint arXiv:2406.05039
Zhang, S., Wu, T., & Wu, S., et al (2022) Catrans: Context and affinity transformer for few-shot segmentation. arXiv preprint arXiv:2204.12817
Zhao, W., Wang, K., & Chu, X., et al (2022) Modeling motion with multi-modal features for text-based video segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 11,737–11,746
Zheng, S., Lu, J., & Zhao, H., et al (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6881–6890
Zhou, S., Niu, L., & Si, J., et al (2021) Weak-shot semantic segmentation by transferring semantic affinity and boundary. arXiv preprint arXiv:2110.01519
Zhou, Z., Xu, HM., & Shu, Y., et al (2024) Unlocking the potential of pre-trained vision transformers for few-shot semantic segmentation through relationship descriptors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3817–3827
Zhou, Z., Xu, HM., & Shu, Y., et al (2024) Unlocking the potential of pre-trained vision transformers for few-shot semantic segmentation through relationship descriptors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3817–3827
Acknowledgements
This work is partly supported by the National Natural Science Foundation of China under Grant Nos. 61971004,U21A20470, 62172136, 62122035, 62206006, and the Young and Middle-Aged Academic Leaders Cultivation Program of Anhui Province (No. DT2023014).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Bryan Allen Plummer.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, H., Li, G., Gao, M. et al. Few-Shot Referring Video Single- and Multi-Object Segmentation Via Cross-Modal Affinity with Instance Sequence Matching. Int J Comput Vis 133, 5610–5628 (2025). https://doi.org/10.1007/s11263-025-02444-0
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02444-0