这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

Few-Shot Referring Video Single- and Multi-Object Segmentation Via Cross-Modal Affinity with Instance Sequence Matching

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Referring Video Object Segmentation (RVOS) aims to segment specific objects in videos based on the provided natural language descriptions. As a new supervised visual learning task, achieving RVOS for a given scene requires a substantial amount of annotated data. However, only minimal annotations are usually available for new scenes in realistic scenarios. Another practical problem is that, apart from a single object, multiple objects of the same category coexist in the same scene. Both of these issues may significantly reduce the performance of existing RVOS methods in handling real-world applications. In this paper, we propose a simple yet effective model to address these issues by incorporating a newly designed cross-modal affinity (CMA) module based on a Transformer architecture. The CMA module facilitates the establishment of multi-modal affinity over a limited number of samples, allowing the rapid acquisition of new semantic information while fostering the model’s adaptability to diverse scenarios. Furthermore, we extend our FS-RVOS approach to multiple objects through a new instance sequence matching module over CMA, which filters out all object trajectories with similarity to language features that exceed a matching threshold, thereby achieving few-shot referring multi-object segmentation (FS-RVMOS). To foster research in this field, we establish a new dataset based on currently available datasets, which covers many scenarios in terms of single-object and multi-object data, hence effectively simulating real-world scenes. Extensive experiments and comparative analyses underscore the exceptional performance of our proposed FS-RVOS and FS-RVMOS methods. Our method consistently outperforms existing related approaches through practical performance evaluations and robustness studies, achieving optimal performance on metrics across diverse benchmark tests.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data Availability

All the benchmark datasets supporting the findings of this study are available from the corresponding author on reasonable request. Additionally, some data sets, namely the Mini-Ref-YouTube-VOS data set and the Mini-Ref-SAIL-VOS data set, are also publicly available at https://github.com/hengliusky/Few_shot_RVOS.

References

  • Botach, A., Zheltonozhskii, E., & Baskin, C. (2022) End-to-end referring video object segmentation with multimodal transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4985–4995

  • Carion, N., Massa, F., & Synnaeve, G., et al (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision, Springer, pp 213–229

  • Chen, H., Wu, H., & Zhao, N., et al (2021) Delving deep into many-to-many attention for few-shot video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14,040–14,049

  • Cheng, B., Schwing, A., & Kirillov, A. (2021). Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34

  • Chen, J., Niu, L., Zhou, S., et al. (2022). Weak-shot semantic segmentation via dual similarity transfer. Advances in Neural Information Processing Systems, 35, 32525–32536.

    Google Scholar 

  • Ding, Z., Hui, T., & Huang, J., et al (2022) Language-bridged spatial-temporal interaction for referring video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4964–4973

  • Ding, H., Liu, C., & He, S., et al (2023) Mevis: A large-scale benchmark for video segmentation with motion expressions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2694–2703

  • Dong, N., & Xing, EP. (2018) Few-shot semantic segmentation with prototype learning. In: BMVC

  • Fan, Q., Pei, W., & Tai, YW., et al (2022) Self-support few-shot semantic segmentation. In: European Conference on Computer Vision, Springer, pp 701–719

  • Fu, TJ., Wang, XE., & Grafton, ST., et al (2022) M3l: Language-based video editing via multi-modal multi-level transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10,513–10,522

  • Gao, M., Zheng, F., Yu, J. J., et al. (2023). Deep learning for video object segmentation: A review. Artificial Intelligence Review, 56(1), 457–531.

    Article  Google Scholar 

  • Gavrilyuk, K., Ghodrati, A., & Li, Z., et al (2018) Actor and action video segmentation from a sentence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5958–5966

  • He, K., Zhang, X., & Ren, S., et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  • Hu, Y.T., Chen, H.S., & Hui, K., et al (2019) Sail-vos: Semantic amodal instance level video object segmentation-a synthetic dataset and baselines. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3105–3115

  • Hu, X., Hampiholi, B., & Neumann, H., et al (2024) Temporal context enhanced referring video object segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 5574–5583

  • Khoreva, A., Rohrbach, A., & Schiele, B. (2018) Video object segmentation with language referring expressions. In: Asian Conference on Computer Vision, Springer, pp 123–141

  • Lang, C., Cheng, G., & Tu, B., et al (2022) Learning what not to segment: A new perspective on few-shot segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 8057–8067

  • Li, G., Gao, M., & Liu, H., et al (2023a) Learning cross-modal affinity for referring video object segmentation targeting limited samples. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2684–2693

  • Li, X., Wang, J., & Xu, X., et al (2023b) Robust referring video object segmentation with cyclic structural consensus. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 22,236–22,245

  • Lin, TY., Dollár, P., & Girshick, R., et al (2017a) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125

  • Lin, TY., Goyal, P., & Girshick, R., et al (2017b) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988

  • Liu, S., Li, F., & Zhang, H., et al (2022a) DAB-DETR: Dynamic anchor boxes are better queries for DETR. In: International Conference on Learning Representations, https://openreview.net/forum?id=oMI9PjOb9Jl

  • Liu, Y., Liu, N., & Cao, Q., et al (2022b) Learning non-target knowledge for few-shot semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11,573–11,582

  • Liu, N., Nan, K., & Zhao, W., et al (2023) Multi-grained temporal prototype learning for few-shot video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 18,862–18,871

  • Liu, Y., Ott, M., & Goyal, N., et al (2019) Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

  • Liu, H., Wang, Y., & Qian, B., et al (2024a) Structure matters: Tackling the semantic discrepancy in diffusion models for image inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8038–8047

  • Liu, Y., Liu, N., Wu, Y., et al. (2024). Ntrenet++: Unleashing the power of non-target knowledge for few-shot semantic segmentation. IEEE Transactions on Circuits and Systems for Video Technology. https://doi.org/10.1109/TCSVT.2024.3519573

    Article  Google Scholar 

  • Milletari, F., Navab, N., & Ahmadi, SA. (2016) V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 2016 fourth international conference on 3D vision (3DV), IEEE, pp 565–571

  • Nguyen, K., & Todorovic, S. (2019) Feature weighting and boosting for few-shot segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 622–631

  • Seo, S., Lee, JY., & Han, B. (2020) Urvos: Unified referring video object segmentation network with a large-scale benchmark. In: European Conference on Computer Vision, Springer, pp 208–223

  • Shaban, A., Bansal, S., & Liu, Z., et al (2017) One-shot learning for semantic segmentation. In: BMVC

  • Siam, M., Derpanis, KG., & Wildes, RP. (2022) Temporal transductive inference for few-shot video object segmentation. arXiv preprint arXiv:2203.14308

  • Siam, M., Doraiswamy, N., & Oreshkin, BN., et al (2021) Weakly supervised few-shot object segmentation using co-attention with visual and semantic embeddings. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp 860–867

  • Snell, J., Swersky, K., & Zemel, R. (2017) Prototypical networks for few-shot learning. Advances in neural information processing systems 30

  • Su, J., Fan, Q., & Pei, W., et al (2024) Domain-rectifying adapter for cross-domain few-shot segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 24,036–24,045

  • Tang, Y., Chen, T., & Jiang, X., et al (2024) Holistic prototype attention network for few-shot video object segmentation. IEEE Transactions on Circuits and Systems for Video Technology 34(8)

  • Tang, J., Zheng, G., & Yang, S. (2023) Temporal collection and distribution for referring video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 15,466–15,476

  • Tian, Z., Lai, X,. & Jiang, L., et al (2022) Generalized few-shot semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 11,563–11,572

  • Tian, Z., Zhao, H., Shu, M., et al. (2020). Prior guided feature enrichment network for few-shot segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(2), 1050–1065.

    Article  Google Scholar 

  • Tong, J., Zhou, H., & Liu, Y., et al (2024) Dynamic knowledge adapter with probabilistic calibration for generalized few-shot semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2781–2790

  • Vaswani, A., Shazeer, N., & Parmar, N., et al (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  • Wang, H., Deng, C., & Ma, F., et al (2020) Context modulated dynamic networks for actor and action video segmentation with language queries. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 12,152–12,159

  • Wang, K., Liew, J.H., & Zou, Y., et al (2019) Panet: Few-shot image semantic segmentation with prototype alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9197–9206

  • Wang, J., Zhang, B., & Pang, J., et al (2024) Rethinking prior information generation with clip for few-shot segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3941–3951

  • Wu, D., Dong, X., & Shao, L., et al (2022a) Multi-level representation learning with semantic alignment for referring video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4996–5005

  • Wu, D., Han, W., & Wang, T., et al (2023) Referring multi-object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14,633–14,642

  • Wu, J., Jiang, Y., & Sun, P., et al (2022b) Language as queries for referring video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4974–4984

  • Xu, C., Hsieh, SH., & Xiong, C., et al (2015) Can humans fly? action understanding with multiple classes of actors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2264–2273

  • Yan, S., Zhang, R., & Guo, Z., et al (2024) Referred by multi-modality: A unified temporal transformer for video object segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 6449–6457

  • Yang, B., Liu, C., & Li, B., et al (2020) Prototype mixture models for few-shot semantic segmentation. In: European Conference on Computer Vision, Springer, pp 763–778

  • Yu, L., Poirson, P., & Yang, S., et al (2016) Modeling context in referring expressions. In: European Conference on Computer Vision, Springer, pp 69–85

  • Yuan, L., Shi, M., & Yue, Z., et al (2024) Losh: Long-short text joint prediction network for referring video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14,001–14,010

  • Zhang, G., Kang, G., & Yang, Y., et al (2021) Few-shot segmentation via cycle-consistent transformer. Advances in Neural Information Processing Systems 34:21,984–21,996

  • Zhang, C., Lin, G., & Liu, F., et al (2019) Pyramid graph networks with connection attentions for region-based one-shot semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9587–9595

  • Zhang, Y., Wu, D., & Han, W., et al (2024) Bootstrapping referring multi-object tracking. arXiv preprint arXiv:2406.05039

  • Zhang, S., Wu, T., & Wu, S., et al (2022) Catrans: Context and affinity transformer for few-shot segmentation. arXiv preprint arXiv:2204.12817

  • Zhao, W., Wang, K., & Chu, X., et al (2022) Modeling motion with multi-modal features for text-based video segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 11,737–11,746

  • Zheng, S., Lu, J., & Zhao, H., et al (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6881–6890

  • Zhou, S., Niu, L., & Si, J., et al (2021) Weak-shot semantic segmentation by transferring semantic affinity and boundary. arXiv preprint arXiv:2110.01519

  • Zhou, Z., Xu, HM., & Shu, Y., et al (2024) Unlocking the potential of pre-trained vision transformers for few-shot semantic segmentation through relationship descriptors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3817–3827

  • Zhou, Z., Xu, HM., & Shu, Y., et al (2024) Unlocking the potential of pre-trained vision transformers for few-shot semantic segmentation through relationship descriptors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3817–3827

Download references

Acknowledgements

This work is partly supported by the National Natural Science Foundation of China under Grant Nos. 61971004,U21A20470, 62172136, 62122035, 62206006, and the Young and Middle-Aged Academic Leaders Cultivation Program of Anhui Province (No. DT2023014).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yang Wang.

Additional information

Communicated by Bryan Allen Plummer.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 5303 KB)

Supplementary file 2 (pdf 7615 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, H., Li, G., Gao, M. et al. Few-Shot Referring Video Single- and Multi-Object Segmentation Via Cross-Modal Affinity with Instance Sequence Matching. Int J Comput Vis 133, 5610–5628 (2025). https://doi.org/10.1007/s11263-025-02444-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-025-02444-0

Keywords