$$\mathrm R^2$$ -Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

Liu, Ye; He, Jixuan; Li, Wanhua; Kim, Junsik; Wei, Donglai; Pfister, Hanspeter; Chen, Chang Wen

doi:10.1007/978-3-031-72940-9_24

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15099))

Included in the following conference series:

European Conference on Computer Vision

970 Accesses
7 Citations

Abstract

Video temporal grounding (VTG) is a fine-grained video understanding problem that aims to ground relevant clips in untrimmed videos given natural language queries. Most existing VTG models are built upon frame-wise final-layer CLIP features, aided by additional temporal backbones (e.g., SlowFast) with sophisticated temporal reasoning mechanisms. In this work, we claim that CLIP itself already shows great potential for fine-grained spatial-temporal modeling, as each layer offers distinct yet useful information under different granularity levels. Motivated by this, we propose Reversed Recurrent Tuning ($\mathrm R^2$-Tuning), a parameter- and memory-efficient transfer learning framework for video temporal grounding. Our method learns a lightweight R$^{2}$containing only $1.5\%$ of the total parameters to perform progressive spatial-temporal modeling. Starting from the last layer of CLIP, R$^{2}$ recurrently aggregates spatial features from earlier layers, then refines temporal correlation conditioning on the given query, resulting in a coarse-to-fine scheme. $\mathrm R^2$-Tuning achieves state-of-the-art performance across three VTG tasks (ı.e., moment retrieval, highlight detection, and video summarization) on six public benchmarks (ı.e., QVHighlights, Charades-STA, Ego4D-NLQ, TACoS, YouTube Highlights, and TVSum) even without the additional backbone, demonstrating the significance and effectiveness of the proposed scheme. Our code is available at https://github.com/yeliudev/R2-Tuning.

Y. Liu—Work done at Harvard University.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Enhancing video temporal grounding with large language model-based data augmentation

Article 25 March 2025

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Frozen CLIP Models are Efficient Video Learners

References

Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: ICCV, pp. 5803–5812 (2017)
Google Scholar
Apostolidis, E., Adamantidou, E., Metsai, A.I., Mezaris, V., Patras, I.: Video summarization using deep neural networks: a survey. Proc. IEEE 109(11), 1838–1863 (2021)
Article Google Scholar
Badamdorj, T., Rochan, M., Wang, Y., Cheng, L.: Joint visual and audio learning for video highlight detection. In: ICCV, pp. 8127–8137 (2021)
Google Scholar
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: A clip-hitchhiker’s guide to long video retrieval. Tech. Rep. arXiv:2205.08508 (2022)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp. 6299–6308 (2017)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16$\,\times \,$16 words: transformers for image recognition at scale. In: ICLR (2020)
Google Scholar
Escorcia, V., Soldan, M., Sivic, J., Ghanem, B., Russell, B.: Temporal localization of moments in video collections with natural language. Tech. Rep. arXiv:1907.12763 (2019)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: ICCV, pp. 6202–6211 (2019)
Google Scholar
Gao, J., Sun, C., Yang, Z., Nevatia, R.: Tall: temporal activity localization via language query. In: ICCV, pp. 5267–5275 (2017)
Google Scholar
Grauman, K., et al.: Ego4D: around the world in 3,000 hours of egocentric video. In: CVPR, pp. 18995–19012 (2022)
Google Scholar
Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L.: Creating summaries from user videos. In: ECCV, pp. 505–520 (2014)
Google Scholar
Gygli, M., Song, Y., Cao, L.: Video2GIF: automatic generation of animated gifs from video. In: CVPR, pp. 1001–1009 (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Hong, F.T., Huang, X., Li, W.H., Zheng, W.S.: MINI-Net: multiple instance ranking network for video highlight detection. In: ECCV, pp. 345–360 (2020)
Google Scholar
Huang, S., et al.: VoP: text-video co-operative prompt tuning for cross-modal retrieval. In: CVPR, pp. 6565–6574 (2023)
Google Scholar
Jang, J., Park, J., Kim, J., Kwon, H., Sohn, K.: Knowing where to focus: event-aware transformer for video grounding. In: ICCV, pp. 13846–13856 (2023)
Google Scholar
Jia, M., et al.: Visual prompt tuning. In: ECCV, pp. 709–727 (2022)
Google Scholar
Jiang, R., Liu, L., Chen, C.: CLIP-count: towards text-guided zero-shot object counting. In: ACM MM (2023)
Google Scholar
Ju, C., Han, T., Zheng, K., Zhang, Y., Xie, W.: Prompting visual-language models for efficient video understanding. In: ECCV, pp. 105–124 (2022)
Google Scholar
Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio, Speech Lang. Process. 28, 2880–2894 (2020)
Article Google Scholar
Larsson, G., Maire, M., Shakhnarovich, G.: FractalNet: ultra-deep neural networks without residuals. Tech. Rep. arXiv:1605.07648 (2016)
Lei, J., Berg, T.L., Bansal, M.: QVHighlights: detecting moments and highlights in videos via natural language queries. In: NeurIPS (2021)
Google Scholar
Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVR: a large-scale dataset for video-subtitle moment retrieval. In: ECCV, pp. 447–463 (2020)
Google Scholar
Li, P., et al.: MomentDiff: generative video moment retrieval from random to real. Tech. Rep. arXiv:2307.02869 (2023)
Li, S., et al.: Probing visual-audio representation for video highlight detection via hard-pairs guided contrastive learning. Tech. Rep. arXiv:2206.10157 (2022)
Lin, K.Q., et al: UniVTG: towards unified video-language temporal grounding. In: CVPR, pp. 2794–2804 (2023)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2980–2988 (2017)
Google Scholar
Lin, Z., et al.: Frozen clip models are efficient video learners. In: ECCV, pp. 388–404 (2022)
Google Scholar
Liu, L., Yu, B.X., Chang, J., Tian, Q., Chen, C.W.: Prompt-matched semantic segmentation. Tech. Rep. arXiv:2208.10159 (2022)
Liu, W., Mei, T., Zhang, Y., Che, C., Luo, J.: Multi-task deep visual-semantic embedding for video thumbnail selection. In: CVPR, pp. 3707–3715 (2015)
Google Scholar
Liu, Y., Li, S., Wu, Y., Chen, C.W., Shan, Y., Qie, X.: UMT: unified multi-modal transformers for joint video moment retrieval and highlight detection. In: CVPR, pp. 3042–3051 (2022)
Google Scholar
Luo, H., et al.: CLIP4clip: an empirical study of clip for end to end video clip retrieval. Neurocomputing 508, 293–304 (2022)
Article Google Scholar
Mahasseni, B., Lam, M., Todorovic, S.: Unsupervised video summarization with adversarial LSTM networks. In: CVPR, pp. 202–211 (2017)
Google Scholar
Moon, W., Hyun, S., Lee, S., Heo, J.P.: Correlation-guided query-dependency calibration in video representation learning for temporal grounding. Tech. Rep. arXiv:2311.08835 (2023)
Moon, W., Hyun, S., Park, S., Park, D., Heo, J.P.: Query-dependent video representation for moment retrieval and highlight detection. In: CVPR, pp. 23023–23033 (2023)
Google Scholar
Nan, G., et al.: Interventional video grounding with dual contrastive learning. In: CVPR, pp. 2765–2775 (2021)
Google Scholar
Ni, B., et al.: Expanding language-image pretrained models for general video recognition. In: ECCV, pp. 1–18 (2022)
Google Scholar
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. In: NeurIPS (2018)
Google Scholar
Pan, J., Lin, Z., Zhu, X., Shao, J., Li, H.: ST-Adapter: parameter-efficient image-to-video transfer learning. In: NeurIPS, pp. 26462–26477 (2022)
Google Scholar
Qing, Z., et al.: Disentangling spatial and temporal learning for efficient image-to-video transfer learning. In: CVPR, pp. 13934–13944 (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
Google Scholar
Rasheed, H., Khattak, M.U., Maaz, M., Khan, S., Khan, F.S.: Fine-tuned clip models are efficient video learners. In: CVPR, pp. 6545–6554 (2023)
Google Scholar
Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. Trans. Assoc. Comput. Linguist. 1, 25–36 (2013)
Article Google Scholar
Song, Y., Redi, M., Vallmitjana, J., Jaimes, A.: To click or not to click: automatic selection of beautiful thumbnails from videos. In: CIKM, pp. 659–668 (2016)
Google Scholar
Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: Tvsum: summarizing web videos using titles. In: CVPR, pp. 5179–5187 (2015)
Google Scholar
Sun, H., Zhou, M., Chen, W., Xie, W.: TR-DETR: task-reciprocal transformer for joint moment retrieval and highlight detection. In: AAAI (2024)
Google Scholar
Sun, M., Farhadi, A., Seitz, S.: Ranking domain-specific highlights by analyzing edited videos. In: ECCV, pp. 787–802 (2014)
Google Scholar
Sung, Y.L., Cho, J., Bansal, M.: LST: ladder side-tuning for parameter and memory efficient transfer learning. In: NeurIPS, pp. 12991–13005 (2022)
Google Scholar
Touvron, H., et al.: LLaMA: open and efficient foundation language models. Tech. Rep. arXiv:2302.13971 (2023)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)
Google Scholar
Wang, L., Liu, D., Puri, R., Metaxas, D.N.: Learning trailer moments in full-length movies. In: ECCV, pp. 300–316 (2020)
Google Scholar
Wang, Z., Wang, L., Wu, T., Li, T., Wu, G.: Negative sample matters: a renaissance of metric learning for temporal grounding. In: AAAI, pp. 2613–2623 (2022)
Google Scholar
Wei, F., Wang, B., Ge, T., Jiang, Y., Li, W., Duan, L.: Learning pixel-level distinctions for video highlight detection. In: CVPR, pp. 3073–3082 (2022)
Google Scholar
Xiong, B., Kalantidis, Y., Ghadiyaram, D., Grauman, K.: Less is more: learning highlight detection from video duration. In: CVPR, pp. 1258–1267 (2019)
Google Scholar
Xu, M., Wang, H., Ni, B., Zhu, R., Sun, Z., Wang, C.: Cross-category video highlight detection via set-based learning. In: ICCV, pp. 7970–7979 (2021)
Google Scholar
Xu, Y., Sun, Y., Li, Y., Shi, Y., Zhu, X., Du, S.: MH-DETR: video moment and highlight detection with cross-modal transformer. Tech. Rep. arXiv:2305.00355 (2023)
Yan, S., et al.: UnLoc: a unified framework for video localization tasks. In: ICCV, pp. 13623–13633 (2023)
Google Scholar
Yang, H., Wang, B., Lin, S., Wipf, D., Guo, M., Guo, B.: Unsupervised extraction of video highlights via robust recurrent auto-encoders. In: ICCV, pp. 4633–4641 (2015)
Google Scholar
Ye, Q., Shen, X., Gao, Y., Wang, Z., Bi, Q., Li, P., Yang, G.: Temporal cue guided video highlight detection with low-rank audio-visual fusion. In: ICCV, pp. 7950–7959 (2021)
Google Scholar
Yuan, T., Zhang, X., Liu, K., Liu, B., Jin, J., Jiao, Z.: UCF-crime annotation: a benchmark for surveillance video-and-language understanding. Tech. Rep. arXiv:2309.13925 (2023)
Yuan, Y., Ma, L., Wang, J., Liu, W., Zhu, W.: Semantic conditioned dynamic modulation for temporal sentence grounding in videos. NeurIPS 32 (2019)
Google Scholar
Zhang, H., Sun, A., Jing, W., Zhou, J.T.: Span-based localizing network for natural language video localization. Tech. Rep. arXiv:2004.13931 (2020)
Zhang, K., Chao, W.-L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 766–782. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_47
Chapter Google Scholar
Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2D temporal adjacent networks for moment localization with natural language. In: AAAI, pp. 12870–12877 (2020)
Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: CVPR, pp. 16816–16825 (2022)
Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vis. 130(9), 2337–2348 (2022)
Article Google Scholar

Download references

Acknowledgments

This work was supported in part by Hong Kong Research Grants Council GRF-15229423, US NIH grant R01HD104969, and NSF award IIS-2239688.

Author information

Authors and Affiliations

The Hong Kong Polytechnic University, Hong Kong, China
Ye Liu & Chang Wen Chen
Tsinghua University, Beijing, China
Jixuan He
Harvard University, Cambridge, USA
Ye Liu, Jixuan He, Wanhua Li, Junsik Kim & Hanspeter Pfister
Boston College, Chestnut Hill, USA
Donglai Wei

Authors

Ye Liu
View author publications
Search author on:PubMed Google Scholar
Jixuan He
View author publications
Search author on:PubMed Google Scholar
Wanhua Li
View author publications
Search author on:PubMed Google Scholar
Junsik Kim
View author publications
Search author on:PubMed Google Scholar
Donglai Wei
View author publications
Search author on:PubMed Google Scholar
Hanspeter Pfister
View author publications
Search author on:PubMed Google Scholar
Chang Wen Chen
View author publications
Search author on:PubMed Google Scholar

Corresponding authors

Correspondence to Wanhua Li or Chang Wen Chen .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 13069 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Y. et al. (2025). $\mathrm R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15099. Springer, Cham. https://doi.org/10.1007/978-3-031-72940-9_24

Download citation

DOI: https://doi.org/10.1007/978-3-031-72940-9_24
Published: 17 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72939-3
Online ISBN: 978-3-031-72940-9
eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics

\(\mathrm R^2\)-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Enhancing video temporal grounding with large language model-based data augmentation

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Frozen CLIP Models are Efficient Video Learners

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 13069 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Keywords

Publish with us

Subscribe and save

Buy Now

\(\mathrm R^2\)-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Enhancing video temporal grounding with large language model-based data augmentation

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Frozen CLIP Models are Efficient Video Learners

Explore related subjects

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 13069 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Keywords

Publish with us