Abstract
Video temporal grounding (VTG) is a fine-grained video understanding problem that aims to ground relevant clips in untrimmed videos given natural language queries. Most existing VTG models are built upon frame-wise final-layer CLIP features, aided by additional temporal backbones (e.g., SlowFast) with sophisticated temporal reasoning mechanisms. In this work, we claim that CLIP itself already shows great potential for fine-grained spatial-temporal modeling, as each layer offers distinct yet useful information under different granularity levels. Motivated by this, we propose Reversed Recurrent Tuning (\(\mathrm R^2\)-Tuning), a parameter- and memory-efficient transfer learning framework for video temporal grounding. Our method learns a lightweight R\(^{2}\)containing only \(1.5\%\) of the total parameters to perform progressive spatial-temporal modeling. Starting from the last layer of CLIP, R\(^{2}\) recurrently aggregates spatial features from earlier layers, then refines temporal correlation conditioning on the given query, resulting in a coarse-to-fine scheme. \(\mathrm R^2\)-Tuning achieves state-of-the-art performance across three VTG tasks (ı.e., moment retrieval, highlight detection, and video summarization) on six public benchmarks (ı.e., QVHighlights, Charades-STA, Ego4D-NLQ, TACoS, YouTube Highlights, and TVSum) even without the additional backbone, demonstrating the significance and effectiveness of the proposed scheme. Our code is available at https://github.com/yeliudev/R2-Tuning.
Y. Liu—Work done at Harvard University.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: ICCV, pp. 5803–5812 (2017)
Apostolidis, E., Adamantidou, E., Metsai, A.I., Mezaris, V., Patras, I.: Video summarization using deep neural networks: a survey. Proc. IEEE 109(11), 1838–1863 (2021)
Badamdorj, T., Rochan, M., Wang, Y., Cheng, L.: Joint visual and audio learning for video highlight detection. In: ICCV, pp. 8127–8137 (2021)
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: A clip-hitchhiker’s guide to long video retrieval. Tech. Rep. arXiv:2205.08508 (2022)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp. 6299–6308 (2017)
Dosovitskiy, A., et al.: An image is worth 16\(\,\times \,\)16 words: transformers for image recognition at scale. In: ICLR (2020)
Escorcia, V., Soldan, M., Sivic, J., Ghanem, B., Russell, B.: Temporal localization of moments in video collections with natural language. Tech. Rep. arXiv:1907.12763 (2019)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: ICCV, pp. 6202–6211 (2019)
Gao, J., Sun, C., Yang, Z., Nevatia, R.: Tall: temporal activity localization via language query. In: ICCV, pp. 5267–5275 (2017)
Grauman, K., et al.: Ego4D: around the world in 3,000 hours of egocentric video. In: CVPR, pp. 18995–19012 (2022)
Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L.: Creating summaries from user videos. In: ECCV, pp. 505–520 (2014)
Gygli, M., Song, Y., Cao, L.: Video2GIF: automatic generation of animated gifs from video. In: CVPR, pp. 1001–1009 (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Hong, F.T., Huang, X., Li, W.H., Zheng, W.S.: MINI-Net: multiple instance ranking network for video highlight detection. In: ECCV, pp. 345–360 (2020)
Huang, S., et al.: VoP: text-video co-operative prompt tuning for cross-modal retrieval. In: CVPR, pp. 6565–6574 (2023)
Jang, J., Park, J., Kim, J., Kwon, H., Sohn, K.: Knowing where to focus: event-aware transformer for video grounding. In: ICCV, pp. 13846–13856 (2023)
Jia, M., et al.: Visual prompt tuning. In: ECCV, pp. 709–727 (2022)
Jiang, R., Liu, L., Chen, C.: CLIP-count: towards text-guided zero-shot object counting. In: ACM MM (2023)
Ju, C., Han, T., Zheng, K., Zhang, Y., Xie, W.: Prompting visual-language models for efficient video understanding. In: ECCV, pp. 105–124 (2022)
Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio, Speech Lang. Process. 28, 2880–2894 (2020)
Larsson, G., Maire, M., Shakhnarovich, G.: FractalNet: ultra-deep neural networks without residuals. Tech. Rep. arXiv:1605.07648 (2016)
Lei, J., Berg, T.L., Bansal, M.: QVHighlights: detecting moments and highlights in videos via natural language queries. In: NeurIPS (2021)
Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVR: a large-scale dataset for video-subtitle moment retrieval. In: ECCV, pp. 447–463 (2020)
Li, P., et al.: MomentDiff: generative video moment retrieval from random to real. Tech. Rep. arXiv:2307.02869 (2023)
Li, S., et al.: Probing visual-audio representation for video highlight detection via hard-pairs guided contrastive learning. Tech. Rep. arXiv:2206.10157 (2022)
Lin, K.Q., et al: UniVTG: towards unified video-language temporal grounding. In: CVPR, pp. 2794–2804 (2023)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2980–2988 (2017)
Lin, Z., et al.: Frozen clip models are efficient video learners. In: ECCV, pp. 388–404 (2022)
Liu, L., Yu, B.X., Chang, J., Tian, Q., Chen, C.W.: Prompt-matched semantic segmentation. Tech. Rep. arXiv:2208.10159 (2022)
Liu, W., Mei, T., Zhang, Y., Che, C., Luo, J.: Multi-task deep visual-semantic embedding for video thumbnail selection. In: CVPR, pp. 3707–3715 (2015)
Liu, Y., Li, S., Wu, Y., Chen, C.W., Shan, Y., Qie, X.: UMT: unified multi-modal transformers for joint video moment retrieval and highlight detection. In: CVPR, pp. 3042–3051 (2022)
Luo, H., et al.: CLIP4clip: an empirical study of clip for end to end video clip retrieval. Neurocomputing 508, 293–304 (2022)
Mahasseni, B., Lam, M., Todorovic, S.: Unsupervised video summarization with adversarial LSTM networks. In: CVPR, pp. 202–211 (2017)
Moon, W., Hyun, S., Lee, S., Heo, J.P.: Correlation-guided query-dependency calibration in video representation learning for temporal grounding. Tech. Rep. arXiv:2311.08835 (2023)
Moon, W., Hyun, S., Park, S., Park, D., Heo, J.P.: Query-dependent video representation for moment retrieval and highlight detection. In: CVPR, pp. 23023–23033 (2023)
Nan, G., et al.: Interventional video grounding with dual contrastive learning. In: CVPR, pp. 2765–2775 (2021)
Ni, B., et al.: Expanding language-image pretrained models for general video recognition. In: ECCV, pp. 1–18 (2022)
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. In: NeurIPS (2018)
Pan, J., Lin, Z., Zhu, X., Shao, J., Li, H.: ST-Adapter: parameter-efficient image-to-video transfer learning. In: NeurIPS, pp. 26462–26477 (2022)
Qing, Z., et al.: Disentangling spatial and temporal learning for efficient image-to-video transfer learning. In: CVPR, pp. 13934–13944 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
Rasheed, H., Khattak, M.U., Maaz, M., Khan, S., Khan, F.S.: Fine-tuned clip models are efficient video learners. In: CVPR, pp. 6545–6554 (2023)
Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. Trans. Assoc. Comput. Linguist. 1, 25–36 (2013)
Song, Y., Redi, M., Vallmitjana, J., Jaimes, A.: To click or not to click: automatic selection of beautiful thumbnails from videos. In: CIKM, pp. 659–668 (2016)
Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: Tvsum: summarizing web videos using titles. In: CVPR, pp. 5179–5187 (2015)
Sun, H., Zhou, M., Chen, W., Xie, W.: TR-DETR: task-reciprocal transformer for joint moment retrieval and highlight detection. In: AAAI (2024)
Sun, M., Farhadi, A., Seitz, S.: Ranking domain-specific highlights by analyzing edited videos. In: ECCV, pp. 787–802 (2014)
Sung, Y.L., Cho, J., Bansal, M.: LST: ladder side-tuning for parameter and memory efficient transfer learning. In: NeurIPS, pp. 12991–13005 (2022)
Touvron, H., et al.: LLaMA: open and efficient foundation language models. Tech. Rep. arXiv:2302.13971 (2023)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)
Wang, L., Liu, D., Puri, R., Metaxas, D.N.: Learning trailer moments in full-length movies. In: ECCV, pp. 300–316 (2020)
Wang, Z., Wang, L., Wu, T., Li, T., Wu, G.: Negative sample matters: a renaissance of metric learning for temporal grounding. In: AAAI, pp. 2613–2623 (2022)
Wei, F., Wang, B., Ge, T., Jiang, Y., Li, W., Duan, L.: Learning pixel-level distinctions for video highlight detection. In: CVPR, pp. 3073–3082 (2022)
Xiong, B., Kalantidis, Y., Ghadiyaram, D., Grauman, K.: Less is more: learning highlight detection from video duration. In: CVPR, pp. 1258–1267 (2019)
Xu, M., Wang, H., Ni, B., Zhu, R., Sun, Z., Wang, C.: Cross-category video highlight detection via set-based learning. In: ICCV, pp. 7970–7979 (2021)
Xu, Y., Sun, Y., Li, Y., Shi, Y., Zhu, X., Du, S.: MH-DETR: video moment and highlight detection with cross-modal transformer. Tech. Rep. arXiv:2305.00355 (2023)
Yan, S., et al.: UnLoc: a unified framework for video localization tasks. In: ICCV, pp. 13623–13633 (2023)
Yang, H., Wang, B., Lin, S., Wipf, D., Guo, M., Guo, B.: Unsupervised extraction of video highlights via robust recurrent auto-encoders. In: ICCV, pp. 4633–4641 (2015)
Ye, Q., Shen, X., Gao, Y., Wang, Z., Bi, Q., Li, P., Yang, G.: Temporal cue guided video highlight detection with low-rank audio-visual fusion. In: ICCV, pp. 7950–7959 (2021)
Yuan, T., Zhang, X., Liu, K., Liu, B., Jin, J., Jiao, Z.: UCF-crime annotation: a benchmark for surveillance video-and-language understanding. Tech. Rep. arXiv:2309.13925 (2023)
Yuan, Y., Ma, L., Wang, J., Liu, W., Zhu, W.: Semantic conditioned dynamic modulation for temporal sentence grounding in videos. NeurIPS 32 (2019)
Zhang, H., Sun, A., Jing, W., Zhou, J.T.: Span-based localizing network for natural language video localization. Tech. Rep. arXiv:2004.13931 (2020)
Zhang, K., Chao, W.-L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 766–782. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_47
Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2D temporal adjacent networks for moment localization with natural language. In: AAAI, pp. 12870–12877 (2020)
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: CVPR, pp. 16816–16825 (2022)
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vis. 130(9), 2337–2348 (2022)
Acknowledgments
This work was supported in part by Hong Kong Research Grants Council GRF-15229423, US NIH grant R01HD104969, and NSF award IIS-2239688.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, Y. et al. (2025). \(\mathrm R^2\)-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15099. Springer, Cham. https://doi.org/10.1007/978-3-031-72940-9_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-72940-9_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72939-3
Online ISBN: 978-3-031-72940-9
eBook Packages: Computer ScienceComputer Science (R0)