这是indexloc提供的服务,不要输入任何密码
Skip to main content

\(\mathrm R^2\)-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Video temporal grounding (VTG) is a fine-grained video understanding problem that aims to ground relevant clips in untrimmed videos given natural language queries. Most existing VTG models are built upon frame-wise final-layer CLIP features, aided by additional temporal backbones (e.g., SlowFast) with sophisticated temporal reasoning mechanisms. In this work, we claim that CLIP itself already shows great potential for fine-grained spatial-temporal modeling, as each layer offers distinct yet useful information under different granularity levels. Motivated by this, we propose Reversed Recurrent Tuning (\(\mathrm R^2\)-Tuning), a parameter- and memory-efficient transfer learning framework for video temporal grounding. Our method learns a lightweight R\(^{2}\)containing only \(1.5\%\) of the total parameters to perform progressive spatial-temporal modeling. Starting from the last layer of CLIP, R\(^{2}\) recurrently aggregates spatial features from earlier layers, then refines temporal correlation conditioning on the given query, resulting in a coarse-to-fine scheme. \(\mathrm R^2\)-Tuning achieves state-of-the-art performance across three VTG tasks (ı.e., moment retrieval, highlight detection, and video summarization) on six public benchmarks (ı.e., QVHighlights, Charades-STA, Ego4D-NLQ, TACoS, YouTube Highlights, and TVSum) even without the additional backbone, demonstrating the significance and effectiveness of the proposed scheme. Our code is available at https://github.com/yeliudev/R2-Tuning.

Y. Liu—Work done at Harvard University.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: ICCV, pp. 5803–5812 (2017)

    Google Scholar 

  2. Apostolidis, E., Adamantidou, E., Metsai, A.I., Mezaris, V., Patras, I.: Video summarization using deep neural networks: a survey. Proc. IEEE 109(11), 1838–1863 (2021)

    Article  Google Scholar 

  3. Badamdorj, T., Rochan, M., Wang, Y., Cheng, L.: Joint visual and audio learning for video highlight detection. In: ICCV, pp. 8127–8137 (2021)

    Google Scholar 

  4. Bain, M., Nagrani, A., Varol, G., Zisserman, A.: A clip-hitchhiker’s guide to long video retrieval. Tech. Rep. arXiv:2205.08508 (2022)

  5. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp. 6299–6308 (2017)

    Google Scholar 

  6. Dosovitskiy, A., et al.: An image is worth 16\(\,\times \,\)16 words: transformers for image recognition at scale. In: ICLR (2020)

    Google Scholar 

  7. Escorcia, V., Soldan, M., Sivic, J., Ghanem, B., Russell, B.: Temporal localization of moments in video collections with natural language. Tech. Rep. arXiv:1907.12763 (2019)

  8. Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: ICCV, pp. 6202–6211 (2019)

    Google Scholar 

  9. Gao, J., Sun, C., Yang, Z., Nevatia, R.: Tall: temporal activity localization via language query. In: ICCV, pp. 5267–5275 (2017)

    Google Scholar 

  10. Grauman, K., et al.: Ego4D: around the world in 3,000 hours of egocentric video. In: CVPR, pp. 18995–19012 (2022)

    Google Scholar 

  11. Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L.: Creating summaries from user videos. In: ECCV, pp. 505–520 (2014)

    Google Scholar 

  12. Gygli, M., Song, Y., Cao, L.: Video2GIF: automatic generation of animated gifs from video. In: CVPR, pp. 1001–1009 (2016)

    Google Scholar 

  13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  14. Hong, F.T., Huang, X., Li, W.H., Zheng, W.S.: MINI-Net: multiple instance ranking network for video highlight detection. In: ECCV, pp. 345–360 (2020)

    Google Scholar 

  15. Huang, S., et al.: VoP: text-video co-operative prompt tuning for cross-modal retrieval. In: CVPR, pp. 6565–6574 (2023)

    Google Scholar 

  16. Jang, J., Park, J., Kim, J., Kwon, H., Sohn, K.: Knowing where to focus: event-aware transformer for video grounding. In: ICCV, pp. 13846–13856 (2023)

    Google Scholar 

  17. Jia, M., et al.: Visual prompt tuning. In: ECCV, pp. 709–727 (2022)

    Google Scholar 

  18. Jiang, R., Liu, L., Chen, C.: CLIP-count: towards text-guided zero-shot object counting. In: ACM MM (2023)

    Google Scholar 

  19. Ju, C., Han, T., Zheng, K., Zhang, Y., Xie, W.: Prompting visual-language models for efficient video understanding. In: ECCV, pp. 105–124 (2022)

    Google Scholar 

  20. Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio, Speech Lang. Process. 28, 2880–2894 (2020)

    Article  Google Scholar 

  21. Larsson, G., Maire, M., Shakhnarovich, G.: FractalNet: ultra-deep neural networks without residuals. Tech. Rep. arXiv:1605.07648 (2016)

  22. Lei, J., Berg, T.L., Bansal, M.: QVHighlights: detecting moments and highlights in videos via natural language queries. In: NeurIPS (2021)

    Google Scholar 

  23. Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVR: a large-scale dataset for video-subtitle moment retrieval. In: ECCV, pp. 447–463 (2020)

    Google Scholar 

  24. Li, P., et al.: MomentDiff: generative video moment retrieval from random to real. Tech. Rep. arXiv:2307.02869 (2023)

  25. Li, S., et al.: Probing visual-audio representation for video highlight detection via hard-pairs guided contrastive learning. Tech. Rep. arXiv:2206.10157 (2022)

  26. Lin, K.Q., et al: UniVTG: towards unified video-language temporal grounding. In: CVPR, pp. 2794–2804 (2023)

    Google Scholar 

  27. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2980–2988 (2017)

    Google Scholar 

  28. Lin, Z., et al.: Frozen clip models are efficient video learners. In: ECCV, pp. 388–404 (2022)

    Google Scholar 

  29. Liu, L., Yu, B.X., Chang, J., Tian, Q., Chen, C.W.: Prompt-matched semantic segmentation. Tech. Rep. arXiv:2208.10159 (2022)

  30. Liu, W., Mei, T., Zhang, Y., Che, C., Luo, J.: Multi-task deep visual-semantic embedding for video thumbnail selection. In: CVPR, pp. 3707–3715 (2015)

    Google Scholar 

  31. Liu, Y., Li, S., Wu, Y., Chen, C.W., Shan, Y., Qie, X.: UMT: unified multi-modal transformers for joint video moment retrieval and highlight detection. In: CVPR, pp. 3042–3051 (2022)

    Google Scholar 

  32. Luo, H., et al.: CLIP4clip: an empirical study of clip for end to end video clip retrieval. Neurocomputing 508, 293–304 (2022)

    Article  Google Scholar 

  33. Mahasseni, B., Lam, M., Todorovic, S.: Unsupervised video summarization with adversarial LSTM networks. In: CVPR, pp. 202–211 (2017)

    Google Scholar 

  34. Moon, W., Hyun, S., Lee, S., Heo, J.P.: Correlation-guided query-dependency calibration in video representation learning for temporal grounding. Tech. Rep. arXiv:2311.08835 (2023)

  35. Moon, W., Hyun, S., Park, S., Park, D., Heo, J.P.: Query-dependent video representation for moment retrieval and highlight detection. In: CVPR, pp. 23023–23033 (2023)

    Google Scholar 

  36. Nan, G., et al.: Interventional video grounding with dual contrastive learning. In: CVPR, pp. 2765–2775 (2021)

    Google Scholar 

  37. Ni, B., et al.: Expanding language-image pretrained models for general video recognition. In: ECCV, pp. 1–18 (2022)

    Google Scholar 

  38. Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. In: NeurIPS (2018)

    Google Scholar 

  39. Pan, J., Lin, Z., Zhu, X., Shao, J., Li, H.: ST-Adapter: parameter-efficient image-to-video transfer learning. In: NeurIPS, pp. 26462–26477 (2022)

    Google Scholar 

  40. Qing, Z., et al.: Disentangling spatial and temporal learning for efficient image-to-video transfer learning. In: CVPR, pp. 13934–13944 (2023)

    Google Scholar 

  41. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)

    Google Scholar 

  42. Rasheed, H., Khattak, M.U., Maaz, M., Khan, S., Khan, F.S.: Fine-tuned clip models are efficient video learners. In: CVPR, pp. 6545–6554 (2023)

    Google Scholar 

  43. Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. Trans. Assoc. Comput. Linguist. 1, 25–36 (2013)

    Article  Google Scholar 

  44. Song, Y., Redi, M., Vallmitjana, J., Jaimes, A.: To click or not to click: automatic selection of beautiful thumbnails from videos. In: CIKM, pp. 659–668 (2016)

    Google Scholar 

  45. Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: Tvsum: summarizing web videos using titles. In: CVPR, pp. 5179–5187 (2015)

    Google Scholar 

  46. Sun, H., Zhou, M., Chen, W., Xie, W.: TR-DETR: task-reciprocal transformer for joint moment retrieval and highlight detection. In: AAAI (2024)

    Google Scholar 

  47. Sun, M., Farhadi, A., Seitz, S.: Ranking domain-specific highlights by analyzing edited videos. In: ECCV, pp. 787–802 (2014)

    Google Scholar 

  48. Sung, Y.L., Cho, J., Bansal, M.: LST: ladder side-tuning for parameter and memory efficient transfer learning. In: NeurIPS, pp. 12991–13005 (2022)

    Google Scholar 

  49. Touvron, H., et al.: LLaMA: open and efficient foundation language models. Tech. Rep. arXiv:2302.13971 (2023)

  50. Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)

    Google Scholar 

  51. Wang, L., Liu, D., Puri, R., Metaxas, D.N.: Learning trailer moments in full-length movies. In: ECCV, pp. 300–316 (2020)

    Google Scholar 

  52. Wang, Z., Wang, L., Wu, T., Li, T., Wu, G.: Negative sample matters: a renaissance of metric learning for temporal grounding. In: AAAI, pp. 2613–2623 (2022)

    Google Scholar 

  53. Wei, F., Wang, B., Ge, T., Jiang, Y., Li, W., Duan, L.: Learning pixel-level distinctions for video highlight detection. In: CVPR, pp. 3073–3082 (2022)

    Google Scholar 

  54. Xiong, B., Kalantidis, Y., Ghadiyaram, D., Grauman, K.: Less is more: learning highlight detection from video duration. In: CVPR, pp. 1258–1267 (2019)

    Google Scholar 

  55. Xu, M., Wang, H., Ni, B., Zhu, R., Sun, Z., Wang, C.: Cross-category video highlight detection via set-based learning. In: ICCV, pp. 7970–7979 (2021)

    Google Scholar 

  56. Xu, Y., Sun, Y., Li, Y., Shi, Y., Zhu, X., Du, S.: MH-DETR: video moment and highlight detection with cross-modal transformer. Tech. Rep. arXiv:2305.00355 (2023)

  57. Yan, S., et al.: UnLoc: a unified framework for video localization tasks. In: ICCV, pp. 13623–13633 (2023)

    Google Scholar 

  58. Yang, H., Wang, B., Lin, S., Wipf, D., Guo, M., Guo, B.: Unsupervised extraction of video highlights via robust recurrent auto-encoders. In: ICCV, pp. 4633–4641 (2015)

    Google Scholar 

  59. Ye, Q., Shen, X., Gao, Y., Wang, Z., Bi, Q., Li, P., Yang, G.: Temporal cue guided video highlight detection with low-rank audio-visual fusion. In: ICCV, pp. 7950–7959 (2021)

    Google Scholar 

  60. Yuan, T., Zhang, X., Liu, K., Liu, B., Jin, J., Jiao, Z.: UCF-crime annotation: a benchmark for surveillance video-and-language understanding. Tech. Rep. arXiv:2309.13925 (2023)

  61. Yuan, Y., Ma, L., Wang, J., Liu, W., Zhu, W.: Semantic conditioned dynamic modulation for temporal sentence grounding in videos. NeurIPS 32 (2019)

    Google Scholar 

  62. Zhang, H., Sun, A., Jing, W., Zhou, J.T.: Span-based localizing network for natural language video localization. Tech. Rep. arXiv:2004.13931 (2020)

  63. Zhang, K., Chao, W.-L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 766–782. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_47

    Chapter  Google Scholar 

  64. Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2D temporal adjacent networks for moment localization with natural language. In: AAAI, pp. 12870–12877 (2020)

    Google Scholar 

  65. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: CVPR, pp. 16816–16825 (2022)

    Google Scholar 

  66. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vis. 130(9), 2337–2348 (2022)

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported in part by Hong Kong Research Grants Council GRF-15229423, US NIH grant R01HD104969, and NSF award IIS-2239688.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Wanhua Li or Chang Wen Chen .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 13069 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, Y. et al. (2025). \(\mathrm R^2\)-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15099. Springer, Cham. https://doi.org/10.1007/978-3-031-72940-9_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72940-9_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72939-3

  • Online ISBN: 978-3-031-72940-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics