Video LLMs for Temporal Reasoning in Long Videos

Fateh, Fawad Javed; Ahmed, Umer; Khan, Hamza; Zia, M. Zeeshan; Tran, Quoc-Huy

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.02930 (cs)

[Submitted on 4 Dec 2024 (v1), last revised 21 Jul 2025 (this version, v4)]

Title:Video LLMs for Temporal Reasoning in Long Videos

Authors:Fawad Javed Fateh, Umer Ahmed, Hamza Khan, M. Zeeshan Zia, Quoc-Huy Tran

View PDF HTML (experimental)

Abstract:This paper introduces TemporalVLM, a video large language model (video LLM) capable of effective temporal reasoning and fine-grained understanding in long videos. At the core, our approach includes a visual encoder for mapping a long-term input video into features which are time-aware and contain both local and global cues. In particular, it first divides the input video into short-term clips, which are jointly encoded with their timestamps and fused across overlapping temporal windows into time-sensitive local features. Next, the local features are passed through a bidirectional long short-term memory (BiLSTM) module for global feature aggregation. The extracted time-aware and multi-level features are important for accurate temporal reasoning and fine-grained understanding in long videos. Moreover, to facilitate the evaluation of TemporalVLM, we present a large-scale long video dataset of industry assembly processes, namely IndustryASM, which consists of videos recorded on factory floors with actions and timestamps annotated by industrial engineers for time and motion studies and temporal action segmentation evaluation. Finally, extensive experiments on datasets of long videos, including TimeIT and IndustryASM, show that TemporalVLM achieves superior performance than previous methods across temporal reasoning and fine-grained understanding tasks, namely dense video captioning, temporal video grounding, video highlight detection, and temporal action segmentation. To the best of our knowledge, our work is the first to incorporate LSTMs into video LLMs.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2412.02930 [cs.CV]
	(or arXiv:2412.02930v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.02930

Submission history

From: Quoc-Huy Tran [view email]
[v1] Wed, 4 Dec 2024 00:50:33 UTC (4,311 KB)
[v2] Sun, 9 Mar 2025 07:25:51 UTC (4,312 KB)
[v3] Fri, 6 Jun 2025 20:24:54 UTC (13,304 KB)
[v4] Mon, 21 Jul 2025 04:32:58 UTC (13,310 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Video LLMs for Temporal Reasoning in Long Videos

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Video LLMs for Temporal Reasoning in Long Videos

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators