Abstract
Video Large Language Models (Video-LLMs) are flourishing and has advanced many video-language tasks. As a golden testbed, Video Question Answering (VideoQA) plays pivotal role in Video-LLM developing. This work conducts a timely and comprehensive study of Video-LLMs’ behavior in VideoQA, aiming to elucidate their success and failure modes, and provide insights towards more human-like video understanding and question answering. Our analyses demonstrate that Video-LLMs excel in VideoQA; they can correlate contextual cues and generate plausible responses to questions about varied video contents. However, models falter in handling video temporality, both in reasoning about temporal content ordering and grounding QA-relevant temporal moments. Moreover, the models behave unintuitively - they are unresponsive to adversarial video perturbations while being sensitive to simple variations of candidate answers and questions. Also, they do not necessarily generalize better. The findings demonstrate Video-LLMs’ QA capability in standard condition yet highlight their severe deficiency in robustness and interpretability, suggesting the urgent need on rationales in Video-LLM developing.
Data Availability
We note that Video-LLM research is emerging, and there are other powerful models we may have missed testing. We thus release all our test data: https://github.com/doc-doc/VideoQA-LLMs.
Notes
In this paper, we denote Video-LLMs as video-language models that use LLMs with \(\ge 1\) billion parameters.
Who, What, When, Where, Why, Which, How.
References
Agrawal, A., Batra, D., & Parikh, D. (2016). Analyzing the behavior of visual question answering models. In: Conference on empirical methods in natural language processing (EMNLP), pp. 1955–1960.
Alayrac, J. B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. (2022). Flamingo: A visual language model for few-shot learning. Advances in Neural Information Processing Systems (NeurIPS), 35, 23716–23736.
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., & Parikh, D. (2015). Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp. 2425–2433.
Bagad, P., Tapaswi, M., & Snoek, C.G. (2023). Test of time: Instilling video-language models with a sense of time. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 2503–2516.
Bai, Z., Wang, P., Xiao, T., He, T., Han, Z., Zhang, Z., & Shou, M.Z. (2024). Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930.
Buch, S., Eyzaguirre, C., Gaidon, A., Wu, J., Fei-Fei, L., & Niebles, J.C. (2022). Revisiting the "video" in video-language understanding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 2917–2927.
Chen, X., Djolonga, J., Padlewski, P., Mustafa, B., Changpinyo, S., Wu, J., Ruiz, C.R., Goodman, S., Wang, X., Tay, Y., et al. (2023). Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565.
Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., Stoica, I., & Xing, E.P. (2023). Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/.
Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al. (2022). Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., & Hoi, S. (2023). Instructblip: Towards general-purpose vision-language models with instruction tuning. In: Proceedings of the 37th international conference on neural information processing systems (NeurIPS), pp. 49250–49267.
Datta, S., Dharur, S., Cartillier, V., Desai, R., Khanna, M., Batra, D., & Parikh, D. (2022). Episodic memory question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 19119–19128.
Devlin, J., Chang, M.W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., & Gelly, S., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In: International conference on learning representations (ICLR).
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
Fan, Y., Ma, X., Wu, R., Du, Y., Li, J., Gao, Z., & Li, Q. (2024). Videoagent: A memory-augmented multimodal agent for video understanding. In: European conference on computer vision (ECCV).
Fei, H., Wu, S., Ji, W., Zhang, H., Zhang, M., Lee, M.L., & Hsu, W. (2024). Video-of-thought: Step-by-step video reasoning from perception to cognition. In: Forty-first international conference on machine learning (ICML).
Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al. (2024). Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075.
Fu, T.J., Li, L., Gan, Z., Lin, K., Wang, W.Y., Wang, L., & Liu, Z. (2021). Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681.
Fu, T.J., Li, L., Gan, Z., Lin, K., Wang, W.Y., Wang, L., & Liu, Z. (2023). An empirical study of end-to-end video-language transformers with masked visual modeling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 22898–22909.
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. (2017). Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 6904–6913.
Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., et al. (2022). Ego4d: Around the world in 3000 hours of egocentric video. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 18995–19012.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: Proceedings of the ieee conference on computer vision and pattern recognition (CVPR), pp. 770–778 .
He, P., Liu, X., Gao, J., & Chen, W. (2020). Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654.
Himakunthala, V., Ouyang, A., Rose, D., He, R., Mei, A., Lu, Y., Sonar, C., Saxon, M., & Wang, W. (2023). Let’s think frame by frame with vip: A video infilling and prediction dataset for evaluating video chain-of-thought. In: Proceedings of the 2023 conference on empirical methods in natural language processing (EMNLP), pp. 204–219.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735–1780.
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
Ilharco , G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A,. Schmidt, L. (2021). Openclip. DOI 10.5281/zenodo.5143773, URL https://doi.org/10.5281/zenodo.5143773.
Jang, Y., Song, Y., Yu, Y., Kim, Y., & Kim, G. (2017). Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 2758–2766.
Jang, Y., Song, Y., Kim, C. D., Yu, Y., Kim, Y., & Kim, G. (2019). Video question answering with spatio-temporal reasoning. International Journal of Computer Vision (IJCV), 127, 1385–1412.
Kervadec, C., Antipov, G., Baccouche, M., & Wolf, C. (2021). Roses are red, violets are blue... but should vqa expect them to? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 2776–2785.
Kim, W., Choi, C., Lee, W., & Rhee, W. (2024). An image grid can be worth a video: Zero-shot video question answering using a vlm. arXiv preprint arXiv:2403.18406.
Ko, D., Lee, J., Kang, W.Y., Roh, B., & Kim, H. (2023). Large language models are temporal and causal reasoners for video question answering. In: Proceedings of the 2023 conference on empirical methods in natural language processing (EMNLP), pp. 4300–4316.
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. Advances in neural information processing systems (NeurIPS), 35, 22199–22213.
Le, T.M., Le, V., Venkatesh, S., & Tran, T. (2020). Hierarchical conditional relation networks for video question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 9972–9981.
Lei, J., Li, L., Zhou, L., Gan, Z., Berg, T.L., Bansal, M., & Liu, J. (2021). Less is more: Clipbert for video-and-language learning via sparse sampling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 7331–7341.
Lei, J., Berg, T., & Bansal, M. (2023). Revealing single frame bias for video-and-language learning. In: Proceedings of the 61st annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp. 487–507.
Li, J., Li, D., Savarese, S., & Hoi, S. (2023a). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: International conference on machine learning (ICML), PMLR, pp. 19730–19742.
Li, K., He, Y., Wang, Y., Li, Y., Wang, W., Luo, P., Wang, Y., Wang, L., & Qiao, Y. (2023b). Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355.
Li, K., Wang, Y., Li, Y., Wang, Y., He, Y., Wang, L., & Qiao, Y. (2023c). Unmasked teacher: Towards training-efficient video foundation models. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV).
Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., et al. (2024a). Mvbench: A comprehensive multi-modal video understanding benchmark. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 22195–22206.
Li, L., Chen, Y.C., Cheng, Y., Gan, Z., Yu, L., & Liu, J. (2020). Hero: Hierarchical encoder for video+ language omni-representation pre-training. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pp. 2046–2065.
Li, S., Li, L., Ren, S., Liu, Y., Liu, Y., Gao, R., Sun, X., & Hou, L. (2023d). Vitatecs: A diagnostic dataset for temporal concept understanding of video-language models. arXiv preprint arXiv:2311.17404.
Li, Y., Wang, X., Xiao, J., Ji, W., & Chua, T.S. (2022). Invariant grounding for video question answering. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 2928–2937.
Li, Y., Wang, X., Xiao, J., Ji, W., & Chua, T.S. (2023e). Transformer-empowered invariant grounding for video question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Li, Y., Xiao, J., Feng, C., Wang, X., & Chua, T.S. (2023f). Discovering spatio-temporal rationales for video question answering. In: IEEE/CVF international conference on computer vision (ICCV), pp. 13869–13878.
Li, Y., Chen, X., Hu, B., Wang, L., Shi, H., & Zhang, M. (2024b). Videovista: A versatile benchmark for video understanding and reasoning. arXiv preprint arXiv:2406.11303.
Li, Y., Wang, C., & Jia, J. (2024c). Llama-vid: An image is worth 2 tokens in large language models. In: European conference on computer vision (ECCV), Springer, pp. 323–340.
Lin, B., Zhu, B., Ye, Y., Ning, M., Jin, P., & Yuan, L. (2023). Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122.
Liu, B., Dong, Y., Wang, Y., Rao, Y., Tang, Y., Ma, W.C., & Krishna, R. (2024a). Coarse correspondence elicit 3d spacetime understanding in multimodal language model. arXiv preprint arXiv:2408.00754.
Liu, H., Li, C., Wu, Q., & Lee, Y.J. (2023). Visual instruction tuning. Advances in neural information processing systems (NeurIPS) 36.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Liu, Y., Li, S., Liu, Y., Wang, Y., Ren, S., Li, L., Chen, S., Sun, X., & Hou, L. (2024b). Tempcompass: Do video llms really understand videos? arXiv preprint arXiv:2403.00476.
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., & Hu, H. (2022). Video swin transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 3202–3211.
Maaz, M., Rasheed, H., Khan, S., & Khan, F.S. (2023). Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424.
Majumdar, A., Ajay, A., Zhang, X., Putta, P., Yenamandra, S., Henaff, M., Silwal, S., Mcvay, P., Maksymets, O., Arnaud, S., et al. (2024). Openeqa: Embodied question answering in the era of foundation models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 16488–16498.
Mangalam, K., Akshulakov, R., & Malik, J. (2023). Egoschema: A diagnostic benchmark for very long-form video language understanding. In: The 37th conference on neural information processing systems (NeurIPS) track on datasets and benchmarks.
Min, J., Buch, S., Nagrani, A., Cho, M., & Schmid, C. (2024). Morevqa: Exploring modular reasoning models for video question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 13235–13245.
Niu, Y., Tang, K., Zhang, H., Lu, Z., Hua, X.S., & Wen, J.R. (2021). Counterfactual vqa: A cause-effect look at language bias. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 12700–12710.
OpenAI. (2023). Gpt-4 technical report. arXiv:2303.08774.
Pătrăucean, V., Smaira, L., Gupta, A., Continente, A.R., Markeeva, L., Banarse, D., Koppula, S., Heyward, J., Malinowski, M., Yang, Y., et al. (2023). Perception test: A diagnostic benchmark for multimodal video models. In: The 37th conference on neural information processing systems (NeurIPS) track on datasets and benchmarks.
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In: International conference on machine learning (ICML), PMLR, pp. 8748–8763.
Seo, P.H., Nagrani, A., & Schmid, C. (2021). Look before you speak: Visually contextualized utterances. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 16877–16887.
Shah, M., Chen, X., Rohrbach, M., & Parikh, D. (2019). Cycle-consistency for robust visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6649–6658.
Shang, C., You, A., Subramanian, S., Darrell, T., & Herzig, R. (2024). Traveler: A multi-lmm agent framework for video question-answering. arXiv preprint arXiv:2404.01476.
Sun, C., Myers, A., Vondrick, C., Murphy, K., & Schmid, C. (2019). Videobert: A joint model for video and language representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp. 7464–7473.
Sun, Q., Fang, Y., Wu, L., Wang, X., & Cao, Y. (2023). Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389.
Surís, D., Menon, S., & Vondrick, C. (2023). Vipergpt: Visual inference via python execution for reasoning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 11888–11898.
Tang, Y., Bi, J., Xu, S., Song, L., Liang, S., Wang, T., Zhang, D., An, J., Lin, J., Zhu, R., et al. (2023). Video understanding with large language models: A survey. arXiv preprint arXiv:2312.17432.
Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., et al. (2023). Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Wang, X., Zhang, Y., Zohar, O., & Yeung-Levy, S. (2024a). Videoagent: Long-form video understanding with large language model as agent. In: European conference on computer vision (ECCV).
Wang, Z., Yu, S., Stengel-Eskin, E., Yoon, J., Cheng, F., Bertasius, G., & Bansal, M. (2024b). Videotree: Adaptive tree-based video representation for llm reasoning on long videos. arXiv preprint arXiv:2405.19209.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems (NeurIPS), 35, 24824–24837.
Xiao, J., Shang, X., Yao, A., & Chua, T.S. (2021). Next-qa: Next phase of question-answering to explaining temporal actions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 9777–9786.
Xiao, J., Yao, A., Liu, Z., Li, Y., Ji, W., & Chua, T. S. (2022). Video as conditional graph hierarchy for multi-granular question answering. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 36, 2804–2812.
Xiao, J., Zhou, P., Chua, T.S., & Yan, S. (2022b). Video graph transformer for video question answering. In: European conference on computer vision (ECCV), Springer, pp. 39–58.
Xiao, J., Zhou, P., Yao, A., Li, Y., Hong, R., Yan, S., & Chua, T. S. (2023). Contrastive video question answering via video graph transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 45(11), 13265–13280. https://doi.org/10.1109/TPAMI.2023.3292266
Xiao, J., Yao, A., Li, Y., & Chua, T.S. (2024). Can i trust your answer? visually grounded video question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 13204–13214.
Xu, D., Zhao, Z., Xiao, J., Wu, F., Zhang, H., He, X., & Zhuang, Y. (2017). Video question answering via gradually refined attention over appearance and motion. In: Proceedings of the 25th ACM international conference on multimedia, pp. 1645–1653.
Yang, A., Miech, A., Sivic, J., Laptev, I., & Schmid, C. (2021). Just ask: Learning to answer questions from millions of narrated videos. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp. 1686–1697.
Yang, A., Miech, A., Sivic, J., Laptev, I., & Schmid, C. (2022). Zero-shot video question answering via frozen bidirectional language models. Advances in Neural Information Processing Systems (NeurIPS), 35, 124–141.
Yu, S., Cho, J., Yadav, P., & Bansal, M. (2023). Self-chained image-language model for video localization and question answering. In: The 37th conference on neural information processing systems (NeurIPS).
Yu, Z., Xu, D., Yu, J., Yu, T., Zhao, Z., Zhuang, Y., & Tao, D. (2019). Activitynet-qa: A dataset for understanding complex web videos via question answering. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 33, 9127–9134.
Zeng, A., Attarian, M., Choromanski, K.M., Wong, A., Welker, S., Tombari, F., Purohit, A., Ryoo, M.S., Sindhwani, V., Lee, V.V. Johnny, & Florence, P. (2023). Socratic models: Composing zero-shot multimodal reasoning with language. In: The 11th international conference on learning representations (ICLR).
Zhang, C., Lu, T., Islam, M.M., Wang, Z., Yu, S., Bansal, M., & Bertasius, G. (2023a). A simple llm framework for long-range video question-answering. arXiv preprint arXiv:2312.17235.
Zhang, H., Li, X., & Bing, L. (2023b). Video-llama: An instruction-tuned audio-visual language model for video understanding. In: Proceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations, pp. 543–553.
Zhang, P., Zhang, K., Li, B., Zeng, G., Yang, J., Zhang, Y., Wang, Z., Tan, H., Li, C., & Liu, Z. (2024a). Long context transfer from language to vision. arXiv preprint arXiv:2406.16852.
Zhang, R., Han, J., Liu, C., Gao, P., Zhou, A., Hu, X., Yan, S., Lu, P., Li, H., & Qiao, Y. (2023c). Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
Zhang, R., Han, J., Zhou, A., Hu, X., Yan, S., Lu, P., Li, H., Gao, P., & Qiao, Y. (2024b). Llama-adapter: Efficient fine-tuning of language models with zero-init attention. ICLR.
Zhang, X., Zhang, F., & Xu, C. (2023d). Reducing vision-answer biases for multiple-choice vqa. IEEE Transactions on Image Processing (TIP) pp. 4621–4634.
Zhang, X., Zhang, F., & Xu, C. (2024). Next-ood: Overcoming dual multiple-choice vqa biases. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 46(4), 1913–1931. https://doi.org/10.1109/TPAMI.2023.3269429
Zhang, Y., Huang, X., Ma, J., Li, Z., Luo, Z., Xie, Y., Qin, Y., Luo, T., Li, Y., Liu, S., et al. (2024d). Recognize anything: A strong image tagging model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 1724–1732.
Zhang, Y., Li, B., Liu, H., Lee, Y.J., Gui, L., Fu, D., Feng, J., Liu, Z., & Li, C. (2024e). Llava-next: A strong zero-shot video understanding model. https://llava-vl.github.io/blog/2024-04-30-llava-next-video/.
Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., & Smola, A. (2023e). Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923.
Zhao, Y., Misra, I., Krähenbühl, P., & Girdhar, R. (2023). Learning video representations from large language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 6586–6597.
Zhong, Y., Xiao, J., Ji, W., Li, Y., Deng, W., & Chua, T.S. (2022). Video question answering: Datasets, algorithms and challenges. In: Proceedings of the 2022 conference on empirical methods in natural language processing (EMNLP), pp. 6439–6455.
Zhu, B., Lin, B., Ning, M., Yan, Y., Cui, J., Wang, H., Pang, Y., Jiang, W., Zhang, J., Li, Z., et al. (2023). Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852.
Acknowledgements
We greatly thank OpenAI for offering us with research access API credits.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Communicated by Gunhee Kim.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This research is supported by the National Research Foundation, Singapore under its NRF Fellowship for AI (NRF-NRFFAI1-2019-0001). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.
Appendices
Appendix 1: Temporal Probes
We adopt different methods to generate the Temporal Exchange and Temporal Description data described in Sect. 3.4. Yet, both are followed by human (the first 4 authors) checking and correction to ensure the quality.
Specifically, to get the data of Temporal Exchange, we prompt GPT-4 by giving the original question and correct answer pairs. In our implementation, we find it is better to use different prompts to handle questions about “before/after” and questions about “when/while”. Specific prompts are presented in Table 6. We find the generated QAs are quite good and only do small effort in correction.
To get the data of Temporal Description, we directly parse the syntactic structure of the questions according to the time signal words “before/after/when/while". Specifically, we keep the question part ahead of the time words. For instance, we derive the new question “what did the person do?" from the original question “what did the person do after she took the black item away?”. To curate the options, we use different time words to combine the two terms “she took the black item away" and “pat animal", and ensure only one combination is correct regarding the content order of the video. Also, we keep the position of the correct answer unchanged. Additionally, we find there are temporal location descriptions for some questions, such as “... in the middle of the video". To handle this, we trim off such description and append it behind the new question. Note that the grammatical issues such as “before pat animal" vs. “before patting animal" are ignored, since them do not affect the overall meanings of the questions and answers.
Appendix 2: Multi-Choice Short-Cuts
To test the short-cuts in candidate answers, we obtain the edited Question-Answer (QA) and Video-Answer (VA) data by prompting GPT-4, again followed by human checking and correction. Our specific prompts are listed in Table 7. For human checking, we specifically check the edited negative options to ensure that they are actual wrong answers with regard to the question and video contents.
Appendix 3: Robustness
To obtain questions with spoken prefixes, we first prompt GPT-4 to generate a set of spoken phrases that humans typically use before asking questions as shown in Table 4. Then, we randomly select from these spoken phrases and prepend them to the questions. To obtain the rephrased questions, we simply prompt GPT-4 using the instructions specified at the bottom of Table 7. We find the generated questions are quite good and only do small correction.
Appendix 4: GPT-4o for VideoQA
Following popular methods, we first decode each video at 3 frames per second (fps) and then uniformly sample 32 video frames. We then feed all of the sampled 32 frames together with the question and answer into GPT-4o and prompt it to answer the question based on the sequence of image contents. Specific prompts are shown in Table 8. Particularly, for the PosVQA experiment described in Sect. 3.5, we sample all frames (after decoding at 3 fps) from the positive temporal moments and uniformly sample 32 frames if the total frames of a segment is large than 32. This is because we find that, for a non-trivial number of samples, only a single frame is positive if we uniformly sample 32 frames from the whole video. Finally, in all experiments, we only experiment with 40% of the original data considering the time and API expenses. Also, we choose not to perform GPT-4o on open-ended QA for answer evaluation issues.
Appendix 5:Effectiveness of Data Augmentation
For improvement, we take the Temporal Description (TD) probe (introduced in Sect. 3.4) as an example to explore the effects of data augmentation, given that all models hardly survive on this probe. In our implementation, we curate new TD data based on the original temporal subset of training data, mirroring how we curate TD test data. We then add the curated TD data into the original training data of NExT-QA for training (\(\sim 4.5\)K). The results in Table 5 show that this simple data augmentation significantly boosts the model performance on both the temporal description (TD) and temporal exchange (TE) probes, and slightly improves the performance on the original test set (TO) for temporal understanding. The consistent improvements on all test settings thus demonstrate the effectiveness of such data argumentation towards more robust temporal understanding. Additionally, we observe profound improvements on TD test probe. We speculate that the curated TD test data oriented for zero-shot testing may leak statistical bias when training with such forms of data.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xiao, J., Huang, N., Qin, H. et al. VideoQA in the Era of LLMs: An Empirical Study. Int J Comput Vis 133, 3970–3993 (2025). https://doi.org/10.1007/s11263-025-02385-8
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02385-8