CodeMind: Evaluating Large Language Models for Code Reasoning

Liu, Changshu; Chen, Yang; Jabbarvand, Reyhaneh

Computer Science > Software Engineering

arXiv:2402.09664 (cs)

[Submitted on 15 Feb 2024 (v1), last revised 22 May 2025 (this version, v5)]

Title:CodeMind: Evaluating Large Language Models for Code Reasoning

Authors:Changshu Liu, Yang Chen, Reyhaneh Jabbarvand

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) have been widely used to automate programming tasks. Their capabilities have been evaluated by assessing the quality of generated code through tests or proofs. The extent to which they can reason about code is a critical question revealing important insights about their true capabilities. This paper introduces CodeMind, a framework designed to gauge the code reasoning abilities of LLMs through the following explicit and implicit code reasoning tasks: Independent Execution Reasoning (IER), Specification Reasoning (SR) and Dynamic Semantics Reasoning (DSR). The first evaluates the abilities of LLMs to simulate the execution of given inputs to a code and predict the output (IER). The second assesses the abilities of LLMs to incorporate the simulation of test data in the specification into code generation (SR). Finally, CodeMind evaluates LLMs' abilities to understand overall code semantics only given a specific input/output (DSR). Our extensive evaluation of ten LLMs across four widely used benchmarks using CodeMind shows that LLMs, depending on their size and training strategy, can reason about some dynamic aspects of code. However, their performance drops for code with higher complexity, non-trivial logical and arithmetic operators, non-primitive types, and API calls. We show that these reasoning tasks evaluate LLMs differently, and a comprehensive evaluation of code reasoning requires them all. Finally, we show that the performance of LLMs in bug repair is not correlated with any of the code reasoning tasks, and except for advanced frontier models, other LLMs do not incorporate code reasoning when performing bug repair.

Subjects:	Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)
Cite as:	arXiv:2402.09664 [cs.SE]
	(or arXiv:2402.09664v5 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2402.09664

Submission history

From: Changshu Liu [view email]
[v1] Thu, 15 Feb 2024 02:24:46 UTC (2,360 KB)
[v2] Fri, 16 Feb 2024 18:35:22 UTC (2,360 KB)
[v3] Wed, 21 Feb 2024 20:23:08 UTC (2,354 KB)
[v4] Wed, 3 Apr 2024 06:23:48 UTC (9,000 KB)
[v5] Thu, 22 May 2025 05:34:22 UTC (27,273 KB)

Computer Science > Software Engineering

Title:CodeMind: Evaluating Large Language Models for Code Reasoning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:CodeMind: Evaluating Large Language Models for Code Reasoning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators