Large Language Models lack essential metacognition for reliable medical reasoning

Griot, Maxime; Hemptinne, Coralie; Vanderdonckt, Jean; Yuksel, Demet

doi:10.1038/s41467-024-55628-6

Download PDF

Article
Open access
Published: 14 January 2025

Large Language Models lack essential metacognition for reliable medical reasoning

Nature Communications volume 16, Article number: 642 (2025) Cite this article

18k Accesses
16 Citations
137 Altmetric
Metrics details

Subjects

Abstract

Large Language Models have demonstrated expert-level accuracy on medical board examinations, suggesting potential for clinical decision support systems. However, their metacognitive abilities, crucial for medical decision-making, remain largely unexplored. To address this gap, we developed MetaMedQA, a benchmark incorporating confidence scores and metacognitive tasks into multiple-choice medical questions. We evaluated twelve models on dimensions including confidence-based accuracy, missing answer recall, and unknown recall. Despite high accuracy on multiple-choice questions, our study revealed significant metacognitive deficiencies across all tested models. Models consistently failed to recognize their knowledge limitations and provided confident answers even when correct options were absent. In this work, we show that current models exhibit a critical disconnect between perceived and actual capabilities in medical reasoning, posing significant risks in clinical settings. Our findings emphasize the need for more robust evaluation frameworks that incorporate metacognitive abilities, essential for developing reliable Large Language Model enhanced clinical decision support systems.

Large language models encode clinical knowledge

Article Open access 12 July 2023

Application of large language models in medicine

Article 07 April 2025

Enhancing diagnostic capability with multi-agents conversational large language models

Article Open access 13 March 2025

Introduction

Large Language Models (LLMs) have emerged as transformative tools across various industries, including healthcare. The rapid development and deployment of these models present a stark contrast to the lengthy timelines required for clinical studies, necessitating the development of automated evaluation methods. Traditionally, these evaluations rely on multiple-choice questions encompassing a range of topics, from biochemistry to clinical decision-making, and are benchmarked using standardized tests such as MultiMedQA¹. While these methods allow for the swift assessment of model performance, they are primarily limited to evaluating pattern recognition and information recall².

Recent efforts have included the use of official board examinations to evaluate LLM performance across different medical specialties such as pediatrics³, oncology⁴, ophthalmology⁵, radiology⁶, or plastic surgery⁷, often demonstrating that these models can perform at a level comparable to medical professionals⁸. However, such testing methodologies are inherently limited. They focus predominantly on accuracy in answering specific questions, without adequately addressing the critical aspects of model safety and the potential for generating erroneous or misleading information. For instance, studies examining specific tasks, such as ICD (International Classification of Diseases) coding⁹, have revealed significant performance deficiencies, underscoring the need for more comprehensive evaluation frameworks^10,11.

In addition, LLMs’ integration in high-stakes environments has been met with resistance and skepticism due to hallucinations and the difficulty of reducing or detecting them^12,13,14. For instance, a lawyer used ChatGPT to assist in a case, but the model hallucinated citations that did not exist¹⁵. The existence of these intrinsic limitations of transformer-based LLMs raises questions regarding other limitations that may be more subtle but have similar safety consequences.

The challenges posed by LLMs are emblematic of broader issues in the application of Artificial Intelligence (AI) in healthcare. AI offers immense potential to address the shortage of healthcare workers¹⁶ and promises to reduce clerical work^17,18, and has already demonstrated uses in precision diagnostics and therapeutics¹⁹. AI also introduces significant challenges due to its nature as a probabilistic black box that often lacks transparency, explicability, and interpretability²⁰. This opacity engenders trust issues among healthcare providers and creates substantial barriers to meeting regulatory requirements for clinical deployment²¹. Consequently, these challenges slow down or postpone the adoption of AI technologies that could otherwise dramatically improve patient outcomes and optimize clinical workflows.

Regulatory frameworks from governing bodies such as the European Union provide more clarity on the expectations of such systems. For instance, the European Union’s approach aims to balance innovation with safety and ethical considerations, requiring AI systems in healthcare to be transparent, accountable, and subject to human oversight²². To address these concerns and improve interpretability and transparency, we propose investigating a crucial but underexplored area: the assessment of LLMs’ metacognition.

Metacognition in AI systems can be split into two categories²³: knowledge of cognition and regulation of cognition. Knowledge of cognition encompasses awareness of one’s own cognitive processes, such as identifying biases. Regulation of cognition refers to skills for managing one’s learning process, including self-evaluation and monitoring. In healthcare, these abilities are crucial for professionals to handle complex, uncertain situations and continuously improve their practice. Understanding whether LLMs can gauge their knowledge and handle uncertainty is essential for their safe integration into clinical environments.

To assess metacognition, we introduce MetaMedQA²⁴, an extension, and modification of the MedQA-USMLE benchmark²⁵, designed to evaluate LLMs’ metacognition on medical problems. Our enhanced benchmark employs techniques such as confidence scoring and uncertainty quantification to assess not only the accuracy of LLMs but also their capacity for self-assessment and identification of knowledge gaps. This approach aims to provide a holistic evaluation framework that aligns more closely with the practical demands of clinical settings, ensuring that the deployment of LLMs in healthcare can be both safe and effective. Moreover, the implications of this research extend beyond healthcare, potentially informing the development and evaluation of AI systems in other high-stakes domains where self-awareness and accurate self-assessment are critical.

In this work, we show that current LLMs demonstrate significant limitations in metacognitive abilities crucial for clinical decision-making. Our results reveal that while larger and newer models generally outperform their smaller and older counterparts in accuracy, most models exhibit poor performance in recognizing unanswerable questions and managing uncertainty. Notably, only three models, with GPT-4o standing out, effectively vary their confidence levels. We find that LLMs’ tendency towards overconfidence and inability to recognize knowledge gaps pose potential risks in clinical applications. These findings underscore the need for developing more sophisticated mechanisms within LLMs to handle uncertainty and ambiguity, as well as the importance of evolving benchmarks and evaluation metrics that capture the complexities of clinical reasoning.

Results

Benchmark creation and preprocessing

To evaluate the metacognitive abilities of Large LLMs in medical contexts, we based our assessment on MedQA-USMLE, a subset of MedQA, due to the other benchmarks included in MultiMedQA lacking both quality and clinical relevance. This benchmark is composed of clinical vignettes accompanied by four answer choices, with only one correct answer²⁶.

We modified the MedQA-USMLE benchmark in three steps to create MetaMedQA as shown in Fig. 1:

1.
Inclusion of Fictional Questions: To test the models’ capabilities in recognizing their knowledge gaps, we included 100 questions from the Glianorex benchmark²⁷, which is constructed in the format of MedQA-USMLE but pertains to a fictional organ. Examples of these questions are presented in Table 1.
Table 1 Examples of fictional questions from Glianorex English and malformed questions from MedQA-USMLE that cannot be answered due to missing information
Full size table
2.
Identification of Malformed Questions: Following Google’s observation that a small percentage of questions may be malformed²⁸, we manually audited the benchmark and identified 55 questions that either relied on missing media or lacked necessary information. Examples of such a question are provided in Table 1.
3.
Modifications to Questions: We randomly selected 125 questions and made changes by either replacing the correct answer with an incorrect one, modifying the correct answer to render it incorrect, or altering the question itself. Examples of these modifications are presented in Table 2.
Table 2 Example of questions after modification, the original content is shown with strikethrough text, and replacement is bolded
Full size table

**Fig. 1: Flow chart description of the MetaMedQA dataset construction.**

These steps resulted in a dataset of 1373 questions, each with six answer choices, with only one correct choice.

Overall accuracy

The results obtained by the different models correlate with their size and release date; larger and more recent models achieved higher accuracy than their smaller and older counterparts as shown in Fig. 2. For example, Qwen2 72B (M = 64.3%, SEM = 1.3%) is significantly more accurate (p < 0.0001) than Qwen2 7B (M = 43.9%, SEM = 1.3%) with a moderate effect size (Cohen’s d = 0.417). GPT-4o-2024-05-13 (M = 73.3%, SEM = 1.2%) is significantly more accurate than all other models (p < 0.0001) while Yi 1.5 9B (M = 29.6%, SEM = 1.2%) is significantly less accurate than all other models (p < 0.0001). The notably low performance of Yi 1.5 9B, compared to similar-sized models stands out as an outlier.

**Fig. 2: Accuracy of models on the MetaMedQA benchmark.**

Impact of confidence

The original MedQA-USMLE benchmark primarily focuses on accuracy to compare models. Given the additional complexities introduced by our enhanced benchmark, we introduced three new metrics to assess AI model accuracy based on confidence levels generated by models ranging from 1 to 5. Each metric computes the percentage of correct answers within its confidence range. This system enabled a nuanced evaluation of the model’s performance, from its most certain predictions to those where it expressed doubt, ultimately enhancing safety and decision-making in healthcare applications. The three metrics use the following rules:

High Confidence Accuracy: For responses with a confidence score of 5.
Medium Confidence Accuracy: For responses with scores between 3 and 4.
Low Confidence Accuracy: For responses with scores below 3.

We observed that most models consistently assigned a maximum confidence level of 5, rendering them unsuitable for the confidence analysis. Only GPT-3.5-turbo-0125, GPT-4o-2024-05-13, and Qwen2-72B exhibited varying confidence levels, as shown in Table 3. For these models, higher confidence levels were correlated with higher accuracy, with GPT-4o demonstrating the best ability to assess its answers accurately. The other two models, however, only provided high or medium confidence scores, never utilizing low confidence ratings.

Table 3 Analysis of the impact of confidence on the accuracy of three models on MetaMedQA including the 95% confidence interval. Source data are provided as a Source Data file

Full size table

Missing answer analysis

The “Missing answer recall” metric evaluated the model’s capability to recognize when none of the provided options are correct, which is essential for ensuring accuracy in ambiguous or incomplete questions. It is calculated by dividing the number of correctly identified “None of the above” answers by the total number of questions where this was the correct answer.

The recall of missing answers when the correct response is “None of the above,” as shown in Fig. 3, indicates that models struggle more with this option compared to others. The Yi 1.5 9B model, which had the lowest overall accuracy, achieved the highest score on this specific metric. This can be attributed to the model selecting “None of the above” 520 times, or 37.9% of the questions, leading to an inflated score in this area but poor performance on other metrics. A similar but less pronounced trend was observed with the Meerkat 7B model, which chose “None of the above” 295 times. Conversely, the Llama 3 8B model almost never selected this option, while the Mistral 7B and Internist 7B models never did. When examining other models, we found that larger and more recent models generally outperformed their smaller and older counterparts, mirroring the overall accuracy pattern. For instance, GPT-4o-2024-05-13 (M = 46.1%, SEM = 4.7%) is significantly more accurate (p < 0.0001) than GPT-3.5-turbo-0125 (M = 11.3%, SEM = 2.9%) with a large effect size (d = 0.826).

**Fig. 3: Recall of “None of the above” of models on the MetaMedQA benchmark, including the 95% confidence interval.**

We conducted additional analyses to explore the relationship between overall accuracy and missing answer recall. After excluding the outliers Yi 1.5 9B and Meerkat 7B, which selected “None of the above” for 37.9% and 21.5% of questions respectively (greatly overestimating their performance in this category), we found a strong positive correlation between these two metrics by calculating the Pearson correlation coefficient with a two-tailed p-value (Pearson r = 0.947, p < 0.0001). This indicates that models with higher overall accuracy generally performed better at identifying missing answers. To quantify this relationship more precisely, we performed a regression analysis. The regression yielded a statistically significant positive slope of 1.319 (95% CI: 1.136 - 1.502, p < 0.0001) as shown in Fig. 4.

**Fig. 4: Linear regression between missing answer recall and overall accuracy of language models (n = 10) on the MetaMedQA benchmark.**

Unknown analysis

We assessed the models’ ability to identify questions they could not answer, either due to missing content making the question undecidable or by presenting questions on fictional content not included in their training data. This metric is essential for evaluating the model’s self-awareness and its ability to avoid making potentially harmful guesses. It is calculated by dividing the number of times the model correctly identifies a question as unanswerable or outside its knowledge base by the total number of such questions. This proved to be the most challenging task for the models, with most scoring 0%. Exceptions were GPT-4o-2024-05-13, which achieved 3.7%, Yi 1.5 34B which scored 0.6%, and Meerkat 7B with 1.2%. The models either never used this answer choice or used it less than 10 times over the 1373 questions.

For this metric, regression, and correlation analyses were limited due to 9 out of 12 models scoring 0. Although the regression analysis yielded a statistically significant slope of 0.05232 (95% CI: 0.0231 - 0.0815, p < 0.001), there was no statistically significant correlation (Pearson r = 0.574, p = 0.051). The predominance of zero scores severely limits the interpretability and practical significance of these statistical findings.

Prompt engineering analysis

To evaluate the impact of prompt engineering on metacognition, we evaluated OpenAI’s GPT-4o-2024-05-13 with a set of various system prompts using the same benchmarking procedure. We started with a simple prompt to describe the model’s role as a medical assistant²⁹ and iteratively added more information about the benchmark, including that some questions can be malformed, incomplete, misleading, or beyond the model’s knowledge to ultimately have a prompt that describes all the tricks found in the benchmark.

A significant improvement in accuracy, high confidence accuracy, and unknown recall appeared (p < 0.0001) once the prompt explicitly informs the model that it may not be able to answer some questions, as shown in Table 4. Missing answer recall improved when the prompt explicitly informs the model that the correct answer might not be present in the choices, but it was not statistically significant (p = 0.07). Interestingly, providing the complete benchmark design instructions did not improve the performance compared to baseline except for unknown recall but underperforms compared to explicit prompts. We also observed that the model fails to use mid and low confidence appropriately when given additional instructions in the system prompt, but the high confidence accuracy was either similar to or higher than baseline.

Table 4 Benchmark results of GPT-4o-2024-05-13 on MetaMedQA with variations of system prompts described in Table 5

Full size table

Discussion

The accuracy results highlighted a clear correlation between model size and release date with performance. Larger and newer models, such as GPT-4o and Qwen2-72B, consistently outperformed their smaller and older counterparts. This trend suggests that advancements in model architecture and training techniques contribute significantly to improved accuracy. However, the notably poor performance of certain models like Yi 1.5-9B, despite being relatively recent, indicates that model optimization and specific training datasets also play crucial roles. Additional medical training demonstrated an improvement in accuracy for both models and the ability to detect missing answers for Meerkat 7b which could be explained by the inclusion of questions with 5 choices and a wider range of questions in the training dataset.

In terms of high confidence accuracy, only three models demonstrated the ability to vary their confidence levels effectively. GPT-4o stood out in this regard, showing a robust capacity to provide higher accuracy when highly confident compared to answers with lower confidence scores. This capability is crucial in clinical settings, where high-confidence decisions need to be reliable to ensure patient safety. The limited use of low confidence scores by most models suggests a tendency toward overconfidence, which could pose risks if models are used in clinical practice without appropriate checks. These findings reinforce previous research recommendations on mitigating healthcare data biases in machine learning³⁰, identifying a probable training data bias that predisposes models to provide confident answers in most scenarios, even when a more cautious response is warranted.

The recall of “None of the above” answers revealed significant differences in how models handle uncertainty. Models like Yi 1.5-9B frequently selected this option, inflating their recall scores at the expense of accuracy. Conversely, models that rarely chose this option might be overly confident, missing opportunities to acknowledge when none of the given answers are correct. This behavior underscores the need for more sophisticated mechanisms within models to handle uncertainty and ambiguity. The “unknown recall” metric, assessing the ability to recognize unanswerable questions, showed poor performance across all models, highlighting a fundamental limitation in current LLMs’ metacognitive abilities. This inability to reliably indicate when they lack sufficient information or knowledge suggests a risk of generating misleading or incorrect information, which could have serious implications in clinical applications.

While the ability of models such as GPT-4o to reliably indicate high-confidence answers suggests potential for clinical decision support, the tendency toward overconfidence among many models underscores the need for enhancements in expressing and managing uncertainty. This significant gap in current LLMs’ ability to recognize and acknowledge their knowledge limitations is critical for preventing the dissemination of incorrect or potentially harmful information in clinical contexts, ensuring that LLMs do not overstep their capabilities.

The absence of metacognitive capabilities in LLMs raises questions about whether such capabilities should be expected. Comprehensive models of cognition, such as the transtheoretical model³¹ incorporate external factors, including interactions with team members or databases, which could be implemented for LLMs with Retrieval Augmented Generation³¹. While external factors might partially compensate for the lack of internal metacognition, this approach presents limitations. Although it aligns with human oversight requirements in healthcare, it may not fully address the complexity required in LLM-based systems for critical decision-making. For instance, a summarization agent retrieving patient record information might fail to recognize incomplete contextual data, potentially generating inaccurate summaries. The limited access to external tools, along with their imperfections raises concerns about relying solely on such tools to prevent errors stemming from metacognitive deficits.

Effective diagnostic reasoning necessitates a synergistic application of both pattern recognition (System 1) and deliberate analytical processes (System 2), particularly when experience alone proves insufficient. Clinicians adeptly employ these cognitive strategies concurrently, selecting the most appropriate approach based on their expertize³². Crucially, the ability to recognize knowledge limitations enables clinicians to dynamically shift between these cognitive strategies³³. Beyond internal processes, clinicians may also leverage external resources, such as clinical guidelines or second opinions, to inform their decision-making³⁴. The clinical decision-making process is inherently complex, demanding not only medical competence but also a profound understanding of one’s own reasoning to strike a delicate balance between caution and confidence. Cognitive errors are an important source of diagnostic error³⁵ and methods such as reflective medical practice³⁶, which help clinicians enhance their ability to navigate complex cases³⁷, or debiasing through feedback to identify and correct cognitive biases³⁸ can help in reducing the number of cognitive errors. Current LLMs, despite their capabilities, exhibit overconfidence and deficiencies in recognizing their limitations, making them unlikely to appropriately employ these nuanced strategies. Moreover, their fixed nature and the challenges associated with providing meaningful feedback leave minimal room for improvement. While we observed some enhancements in metacognitive task performance through prompt engineering with GPT-4o, these improvements remain constrained. Prompts had to explicitly inform the LLM of potential biases and dangers, necessitating an exhaustive—and impractical—list of all potential pitfalls for real-world applications. Consequently, we argue that metacognition should be considered a fundamental capability for LLMs, particularly in critical domains such as healthcare. This emphasis on metacognitive abilities would enable AI systems to more closely emulate the sophisticated reasoning processes employed by human clinicians, potentially leading to more reliable and trustworthy AI-assisted diagnostic tools.

Potential improvements in terms of metacognitive abilities could be made through the generation of synthetic data using the prompt engineering techniques demonstrated. By creating diverse scenarios that explicitly require metacognitive skills—such as recognizing knowledge limitations and assessing confidence levels—LLMs could be fine-tuned to better align with expected metacognitive behaviors. This approach could involve synthesizing clinical scenarios that reflect the multifaceted nature of decision-making, incorporating elements from comprehensive cognitive models. While this presents a promising direction for future work, it remains crucial to consider the challenges of ensuring data quality, avoiding new biases, and validating that improvements translate effectively to real-world clinical scenarios.

Regarding benchmark and methodology limitations, the MedQA benchmark, even with our modifications, may not fully capture the complexity and variability of real-world clinical scenarios. While we aimed to enhance the benchmark by including questions designed to test metacognitive capabilities, the controlled nature of multiple-choice questions cannot replicate the nuanced decision-making processes required in clinical practice. Nevertheless, our benchmark modifications are a significant step toward assessing metacognitive abilities, providing a foundational evaluation that can be built upon in future studies with more complex and realistic scenarios. In addition, the manual modifications and audits we performed, although thorough, are subject to human error and interpretation biases. The selection and modification of questions, as well as the auditing process, could have introduced subjective biases affecting the outcomes of our evaluations. Despite this, the systematic approach and open access to our modifications ensure that our findings remain reliable and reproducible, providing a clear methodology for subsequent studies to enhance and validate further.

The reliance on multiple-choice questions for LLM evaluation presents limitations in assessing cognitive capabilities, particularly in reasoning tasks. Recent studies on GPT-4V’s performance on medical multiple-choice questions demonstrated that despite the impressive results of models on multiple-choice questions, the rationale behind correct answers is flawed in a significant percentage of cases³⁹. Another analysis of GPT-4’s errors on the USMLE demonstrated that most errors are either caused by an anchoring bias or incorrect conclusions⁴⁰. These findings emphasize the limits of multiple-choice to assess cognitive capabilities, especially in reasoning tasks. To address these shortcomings, future research should explore alternative assessment methods, such as key-feature questions. Unlike conventional multiple-choice questions, key feature assessments target critical problem-solving steps, thereby evaluating the ability to apply knowledge in practical scenarios. Validated across all levels of medical training and practice⁴¹, key features could offer a promising approach for more accurately assessing the decision-making processes of LLMs in clinical tasks. This method may provide valuable insights into LLMs’ cognitive abilities that are not captured by traditional multiple-choice assessments.

In terms of metrics and evaluation limitations, while we implemented a confidence scoring system to capture models’ confidence levels on a scale from 1 to 5, this may not fully represent the nuanced levels of certainty a model might have. In addition, the tendency of models to avoid low confidence scores suggests a potential bias towards overconfidence. Despite these limitations, the confidence scoring system provides an essential dimension of evaluation, highlighting areas where models exhibit confidence misalignment, which is crucial for understanding and improving their deployment in clinical settings. The final metrics, including confidence accuracy, missing answer recall, and unknown recall, are designed to provide a comprehensive assessment but may not capture all aspects of model performance and safety. These metrics serve as proxies for complex behaviors that might manifest differently in real-world applications. Nonetheless, they offer a structured approach to evaluating critical aspects of LLM performance, forming a robust basis for future refinement and development of more sophisticated metrics.

Considering model selection and access limitations, this work focused on a limited set of LLMs available and popular as of June 2024. This temporal limitation means the findings may not be fully generalizable to future models or those trained with different objectives and datasets. However, the trends and correlations observed, such as the impact of model size and recentness, are likely to remain relevant as guiding principles for future LLM development and evaluation. The proprietary nature of some models, such as OpenAI’s GPT-4o, limits our insight into their training data and methodologies. This constraint could influence their performance and the interpretation of our results. Yet, the inclusion of both proprietary and open-weight models allows for a broader assessment, demonstrating that our findings are not confined to a single type of model but rather indicative of general trends in LLM performance and metacognitive abilities.

Lastly, regarding theoretical framework limitations, the reliance on the Dual Process Theory (DPT)⁴² may not accurately represent the cognition processes involved in clinical decision-making. More comprehensive theories of cognition, such as the transtheoretical model, while including the DPT, also incorporate additional layers such as embodied cognition through sensory input or situated cognition representing the interactions between individuals and their environment. These additional layers provide a more holistic view of clinical reasoning, acknowledging the complex interplay between internal cognitive processes and external factors. When applying these theories to LLMs, we encounter significant limitations. Considering LLMs have restricted access to external cognitive processes, we argue that internal cognition processes from the DPT must compensate. This compensation, however, may not fully replicate the richness of human clinical reasoning. Our work investigates System 1 thinking exclusively, which involves rapid, intuitive decision-making. While additional experiments involving System 2 should be conducted to improve our understanding of LLM cognition, it’s important to note that studies have shown that switching to System 2 may not always reduce reasoning errors in humans⁴³. LLMs also appear to suffer from a similar shortcoming and fail to self-correct when their reasoning is faulty⁴⁴. Therefore, investigating System 1 exclusively appears to be an important initial step towards understanding the limitations in cognitive capabilities for clinical decision-making of LLMs. Future research could explore ways to incorporate aspects of System 2 thinking and elements of the transtheoretical model into LLM-based clinical decision support systems, potentially bridging the gap between current LLM capabilities and the complex, multifaceted nature of human clinical reasoning.

In conclusion, these results suggest that current LLMs, despite high accuracy on certain tasks, lack essential capabilities for safe deployment in clinical settings. The discrepancy between performance on standard questions and metacognitive tasks highlights a critical area for improvement in LLM development. This gap raises concerns about a form of deceptive expertise, where systems appear knowledgeable but fail to recognize their own limitations. Future research should focus on enhancing LLMs’ ability to recognize uncertainty and knowledge gaps, as well as developing robust evaluation metrics that better reflect the complexities of clinical reasoning.

Methods

Benchmark procedure

We used Python 3.12 and Guidance, a Python library designed to enforce model adherence to specific instructions through constrained decoding⁴⁵, ensuring the models selected only from the allowed choices (A/B/C/D/E/F) and provided confidence scores (1/2/3/4/5)⁴⁶.

We included both proprietary and open-weight models in our evaluation. For proprietary models, we tested OpenAI’s GPT-4o-2024-05-13⁴⁷ and GPT-3.5-turbo-0125⁴⁸. For open-weight models, we selected the most popular foundational models from the HuggingFace trending text generation model list as of June 2024, including Mixtral-8x7B-v0.1⁴⁹, Mistral-7B-v0.1⁵⁰, Yi-1.5-9B, Yi-1.5-34B⁵¹, Meta-Llama-3-8B, Meta-Llama-3-70B⁵², Qwen2-7B, and Qwen2-72B⁵³. In addition, we evaluated two medical models based on Mistral-7B-v0.1, namely meerkat-7b-v1.0⁵⁴ and internistai/base-7b-v0.2⁵⁵, to determine if additional medical training enhances metacognitive abilities. All models were evaluated with a temperature setting of 0 to ensure reliability and reproducibility of results⁵⁶. The open-weight model evaluations were performed on a Microsoft Azure Virtual Machine with 4 NVIDIA A100 80GB GPUs and required a total runtime of 3 hours, including setup time.

The 95% confidence interval is derived from the standard error of the mean multiplied by 1.96. Model accuracy differences were evaluated for statistical significance using p-values calculated with a one-way ANOVA in GraphPad Prism 10.1 followed by a Tukey test^57,58.

Prompt engineering

The iterative process was designed to reveal information progressively, first implicitly and finally explicitly. The complete list of prompts is shown in Table 5. The statistical significance of differences between the prompts and the baseline were assessed using a one-way ANOVA in GraphPad Prism 10.1, followed by Fisher’s least significant difference test⁵⁹ for post-hoc comparisons.

Table 5 Exhaustive list of system prompts used to evaluate the impact of prompt engineering on GPT-4o-2024-05-13’s performance on MetaMedQA

Full size table

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The MetaMedQA data generated in this study have been deposited in the HuggingFace database under accession code datasets/maximegmd/MetaMedQA https://doi.org/10.57967/hf/3547. The modified MedQA-USMLE data generated in this study have been deposited in the HuggingFace database under accession code datasets/maximegmd/MedQA-USMLE-4-options-clean https://doi.org/10.57967/hf/3546. The evaluation results data generated in this study are provided in the Source Data file. The MedQA-USMLE data used in this study are available in the HuggingFace database under accession code datasets/GBaker/MedQA-USMLE-4-options https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options. Source data are provided in this paper.

Code availability

The benchmarking code is available on GitHub alongside instructions to run the benchmark https://doi.org/10.5281/zenodo.14177940.

References

Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Freiwald, T., Salimi, M., Khaljani, E. & Harendza, S. Pattern recognition as a concept for multiple-choice questions in a national licensing exam. BMC Med. Educ. 14, 232 (2014).
Article PubMed PubMed Central Google Scholar
Barile, J. et al. Diagnostic accuracy of a large language model in pediatric case studies. JAMA Pediatr. 178, 313–315 (2024).
Article PubMed PubMed Central MATH Google Scholar
Rydzewski, N. R. et al. Comparative evaluation of LLMs in clinical oncology. NEJM AI 1, AIoa2300151 (2024).
Article MATH Google Scholar
Mihalache, A., Popovic, M. M. & Muni, R. H. Performance of an artificial intelligence chatbot in ophthalmic knowledge assessment. JAMA Ophthalmol. 141, 589–597 (2023).
Article PubMed PubMed Central MATH Google Scholar
Bhayana, R., Bleakney, R. R. & Krishna, S. GPT-4 in Radiology: Improvements in advanced reasoning. Radiology 307, e230987 (2023).
Article PubMed Google Scholar
Humar, P., Asaad, M., Bengur, F. B. & Nguyen, V. ChatGPT is equivalent to first-year plastic surgery residents: Evaluation of chatGPT on the plastic surgery in-service examination. Aesthet. Surg. J. 43, NP1085–NP1089 (2023).
Article PubMed Google Scholar
Katz, U. et al. GPT versus resident physicians — A benchmark based on official board scores. NEJM AI 1, AIdbp2300192 (2024).
Article MATH Google Scholar
Soroush, A. et al. Large language models are poor medical coders — benchmarking of medical code querying. NEJM AI 1, AIdbp2300040 (2024).
Article MATH Google Scholar
Giuffrè, M., You, K. & Shung, D. L. Evaluating chatGPT in medical contexts: The imperative to guard against hallucinations and partial accuracies. Clin. Gastroenterol. Hepatol. 22, 1145–1146 (2024).
Article PubMed MATH Google Scholar
Gilbert, S., Harvey, H., Melvin, T., Vollebregt, E. & Wicks, P. Large language model AI chatbots require approval as medical devices. Nat. Med. 29, 2396–2398 (2023).
Article CAS PubMed Google Scholar
Ahmad M. A., Yaramis I., Roy T. D. Creating trustworthy LLMs: Dealing with hallucinations in healthcare AI. Preprint at https://doi.org/10.48550/arXiv.2311.01463 (2023).
Busch, F. et al. Systematic review of large language models for patient care: Current applications and challenges. Preprint at https://doi.org/10.1101/2024.03.04.24303733 (2024).
Adatrao N. S. K., Gadireddy G. R., Noh J. A Survey on Conversational Search and Applications in Biomedicine. In Proceedings of the 2023 ACM Southeast Conference. (2023).
Weiser B., Schweber N. The ChatGPT Lawyer Explains Himself. The New York Times. https://www.nytimes.com/2023/06/08/nyregion/lawyer-chatgpt-sanctions.html (2024).
OECD. Declaration on Building Better Policies for More Resilient Health Systems. (2024).
Schoonbeek, R. et al. Completeness, correctness and conciseness of physician-written versus large language model generated patient summaries integrated in electronic health records. Preprint at https://doi.org/10.2139/ssrn.4835935 (2024).
Preiksaitis, C. et al. The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review. JMIR Med. Inf. https://doi.org/10.2196/53787 (2024).
Bajwa, J., Munir, U., Nori, A. & Williams, B. Artificial intelligence in healthcare: transforming the practice of medicine. Future Health. J. 8, e188–e194 (2021).
Article Google Scholar
Pagallo, U. et al. The underuse of AI in the health sector: Opportunity costs, success stories, risks and recommendations. Health Technol. 14, 1–14 (2024).
Article MATH Google Scholar
American Medical Association. Physician sentiments around the use of AI in heath care: motivations, opportunities, risks, and use cases. https://www.ama-assn.org/system/files/physician-ai-sentiment-report.pdf (2023).
Stöger, K., Schneeberger, D. & Holzinger, A. Medical artificial intelligence: the European legal perspective. Commun. ACM 64, 34–36 (2021).
Article MATH Google Scholar
Gonullu, I. & Artar, M. Metacognition in medical education. Educ. Health 27, 225 (2014).
Article MATH Google Scholar
Griot, M., Hemptinne, C., Vanderdonckt, J., Yuksel, D. MetaMedQA. https://doi.org/10.57967/HF/3547 (2024).
Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).
National Board of Medical Examiners. United States Medical Licensing Examination. https://www.usmle.org/ (2023).
Griot, M., Vanderdonckt, J., Yuksel, D. & Hemptinne, C. Multiple choice questions and large languages models: A case study with fictional medical data. Preprint at https://doi.org/10.48550/arXiv.2406.02394 (2024).
Saab, K. et al. Capabilities of gemini models in medicine. Peprint at https://doi.org/10.48550/arXiv.2404.18416 (2024).
OpenAI. Prompt Engineering. https://platform.openai.com/docs/guides/prompt-engineering (2024).
Ghassemi, M., Nsoesie, E. O. In medicine, how do we machine learn anything real? Patterns https://doi.org/10.1016/j.patter.2021.100392 (2024).
Parsons, A. S. et al. Beyond thinking fast and slow: Implications of a transtheoretical model of clinical reasoning and error on teaching, assessment, and research. Med. Teach. 1–12 https://doi.org/10.1080/0142159x.2024.2359963 (2024).
Bowen, J. L. Educational strategies to promote clinical diagnostic reasoning. N. Engl. J. Med. 355, 2217–2225 (2006).
Article CAS PubMed MATH Google Scholar
Zwaan, L., Hautz, W. E. Bridging the gap between uncertainty, confidence and diagnostic accuracy: calibration is key. BMJ Qual. Saf. 28, 352–355 (2019).
Elstein, A. S. Thinking about diagnostic thinking: a 30-year perspective. Adv. Health Sci. Educ. 14, 7–18 (2009).
Article Google Scholar
Croskerry, P. The importance of cognitive errors in diagnosis and strategies to minimize them. Acad. Med. 78, 775 (2003).
Article PubMed MATH Google Scholar
Mamede, S. & Schmidt, H. G. The structure of reflective practice in medicine. Med. Educ. 38, 1302–1308 (2004).
Article PubMed MATH Google Scholar
Mamede, S., Schmidt, H. G. & Penaforte, J. C. Effects of reflective practice on the accuracy of medical diagnoses. Med. Educ. 42, 468–475 (2008).
Article PubMed MATH Google Scholar
Diagnosing Diagnosis Errors: Lessons from a Multi-institutional Collaborative Project. in Advances in Patient Safety: From Research to Implementation (Volume 2: Concepts and Methodology) - NCBI Bookshelf. (2024).
Jin, Q. et al. Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine. Npj Digit. Med. 7, 1–6 (2024).
Article MATH Google Scholar
Roy, S. et al. Beyond accuracy: Investigating error types in GPT-4 responses to USMLE questions. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. (2024).
Bordage, G. & Page, G. The key-features approach to assess clinical decisions: validity evidence to date. Adv. Health Sci. Educ. 23, 1005–1036 (2018).
Article CAS MATH Google Scholar
Bellini-Leite, S. C. Dual Process Theory: Embodied and Predictive; Symbolic and Classical. Front. Psychol. 13, https://doi.org/10.3389/fpsyg.2022.805386 (2022).
Norman, G. R. et al. The Causes of Errors in Clinical Reasoning: Cognitiv. Acad. Med. 92, 23–30 (2017).
Huang, J. et al. Large language models cannot self-correct reasoning yet. In The Twelfth International Conference on Learning Representations. (2024).
guidance-ai/guidance: A guidance language for controlling large language models. https://github.com/guidance-ai/guidance (2023).
Griot, M., Hemptinne, C., Vanderdonckt, J., Yuksel, D. MetaMedQA benchmark code. https://doi.org/10.5281/zenodo.14177940 (2024).
OpenAI. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/ (2024).
OpenAI. Introducing ChatGPT. https://openai.com/blog/chatgpt (2023).
Jiang, A. Q. et al. Mixtral of experts. Preprint at https://doi.org/10.48550/arXiv.2401.04088 (2024).
Jiang, A. Q. et al. Mistral 7B. Preprint at https://doi.org/10.48550/arXiv.2310.06825 (2023).
AI, 01. et al. Yi: Open foundation models by 01.AI. Preprint at https://doi.org/10.48550/arXiv.2403.04652 (2024).
Dubey, A. et al. The Llama 3 herd of models. Preprint at https://doi.org/10.48550/arXiv.2407.21783 (2024).
Yang, A. et al. Qwen2 Technical report. Preprint at https://doi.org/10.48550/arXiv.2407.10671 (2024).
Kim, H. et al. Small language models learn enhanced reasoning skills from medical textbooks. Preprint at https://doi.org/10.48550/arXiv.2404.00376 (2024).
Griot, M., Hemptinne, C., Vanderdonckt, J. & Yukse, D. Impact of high-quality, mixed-domain data on the performance of medical language models. J. Am. Med. Inform. Assoc. 31, 1875–1883 (2024).
Ronanki, K., Cabrero-Daniel, B., Horkoff, J. & Berger, C. Requirements engineering using generative AI: prompts and prompting patterns. In Generative AI for Effective Software Development. (Springer, Cham, 2024).
Home - GraphPad. https://www.graphpad.com/ (2023).
GraphPad Prism 10 Statistics Guide - Tukey and Dunnett methods. https://www.graphpad.com/guides/prism/latest/statistics/stat_the_methods_of_tukey_and_dunne.htm (2024).
Williams, L. J., Abdi, H. in Encyclopedia of Research Design. (2010).

Download references

Acknowledgements

This work was supported by the Fondation Saint-Luc grant number 467E and the Fédération Wallonie-Bruxelles through the Fond Spécial de Recherche of Université Catholique de Louvain.

Author information

Authors and Affiliations

Institute of NeuroScience, Université catholique de Louvain, Brussels, Belgium
Maxime Griot, Coralie Hemptinne & Demet Yuksel
Louvain Research Institute in Management and Organizations, Université catholique de Louvain, Louvain-la-Neuve, Belgium
Maxime Griot & Jean Vanderdonckt
Ophthalmology, Cliniques Universitaires Saint-Luc, Brussels, Belgium
Coralie Hemptinne
Medical Information Department, Cliniques Universitaires Saint-Luc, Brussels, Belgium
Demet Yuksel

Authors

Maxime Griot
View author publications
Search author on:PubMed Google Scholar
Coralie Hemptinne
View author publications
Search author on:PubMed Google Scholar
Jean Vanderdonckt
View author publications
Search author on:PubMed Google Scholar
Demet Yuksel
View author publications
Search author on:PubMed Google Scholar

Contributions

M.G. had full access to all the data in the study and took responsibility for the integrity of the data and the accuracy of the data analysis. M.G. was involved in the concept and design of the study, drafted the manuscript, and performed the statistical analysis. C.H. assisted in the acquisition, analysis, and interpretation of data, played a key role in administrative, technical, or material support for the study, and participated in the critical revision of the manuscript. J.V. contributed to the acquisition, analysis, and interpretation of data, provided critical input during the revision of the manuscript for important intellectual content, contributed to the statistical analysis, and provided supervision. D.Y. obtained funding for the study, was instrumental in providing administrative, technical, or material support, supervised various aspects of the project, and contributed to the critical revision of the manuscript for important intellectual content. Each author has reviewed the manuscript, provided critical feedback, and approved the final version to be published. They agree to be accountable for all aspects of the work, ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Corresponding author

Correspondence to Maxime Griot.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Leo Anthony Celi, Stephen Gilbert, and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Reporting Summary

Transparent Peer Review file

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Griot, M., Hemptinne, C., Vanderdonckt, J. et al. Large Language Models lack essential metacognition for reliable medical reasoning. Nat Commun 16, 642 (2025). https://doi.org/10.1038/s41467-024-55628-6

Download citation

Received: 23 July 2024
Accepted: 19 December 2024
Published: 14 January 2025
DOI: https://doi.org/10.1038/s41467-024-55628-6

This article is cited by

A large language model improves clinicians’ diagnostic performance in complex critical illness cases
- Xintong Wu
- Yu Huang
- Qing He
Critical Care (2025)
Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study
- Jessica D. Workum
- Bas W. S. Volkers
- Michel E. van Genderen
Critical Care (2025)
Human versus artificial social cognition and metacognition: the normative difference
- Rémi Tison
- Tadeusz Zawidzki
AI and Ethics (2025)

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Benchmark creation and preprocessing

Overall accuracy

Impact of confidence

Missing answer analysis

Unknown analysis

Prompt engineering analysis

Discussion

Methods

Benchmark procedure

Prompt engineering

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links