Abstract
Large Language Models have demonstrated expert-level accuracy on medical board examinations, suggesting potential for clinical decision support systems. However, their metacognitive abilities, crucial for medical decision-making, remain largely unexplored. To address this gap, we developed MetaMedQA, a benchmark incorporating confidence scores and metacognitive tasks into multiple-choice medical questions. We evaluated twelve models on dimensions including confidence-based accuracy, missing answer recall, and unknown recall. Despite high accuracy on multiple-choice questions, our study revealed significant metacognitive deficiencies across all tested models. Models consistently failed to recognize their knowledge limitations and provided confident answers even when correct options were absent. In this work, we show that current models exhibit a critical disconnect between perceived and actual capabilities in medical reasoning, posing significant risks in clinical settings. Our findings emphasize the need for more robust evaluation frameworks that incorporate metacognitive abilities, essential for developing reliable Large Language Model enhanced clinical decision support systems.
Similar content being viewed by others
Introduction
Large Language Models (LLMs) have emerged as transformative tools across various industries, including healthcare. The rapid development and deployment of these models present a stark contrast to the lengthy timelines required for clinical studies, necessitating the development of automated evaluation methods. Traditionally, these evaluations rely on multiple-choice questions encompassing a range of topics, from biochemistry to clinical decision-making, and are benchmarked using standardized tests such as MultiMedQA1. While these methods allow for the swift assessment of model performance, they are primarily limited to evaluating pattern recognition and information recall2.
Recent efforts have included the use of official board examinations to evaluate LLM performance across different medical specialties such as pediatrics3, oncology4, ophthalmology5, radiology6, or plastic surgery7, often demonstrating that these models can perform at a level comparable to medical professionals8. However, such testing methodologies are inherently limited. They focus predominantly on accuracy in answering specific questions, without adequately addressing the critical aspects of model safety and the potential for generating erroneous or misleading information. For instance, studies examining specific tasks, such as ICD (International Classification of Diseases) coding9, have revealed significant performance deficiencies, underscoring the need for more comprehensive evaluation frameworks10,11.
In addition, LLMs’ integration in high-stakes environments has been met with resistance and skepticism due to hallucinations and the difficulty of reducing or detecting them12,13,14. For instance, a lawyer used ChatGPT to assist in a case, but the model hallucinated citations that did not exist15. The existence of these intrinsic limitations of transformer-based LLMs raises questions regarding other limitations that may be more subtle but have similar safety consequences.
The challenges posed by LLMs are emblematic of broader issues in the application of Artificial Intelligence (AI) in healthcare. AI offers immense potential to address the shortage of healthcare workers16 and promises to reduce clerical work17,18, and has already demonstrated uses in precision diagnostics and therapeutics19. AI also introduces significant challenges due to its nature as a probabilistic black box that often lacks transparency, explicability, and interpretability20. This opacity engenders trust issues among healthcare providers and creates substantial barriers to meeting regulatory requirements for clinical deployment21. Consequently, these challenges slow down or postpone the adoption of AI technologies that could otherwise dramatically improve patient outcomes and optimize clinical workflows.
Regulatory frameworks from governing bodies such as the European Union provide more clarity on the expectations of such systems. For instance, the European Union’s approach aims to balance innovation with safety and ethical considerations, requiring AI systems in healthcare to be transparent, accountable, and subject to human oversight22. To address these concerns and improve interpretability and transparency, we propose investigating a crucial but underexplored area: the assessment of LLMs’ metacognition.
Metacognition in AI systems can be split into two categories23: knowledge of cognition and regulation of cognition. Knowledge of cognition encompasses awareness of one’s own cognitive processes, such as identifying biases. Regulation of cognition refers to skills for managing one’s learning process, including self-evaluation and monitoring. In healthcare, these abilities are crucial for professionals to handle complex, uncertain situations and continuously improve their practice. Understanding whether LLMs can gauge their knowledge and handle uncertainty is essential for their safe integration into clinical environments.
To assess metacognition, we introduce MetaMedQA24, an extension, and modification of the MedQA-USMLE benchmark25, designed to evaluate LLMs’ metacognition on medical problems. Our enhanced benchmark employs techniques such as confidence scoring and uncertainty quantification to assess not only the accuracy of LLMs but also their capacity for self-assessment and identification of knowledge gaps. This approach aims to provide a holistic evaluation framework that aligns more closely with the practical demands of clinical settings, ensuring that the deployment of LLMs in healthcare can be both safe and effective. Moreover, the implications of this research extend beyond healthcare, potentially informing the development and evaluation of AI systems in other high-stakes domains where self-awareness and accurate self-assessment are critical.
In this work, we show that current LLMs demonstrate significant limitations in metacognitive abilities crucial for clinical decision-making. Our results reveal that while larger and newer models generally outperform their smaller and older counterparts in accuracy, most models exhibit poor performance in recognizing unanswerable questions and managing uncertainty. Notably, only three models, with GPT-4o standing out, effectively vary their confidence levels. We find that LLMs’ tendency towards overconfidence and inability to recognize knowledge gaps pose potential risks in clinical applications. These findings underscore the need for developing more sophisticated mechanisms within LLMs to handle uncertainty and ambiguity, as well as the importance of evolving benchmarks and evaluation metrics that capture the complexities of clinical reasoning.
Results
Benchmark creation and preprocessing
To evaluate the metacognitive abilities of Large LLMs in medical contexts, we based our assessment on MedQA-USMLE, a subset of MedQA, due to the other benchmarks included in MultiMedQA lacking both quality and clinical relevance. This benchmark is composed of clinical vignettes accompanied by four answer choices, with only one correct answer26.
We modified the MedQA-USMLE benchmark in three steps to create MetaMedQA as shown in Fig. 1:
-
1.
Inclusion of Fictional Questions: To test the models’ capabilities in recognizing their knowledge gaps, we included 100 questions from the Glianorex benchmark27, which is constructed in the format of MedQA-USMLE but pertains to a fictional organ. Examples of these questions are presented in Table 1.
Table 1 Examples of fictional questions from Glianorex English and malformed questions from MedQA-USMLE that cannot be answered due to missing information -
2.
Identification of Malformed Questions: Following Google’s observation that a small percentage of questions may be malformed28, we manually audited the benchmark and identified 55 questions that either relied on missing media or lacked necessary information. Examples of such a question are provided in Table 1.
-
3.
Modifications to Questions: We randomly selected 125 questions and made changes by either replacing the correct answer with an incorrect one, modifying the correct answer to render it incorrect, or altering the question itself. Examples of these modifications are presented in Table 2.
Table 2 Example of questions after modification, the original content is shown with strikethrough text, and replacement is bolded
These steps resulted in a dataset of 1373 questions, each with six answer choices, with only one correct choice.
Overall accuracy
The results obtained by the different models correlate with their size and release date; larger and more recent models achieved higher accuracy than their smaller and older counterparts as shown in Fig. 2. For example, Qwen2 72B (M = 64.3%, SEM = 1.3%) is significantly more accurate (p < 0.0001) than Qwen2 7B (M = 43.9%, SEM = 1.3%) with a moderate effect size (Cohen’s d = 0.417). GPT-4o-2024-05-13 (M = 73.3%, SEM = 1.2%) is significantly more accurate than all other models (p < 0.0001) while Yi 1.5 9B (M = 29.6%, SEM = 1.2%) is significantly less accurate than all other models (p < 0.0001). The notably low performance of Yi 1.5 9B, compared to similar-sized models stands out as an outlier.
Results are presented as mean values +/− 95% CI (n = 1373). Representative statistical significance was determined using a one-way ANOVA with a Tukey correction for multiple comparisons and is indicated by asterisks above the brackets (* p < 0.05 and **** p < 0.0001; Yi 1.5 34b vs Meerkat 7b, p = 0.0142). Models of the same family share the same color. Source data are provided as a Source Data file.
Impact of confidence
The original MedQA-USMLE benchmark primarily focuses on accuracy to compare models. Given the additional complexities introduced by our enhanced benchmark, we introduced three new metrics to assess AI model accuracy based on confidence levels generated by models ranging from 1 to 5. Each metric computes the percentage of correct answers within its confidence range. This system enabled a nuanced evaluation of the model’s performance, from its most certain predictions to those where it expressed doubt, ultimately enhancing safety and decision-making in healthcare applications. The three metrics use the following rules:
-
High Confidence Accuracy: For responses with a confidence score of 5.
-
Medium Confidence Accuracy: For responses with scores between 3 and 4.
-
Low Confidence Accuracy: For responses with scores below 3.
We observed that most models consistently assigned a maximum confidence level of 5, rendering them unsuitable for the confidence analysis. Only GPT-3.5-turbo-0125, GPT-4o-2024-05-13, and Qwen2-72B exhibited varying confidence levels, as shown in Table 3. For these models, higher confidence levels were correlated with higher accuracy, with GPT-4o demonstrating the best ability to assess its answers accurately. The other two models, however, only provided high or medium confidence scores, never utilizing low confidence ratings.
Missing answer analysis
The “Missing answer recall” metric evaluated the model’s capability to recognize when none of the provided options are correct, which is essential for ensuring accuracy in ambiguous or incomplete questions. It is calculated by dividing the number of correctly identified “None of the above” answers by the total number of questions where this was the correct answer.
The recall of missing answers when the correct response is “None of the above,” as shown in Fig. 3, indicates that models struggle more with this option compared to others. The Yi 1.5 9B model, which had the lowest overall accuracy, achieved the highest score on this specific metric. This can be attributed to the model selecting “None of the above” 520 times, or 37.9% of the questions, leading to an inflated score in this area but poor performance on other metrics. A similar but less pronounced trend was observed with the Meerkat 7B model, which chose “None of the above” 295 times. Conversely, the Llama 3 8B model almost never selected this option, while the Mistral 7B and Internist 7B models never did. When examining other models, we found that larger and more recent models generally outperformed their smaller and older counterparts, mirroring the overall accuracy pattern. For instance, GPT-4o-2024-05-13 (M = 46.1%, SEM = 4.7%) is significantly more accurate (p < 0.0001) than GPT-3.5-turbo-0125 (M = 11.3%, SEM = 2.9%) with a large effect size (d = 0.826).
Representative statistical significance was determined using a one-way ANOVA with a Tukey correction for multiple comparisons and is indicated by asterisks above the brackets (* p < 0.05 and **** p < 0.0001; Meerkat 7b vs Qwen2 7b, p = 0.012). Results are presented as mean values +/− 95% CI (n = 115). Models of the same family share the same color. Source data are provided as a Source Data file.
We conducted additional analyses to explore the relationship between overall accuracy and missing answer recall. After excluding the outliers Yi 1.5 9B and Meerkat 7B, which selected “None of the above” for 37.9% and 21.5% of questions respectively (greatly overestimating their performance in this category), we found a strong positive correlation between these two metrics by calculating the Pearson correlation coefficient with a two-tailed p-value (Pearson r = 0.947, p < 0.0001). This indicates that models with higher overall accuracy generally performed better at identifying missing answers. To quantify this relationship more precisely, we performed a regression analysis. The regression yielded a statistically significant positive slope of 1.319 (95% CI: 1.136 - 1.502, p < 0.0001) as shown in Fig. 4.
The plot shows models excluding the outliers Yi 1.5 9B and Meerkat 7B. The solid line represents the linear regression fit, with the shaded area indicating the 95% confidence interval. Labeled points represent various models, while the unlabeled points from left to right are Mistral 7B, Internist 7B, and Llama 3 8B, respectively. Results are presented as mean values +/− 95% CI. Source data are provided as a Source Data file.
Unknown analysis
We assessed the models’ ability to identify questions they could not answer, either due to missing content making the question undecidable or by presenting questions on fictional content not included in their training data. This metric is essential for evaluating the model’s self-awareness and its ability to avoid making potentially harmful guesses. It is calculated by dividing the number of times the model correctly identifies a question as unanswerable or outside its knowledge base by the total number of such questions. This proved to be the most challenging task for the models, with most scoring 0%. Exceptions were GPT-4o-2024-05-13, which achieved 3.7%, Yi 1.5 34B which scored 0.6%, and Meerkat 7B with 1.2%. The models either never used this answer choice or used it less than 10 times over the 1373 questions.
For this metric, regression, and correlation analyses were limited due to 9 out of 12 models scoring 0. Although the regression analysis yielded a statistically significant slope of 0.05232 (95% CI: 0.0231 - 0.0815, p < 0.001), there was no statistically significant correlation (Pearson r = 0.574, p = 0.051). The predominance of zero scores severely limits the interpretability and practical significance of these statistical findings.
Prompt engineering analysis
To evaluate the impact of prompt engineering on metacognition, we evaluated OpenAI’s GPT-4o-2024-05-13 with a set of various system prompts using the same benchmarking procedure. We started with a simple prompt to describe the model’s role as a medical assistant29 and iteratively added more information about the benchmark, including that some questions can be malformed, incomplete, misleading, or beyond the model’s knowledge to ultimately have a prompt that describes all the tricks found in the benchmark.
A significant improvement in accuracy, high confidence accuracy, and unknown recall appeared (p < 0.0001) once the prompt explicitly informs the model that it may not be able to answer some questions, as shown in Table 4. Missing answer recall improved when the prompt explicitly informs the model that the correct answer might not be present in the choices, but it was not statistically significant (p = 0.07). Interestingly, providing the complete benchmark design instructions did not improve the performance compared to baseline except for unknown recall but underperforms compared to explicit prompts. We also observed that the model fails to use mid and low confidence appropriately when given additional instructions in the system prompt, but the high confidence accuracy was either similar to or higher than baseline.
Discussion
The accuracy results highlighted a clear correlation between model size and release date with performance. Larger and newer models, such as GPT-4o and Qwen2-72B, consistently outperformed their smaller and older counterparts. This trend suggests that advancements in model architecture and training techniques contribute significantly to improved accuracy. However, the notably poor performance of certain models like Yi 1.5-9B, despite being relatively recent, indicates that model optimization and specific training datasets also play crucial roles. Additional medical training demonstrated an improvement in accuracy for both models and the ability to detect missing answers for Meerkat 7b which could be explained by the inclusion of questions with 5 choices and a wider range of questions in the training dataset.
In terms of high confidence accuracy, only three models demonstrated the ability to vary their confidence levels effectively. GPT-4o stood out in this regard, showing a robust capacity to provide higher accuracy when highly confident compared to answers with lower confidence scores. This capability is crucial in clinical settings, where high-confidence decisions need to be reliable to ensure patient safety. The limited use of low confidence scores by most models suggests a tendency toward overconfidence, which could pose risks if models are used in clinical practice without appropriate checks. These findings reinforce previous research recommendations on mitigating healthcare data biases in machine learning30, identifying a probable training data bias that predisposes models to provide confident answers in most scenarios, even when a more cautious response is warranted.
The recall of “None of the above” answers revealed significant differences in how models handle uncertainty. Models like Yi 1.5-9B frequently selected this option, inflating their recall scores at the expense of accuracy. Conversely, models that rarely chose this option might be overly confident, missing opportunities to acknowledge when none of the given answers are correct. This behavior underscores the need for more sophisticated mechanisms within models to handle uncertainty and ambiguity. The “unknown recall” metric, assessing the ability to recognize unanswerable questions, showed poor performance across all models, highlighting a fundamental limitation in current LLMs’ metacognitive abilities. This inability to reliably indicate when they lack sufficient information or knowledge suggests a risk of generating misleading or incorrect information, which could have serious implications in clinical applications.
While the ability of models such as GPT-4o to reliably indicate high-confidence answers suggests potential for clinical decision support, the tendency toward overconfidence among many models underscores the need for enhancements in expressing and managing uncertainty. This significant gap in current LLMs’ ability to recognize and acknowledge their knowledge limitations is critical for preventing the dissemination of incorrect or potentially harmful information in clinical contexts, ensuring that LLMs do not overstep their capabilities.
The absence of metacognitive capabilities in LLMs raises questions about whether such capabilities should be expected. Comprehensive models of cognition, such as the transtheoretical model31 incorporate external factors, including interactions with team members or databases, which could be implemented for LLMs with Retrieval Augmented Generation31. While external factors might partially compensate for the lack of internal metacognition, this approach presents limitations. Although it aligns with human oversight requirements in healthcare, it may not fully address the complexity required in LLM-based systems for critical decision-making. For instance, a summarization agent retrieving patient record information might fail to recognize incomplete contextual data, potentially generating inaccurate summaries. The limited access to external tools, along with their imperfections raises concerns about relying solely on such tools to prevent errors stemming from metacognitive deficits.
Effective diagnostic reasoning necessitates a synergistic application of both pattern recognition (System 1) and deliberate analytical processes (System 2), particularly when experience alone proves insufficient. Clinicians adeptly employ these cognitive strategies concurrently, selecting the most appropriate approach based on their expertize32. Crucially, the ability to recognize knowledge limitations enables clinicians to dynamically shift between these cognitive strategies33. Beyond internal processes, clinicians may also leverage external resources, such as clinical guidelines or second opinions, to inform their decision-making34. The clinical decision-making process is inherently complex, demanding not only medical competence but also a profound understanding of one’s own reasoning to strike a delicate balance between caution and confidence. Cognitive errors are an important source of diagnostic error35 and methods such as reflective medical practice36, which help clinicians enhance their ability to navigate complex cases37, or debiasing through feedback to identify and correct cognitive biases38 can help in reducing the number of cognitive errors. Current LLMs, despite their capabilities, exhibit overconfidence and deficiencies in recognizing their limitations, making them unlikely to appropriately employ these nuanced strategies. Moreover, their fixed nature and the challenges associated with providing meaningful feedback leave minimal room for improvement. While we observed some enhancements in metacognitive task performance through prompt engineering with GPT-4o, these improvements remain constrained. Prompts had to explicitly inform the LLM of potential biases and dangers, necessitating an exhaustive—and impractical—list of all potential pitfalls for real-world applications. Consequently, we argue that metacognition should be considered a fundamental capability for LLMs, particularly in critical domains such as healthcare. This emphasis on metacognitive abilities would enable AI systems to more closely emulate the sophisticated reasoning processes employed by human clinicians, potentially leading to more reliable and trustworthy AI-assisted diagnostic tools.
Potential improvements in terms of metacognitive abilities could be made through the generation of synthetic data using the prompt engineering techniques demonstrated. By creating diverse scenarios that explicitly require metacognitive skills—such as recognizing knowledge limitations and assessing confidence levels—LLMs could be fine-tuned to better align with expected metacognitive behaviors. This approach could involve synthesizing clinical scenarios that reflect the multifaceted nature of decision-making, incorporating elements from comprehensive cognitive models. While this presents a promising direction for future work, it remains crucial to consider the challenges of ensuring data quality, avoiding new biases, and validating that improvements translate effectively to real-world clinical scenarios.
Regarding benchmark and methodology limitations, the MedQA benchmark, even with our modifications, may not fully capture the complexity and variability of real-world clinical scenarios. While we aimed to enhance the benchmark by including questions designed to test metacognitive capabilities, the controlled nature of multiple-choice questions cannot replicate the nuanced decision-making processes required in clinical practice. Nevertheless, our benchmark modifications are a significant step toward assessing metacognitive abilities, providing a foundational evaluation that can be built upon in future studies with more complex and realistic scenarios. In addition, the manual modifications and audits we performed, although thorough, are subject to human error and interpretation biases. The selection and modification of questions, as well as the auditing process, could have introduced subjective biases affecting the outcomes of our evaluations. Despite this, the systematic approach and open access to our modifications ensure that our findings remain reliable and reproducible, providing a clear methodology for subsequent studies to enhance and validate further.
The reliance on multiple-choice questions for LLM evaluation presents limitations in assessing cognitive capabilities, particularly in reasoning tasks. Recent studies on GPT-4V’s performance on medical multiple-choice questions demonstrated that despite the impressive results of models on multiple-choice questions, the rationale behind correct answers is flawed in a significant percentage of cases39. Another analysis of GPT-4’s errors on the USMLE demonstrated that most errors are either caused by an anchoring bias or incorrect conclusions40. These findings emphasize the limits of multiple-choice to assess cognitive capabilities, especially in reasoning tasks. To address these shortcomings, future research should explore alternative assessment methods, such as key-feature questions. Unlike conventional multiple-choice questions, key feature assessments target critical problem-solving steps, thereby evaluating the ability to apply knowledge in practical scenarios. Validated across all levels of medical training and practice41, key features could offer a promising approach for more accurately assessing the decision-making processes of LLMs in clinical tasks. This method may provide valuable insights into LLMs’ cognitive abilities that are not captured by traditional multiple-choice assessments.
In terms of metrics and evaluation limitations, while we implemented a confidence scoring system to capture models’ confidence levels on a scale from 1 to 5, this may not fully represent the nuanced levels of certainty a model might have. In addition, the tendency of models to avoid low confidence scores suggests a potential bias towards overconfidence. Despite these limitations, the confidence scoring system provides an essential dimension of evaluation, highlighting areas where models exhibit confidence misalignment, which is crucial for understanding and improving their deployment in clinical settings. The final metrics, including confidence accuracy, missing answer recall, and unknown recall, are designed to provide a comprehensive assessment but may not capture all aspects of model performance and safety. These metrics serve as proxies for complex behaviors that might manifest differently in real-world applications. Nonetheless, they offer a structured approach to evaluating critical aspects of LLM performance, forming a robust basis for future refinement and development of more sophisticated metrics.
Considering model selection and access limitations, this work focused on a limited set of LLMs available and popular as of June 2024. This temporal limitation means the findings may not be fully generalizable to future models or those trained with different objectives and datasets. However, the trends and correlations observed, such as the impact of model size and recentness, are likely to remain relevant as guiding principles for future LLM development and evaluation. The proprietary nature of some models, such as OpenAI’s GPT-4o, limits our insight into their training data and methodologies. This constraint could influence their performance and the interpretation of our results. Yet, the inclusion of both proprietary and open-weight models allows for a broader assessment, demonstrating that our findings are not confined to a single type of model but rather indicative of general trends in LLM performance and metacognitive abilities.
Lastly, regarding theoretical framework limitations, the reliance on the Dual Process Theory (DPT)42 may not accurately represent the cognition processes involved in clinical decision-making. More comprehensive theories of cognition, such as the transtheoretical model, while including the DPT, also incorporate additional layers such as embodied cognition through sensory input or situated cognition representing the interactions between individuals and their environment. These additional layers provide a more holistic view of clinical reasoning, acknowledging the complex interplay between internal cognitive processes and external factors. When applying these theories to LLMs, we encounter significant limitations. Considering LLMs have restricted access to external cognitive processes, we argue that internal cognition processes from the DPT must compensate. This compensation, however, may not fully replicate the richness of human clinical reasoning. Our work investigates System 1 thinking exclusively, which involves rapid, intuitive decision-making. While additional experiments involving System 2 should be conducted to improve our understanding of LLM cognition, it’s important to note that studies have shown that switching to System 2 may not always reduce reasoning errors in humans43. LLMs also appear to suffer from a similar shortcoming and fail to self-correct when their reasoning is faulty44. Therefore, investigating System 1 exclusively appears to be an important initial step towards understanding the limitations in cognitive capabilities for clinical decision-making of LLMs. Future research could explore ways to incorporate aspects of System 2 thinking and elements of the transtheoretical model into LLM-based clinical decision support systems, potentially bridging the gap between current LLM capabilities and the complex, multifaceted nature of human clinical reasoning.
In conclusion, these results suggest that current LLMs, despite high accuracy on certain tasks, lack essential capabilities for safe deployment in clinical settings. The discrepancy between performance on standard questions and metacognitive tasks highlights a critical area for improvement in LLM development. This gap raises concerns about a form of deceptive expertise, where systems appear knowledgeable but fail to recognize their own limitations. Future research should focus on enhancing LLMs’ ability to recognize uncertainty and knowledge gaps, as well as developing robust evaluation metrics that better reflect the complexities of clinical reasoning.
Methods
Benchmark procedure
We used Python 3.12 and Guidance, a Python library designed to enforce model adherence to specific instructions through constrained decoding45, ensuring the models selected only from the allowed choices (A/B/C/D/E/F) and provided confidence scores (1/2/3/4/5)46.
We included both proprietary and open-weight models in our evaluation. For proprietary models, we tested OpenAI’s GPT-4o-2024-05-1347 and GPT-3.5-turbo-012548. For open-weight models, we selected the most popular foundational models from the HuggingFace trending text generation model list as of June 2024, including Mixtral-8x7B-v0.149, Mistral-7B-v0.150, Yi-1.5-9B, Yi-1.5-34B51, Meta-Llama-3-8B, Meta-Llama-3-70B52, Qwen2-7B, and Qwen2-72B53. In addition, we evaluated two medical models based on Mistral-7B-v0.1, namely meerkat-7b-v1.054 and internistai/base-7b-v0.255, to determine if additional medical training enhances metacognitive abilities. All models were evaluated with a temperature setting of 0 to ensure reliability and reproducibility of results56. The open-weight model evaluations were performed on a Microsoft Azure Virtual Machine with 4 NVIDIA A100 80GB GPUs and required a total runtime of 3 hours, including setup time.
The 95% confidence interval is derived from the standard error of the mean multiplied by 1.96. Model accuracy differences were evaluated for statistical significance using p-values calculated with a one-way ANOVA in GraphPad Prism 10.1 followed by a Tukey test57,58.
Prompt engineering
The iterative process was designed to reveal information progressively, first implicitly and finally explicitly. The complete list of prompts is shown in Table 5. The statistical significance of differences between the prompts and the baseline were assessed using a one-way ANOVA in GraphPad Prism 10.1, followed by Fisher’s least significant difference test59 for post-hoc comparisons.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The MetaMedQA data generated in this study have been deposited in the HuggingFace database under accession code datasets/maximegmd/MetaMedQA https://doi.org/10.57967/hf/3547. The modified MedQA-USMLE data generated in this study have been deposited in the HuggingFace database under accession code datasets/maximegmd/MedQA-USMLE-4-options-clean https://doi.org/10.57967/hf/3546. The evaluation results data generated in this study are provided in the Source Data file. The MedQA-USMLE data used in this study are available in the HuggingFace database under accession code datasets/GBaker/MedQA-USMLE-4-options https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options. Source data are provided in this paper.
Code availability
The benchmarking code is available on GitHub alongside instructions to run the benchmark https://doi.org/10.5281/zenodo.14177940.
References
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Freiwald, T., Salimi, M., Khaljani, E. & Harendza, S. Pattern recognition as a concept for multiple-choice questions in a national licensing exam. BMC Med. Educ. 14, 232 (2014).
Barile, J. et al. Diagnostic accuracy of a large language model in pediatric case studies. JAMA Pediatr. 178, 313–315 (2024).
Rydzewski, N. R. et al. Comparative evaluation of LLMs in clinical oncology. NEJM AI 1, AIoa2300151 (2024).
Mihalache, A., Popovic, M. M. & Muni, R. H. Performance of an artificial intelligence chatbot in ophthalmic knowledge assessment. JAMA Ophthalmol. 141, 589–597 (2023).
Bhayana, R., Bleakney, R. R. & Krishna, S. GPT-4 in Radiology: Improvements in advanced reasoning. Radiology 307, e230987 (2023).
Humar, P., Asaad, M., Bengur, F. B. & Nguyen, V. ChatGPT is equivalent to first-year plastic surgery residents: Evaluation of chatGPT on the plastic surgery in-service examination. Aesthet. Surg. J. 43, NP1085–NP1089 (2023).
Katz, U. et al. GPT versus resident physicians — A benchmark based on official board scores. NEJM AI 1, AIdbp2300192 (2024).
Soroush, A. et al. Large language models are poor medical coders — benchmarking of medical code querying. NEJM AI 1, AIdbp2300040 (2024).
Giuffrè, M., You, K. & Shung, D. L. Evaluating chatGPT in medical contexts: The imperative to guard against hallucinations and partial accuracies. Clin. Gastroenterol. Hepatol. 22, 1145–1146 (2024).
Gilbert, S., Harvey, H., Melvin, T., Vollebregt, E. & Wicks, P. Large language model AI chatbots require approval as medical devices. Nat. Med. 29, 2396–2398 (2023).
Ahmad M. A., Yaramis I., Roy T. D. Creating trustworthy LLMs: Dealing with hallucinations in healthcare AI. Preprint at https://doi.org/10.48550/arXiv.2311.01463 (2023).
Busch, F. et al. Systematic review of large language models for patient care: Current applications and challenges. Preprint at https://doi.org/10.1101/2024.03.04.24303733 (2024).
Adatrao N. S. K., Gadireddy G. R., Noh J. A Survey on Conversational Search and Applications in Biomedicine. In Proceedings of the 2023 ACM Southeast Conference. (2023).
Weiser B., Schweber N. The ChatGPT Lawyer Explains Himself. The New York Times. https://www.nytimes.com/2023/06/08/nyregion/lawyer-chatgpt-sanctions.html (2024).
OECD. Declaration on Building Better Policies for More Resilient Health Systems. (2024).
Schoonbeek, R. et al. Completeness, correctness and conciseness of physician-written versus large language model generated patient summaries integrated in electronic health records. Preprint at https://doi.org/10.2139/ssrn.4835935 (2024).
Preiksaitis, C. et al. The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review. JMIR Med. Inf. https://doi.org/10.2196/53787 (2024).
Bajwa, J., Munir, U., Nori, A. & Williams, B. Artificial intelligence in healthcare: transforming the practice of medicine. Future Health. J. 8, e188–e194 (2021).
Pagallo, U. et al. The underuse of AI in the health sector: Opportunity costs, success stories, risks and recommendations. Health Technol. 14, 1–14 (2024).
American Medical Association. Physician sentiments around the use of AI in heath care: motivations, opportunities, risks, and use cases. https://www.ama-assn.org/system/files/physician-ai-sentiment-report.pdf (2023).
Stöger, K., Schneeberger, D. & Holzinger, A. Medical artificial intelligence: the European legal perspective. Commun. ACM 64, 34–36 (2021).
Gonullu, I. & Artar, M. Metacognition in medical education. Educ. Health 27, 225 (2014).
Griot, M., Hemptinne, C., Vanderdonckt, J., Yuksel, D. MetaMedQA. https://doi.org/10.57967/HF/3547 (2024).
Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).
National Board of Medical Examiners. United States Medical Licensing Examination. https://www.usmle.org/ (2023).
Griot, M., Vanderdonckt, J., Yuksel, D. & Hemptinne, C. Multiple choice questions and large languages models: A case study with fictional medical data. Preprint at https://doi.org/10.48550/arXiv.2406.02394 (2024).
Saab, K. et al. Capabilities of gemini models in medicine. Peprint at https://doi.org/10.48550/arXiv.2404.18416 (2024).
OpenAI. Prompt Engineering. https://platform.openai.com/docs/guides/prompt-engineering (2024).
Ghassemi, M., Nsoesie, E. O. In medicine, how do we machine learn anything real? Patterns https://doi.org/10.1016/j.patter.2021.100392 (2024).
Parsons, A. S. et al. Beyond thinking fast and slow: Implications of a transtheoretical model of clinical reasoning and error on teaching, assessment, and research. Med. Teach. 1–12 https://doi.org/10.1080/0142159x.2024.2359963 (2024).
Bowen, J. L. Educational strategies to promote clinical diagnostic reasoning. N. Engl. J. Med. 355, 2217–2225 (2006).
Zwaan, L., Hautz, W. E. Bridging the gap between uncertainty, confidence and diagnostic accuracy: calibration is key. BMJ Qual. Saf. 28, 352–355 (2019).
Elstein, A. S. Thinking about diagnostic thinking: a 30-year perspective. Adv. Health Sci. Educ. 14, 7–18 (2009).
Croskerry, P. The importance of cognitive errors in diagnosis and strategies to minimize them. Acad. Med. 78, 775 (2003).
Mamede, S. & Schmidt, H. G. The structure of reflective practice in medicine. Med. Educ. 38, 1302–1308 (2004).
Mamede, S., Schmidt, H. G. & Penaforte, J. C. Effects of reflective practice on the accuracy of medical diagnoses. Med. Educ. 42, 468–475 (2008).
Diagnosing Diagnosis Errors: Lessons from a Multi-institutional Collaborative Project. in Advances in Patient Safety: From Research to Implementation (Volume 2: Concepts and Methodology) - NCBI Bookshelf. (2024).
Jin, Q. et al. Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine. Npj Digit. Med. 7, 1–6 (2024).
Roy, S. et al. Beyond accuracy: Investigating error types in GPT-4 responses to USMLE questions. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. (2024).
Bordage, G. & Page, G. The key-features approach to assess clinical decisions: validity evidence to date. Adv. Health Sci. Educ. 23, 1005–1036 (2018).
Bellini-Leite, S. C. Dual Process Theory: Embodied and Predictive; Symbolic and Classical. Front. Psychol. 13, https://doi.org/10.3389/fpsyg.2022.805386 (2022).
Norman, G. R. et al. The Causes of Errors in Clinical Reasoning: Cognitiv. Acad. Med. 92, 23–30 (2017).
Huang, J. et al. Large language models cannot self-correct reasoning yet. In The Twelfth International Conference on Learning Representations. (2024).
guidance-ai/guidance: A guidance language for controlling large language models. https://github.com/guidance-ai/guidance (2023).
Griot, M., Hemptinne, C., Vanderdonckt, J., Yuksel, D. MetaMedQA benchmark code. https://doi.org/10.5281/zenodo.14177940 (2024).
OpenAI. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/ (2024).
OpenAI. Introducing ChatGPT. https://openai.com/blog/chatgpt (2023).
Jiang, A. Q. et al. Mixtral of experts. Preprint at https://doi.org/10.48550/arXiv.2401.04088 (2024).
Jiang, A. Q. et al. Mistral 7B. Preprint at https://doi.org/10.48550/arXiv.2310.06825 (2023).
AI, 01. et al. Yi: Open foundation models by 01.AI. Preprint at https://doi.org/10.48550/arXiv.2403.04652 (2024).
Dubey, A. et al. The Llama 3 herd of models. Preprint at https://doi.org/10.48550/arXiv.2407.21783 (2024).
Yang, A. et al. Qwen2 Technical report. Preprint at https://doi.org/10.48550/arXiv.2407.10671 (2024).
Kim, H. et al. Small language models learn enhanced reasoning skills from medical textbooks. Preprint at https://doi.org/10.48550/arXiv.2404.00376 (2024).
Griot, M., Hemptinne, C., Vanderdonckt, J. & Yukse, D. Impact of high-quality, mixed-domain data on the performance of medical language models. J. Am. Med. Inform. Assoc. 31, 1875–1883 (2024).
Ronanki, K., Cabrero-Daniel, B., Horkoff, J. & Berger, C. Requirements engineering using generative AI: prompts and prompting patterns. In Generative AI for Effective Software Development. (Springer, Cham, 2024).
Home - GraphPad. https://www.graphpad.com/ (2023).
GraphPad Prism 10 Statistics Guide - Tukey and Dunnett methods. https://www.graphpad.com/guides/prism/latest/statistics/stat_the_methods_of_tukey_and_dunne.htm (2024).
Williams, L. J., Abdi, H. in Encyclopedia of Research Design. (2010).
Acknowledgements
This work was supported by the Fondation Saint-Luc grant number 467E and the Fédération Wallonie-Bruxelles through the Fond Spécial de Recherche of Université Catholique de Louvain.
Author information
Authors and Affiliations
Contributions
M.G. had full access to all the data in the study and took responsibility for the integrity of the data and the accuracy of the data analysis. M.G. was involved in the concept and design of the study, drafted the manuscript, and performed the statistical analysis. C.H. assisted in the acquisition, analysis, and interpretation of data, played a key role in administrative, technical, or material support for the study, and participated in the critical revision of the manuscript. J.V. contributed to the acquisition, analysis, and interpretation of data, provided critical input during the revision of the manuscript for important intellectual content, contributed to the statistical analysis, and provided supervision. D.Y. obtained funding for the study, was instrumental in providing administrative, technical, or material support, supervised various aspects of the project, and contributed to the critical revision of the manuscript for important intellectual content. Each author has reviewed the manuscript, provided critical feedback, and approved the final version to be published. They agree to be accountable for all aspects of the work, ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Leo Anthony Celi, Stephen Gilbert, and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Griot, M., Hemptinne, C., Vanderdonckt, J. et al. Large Language Models lack essential metacognition for reliable medical reasoning. Nat Commun 16, 642 (2025). https://doi.org/10.1038/s41467-024-55628-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-024-55628-6
This article is cited by
-
A large language model improves clinicians’ diagnostic performance in complex critical illness cases
Critical Care (2025)
-
Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study
Critical Care (2025)
-
Human versus artificial social cognition and metacognition: the normative difference
AI and Ethics (2025)