Introduction

Large Language Models (LLMs) have emerged as transformative tools across various industries, including healthcare. The rapid development and deployment of these models present a stark contrast to the lengthy timelines required for clinical studies, necessitating the development of automated evaluation methods. Traditionally, these evaluations rely on multiple-choice questions encompassing a range of topics, from biochemistry to clinical decision-making, and are benchmarked using standardized tests such as MultiMedQA1. While these methods allow for the swift assessment of model performance, they are primarily limited to evaluating pattern recognition and information recall2.

Recent efforts have included the use of official board examinations to evaluate LLM performance across different medical specialties such as pediatrics3, oncology4, ophthalmology5, radiology6, or plastic surgery7, often demonstrating that these models can perform at a level comparable to medical professionals8. However, such testing methodologies are inherently limited. They focus predominantly on accuracy in answering specific questions, without adequately addressing the critical aspects of model safety and the potential for generating erroneous or misleading information. For instance, studies examining specific tasks, such as ICD (International Classification of Diseases) coding9, have revealed significant performance deficiencies, underscoring the need for more comprehensive evaluation frameworks10,11.

In addition, LLMs’ integration in high-stakes environments has been met with resistance and skepticism due to hallucinations and the difficulty of reducing or detecting them12,13,14. For instance, a lawyer used ChatGPT to assist in a case, but the model hallucinated citations that did not exist15. The existence of these intrinsic limitations of transformer-based LLMs raises questions regarding other limitations that may be more subtle but have similar safety consequences.

The challenges posed by LLMs are emblematic of broader issues in the application of Artificial Intelligence (AI) in healthcare. AI offers immense potential to address the shortage of healthcare workers16 and promises to reduce clerical work17,18, and has already demonstrated uses in precision diagnostics and therapeutics19. AI also introduces significant challenges due to its nature as a probabilistic black box that often lacks transparency, explicability, and interpretability20. This opacity engenders trust issues among healthcare providers and creates substantial barriers to meeting regulatory requirements for clinical deployment21. Consequently, these challenges slow down or postpone the adoption of AI technologies that could otherwise dramatically improve patient outcomes and optimize clinical workflows.

Regulatory frameworks from governing bodies such as the European Union provide more clarity on the expectations of such systems. For instance, the European Union’s approach aims to balance innovation with safety and ethical considerations, requiring AI systems in healthcare to be transparent, accountable, and subject to human oversight22. To address these concerns and improve interpretability and transparency, we propose investigating a crucial but underexplored area: the assessment of LLMs’ metacognition.

Metacognition in AI systems can be split into two categories23: knowledge of cognition and regulation of cognition. Knowledge of cognition encompasses awareness of one’s own cognitive processes, such as identifying biases. Regulation of cognition refers to skills for managing one’s learning process, including self-evaluation and monitoring. In healthcare, these abilities are crucial for professionals to handle complex, uncertain situations and continuously improve their practice. Understanding whether LLMs can gauge their knowledge and handle uncertainty is essential for their safe integration into clinical environments.

To assess metacognition, we introduce MetaMedQA24, an extension, and modification of the MedQA-USMLE benchmark25, designed to evaluate LLMs’ metacognition on medical problems. Our enhanced benchmark employs techniques such as confidence scoring and uncertainty quantification to assess not only the accuracy of LLMs but also their capacity for self-assessment and identification of knowledge gaps. This approach aims to provide a holistic evaluation framework that aligns more closely with the practical demands of clinical settings, ensuring that the deployment of LLMs in healthcare can be both safe and effective. Moreover, the implications of this research extend beyond healthcare, potentially informing the development and evaluation of AI systems in other high-stakes domains where self-awareness and accurate self-assessment are critical.

In this work, we show that current LLMs demonstrate significant limitations in metacognitive abilities crucial for clinical decision-making. Our results reveal that while larger and newer models generally outperform their smaller and older counterparts in accuracy, most models exhibit poor performance in recognizing unanswerable questions and managing uncertainty. Notably, only three models, with GPT-4o standing out, effectively vary their confidence levels. We find that LLMs’ tendency towards overconfidence and inability to recognize knowledge gaps pose potential risks in clinical applications. These findings underscore the need for developing more sophisticated mechanisms within LLMs to handle uncertainty and ambiguity, as well as the importance of evolving benchmarks and evaluation metrics that capture the complexities of clinical reasoning.

Results

Benchmark creation and preprocessing

To evaluate the metacognitive abilities of Large LLMs in medical contexts, we based our assessment on MedQA-USMLE, a subset of MedQA, due to the other benchmarks included in MultiMedQA lacking both quality and clinical relevance. This benchmark is composed of clinical vignettes accompanied by four answer choices, with only one correct answer26.

We modified the MedQA-USMLE benchmark in three steps to create MetaMedQA as shown in Fig. 1:

  1. 1.

    Inclusion of Fictional Questions: To test the models’ capabilities in recognizing their knowledge gaps, we included 100 questions from the Glianorex benchmark27, which is constructed in the format of MedQA-USMLE but pertains to a fictional organ. Examples of these questions are presented in Table 1.

    Table 1 Examples of fictional questions from Glianorex English and malformed questions from MedQA-USMLE that cannot be answered due to missing information
  2. 2.

    Identification of Malformed Questions: Following Google’s observation that a small percentage of questions may be malformed28, we manually audited the benchmark and identified 55 questions that either relied on missing media or lacked necessary information. Examples of such a question are provided in Table 1.

  3. 3.

    Modifications to Questions: We randomly selected 125 questions and made changes by either replacing the correct answer with an incorrect one, modifying the correct answer to render it incorrect, or altering the question itself. Examples of these modifications are presented in Table 2.

    Table 2 Example of questions after modification, the original content is shown with strikethrough text, and replacement is bolded
Fig. 1: Flow chart description of the MetaMedQA dataset construction.
figure 1

Starting from the original MedQA and Glianorex English benchmarks we obtain a benchmark with 1096 questions retaining their original answers and respectively 115 questions and 162 questions for the “None of the above” and “I don’t know or cannot answer” choices.

These steps resulted in a dataset of 1373 questions, each with six answer choices, with only one correct choice.

Overall accuracy

The results obtained by the different models correlate with their size and release date; larger and more recent models achieved higher accuracy than their smaller and older counterparts as shown in Fig. 2. For example, Qwen2 72B (M = 64.3%, SEM = 1.3%) is significantly more accurate (p < 0.0001) than Qwen2 7B (M = 43.9%, SEM = 1.3%) with a moderate effect size (Cohen’s d = 0.417). GPT-4o-2024-05-13 (M = 73.3%, SEM = 1.2%) is significantly more accurate than all other models (p < 0.0001) while Yi 1.5 9B (M = 29.6%, SEM = 1.2%) is significantly less accurate than all other models (p < 0.0001). The notably low performance of Yi 1.5 9B, compared to similar-sized models stands out as an outlier.

Fig. 2: Accuracy of models on the MetaMedQA benchmark.
figure 2

Results are presented as mean values +/− 95% CI (n = 1373). Representative statistical significance was determined using a one-way ANOVA with a Tukey correction for multiple comparisons and is indicated by asterisks above the brackets (* p < 0.05 and **** p < 0.0001; Yi 1.5 34b vs Meerkat 7b, p = 0.0142). Models of the same family share the same color. Source data are provided as a Source Data file.

Impact of confidence

The original MedQA-USMLE benchmark primarily focuses on accuracy to compare models. Given the additional complexities introduced by our enhanced benchmark, we introduced three new metrics to assess AI model accuracy based on confidence levels generated by models ranging from 1 to 5. Each metric computes the percentage of correct answers within its confidence range. This system enabled a nuanced evaluation of the model’s performance, from its most certain predictions to those where it expressed doubt, ultimately enhancing safety and decision-making in healthcare applications. The three metrics use the following rules:

  • High Confidence Accuracy: For responses with a confidence score of 5.

  • Medium Confidence Accuracy: For responses with scores between 3 and 4.

  • Low Confidence Accuracy: For responses with scores below 3.

We observed that most models consistently assigned a maximum confidence level of 5, rendering them unsuitable for the confidence analysis. Only GPT-3.5-turbo-0125, GPT-4o-2024-05-13, and Qwen2-72B exhibited varying confidence levels, as shown in Table 3. For these models, higher confidence levels were correlated with higher accuracy, with GPT-4o demonstrating the best ability to assess its answers accurately. The other two models, however, only provided high or medium confidence scores, never utilizing low confidence ratings.

Table 3 Analysis of the impact of confidence on the accuracy of three models on MetaMedQA including the 95% confidence interval. Source data are provided as a Source Data file

Missing answer analysis

The “Missing answer recall” metric evaluated the model’s capability to recognize when none of the provided options are correct, which is essential for ensuring accuracy in ambiguous or incomplete questions. It is calculated by dividing the number of correctly identified “None of the above” answers by the total number of questions where this was the correct answer.

The recall of missing answers when the correct response is “None of the above,” as shown in Fig. 3, indicates that models struggle more with this option compared to others. The Yi 1.5 9B model, which had the lowest overall accuracy, achieved the highest score on this specific metric. This can be attributed to the model selecting “None of the above” 520 times, or 37.9% of the questions, leading to an inflated score in this area but poor performance on other metrics. A similar but less pronounced trend was observed with the Meerkat 7B model, which chose “None of the above” 295 times. Conversely, the Llama 3 8B model almost never selected this option, while the Mistral 7B and Internist 7B models never did. When examining other models, we found that larger and more recent models generally outperformed their smaller and older counterparts, mirroring the overall accuracy pattern. For instance, GPT-4o-2024-05-13 (M = 46.1%, SEM = 4.7%) is significantly more accurate (p < 0.0001) than GPT-3.5-turbo-0125 (M = 11.3%, SEM = 2.9%) with a large effect size (d = 0.826).

Fig. 3: Recall of “None of the above” of models on the MetaMedQA benchmark, including the 95% confidence interval.
figure 3

Representative statistical significance was determined using a one-way ANOVA with a Tukey correction for multiple comparisons and is indicated by asterisks above the brackets (* p < 0.05 and **** p < 0.0001; Meerkat 7b vs Qwen2 7b, p = 0.012). Results are presented as mean values +/− 95% CI (n = 115). Models of the same family share the same color. Source data are provided as a Source Data file.

We conducted additional analyses to explore the relationship between overall accuracy and missing answer recall. After excluding the outliers Yi 1.5 9B and Meerkat 7B, which selected “None of the above” for 37.9% and 21.5% of questions respectively (greatly overestimating their performance in this category), we found a strong positive correlation between these two metrics by calculating the Pearson correlation coefficient with a two-tailed p-value (Pearson r = 0.947, p < 0.0001). This indicates that models with higher overall accuracy generally performed better at identifying missing answers. To quantify this relationship more precisely, we performed a regression analysis. The regression yielded a statistically significant positive slope of 1.319 (95% CI: 1.136 - 1.502, p < 0.0001) as shown in Fig. 4.

Fig. 4: Linear regression between missing answer recall and overall accuracy of language models (n = 10) on the MetaMedQA benchmark.
figure 4

The plot shows models excluding the outliers Yi 1.5 9B and Meerkat 7B. The solid line represents the linear regression fit, with the shaded area indicating the 95% confidence interval. Labeled points represent various models, while the unlabeled points from left to right are Mistral 7B, Internist 7B, and Llama 3 8B, respectively. Results are presented as mean values +/− 95% CI. Source data are provided as a Source Data file.

Unknown analysis

We assessed the models’ ability to identify questions they could not answer, either due to missing content making the question undecidable or by presenting questions on fictional content not included in their training data. This metric is essential for evaluating the model’s self-awareness and its ability to avoid making potentially harmful guesses. It is calculated by dividing the number of times the model correctly identifies a question as unanswerable or outside its knowledge base by the total number of such questions. This proved to be the most challenging task for the models, with most scoring 0%. Exceptions were GPT-4o-2024-05-13, which achieved 3.7%, Yi 1.5 34B which scored 0.6%, and Meerkat 7B with 1.2%. The models either never used this answer choice or used it less than 10 times over the 1373 questions.

For this metric, regression, and correlation analyses were limited due to 9 out of 12 models scoring 0. Although the regression analysis yielded a statistically significant slope of 0.05232 (95% CI: 0.0231 - 0.0815, p < 0.001), there was no statistically significant correlation (Pearson r = 0.574, p = 0.051). The predominance of zero scores severely limits the interpretability and practical significance of these statistical findings.

Prompt engineering analysis

To evaluate the impact of prompt engineering on metacognition, we evaluated OpenAI’s GPT-4o-2024-05-13 with a set of various system prompts using the same benchmarking procedure. We started with a simple prompt to describe the model’s role as a medical assistant29 and iteratively added more information about the benchmark, including that some questions can be malformed, incomplete, misleading, or beyond the model’s knowledge to ultimately have a prompt that describes all the tricks found in the benchmark.

A significant improvement in accuracy, high confidence accuracy, and unknown recall appeared (p < 0.0001) once the prompt explicitly informs the model that it may not be able to answer some questions, as shown in Table 4. Missing answer recall improved when the prompt explicitly informs the model that the correct answer might not be present in the choices, but it was not statistically significant (p = 0.07). Interestingly, providing the complete benchmark design instructions did not improve the performance compared to baseline except for unknown recall but underperforms compared to explicit prompts. We also observed that the model fails to use mid and low confidence appropriately when given additional instructions in the system prompt, but the high confidence accuracy was either similar to or higher than baseline.

Table 4 Benchmark results of GPT-4o-2024-05-13 on MetaMedQA with variations of system prompts described in Table 5

Discussion

The accuracy results highlighted a clear correlation between model size and release date with performance. Larger and newer models, such as GPT-4o and Qwen2-72B, consistently outperformed their smaller and older counterparts. This trend suggests that advancements in model architecture and training techniques contribute significantly to improved accuracy. However, the notably poor performance of certain models like Yi 1.5-9B, despite being relatively recent, indicates that model optimization and specific training datasets also play crucial roles. Additional medical training demonstrated an improvement in accuracy for both models and the ability to detect missing answers for Meerkat 7b which could be explained by the inclusion of questions with 5 choices and a wider range of questions in the training dataset.

In terms of high confidence accuracy, only three models demonstrated the ability to vary their confidence levels effectively. GPT-4o stood out in this regard, showing a robust capacity to provide higher accuracy when highly confident compared to answers with lower confidence scores. This capability is crucial in clinical settings, where high-confidence decisions need to be reliable to ensure patient safety. The limited use of low confidence scores by most models suggests a tendency toward overconfidence, which could pose risks if models are used in clinical practice without appropriate checks. These findings reinforce previous research recommendations on mitigating healthcare data biases in machine learning30, identifying a probable training data bias that predisposes models to provide confident answers in most scenarios, even when a more cautious response is warranted.

The recall of “None of the above” answers revealed significant differences in how models handle uncertainty. Models like Yi 1.5-9B frequently selected this option, inflating their recall scores at the expense of accuracy. Conversely, models that rarely chose this option might be overly confident, missing opportunities to acknowledge when none of the given answers are correct. This behavior underscores the need for more sophisticated mechanisms within models to handle uncertainty and ambiguity. The “unknown recall” metric, assessing the ability to recognize unanswerable questions, showed poor performance across all models, highlighting a fundamental limitation in current LLMs’ metacognitive abilities. This inability to reliably indicate when they lack sufficient information or knowledge suggests a risk of generating misleading or incorrect information, which could have serious implications in clinical applications.

While the ability of models such as GPT-4o to reliably indicate high-confidence answers suggests potential for clinical decision support, the tendency toward overconfidence among many models underscores the need for enhancements in expressing and managing uncertainty. This significant gap in current LLMs’ ability to recognize and acknowledge their knowledge limitations is critical for preventing the dissemination of incorrect or potentially harmful information in clinical contexts, ensuring that LLMs do not overstep their capabilities.

The absence of metacognitive capabilities in LLMs raises questions about whether such capabilities should be expected. Comprehensive models of cognition, such as the transtheoretical model31 incorporate external factors, including interactions with team members or databases, which could be implemented for LLMs with Retrieval Augmented Generation31. While external factors might partially compensate for the lack of internal metacognition, this approach presents limitations. Although it aligns with human oversight requirements in healthcare, it may not fully address the complexity required in LLM-based systems for critical decision-making. For instance, a summarization agent retrieving patient record information might fail to recognize incomplete contextual data, potentially generating inaccurate summaries. The limited access to external tools, along with their imperfections raises concerns about relying solely on such tools to prevent errors stemming from metacognitive deficits.

Effective diagnostic reasoning necessitates a synergistic application of both pattern recognition (System 1) and deliberate analytical processes (System 2), particularly when experience alone proves insufficient. Clinicians adeptly employ these cognitive strategies concurrently, selecting the most appropriate approach based on their expertize32. Crucially, the ability to recognize knowledge limitations enables clinicians to dynamically shift between these cognitive strategies33. Beyond internal processes, clinicians may also leverage external resources, such as clinical guidelines or second opinions, to inform their decision-making34. The clinical decision-making process is inherently complex, demanding not only medical competence but also a profound understanding of one’s own reasoning to strike a delicate balance between caution and confidence. Cognitive errors are an important source of diagnostic error35 and methods such as reflective medical practice36, which help clinicians enhance their ability to navigate complex cases37, or debiasing through feedback to identify and correct cognitive biases38 can help in reducing the number of cognitive errors. Current LLMs, despite their capabilities, exhibit overconfidence and deficiencies in recognizing their limitations, making them unlikely to appropriately employ these nuanced strategies. Moreover, their fixed nature and the challenges associated with providing meaningful feedback leave minimal room for improvement. While we observed some enhancements in metacognitive task performance through prompt engineering with GPT-4o, these improvements remain constrained. Prompts had to explicitly inform the LLM of potential biases and dangers, necessitating an exhaustive—and impractical—list of all potential pitfalls for real-world applications. Consequently, we argue that metacognition should be considered a fundamental capability for LLMs, particularly in critical domains such as healthcare. This emphasis on metacognitive abilities would enable AI systems to more closely emulate the sophisticated reasoning processes employed by human clinicians, potentially leading to more reliable and trustworthy AI-assisted diagnostic tools.

Potential improvements in terms of metacognitive abilities could be made through the generation of synthetic data using the prompt engineering techniques demonstrated. By creating diverse scenarios that explicitly require metacognitive skills—such as recognizing knowledge limitations and assessing confidence levels—LLMs could be fine-tuned to better align with expected metacognitive behaviors. This approach could involve synthesizing clinical scenarios that reflect the multifaceted nature of decision-making, incorporating elements from comprehensive cognitive models. While this presents a promising direction for future work, it remains crucial to consider the challenges of ensuring data quality, avoiding new biases, and validating that improvements translate effectively to real-world clinical scenarios.

Regarding benchmark and methodology limitations, the MedQA benchmark, even with our modifications, may not fully capture the complexity and variability of real-world clinical scenarios. While we aimed to enhance the benchmark by including questions designed to test metacognitive capabilities, the controlled nature of multiple-choice questions cannot replicate the nuanced decision-making processes required in clinical practice. Nevertheless, our benchmark modifications are a significant step toward assessing metacognitive abilities, providing a foundational evaluation that can be built upon in future studies with more complex and realistic scenarios. In addition, the manual modifications and audits we performed, although thorough, are subject to human error and interpretation biases. The selection and modification of questions, as well as the auditing process, could have introduced subjective biases affecting the outcomes of our evaluations. Despite this, the systematic approach and open access to our modifications ensure that our findings remain reliable and reproducible, providing a clear methodology for subsequent studies to enhance and validate further.

The reliance on multiple-choice questions for LLM evaluation presents limitations in assessing cognitive capabilities, particularly in reasoning tasks. Recent studies on GPT-4V’s performance on medical multiple-choice questions demonstrated that despite the impressive results of models on multiple-choice questions, the rationale behind correct answers is flawed in a significant percentage of cases39. Another analysis of GPT-4’s errors on the USMLE demonstrated that most errors are either caused by an anchoring bias or incorrect conclusions40. These findings emphasize the limits of multiple-choice to assess cognitive capabilities, especially in reasoning tasks. To address these shortcomings, future research should explore alternative assessment methods, such as key-feature questions. Unlike conventional multiple-choice questions, key feature assessments target critical problem-solving steps, thereby evaluating the ability to apply knowledge in practical scenarios. Validated across all levels of medical training and practice41, key features could offer a promising approach for more accurately assessing the decision-making processes of LLMs in clinical tasks. This method may provide valuable insights into LLMs’ cognitive abilities that are not captured by traditional multiple-choice assessments.

In terms of metrics and evaluation limitations, while we implemented a confidence scoring system to capture models’ confidence levels on a scale from 1 to 5, this may not fully represent the nuanced levels of certainty a model might have. In addition, the tendency of models to avoid low confidence scores suggests a potential bias towards overconfidence. Despite these limitations, the confidence scoring system provides an essential dimension of evaluation, highlighting areas where models exhibit confidence misalignment, which is crucial for understanding and improving their deployment in clinical settings. The final metrics, including confidence accuracy, missing answer recall, and unknown recall, are designed to provide a comprehensive assessment but may not capture all aspects of model performance and safety. These metrics serve as proxies for complex behaviors that might manifest differently in real-world applications. Nonetheless, they offer a structured approach to evaluating critical aspects of LLM performance, forming a robust basis for future refinement and development of more sophisticated metrics.

Considering model selection and access limitations, this work focused on a limited set of LLMs available and popular as of June 2024. This temporal limitation means the findings may not be fully generalizable to future models or those trained with different objectives and datasets. However, the trends and correlations observed, such as the impact of model size and recentness, are likely to remain relevant as guiding principles for future LLM development and evaluation. The proprietary nature of some models, such as OpenAI’s GPT-4o, limits our insight into their training data and methodologies. This constraint could influence their performance and the interpretation of our results. Yet, the inclusion of both proprietary and open-weight models allows for a broader assessment, demonstrating that our findings are not confined to a single type of model but rather indicative of general trends in LLM performance and metacognitive abilities.

Lastly, regarding theoretical framework limitations, the reliance on the Dual Process Theory (DPT)42 may not accurately represent the cognition processes involved in clinical decision-making. More comprehensive theories of cognition, such as the transtheoretical model, while including the DPT, also incorporate additional layers such as embodied cognition through sensory input or situated cognition representing the interactions between individuals and their environment. These additional layers provide a more holistic view of clinical reasoning, acknowledging the complex interplay between internal cognitive processes and external factors. When applying these theories to LLMs, we encounter significant limitations. Considering LLMs have restricted access to external cognitive processes, we argue that internal cognition processes from the DPT must compensate. This compensation, however, may not fully replicate the richness of human clinical reasoning. Our work investigates System 1 thinking exclusively, which involves rapid, intuitive decision-making. While additional experiments involving System 2 should be conducted to improve our understanding of LLM cognition, it’s important to note that studies have shown that switching to System 2 may not always reduce reasoning errors in humans43. LLMs also appear to suffer from a similar shortcoming and fail to self-correct when their reasoning is faulty44. Therefore, investigating System 1 exclusively appears to be an important initial step towards understanding the limitations in cognitive capabilities for clinical decision-making of LLMs. Future research could explore ways to incorporate aspects of System 2 thinking and elements of the transtheoretical model into LLM-based clinical decision support systems, potentially bridging the gap between current LLM capabilities and the complex, multifaceted nature of human clinical reasoning.

In conclusion, these results suggest that current LLMs, despite high accuracy on certain tasks, lack essential capabilities for safe deployment in clinical settings. The discrepancy between performance on standard questions and metacognitive tasks highlights a critical area for improvement in LLM development. This gap raises concerns about a form of deceptive expertise, where systems appear knowledgeable but fail to recognize their own limitations. Future research should focus on enhancing LLMs’ ability to recognize uncertainty and knowledge gaps, as well as developing robust evaluation metrics that better reflect the complexities of clinical reasoning.

Methods

Benchmark procedure

We used Python 3.12 and Guidance, a Python library designed to enforce model adherence to specific instructions through constrained decoding45, ensuring the models selected only from the allowed choices (A/B/C/D/E/F) and provided confidence scores (1/2/3/4/5)46.

We included both proprietary and open-weight models in our evaluation. For proprietary models, we tested OpenAI’s GPT-4o-2024-05-1347 and GPT-3.5-turbo-012548. For open-weight models, we selected the most popular foundational models from the HuggingFace trending text generation model list as of June 2024, including Mixtral-8x7B-v0.149, Mistral-7B-v0.150, Yi-1.5-9B, Yi-1.5-34B51, Meta-Llama-3-8B, Meta-Llama-3-70B52, Qwen2-7B, and Qwen2-72B53. In addition, we evaluated two medical models based on Mistral-7B-v0.1, namely meerkat-7b-v1.054 and internistai/base-7b-v0.255, to determine if additional medical training enhances metacognitive abilities. All models were evaluated with a temperature setting of 0 to ensure reliability and reproducibility of results56. The open-weight model evaluations were performed on a Microsoft Azure Virtual Machine with 4 NVIDIA A100 80GB GPUs and required a total runtime of 3 hours, including setup time.

The 95% confidence interval is derived from the standard error of the mean multiplied by 1.96. Model accuracy differences were evaluated for statistical significance using p-values calculated with a one-way ANOVA in GraphPad Prism 10.1 followed by a Tukey test57,58.

Prompt engineering

The iterative process was designed to reveal information progressively, first implicitly and finally explicitly. The complete list of prompts is shown in Table 5. The statistical significance of differences between the prompts and the baseline were assessed using a one-way ANOVA in GraphPad Prism 10.1, followed by Fisher’s least significant difference test59 for post-hoc comparisons.

Table 5 Exhaustive list of system prompts used to evaluate the impact of prompt engineering on GPT-4o-2024-05-13’s performance on MetaMedQA

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.