- Research
- Open access
- Published:
Which curriculum components do medical students find most helpful for evaluating AI outputs?
BMC Medical Education volume 25, Article number: 195 (2025)
Abstract
Introduction
The risk and opportunity of Large Language Models (LLMs) in medical education both rest in their imitation of human communication. Future doctors working with generative artificial intelligence (AI) need to judge the value of any outputs from LLMs to safely direct the management of patients. We set out to investigate medical students’ ability to evaluate LLM responses to clinical vignettes, identify which prior learning they utilised to scrutinise the LLM answers, and assess their awareness of ‘clinical prompt engineering’.
Methods
Final year medical students were asked in a survey to assess the accuracy of the answers provided by generative pre-trained transformer (GPT) 3.5 in response to ten clinical scenarios, five of which GPT 3.5 had answered incorrectly, and to identify which prior training enabled them to evaluate the GPT 3.5 output. A content analysis was conducted amongst 148 consenting medical students.
Results
The median percentage of students who correctly evaluated the LLM output was 56%. Students reported interactive case-based and pathology teaching using questions to be the most helpful training provided by the medical school for evaluating AI outputs. Only 5% were familiar with the concept of ‘clinical prompt engineering’.
Conclusion
Pathology and interactive case-based teaching using questions were the self-reported best training for medical students to safely interact with the outputs of LLMs. This study can inform the design of medical training for future doctors graduating into AI-enhanced health services.
Practice points
Students reported pathology and interactive case-based teaching using questions as the best training to evaluate the LLM outputs.
Students need to be adequately equipped to engage with clinical prompt engineering to optimise the value of LLM outputs they interact with in their future practice.
Introduction
Large language models (LLMs) are advanced probabilistic models of language [1], which are a form of artificial intelligence (AI) that can produce text output similar to human syntax. Generative pre-trained transformer (GPT) [2] is an LLM, which has demonstrated human-level performance on various professional and academic benchmarks [3]. Large language models appear to show promise in automating and delivering clinical processes, including passing accreditation exams [4]. This new capability prompts society to reflect on the added value of a doctor and how they should interact with LLMs when caring for patients. The supervisory role of doctors evaluating the value of LLMs is critical as they are developed for a range of administrative, diagnostic and management tasks. The risk and opportunity of LLMs in medical education both rest in their imitation of human communication, such as textbook paragraphs, and the replication of narrative knowledge, such as a curriculum. The breadth of LLM applications include clinical communication [5] and medical examination [6], including clinical assessment item writing [7].
There is great potential for LLMs to streamline the training of future doctors and staffing healthcare systems [8]. An LLM developed by Google called Med-PaLM 2 [9] has particularly drawn attention after achieving a convincing pass (85% accuracy) on the United States Medical Licensing Examination (USMLE), raising questions about the assessment of doctors and LLMs’ contributions beyond the retention of medical knowledge. Question-answer pairs for medical examinations are validated by experts and perceived as a legitimate substitute for real world training data written in narrative form, as per Google’s use of medical questions to develop Med-PaLM 2 [9]. The collection of established standard-set medical finals’ exam questions is therefore perceived to be a benchmark for evaluating LLM capabilities. However, large language models trained on written exams may lack nontechnical, nonverbal and contextual awareness to holistically evaluate real clinical scenarios [10].
In a cross-sectional multicentre study, most of the medical students perceived AI as an assistive technology that could facilitate physicians’ access to information, patients’ access to healthcare, and reduce errors [11]. They expressed their educational requirements as their need for knowledge and skills related to AI applications, applications for reducing medical errors, and training to prevent and solve ethical problems that might arise as a result of using AI applications [11]. We do not know the exact details for how AI will impact clinical practice, but we can aim to prepare students to be equipped to engage with LLMs. By improving AI literacy rates amongst future doctors, innovation may be accelerated for the benefit of patients and practitioners [12].
Clinical prompt engineering is the practice of composing carefully structured commands in the chat interface of an LLM to optimise the clinical utility of the output. This approach is essential for formulating inputs to an LLM and represents an important concept when assessing the safety and reliability of LLM-generated outputs. However, ‘hallucinations’ [13], variable performance of LLMs and the difficulty of reliable ‘clinical prompt engineering’ highlight the need for ongoing clinical supervision. ‘Hallucination’ of LLMs describes outputs which are incorrect or disconnected from the inputs. This leads to biased and convincing medical misinformation [14] with potential adverse consequences for patient care. Moreover, variability of LLMs when applied to electronic health records highlights the possibility of different responses to the same questions [15]. Large language models are curtailed by the initial pre-training text corpuses, which provide the underlying ‘learnt’ knowledge. When LLMs were evaluated for their ability to give therapeutic advice, distinct error patterns emerged including ambiguity and dangerous omissions [16]. Future medical graduates need to be able to interact with outputs of LLMs safely, possibly by using ‘clinical prompt engineering’ as the method of instructing an LLM to perform a clinically relevant task [17]. ‘Prompt engineering’ is a possible way to apply theoretical LLM knowledge to clinical scenarios, although this discipline currently remains unstable, with different clinical prompts being shown to have inconsistent effects in different LLMs [18]. If we are to become more reliant on AI, we need to equip students with the confidence to call out decisions made by AI.
This study aims to understand the learning activities, which final year medical students perceive to have informed their reviews of GPT-3.5 responses to medical school final exam questions. ‘Risk evaluation’ is defined in this context as the student’s ability to judge the value of the GPT’s answer for the safe care of the patient. Large language model clinical ‘risk evaluation’ is a multifaceted skill, which incorporates clinical reasoning ability, critical thinking, and awareness of the strengths and limitations of LLMs [19]. These challenges have highlighted the need for guidelines for the evaluation of decision support systems driven by AI [20] and tools for the quality assessment of diagnostic accuracy studies using AI (QUADAS)-AI [21]. The GMC’s ‘Outcomes for Graduates’ [22] defines the target outcomes for doctors graduating from UK medical schools and therefore directs the curation of medical curricula. Our study is of direct relevance to the GMC Outcomes ‘patient safety and quality’, ‘dealing with complexity and uncertainty’, and ‘using information effectively and safely’. We sought to explore medical students’ ability to validate LLM responses to clinical questions and understand which prior learning may equip them with the skills to make safe clinical decisions when AI is a potential source of information.
Methods
The study received ethical approval from Imperial College London Education Ethics Review Process (EERP2324-018) and was carried out in accordance with the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) guidelines [23].
Educational context and data collection
In preparation for summative assessments, Imperial College School of Medicine (ICSM) final year medical students received revision lectures in January 2024. We investigated the risk evaluation methodology of medical students responding to five questions about GPT 3.5 responses to previous final year exam questions as part of a revision lecture.
We asked GPT 3.5 to answer a set of single best answer (SBA) questions, each with five options. Final year students undertook a set of ten SBA questions, five of which had been answered correctly by GPT-3.5 (Table 1).
For each SBA question, students were asked the following questions:
-
Did GPT-3.5 answer this question correctly?
-
What information was required to determine the correctness of GPT-3.5’s response?
The questionnaire also included the following questions:
-
Which areas of prior learning did you draw upon (e.g. experiential learning, specific medical school modules)?
-
Which aspect of your medical school training was instrumental in analysing these responses?
-
How might this knowledge inform a skill called ‘Clinical Prompt Engineering’?
Patient safety is related to both the outputs and the inputs of LLMs. The first two questions simulated a clinician interacting with the outputs of an LLM, assessing their ability to safely respond to an LLM’s clinical recommendations. The last question assessed a student’s awareness of clinical prompt engineering, the skill required to safely interact with the input of an LLM. This question is a necessary component of an investigation into LLM clinical safety, as it is the primary method by which a clinician will engage with LLMs. It is therefore an important part of assessing the educational needs of those who will interact with LLMs when caring for patients.
Students were prompted to answer each question individually and anonymously using Mentimeter, a cloud-based interactive presentation software [24]. They were then provided with immediate feedback on their responses as part of the revision lecture.
Data analysis
The main corpus of analysis was constituted by the above questions, four of which were open-ended. To analyse textual data, we followed content analysis [25]. As the students answered the questions using Mentimeter, the responses were brief and ‘objective’. Students did not use sentences; mostly keywords and/or short expressions. This was therefore the unit of analysis that was used. We followed an inductive, conventional content analysis [26] as no previous framework had been established. Each keyword and/or expression were analysed only once into a code. Some codes were then broken down into more specific categories so we could try to elicit more meaning from the text produced by the students.
Results
One hundred and forty-eight students gave consent for their responses to be part of this research project. The first question asked students if the GPT 3.5 response to the SBA question was correct (Table 1). Students’ justifications for their judgement were categorised as to whether they were ‘accurate’, ‘vague’ or ‘inaccurate’. The median percentage of students correctly determining the accuracy of the LLM output was 61%. When those who provided vague or inaccurate justifications were excluded, the median percentage of students who correctly evaluated the LLM output dropped to 56%. Further analysis of the accurate justifications revealed that 62% included a specific pathology test, 24% referred to a blood marker and 14% pointed to a specific symptom.
63% of students responded to the question regarding the areas of prior learning they drew upon to judge the LLM output. Students’ responses referred to medical school taught modules (46%), medical school online learning resources (21%), and external resources including online notes and question banks (33%).
57% of students answered the question about aspects of their medical school training, which they thought were instrumental in analysing the LLM outputs. 72% of the comments about the most valuable official medical school teaching for their evaluation of GPT 3.5-generated answers related to the pathology module or interactive case-based teaching sessions using questions.
50% of students answered the question about ‘clinical prompt engineering’, with only 5% knowing the concept. Those aware of ‘clinical prompt engineering’ were evenly split in designing prompts for specific clinical tasks and specifying GPT output requirements.
Discussion
We identified a dominant theme of pathology-related teaching using questions in student perceptions of which prior learning activities equip them for engaging safely with LLM outputs. Pathology has been described as the ‘science underpinning medicine’ [27], and learning pathology remains important in training in all specialties and postgraduate assessments [28]. Pathology informs both diagnostic reasoning and therapeutic justification and is therefore integral to both the investigation and treatment of patients. Case-based learning using questions was reported by medical students as another aspect of their training that helped them to assess the LLM outputs. Our data show both formal and informal learning opportunities contribute to students’ ‘risk evaluation’ competency. Whilst former medical pedagogy was founded on textbooks as the main source of information for medical students and residents [29], the speed of change in AI presents profound challenges to the design of modern medical training. Of particular importance is the evolution of competency frameworks as a ‘series of propositions and relationships that collectively define an ideal’ [30], which must now be informed by the realities of AI-enhanced clinical practice [31]. The Canadian Medical Education Directives for Specialists (CanMEDS) Physician Competency Framework [32] describes the roles of a doctor as communicator, collaborator, leader, health advocate, scholar, professional, and medical expert. Changes brought about by AI affect all these roles [33].
Our results have corroborated the value of question based collaborative learning [34], the importance of question banks as a learning resource [35], and the value of questions in team-based learning [36], not least for higher enjoyment and engagement seen with interactive case-based learning amongst students [37]. The holistic integration of technical knowledge and sociocultural factors [38] are yet to be achieved by AI, but the patient-doctor relationship “has been altered into a triadic relationship by introducing the computer into the examination room” [39]. A proportion of students acknowledged the value of online question banks. An example of the application of questions to facilitate learning is ‘Question-Based Collaborative Learning’ in which students achieve their own constructive alignment of the curriculum and assessment by writing assessment items based on the patients they have seen [34].
AI literacy, of which LLM literacy is a subcomponent, has emerged in recent years as an essential skill within multiple disciplines and industries [40]. It is essential to develop AI-driven transformations, which respect the interest of patients. Students need to ascertain how to reliably judge if patient safety is threatened by the output of AI and be able to practise amongst this specific form of ‘complexity and uncertainty’. We need to develop robust pedagogy for scrutinising the factuality and bias of any outputs upon which clinical management is based. It is important to note that underlying bias from the socio-cultural context of the original text corpus will be retained in the LLM output. For example, Black and Hispanic patients who are not documented properly in electronic health records will not be represented appropriately in the development process of these LLMs, although natural language processes may be able to address this [41]. Therefore, medical education applications of LLMs must train doctors to supervise, evaluate and reflect on the value of any LLM output depending on the wider clinical context. This is essential for patient-centred, evidence-based medicine which engages with both the risks and benefits of LLMs. We need to incorporate AI literacy into clinical training, both at medical school and for postgraduate levels, with a particular focus on when patient safety concerns are triggered. A clinician should be able to flag when an LLM output does not fit the wider clinical picture and reject dangerous outputs. Collaboration between researchers, educators and practitioners is essential for developing transparent AI models that encourage the ethical and responsible use of AI in medical education [42].
A scoping review of literature from 2022 onwards about generative AI in the context of medical education identified three key areas for investigation: developing learners’ skills to evaluate AI critically, studying human-AI interactions and rethinking assessment methodology [43]. Our study also highlighted the shortage of medical student awareness of the essential skill of clinical prompt engineering. The students self-reported a lack of awareness of the concept, and do not therefore understand the concept as it relates to evaluating the clinical value of an LLM output. A previous study showed the majority of students and faculty learned about AI from the media [44], highlighting a vacuum of alternative reliable training. However, even when additional teaching programmes such as a five week ‘Introduction to Medical AI’ workshop were implemented for medical students, four challenges were identified: prior knowledge heterogeneity, attendance attrition, curricular design and knowledge retention [45].
Our study has several limitations. The response rate diminished as the questions progressed. Furthermore, not all areas of clinical practice were covered in the SBA questions, and vague answers from students compromised their representation in analysis. Furthermore, the survey did not allow us to explore the points raised by students in more detail. Future studies with student and staff focus groups could provide better insight into which aspects of the medical school curriculum equip students with the appropriate skills to evaluate the risks associated with AI-generated answers.
It is challenging to assess the baseline AI literacy of students, especially given the rapid advancements in LLM capabilities. AI literacy varies across institutions and is influenced by factors such as access to technology, prior exposure to AI tools and the integration of AI-related content into curricula. These variations make it difficult to generalise our findings to broader student populations. Furthermore, as LLMs continue to evolve, the skills required to interact with and critically evaluate their outputs could also change, potentially altering the relevance of our current findings.
To effectively appraise the answers provided by LLMs, physicians must first have an adequate baseline knowledge of the subject being assessed. In our study, 56% of the 148 participating students demonstrated this requisite knowledge. If students do not have the necessary pre-requisite knowledge in a given topic, they are not ready to engage in the highly sophisticated process of appraising LLM outputs in a safe and effective manner. Our study was conducted approximately eight weeks before the summative assessment window. Therefore, it is possible that participants would have performed better in the assessment following the completion of their revision for the summative assessments. Consequently, our findings may underrepresent the students’ ultimate potential to engage effectively with LLM outputs. Future studies should also include an inquiry into how interactive case-based learning using questions might help students to appraise LLM outputs specifically.
Conclusion
AI is a sociotechnical reality, and we need to validate the pedagogical requirements for the next generation of doctors. Our study highlights the importance of pathology and interactive case-based teaching using questions when developing doctors who can practise safely amongst LLMs and other clinical decision support systems.
Data availability
All data relevant to the study are included in the manuscript.
Change history
10 July 2025
This article has been updated to amend the license information.
Abbreviations
- LLM:
-
Large Language Model
- GPT:
-
Generative Pre-trained Transformer
- USMLE:
-
United States Medical Licensing Examination
- AI:
-
Artificial Intelligence
- QUADAS-AI:
-
Quality Assessment of Diagnostic Accuracy Studies
- GMC:
-
General Medical Council
- ICSM:
-
Imperial College School of Medicine
- SBA:
-
Single Best Answer
- CanMEDS:
-
Canadian Medical Education Directives for Specialists
References
Jurafsky D, Martin JH. 2024. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models, 3rd edition. Online manuscript released August 20, 2024. https://web.stanford.edu/~jurafsky/slp3
Radford A, Narasimhan K, Salimans T, Sutskever I. Improving Language Understanding by Generative Pre-Training. [Accessed 14 November 2024] https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
GPT-4 Technical Report. 2023. arXiv preprint arXiv:2303.08774.
Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, Madriaga M, Aggabao R, Diaz-Candido G, Maningo J, Tseng V. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. https://doi.org/10.1371/journal.pdig.0000198.
Ayers JW, Poliak A, Dredze M, et al. Comparing physician and Artificial Intelligence Chatbot responses to patient questions posted to a Public Social Media Forum. JAMA Intern Med. 2023;183(6):589–96. https://doi.org/10.1001/jamainternmed.2023.1838.
Artsi Y, Sorin V, Konen E, Glicksberg BS, Nadkarni G, Klang E. Large language models for generating medical examinations: systematic review. BMC Med Educ. 2024;24(1):354. https://doi.org/10.1186/s12909-024-05239-y.
Lam G, Shammoon Y, Coulson A, Lalloo F, Maini A, Amin A, Brown C, Sam AH. Utility of large language models for creating clinical assessment items. Med Teach. 2024;26:1–5. https://doi.org/10.1080/0142159X.2024.2382860.
Sallam M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthc (Basel). 2023;11(6):887. https://doi.org/10.3390/healthcare11060887.
Google Health. 2024. [Accessed 14 November 2024]; https://cloud.google.com/blog/topics/healthcare-life-sciences/sharing-google-med-palm-2-medical-large-language-model
Safranek CW, Sidamon-Eristoff AE, Gilson A, Chartash D. The role of large Language models in Medical Education: applications and implications. JMIR Med Educ. 2023;9:e50945. https://doi.org/10.2196/50945.
Civaner MM, Uncu Y, Bulut F, Chalil EG, Tatli A. Artificial intelligence in medical education: a cross-sectional needs assessment. BMC Med Educ. 2022;22(1):772. https://doi.org/10.1186/s12909-022-03852-3.
Ng FYC, Thirunavukarasu AJ, Cheng H, Tan TF, Gutierrez L, Lan Y, Ong JCL, Chong YS, Ngiam KY, Ho D, Wong TY, Kwek K, Doshi-Velez F, Lucey C, Coffman T, Ting DSW. Artificial intelligence education: an evidence-based medicine approach for consumers, translators, and developers. Cell Rep Med. 2023;4(10):101230. https://doi.org/10.1016/j.xcrm.2023.101230.
Azamfirei R, Kudchadkar SR, Fackler J. Large language models and the perils of their hallucinations. Crit Care. 2023;27(1):120. https://doi.org/10.1186/s13054-023-04393-x.
Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, Payne P, Seneviratne M, Gamble P, Kelly C, Babiker A, Schärli N, Chowdhery A, Mansfield P, Demner-Fushman D, Agüera y Arcas B, Webster D, Corrado GS, Matias Y, Chou K, Gottweis J, Tomasev N, Liu Y, Rajkomar A, Barral J, Semturs C, Karthikesalingam A, Natarajan V. Large language models encode clinical knowledge. Nature. 2023;620(7972):172–80. https://doi.org/10.1038/s41586-023-06291-2.
Schmiedmayer P, Rao A, Zagar P, Ravi V, Zahedivash A, Fereydooni A, Aalami O. LLM on FHIR—Demystifying Health Records. 2024; arXiv preprint arXiv:2402.01711.
Wilhelm TI, Roos J, Kaczmarczyk R. Large Language models for Therapy recommendations across 3 clinical specialties: comparative study. J Med Internet Res. 2023;25:e49324. https://doi.org/10.2196/49324.
Meskó B. Prompt Engineering as an important emerging skill for medical professionals: Tutorial. J Med Internet Res. 2023;25:e50638. https://doi.org/10.2196/50638.
Wang L, Chen X, Deng X, Wen H, You M, Liu W, Li Q, Li J. 2024. Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. NPJ Digital Medicine. 2024;7(1):41. https://doi.org/10.1038/s41746-024-01029-4
Magrabi F, Ammenwerth E, McNair JB, De Keizer NF, Hyppönen H, Nykänen P, Rigby M, Scott PJ, Vehko T, Wong ZS, Georgiou A. Artificial Intelligence in clinical decision support: challenges for evaluating AI and practical implications. Yearb Med Inf. 2019;28(1):128–34. https://doi.org/10.1055/s-0039-1677903.
Vasey B, Nagendran M, Campbell B, Clifton DA, Collins GS, Denaxas S, Denniston AK, Faes L, Geerts B, Ibrahim M, Liu X, Mateen BA, Mathur P, McCradden MD, Morgan L, Ordish J, Rogers C, Saria S, Ting DSW, Watkinson P, Weber W, Wheatstone P, McCulloch P. DECIDE-AI expert group. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat Med. 2022;28(5):924–33. https://doi.org/10.1038/s41591-022-01772-9.
Guni A, Sounderajah V, Whiting P, Bossuyt P, Darzi A, Ashrafian H. Revised Tool for the Quality Assessment of Diagnostic Accuracy studies using AI (QUADAS-AI): protocol for a qualitative study. JMIR Res Protocols. 2024;18:13. https://doi.org/10.2196/58202.
General Medical Council. 2024. [Accessed 14 November 2024] https://www.gmc-uk.org/education/standards-guidance-and-curricula/standards-and-outcomes/outcomes-for-graduates
Equator Network. 2024. [Accessed 14 November 2024] https://www.equator-network.org/reporting-guidelines/strobe/
Mentimeter. 2024. [Accessed 14 November 2024] https://www.mentimeter.com/
Bardin L. (2011). Content Analysis. Sao Paulo: Edicoes 70.
Hsieh H-F, Shannon SE. Three approaches to qualitative content analysis. Qual Health Res. 2005;15(9):1277–88. https://doi.org/10.1177/1049732305276687.
Sam AH, Peleva E, Fung CY, Cohen N, Benbow EW, Meeran K. Very short answer questions: a Novel Approach to summative assessments in Pathology. Adv Med Educ Pract. 2019;10:943–8. https://doi.org/10.2147/AMEP.S197977.
Marsdin E, Biswas S. Are we learning enough pathology in medical school to prepare us for postgraduate training and examinations? J Biomed Educ. 2013. https://doi.org/10.1155/2013/165691.
Tez M, Yildiz B. How Reliable Are Medical textbooks? J Grad Med Educ. 2017;9(4):550. https://doi.org/10.4300/JGME-D-17-00209.1.
Ellaway R. CanMEDS is a theory. Adv Health Sci Educ Theory Pract. 2016;21(5):915–7. https://doi.org/10.1007/s10459-016-9724-3.
Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44–56. https://doi.org/10.1038/s41591-018-0300-7.
The Royal College of Physicians and Surgeons of Canada. 2024. [Accessed 14 November 2024] https://www.royalcollege.ca/en/standards-and-accreditation/canmeds.html
Rampton V, Mittelman M, Goldhahn J. Implications of artificial intelligence for medical education. Lancet Digit Health. 2020;2(3):e111–2. https://doi.org/10.1016/S2589-7500(20)30023-6.
Wynn-Lawrence LS, Bala L, Fletcher RJ, Wilson RK, Sam AH. Question-based collaborative learning for constructive curricular alignment. Adv Med Educ Pract. 2021;11:1047–53. https://doi.org/10.2147/AMEP.S280972.
Karunaratne D, Karunaratne N, Wilmot J, Vincent T, Wright J, Mahmood N, Tang A, Sam AH, Reed M, Howlett D. An online teaching resource to support UK Medical Student Education during the COVID-19 pandemic: a descriptive account. Adv Med Educ Pract. 2021;12:1317–27. https://doi.org/10.2147/AMEP.S337544.
Millar KR, Reid MD, Rajalingam P, Canning CA, Halse O, Low-Beer N, Sam AH. Exploring the feasibility of using very short answer questions (VSAQs) in team-based learning (TBL). Clin Teach. 2021;18(4):404–8. https://doi.org/10.1111/tct.13347.
Stackhouse AA, Rafi D, Walls R, Dodd RV, Badger K, Davies DJ, Brown CA, Cowell A, Meeran K, Halse O, Kinross J, Lupton M, Hughes EA, Sam AH. Knowledge Attainment and Engagement among Medical students: a comparison of three forms of Online Learning. Adv Med Educ Pract. 2023;14:373–80. https://doi.org/10.2147/AMEP.S391816.
Kuper A, Veinot P, Leavitt J, Levitt S, Li A, Goguen J, Schreiber M, Richardson L, Whitehead CR. Epistemology, culture, justice and power: non-bioscientific knowledge for medical training. Med Educ. 2017;51(2):158–73. https://doi.org/10.1111/medu.13115.
Assis-Hassid S, Reychav I, Heart T, Pliskin JS, Reis S. Enhancing patient-doctor-computer communication in primary care: towards measurement construction. Isr J Health Policy Res. 2015;4:4. https://doi.org/10.1186/2045-4015-4-4.
Ng DTK, Leung JKL, Chu SKW, Qiao MS. Conceptualizing AI literacy: an exploratory review. Computers Education: Artif Intell. 2021;2:100041. https://doi.org/10.1016/j.caeai.2021.100041.
Sholle ET, et al. Underserved populations with missing race ethnicity data differ significantly from those with structured race/ethnicity documentation. J Am Med Inf Assoc. 2019;26(8–9):722–9. https://doi.org/10.1093/jamia/ocz040.
Karabacak M, Ozkara B, Margetis K, Wintermark M, Bisdas S. The Advent of Generative Language models in Medical Education. JMIR Med Educ. 2023;9:e48163. https://doi.org/10.2196/48163.
Preiksaitis C, Rose C, Opportunities. Challenges, and Future Directions of Generative Artificial Intelligence in Medical Education: scoping review. JMIR Med Educ. 2023;9:e48785. https://doi.org/10.2196/48785.
Wood EA, Ange BL, Miller DD. Are we ready to Integrate Artificial Intelligence Literacy into Medical School Curriculum: students and Faculty Survey. J Med Educ Curric Dev. 2021;8:23821205211024078. https://doi.org/10.1177/23821205211024078.
Hu R, Fan KY, Pandey P, Hu Z, Yau O, Teng M, Wang P, Li T, Ashraf M, Singla R. Insights from teaching artificial intelligence to medical students in Canada. Commun Med (Lond). 2022;2(1):63. https://doi.org/10.1038/s43856-022-00125-4
Funding
This study was funded by Imperial College London.
Author information
Authors and Affiliations
Contributions
All authors were involved in the design of the study. The data were collected by AHS and WJW. WJW, GL and AHS analysed the data. WJW wrote the initial draft of the manuscript. All other authors read, revised and commented on the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The Imperial College London Education Ethics Review Process approved this research project (EERP2324-018). We obtained informed consent from all the participants.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Waldock, W.J., Lam, G., Baptista, A. et al. Which curriculum components do medical students find most helpful for evaluating AI outputs?. BMC Med Educ 25, 195 (2025). https://doi.org/10.1186/s12909-025-06735-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12909-025-06735-5