- Research
- Open access
- Published:
Feasibility study of using GPT for history-taking training in medical education: a randomized clinical trial
BMC Medical Education volume 25, Article number: 1030 (2025)
Abstract
Backgrounds
Traditional methods of teaching history-taking in medical education are limited by scalability and resource intensity. This study aims to assess the effectiveness of simulated patient interactions based on a custom-designed Generative Pre-trained Transformer (GPT) model, developed using OpenAI’s ChatGPT GPTs platform, in enhancing medical students’ history-taking skills compared to traditional role-playing methods.
Methods
A total of 56 medical students were randomly assigned into two groups: an GPT group using GPT-simulated patients and a control group using traditional role-playing. Pre- and post-training assessments were conducted using a structured clinical examination to measure students’ abilities in history collection, clinical reasoning, communication skills, and professional behavior. Additionally, students’ evaluations of the educational effectiveness, satisfaction, and recommendation likelihood were assessed.
Results
The GPT-simulation group showed significantly higher post-training scores in the structured clinical examination compared to the control group (86.79 ± 5.46,73.64 ± 4.76, respectively, P < 0.001). Students in the GPT group exhibited higher enthusiasm for learning, greater self-directed learning motivation, and better communication feedback abilities compared to the control group (P < 0.05). Additionally, the student satisfaction survey revealed that the GPT group rated higher on the diversity of diseases encountered, ease of use, and likelihood of recommending the training compared to the control group (P < 0.05).
Conclusions
GPT-based history-taking training effectively enhances medical students’ history-taking skills, providing a solid foundation for the application of artificial intelligence (AI) in medical education.
Clinical trial number
NCT06766383.
Background
As medical education evolves, traditional teaching methods face growing challenges—particularly in advancing students’ clinical reasoning and history-taking skills, defined as the structured process of eliciting a patient’s medical background, including the chief complaint, present illness, past history, and relevant social and family information. Although emerging technologies such as virtual reality and machine learning offer valuable supplements, they do not fully resolve the limitations of conventional approaches, which still require critical reform to meet the demands of modern clinical training [1, 2]. While traditional face-to-face history-taking training has proven effective, it is often resource-intensive, costly, and difficult to standardize on a large scale. Furthermore, before entering clinical rotations, students rarely have the opportunity to practice direct patient interviews. Additionally, due to the inherent strain in doctor-patient relationships [3], it is impractical to find a sufficient number of patients for each student to gain adequate interviewing experience.
Recent studies have increasingly explored the integration of large language models, particularly ChatGPT, into various domains of medical education [4, 5]. For example, Holderried et al. developed a GPT-powered simulated patient platform that provided automated feedback to support history-taking practice [6]. In a randomized controlled trial, Brügge et al. found that AI-based patient simulations significantly improved medical students’ performance in clinical decision-making and interview tasks [7]. Moreover, several reviews have highlighted the broader potential of generative AI in enhancing learner engagement, facilitating self-directed learning, and supporting personalized feedback in medical training [5, 8, 9]. These studies underscore the promise of large language models as scalable, interactive, and adaptive tools for clinical education.
However, most existing research remains conceptual or exploratory, with limited empirical evidence on the practical implementation of GPT-based systems in specific skill domains. In particular, structured history-taking training—a foundational component of clinical competence—has not been rigorously evaluated using GPT-driven simulations. Given the critical role of history-taking in clinical encounters and the persistent challenges faced by students in mastering this skill, it is necessary to assess whether such AI-based systems can provide meaningful improvements in training outcomes compared to traditional methods.
Therefore, this study aims to evaluate the effectiveness of a GPT-based simulated patient system in improving medical students’ history-taking skills. Specifically, we compare it impacts on four key competencies—history collection, clinical reasoning, communication skills, and professional behavior—with those of traditional instructor-led role-playing methods. In addition to these primary outcomes, we also explored students’ perceptions of the training experience, including feedback on the effectiveness of the method and overall satisfaction. We hypothesize that students trained using the GPT system will demonstrate greater improvements across these core domains and report higher levels of engagement and satisfaction compared to those receiving traditional training.
Methods
This single-center, randomized controlled trial was approved by the Ethics Committee of the Second Affiliated Hospital of Anhui Medical University. All participants provided written informed consent before enrollment. The study adhered to the Consolidated Standards of Reporting Trials (CONSORT) guidelines for randomized clinical trials.
Participants
Fifty-six fifth-year medical students from Anhui Medical University were invited to participate in this study (Fig. 1). In Anhui Medical University, students complete all basic science courses in Years 1–2 and a compulsory Diagnostics course in Year 3 (72 contact hours, including ≥ 12 h of structured workshops on history-taking and communication). They also undertake brief ward-shadowing sessions in Years 3–4, but do not begin the intensive ≥ 48-week clerkship block until Year 5. Consequently, although the present study had received classroom-based instruction and limited simulated practice, none had participated in sustained real-patient interviews or any formal simulated-patient programs before enrolment. Recruitment was conducted through university networks, academic groups and interest-based student forums. Eligible participants were enrolled in the medical programme; exclusion criteria included any previous exposure to simulated-patient-based training or structured history-taking courses.
Randomization
Participants were randomly assigned to the intervention or control group using a computer-generated simple randomization sequence. The randomization list was generated by an independent researcher using SPSS version 26.0, with a 1:1 allocation ratio. Allocation concealment was ensured through sequentially numbered, opaque, sealed envelopes prepared in advance. Participants were informed of their group assignment 24 h before the training sessions by a coordinator who was not involved in outcome assessments.
Sample size and power analysis
A priori power analysis was conducted using G*Power version 3.1 to determine the minimum required sample size. Assuming a moderate to large effect size (Cohen’s d = 0.8), a significance level of α = 0.05, and power (1 − β) = 0.8 for a two-tailed independent-samples t-test, a minimum of 26 participants per group was required.
Development and testing of the GPT model
We developed a GPT model specifically for medical student history-taking training using the ChatGPT GPTs platform (OpenAI) in October 2024. This platform allows users to create a private GPT by uploading reference materials and defining behavioral instructions via natural-language prompts. We uploaded selected chapters from the 10th edition of Internal Medicine, the 9th edition of Diagnostic Medicine of China, and the national medical record writing guidelines through the platform’s Knowledge tool. The model was further guided by system-level instructions that defined the patient role, communication style, and response boundaries. No model fine-tuning was performed. Instead, the GPT’s behavior was shaped entirely through prompt engineering—a form of in-context learning in which the model interprets real-time user input based on background instructions and embedded materials. This approach enabled the GPT to simulate realistic clinical encounters without requiring additional model training. According to OpenAI’s documentation, custom GPTs use the most recent GPT-4 family model available at the time; specific sub-models (e.g., GPT-4o) are managed internally by the platform and are not disclosed to end users. All interactions ran under platform-default inference settings, with no manual adjustment of model parameters (e.g., temperature or top-p). To test the model’s output, the system was tasked with generating comprehensive clinical cases covering common and representative diseases. These simulations included realistic patient responses, such as vague statements, off-topic replies, and frequent shifts in narrative—mimicking communication challenges that are often encountered in clinical practice. To better mimic the dynamics of real clinical history-taking, the GPT system’s ability to provide overly direct or exhaustive information was intentionally restricted, requiring students to actively inquire and synthesize information. To mitigate potential biases and inaccuracies inherent in generative models, we implemented the following systematic validation measures:1)Structured Clinical Review: All GPT-generated patient dialogues and associated feedback were manually reviewed by two senior clinicians. They specifically screened for clinical inaccuracies, inappropriate terminology usage, logical inconsistencies, or unrealistic patient behaviors. Any identified biases or inaccuracies—such as overly suggestive prompts, incorrect clinical terminology, or deviations from clinical logic—were documented and corrected through iterative prompt revisions. 2)Pilot Student Testing and Feedback: We conducted pilot testing sessions involving volunteer medical students. Students interacted with the GPT-generated cases and provided structured feedback regarding realism, clarity, clinical consistency, and educational utility. This feedback loop helped identify issues from the learner’s perspective, enabling further refinement of the prompts to enhance realism and educational relevance. This prompt engineering approach, while highly effective and controllable for educational purposes, does not allow the GPT model itself to adapt dynamically or “learn” from new interactions. Consequently, while it provides reliable and reproducible training scenarios, it limits scalability and adaptive learning capabilities. Recognizing this limitation, we ensured meticulous manual validation and iterative refinement to maintain clinical accuracy and educational effectiveness. A feedback mechanism was also integrated into the system: after each simulated interview, students could compare their performance against pre-set reference histories and receive automated feedback on the completeness and logic of their questioning process. Detailed examples illustrating the GPT simulation process—including the system prompt used to define GPT behaviour, case generation, interaction flow, and feedback structure—are available in Supplement 1.
Study procedure
After providing written informed consent, all participants completed a background questionnaire covering demographic data and prior medical-training exposure. They were then randomly assigned to either the intervention group (GPT simulation) or the control group (traditional role-playing) using a computer-generated sequence. GPT group: Students interacted with AI-generated patients powered by the GPT-4 API. Each 30-minute session presented a simulated clinical encounter drawn from a library of scenarios representing common medical and surgical conditions (e.g., chest pain, dyspnea, abdominal pain, thyroid dysfunction). Students could either speak or type their questions. GPT generated a response that was played back as synthetic speech and simultaneously displayed as on-screen text. After each encounter the system produced a written feedback report that followed a fixed four-dimension rubric—history completeness, clinical reasoning/logical flow, communication and professional behavior—while dynamically tailoring comments to the individual student’s performance. All outputs were pre-validated by two clinicians for content accuracy. Control group (traditional role-playing): Simulated patients were portrayed by experienced clinical-skills instructors who regularly teach history-taking at our institution. Before the trial, these instructors attended a calibration workshop to rehearse identical case briefs, key inquiry cues and emotional tone, ensuring standardized portrayal and minimizing instructor-related bias. Session length (30 min), session frequency (three per week for four weeks), group size and range of clinical conditions were matched to the GPT arm. After each encounter, the instructor provided structured feedback using the same four-dimension rubric applied in the GPT group. Each 30-minute session was self-paced in two group. Most participants completed 2–3 focused history encounters per session; learners who demonstrated proficiency on core items by week 3 were permitted to omit repetitive screening questions (e.g., smoking/alcohol history) and typically practiced 3–5 cases within the same period. The same pacing rule and case library applied to both groups to maintain parity. Both groups completed three sessions per week for four consecutive weeks—a duration previously shown to be sufficient for improving interview skills [10]. A structured clinical examination (OSCE) was administered one week before and one week after the training period. Two experienced clinical instructors, blinded to group allocation, independently scored each OSCE; discrepancies were resolved by discussion or, where necessary, a third adjudicator. Before data collection, the same two raters completed a brief on-site pilot calibration: six senior students (chosen to represent high, average and low performance) performed live history-taking exercises, which the raters scored independently. They then compared ratings and refined anchor interpretations until consensus was reached on all checklist items. No formal percentage-agreement or ICC statistics were calculated; this calibration was intended solely to align the raters’ understanding of the scoring criteria prior to the main study.
Outcome measures
The primary outcome was structured clinical examination, all participants completed a structured clinical examination before and after the training. The structured clinical examination used in this study was designed with reference to the Undergraduate Medical Education Standards—Clinical Medicine (2022 Edition), jointly issued by the Ministry of Education and the National Health Commission of China. The assessment framework was adapted from the Objective Structured Clinical Examination model, which has been widely implemented in Chinese medical schools and has demonstrated good reliability and validity in evaluating clinical competencies. The examination consisted of four components (total score = 100 points): (a) History Collection (30 points): Including the level of detail in the chief complaint and symptoms (10 points), the ability to understand and identify patient information (10 points), and the appropriateness and logic of follow-up questions (10 points); (b) Clinical Reasoning (30 points): Including the thoroughness of diagnostic thinking (15 points) and the efficiency in processing information and forming clinical judgments (15 points); (c) Communication Skills (20 points): Including interaction with patients (10 points) and clarity of information delivery (10 points); (d) Professional Behavior (20 points): Adherence to clinical procedural norms (10 points) and professional attitude towards patients (10 points) (eAppendix 1 in Supplement 2).
The secondary outcomes were feedback on the effectiveness of the training methods and student satisfaction (eAppendix 2, 3 in Supplement 2). Feedback on the effectiveness of the training methods was collected from both groups through a questionnaire on a 5-point Likert scale, assessing factors such as post-class self-directed learning, enhanced enthusiasm for learning, improvements in communication and feedback skills, increased expressiveness, improvement in interview logical reasoning ability—defined here as the ability to formulate, adapt, and sequence clinical questions logically during history-taking—and changes in anxiety levels during the interview process. The item on post-class self- directed learning was included as an exploratory indicator of whether the training method stimulated ongoing engagement—a key component of long-term educational impact in medical training. Student satisfaction was evaluated regarding the naturalness of patient interaction, the usability of the training methods, and the likelihood of recommending the training to peers. In addition, we included an item on the perceived diversity of diseases encountered during training to capture students’ subjective experience of case variety. Given that GPT-based simulations can generate a broader range of clinical scenarios based on learner input, this dimension was considered relevant to overall satisfaction with the training experience. All participants provided qualitative feedback on their training experiences. Responses were analyzed using basic thematic analysis, with one researcher identifying key themes (usefulness, convenience, engagement) and a second researcher verifying the categorizations.
Statistical analysis
Data analysis was conducted using SPSS and Prism software. Quantitative variables that were normally distributed are presented as mean ± standard deviation, while variables that were not normally distributed are reported using median and interquartile range. Between-group comparisons were performed using independent samples t-tests or rank-sum tests. Categorical variables were presented as frequencies, and between-group differences were assessed using Chi-square tests. All statistical tests were two-sided, and a P-value of < 0.05 was considered statistically significant. Effect sizes for between-group differences were calculated using Cohen’s d, with values of 0.2, 0.5, and 0.8 representing small, medium, and large effects, respectively.
Results
A total of 56 medical students from Anhui Medical University were enrolled in this study and randomly assigned to two groups, with 28 participants in the GPT group and 28 in the control group. Baseline characteristics were evenly distributed between the groups (Table 1). All participants completed the training, and there were no dropouts during the study.
Comparison of structured clinical examination scores
Before the training, there was no statistically significant difference in the average structured clinical examination scores between the intervention and control groups (57.39 ± 11.14, 54.68 ± 10.33, respectively, P = 0.349). After training, the GPT group scored significantly higher on the structured clinical examination than the control group (86.79 ± 5.46 vs. 73.64 ± 4.76, P < 0.001). This large difference (Cohen’s d = 2.57) suggests a strong educational effect. This between-group difference highlights the potential value of GPT-based training in enhancing clinical competencies. Details are presented in Table 2; Fig. 2.
Feedback on educational effectiveness
Students in the GPT group reported significantly greater post-training gains in self-directed learning, enthusiasm, communication feedback, and logical reasoning during interviews than those in the control group. (P = 0.004, P < 0.001, P = 0.006, P = 0.036, respectively). Additionally, students in the GPT group experienced less anxiety during the interviews (P < 0.001). Details are presented in Table 2; Fig. 3.
Comparison of Students’ Evaluation of the Educational Effectiveness of the Training Methods. Legend: Students’ evaluation of educational effectiveness on a 5-point Likert scale. GPT group reported significantly higher self-directed learning, enthusiasm, communication skills, and reasoning ability, and lower anxiety(P < 0.05)
Student satisfaction survey
Students in the GPT group rated the diversity of diseases encountered, the ease of use of the training methods, and their likelihood of recommending the training methods significantly higher than the control group (P = 0.004, P < 0.001, P < 0.001, respectively). Details are presented in Table 2; Fig. 4.
Qualitative feedback
Thematic analysis of student feedback revealed three key themes: perceived usefulness, convenience, and engagement. Students in the GPT group commonly reported that the system enhanced their efficiency in history-taking and clinical reasoning. They appreciated the system’s ease of use, flexible scheduling, and wide range of clinical cases. One student noted, “I could practice with many different diseases whenever I had time, which helped me understand textbook content more deeply.” In contrast, students in the control group expressed concerns about the limitations of traditional instructor-led role-playing. They highlighted the logistical burden on teachers and restricted access to practice opportunities. As one student commented, “If the teacher wasn’t available, we couldn’t continue practicing. It was hard to stay consistent.” Overall, feedback suggested that while both methods were valued, the GPT-based training provided greater autonomy and engagement, especially in reinforcing knowledge through repeated and varied practice.
Discussion
This study explores the feasibility and effectiveness of using a GPT-based simulation system for medical student history-taking training. The results show that the GPT group performed significantly better than the control group. This finding is consistent with recent research trends in the application of technology-driven educational tools in the field of medicine [11].
Our findings support the use of GPT-based simulations for structured history-taking training, showing clear improvements in clinical skills compared to traditional role-playing. A key contribution of this study is the measurable improvement in clinical reasoning and communication—areas that are often challenging for medical students [1, 2]. The notably higher performance in these domains by GPT-trained students suggests that the immediate, structured feedback provided by the GPT system—absent in many traditional methods—may have significantly facilitated active reflection and rapid iteration of questioning techniques. This supports Shanahan et al.‘s view that GPT agents can simulate a wide range of characters through ongoing, adaptive conversation [12]. Additionally, our results revealed substantial enhancements in student motivation and self-directed learning behaviors post-training, extending beyond the scope of typical competency assessments. These improvements likely stem from the GPT system’s flexibility, allowing learners autonomy in pacing and scenario selection, thus promoting intrinsic motivation—a critical factor for sustained educational impact [13, 14]. This contrasts sharply with traditional role-playing exercises, where practical constraints limit personalized learning opportunities and may inadvertently reduce learner motivation through rigid structure and instructor dependency. A further significant finding was the reduced anxiety reported by students engaging with GPT simulations compared to traditional methods. Reduced anxiety in clinical skill training environments can markedly improve educational outcomes, as anxiety has been consistently shown to impair cognitive processing and skill acquisition [15]. This anxiety reduction in the GPT group is likely attributed to the psychological safety provided by an AI environment, free from evaluative pressures typically experienced in instructor-led sessions. Consequently, GPT-based platforms may better prepare students to transition confidently into clinical practice, as indicated by students’ qualitative feedback. By using empirical data and direct comparisons, this study clarifies how GPT works educationally and offers practical guidance for integrating AI into medical training. In addition, the GPT-based training system offers several key advantages: (1) Diverse Clinical Case Handling [16, 17]: GPT can process large volumes of clinical data, providing students with a wide range of cases—from simple to complex. This helps learners gain a more comprehensive understanding of various clinical scenarios. (2) Continuous Learning and Adaptation [18]: GPT is constantly updated with the latest medical research, guidelines, and case studies, ensuring that the educational content remains current and accurate. By integrating real-world data from multiple medical disciplines, it creates an interdisciplinary learning environment. This not only enhances students’ clinical reasoning and interview skills but also makes it a powerful supplement to traditional teaching. As technology advances and more data become available, GPT’s role in medical education is expected to grow even further.
Interpretation of results
While both groups received equal training time and completed the same number of scheduled sessions, the GPT group demonstrated significantly better outcomes. Several factors may explain this difference. First, the GPT system allowed students to adjust the difficulty level and case types based on their needs. Once they became familiar with common questioning patterns, they could efficiently skip redundant parts and proceed to new cases, increasing the number of interactions within each session. Second, GPT sessions eliminated delays typically seen in traditional role-playing, such as waiting for instructors to switch roles or prepare feedback. Students were able to complete multiple high-quality simulations in a row without interruption. Third, students in the GPT group reported lower levels of anxiety during training, likely because they did not feel judged by an instructor. This reduced pressure helped students transition more quickly into a clinical mindset, remain focused throughout the history-taking session, and engage more effectively in the learning process. Another potential reason for the GPT group’s superior performance lies in the system’s highly interactive and adaptive nature [16]. The GPT model provided immediate, tailored written feedback within seconds after each simulation, allowing students to reflect and improve while their memory of the interaction was still fresh. This rapid feedback loop may have facilitated deeper learning and knowledge integration. Students’ qualitative feedback supported this, with many emphasizing the benefit of being able to “practice anytime and receive useful suggestions right away.” These combined advantages may account for the superior performance observed in the GPT group.
Comparison with previous research
With the rapid development of AI technology in recent years, its application in medical education has been steadily increasing [19]. Studies show that AI-based teaching tools can enhance students’ motivation and teaching outcomes [15, 20]. While previous studies have explored the potential of large language models in medical education, most of them have focused on either conceptual frameworks or narrowly defined tasks, such as short-term clinical reasoning or simple patient interactions. For instance, Holderried et al. [6] demonstrated that GPT-powered simulated patients could offer automated feedback in isolated history-taking tasks, and Brügge et al. [7] reported improved decision-making through AI-facilitated simulations. However, these studies often lacked structured, multi-dimensional evaluations that encompass both objective performance and learner-centered outcomes such as satisfaction, motivation, and anxiety. In contrast, our study advances the field by implementing a comprehensive, controlled comparison between GPT-based simulations and traditional role-play training, using a structured clinical examination that evaluates four critical competencies—history collection, clinical reasoning, communication skills, and professional behavior. Moreover, we integrated learner feedback across both quantitative and qualitative dimensions, including psychological and behavioral responses to training. This multi-level evaluation provides more robust evidence of the educational value of GPT, moving beyond preliminary proof-of-concept work. The combination of enhanced clinical performance and higher student-reported motivation, reduced anxiety, and greater engagement underscores GPT’s role not just as a technological novelty, but as a pedagogically sound tool that addresses multiple limitations of traditional methods. Therefore, our study contributes not only empirical validation but also pedagogical insight into how AI can be meaningfully integrated into competency-based medical education. By addressing gaps in prior research design and outcome measurement, we demonstrate that GPT-based training may go beyond enhancing task performance to shaping learner behavior and experience in a clinically relevant and sustainable way.
External validity and curricular comparability
Because all participants were fifth-year Chinese medical students who had previously received only classroom-based lectures and laboratory practicals in Diagnostics and related disciplines, they were just beginning their first sustained clinical rotations and therefore had no real-patient history-taking experience at study entry. This developmental stage is broadly comparable to (i) the start of third-year core clerkships in a typical 4-year U.S. MD programme, when students first undertake 6- to 8-week rotations in internal medicine, surgery, pediatrics, psychiatry and related disciplines. Providing this curricular context helps readers judge the transferability of our findings to other international settings where learners possess comparable pre-clerkship knowledge yet limited real-patient interview experience.
Disadvantages of GPT teaching methods
Despite its advantages in accessibility, standardization and scalability, the GPT teaching method still has several noteworthy limitations. First, the system is highly data-dependent and requires a large volume of high-quality medical-education content to generate clinically realistic scenarios.
Second, although our platform incorporates automatic speech recognition and text-to-speech synthesis—thus allowing real-time voice conversations that convey prosody, pausing and basic tone—it cannot display visual non-verbal behaviors such as facial expression, eye contact or hand gestures. Non-verbal communication plays a vital role in physician–patient encounters, particularly in building rapport, conveying empathy, and identifying inconsistencies between verbal reports and observed behavior. The absence of these cues may therefore have reduced the communicative complexity confronting learners and may partially inflate the observed advantage of the GPT group over the traditional role-play condition, in which teachers acting as patients provided both verbal and visual feedback. For example, physicians often judge whether a patient’s reported pain matches observable behavior—such as facial grimacing or protective movements—to distinguish organic from non-organic symptoms. Likewise, anxious or hyperactive patients whose restless gestures or over-talkativeness signal distress usually require a different interview strategy with more structured guidance and emotional reassurance. These nuanced adjustments are difficult to cultivate in a purely text-based or voice-only simulation environment. To mitigate the missing-cue limitation, several hybrid solutions could be explored: (i) coupling the LLM dialogue engine to an embodied conversational agent or 3-D avatar capable of rendering facial micro-expressions and gaze behavior; (ii) adopting sequential training pathways in which students first practice questioning strategies with GPT, then consolidate non-verbal competencies in high-fidelity manikin or human standardized-patient sessions; and (iii) embedding AI-driven voice interactions within existing OSCE circuits, where an examiner or peer supplies real-time visual cues while the LLM controls the clinical narrative. Such blended modalities may preserve the scalability and immediate feedback of GPT while exposing learners to the full spectrum of communication signals required in authentic clinical encounters.
Study limitations
The main limitations of this study include the relatively small sample size and the fact that it was conducted at a single institution, which may affect the generalizability and external validity of the results. Furthermore, the study had a short follow-up period and did not evaluate the long-term educational effects. Future studies can overcome these limitations by expanding the sample size, collaborating across multiple centers, and extending the follow-up period. Meanwhile, although our structured clinical examination employed the nationally standardized checklist and the two raters underwent an initial calibration exercise, we did not compute formal inter-rater reliability indices (e.g., percentage agreement, ICC, or G-coefficients) across the study sample. Consequently, some degree of scorer variability cannot be ruled out. Future studies should include duplicate ratings on a larger proportion of encounters and report full reliability statistics to strengthen the psychometric evidence base.
Conclusion
The GPT-based history-taking simulation demonstrated promising efficacy in improving medical students’ history-taking skills in this single-center, randomized controlled trial. Although these findings support the potential utility of GPT-driven simulations in medical education, further studies involving larger sample sizes, multi-center collaborations, and assessments of long-term impacts are necessary before broader generalizations or wide-scale implementation can be recommended.
Data availability
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.
Abbreviations
- GPT:
-
Generative Pre-trained Transformer
- AI:
-
Artificial intelligence
- GPTs:
-
Generative Pre-trained Transformers
- CONSORT:
-
Consolidated Standards of Reporting Trials
References
Lee J, Kim H, Kim KH, Jung D, Jowsey T, Webster CS. Effective virtual patient simulators for medical communication training: A systematic review. Med Educ. 2020;54(9):786–95.
Jani KH, Jones KA, Jones GW, Amiel J, Barron B, Elhadad N. Machine learning to extract communication and history-taking skills in OSCE transcripts. Med Educ. 2020;54(12):1159–70.
Liang Z, Xu M, Liu G, Zhou Y, Howard P. Patient-centred care and patient autonomy: doctors’ views in Chinese hospitals. BMC Med Ethics. 2022;23(1):38.
Stretton B, Kovoor J, Arnold M, Bacchi S. ChatGPT-Based learning: generative artificial intelligence in medical education. Med Sci Educ. 2024;34(1):215–7.
Lee H. The rise of chatgpt: exploring its potential in medical education. Anat Sci Educ. 2024;17(5):926–31.
Holderried F, Stegemann-Philipps C, Herrmann-Werner A, Festl-Wietek T, Holderried M, Eickhoff C, et al. A Language Model-Powered simulated patient with automated feedback for history taking: prospective study. JMIR Med Educ. 2024;10:e59213.
Brügge E, Ricchizzi S, Arenbeck M, Keller MN, Schur L, Stummer W, et al. Large Language models improve clinical decision making of medical students through patient simulation and structured feedback: a randomized controlled trial. BMC Med Educ. 2024;24(1):1391.
Preiksaitis C, Rose C. Opportunities, challenges, and future directions of generative artificial intelligence in medical education: scoping review. JMIR Med Educ. 2023;9:e48785.
Boscardin CK, Gin B, Golde PB, Hauer KE. ChatGPT and generative artificial intelligence for medical education: potential impact and opportunity. Acad Med. 2024;99(1):22–7.
Keifenheim KE, Teufel M, Ip J, Speiser N, Leehr EJ, Zipfel S, Herzog W. Becker g.teaching history taking to medical students: a systematic review. BMC Med Educ. 2015;15:159.
Lucas HC, Upperman JS, Robinson JR. A systematic review of large Language models and their implications in medical education. Med Educ. 2024;58(11):1276–85.
Shanahan M, McDonell K, Reynolds L. Role play with large Language models. Nature. 2023;623(7987):493–8.
Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large Language models in medicine. Nat Med. 2023;29(8):1930–40.
Wu Y, Zheng Y, Feng B, Yang Y, Kang K, Zhao A. Embracing ChatGPT for medical education: exploring its impact on Doctors and medical students. JMIR Med Educ. 2024;10:e52483.
Hamid H, Zulkifli K, Naimat F, Che Yaacob NL, Ng KW. Exploratory study on student perception on the use of chat AI in process-driven problem-based learning. Curr Pharm Teach Learn. 2023;15(12):1017–25.
Thomae AV, Witt CM, Barth J. Integration of ChatGPT into a course for medical students: explorative study on teaching scenarios, students’ perception, and applications. JMIR Med Educ. 2024;10:e50545.
Xu X, Chen Y, Miao J. Opportunities, challenges, and future directions of large Language models, including ChatGPT in medical education: a systematic scoping review. J Educ Eval Health Prof. 2024;21:6.
Holderried F, Stegemann-Philipps C, Herschbach L, Moldt JA, Nevins A, Griewatz J, et al. A generative pretrained transformer (GPT)-Powered chatbot as a simulated patient to practice history taking: prospective, mixed methods study. JMIR Med Educ. 2024;10:e53961.
Abd-Alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Healy PM, Latifi S, et al. Large Language models in medical education: opportunities, challenges, and future directions. JMIR Med Educ. 2023;9:e48291.
Wu SY, Yang KK. The effectiveness of teacher support for students’ learning of artificial intelligence popular science activities. Front Psychol. 2022;13:868623.
Acknowledgements
None.
Funding
2023 Anhui Province Quality Engineering Project (2023jyxm1150), 2024 Anhui Province Quality Engineering Project (2024jyxm0779).
Author information
Authors and Affiliations
Contributions
ZW and XCW contributed to conception and design of the study, and acquisition, analysis, and interpretation of data; MLL, NJZ, TTF contributed to acquisition and interpretation of data. All authors contributed to revision of the manuscript, approved the final version, and had a final responsibility for the decision to submit for publication.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This study was conducted in accordance with the ethical principles of the Declaration of Helsinki (https://www.wma.net/policies-post/wma-declaration-of-helsinki/). The study protocol was approved by the Ethics Committee of the Second Affiliated Hospital of Anhui Medical University. Written informed consent was obtained from all participants prior to their inclusion in the study.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Wang, Z., Fan, TT., Li, ML. et al. Feasibility study of using GPT for history-taking training in medical education: a randomized clinical trial. BMC Med Educ 25, 1030 (2025). https://doi.org/10.1186/s12909-025-07614-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12909-025-07614-9