Abstract
There is a growing recognition of the need for clinical trials to safely and effectively deploy artificial intelligence (AI) in clinical settings. We introduce dynamic deployment as a framework for AI clinical trials tailored for the dynamic nature of large language models, making possible complex medical AI systems which continuously learn and adapt in situ from new data and interactions with users while enabling continuous real-time monitoring and clinical validation.
Similar content being viewed by others
Introduction
Artificial intelligence (AI) in medicine has been riding atop a wave of inflated expectations and hype for at least a decade1, with the pace of research continuing to accelerate in recent years driven by the advent of large language models (LLMs)2,3. Scientists and clinicians have sought to take advantage of these technological advancements by applying LLMs to a wide array of areas in the healthcare system, ranging from estimating causal treatment effects of medications from online forum posts4, to aiding in writing research articles5, and automating administrative tasks such as insurance prior authorization paperwork6. Among many in the healthcare field, there is a general consensus that AI in healthcare is “the future”7.
The lifecycle of an AI model can be conceptualized in stages, beginning with initial problem identification, then proceeding to a design phase, followed by model development in the research setting, silent deployment, then deployment to “production” setting in the real-world, and finally a post-deployment phase of monitoring and making changes or removing from production as needed8,9,10,11. This has similarities to models of the software development lifecycle12.
In the medical context, there is a growing recognition of the need for deployment to happen through clinical trials, so as to protect participants and rigorously ensure the safety and efficacy of models13. Yet while there have been some notable examples of prospective clinical trials of LLM tools, such as for aiding nurses with receptionist tasks14 and drafting responses to patient messages15, overall only a very small fraction of models ever makes it out of the research phase to be deployed in the real-world setting. A systematic review in 2022 found only 41 randomized trials of machine learning interventions worldwide16; by 2024, this number had increased to a total of only 8617. A 2023 analysis of insurance claims found a total of only 16 medical AI procedures with billing codes18. Overall, the medical system has failed to keep up with the pace of recent developments in AI – this disconnect, known by various terms such as “implementation gap”19 and “AI chasm”20, means that the vast majority of research advances in medical AI never actually directly benefit patients or clinicians. The causes of the implementation gap are multifactorial and include not only technical and logistical barriers, but also sociocultural, ethical, economic, and regulatory factors21,22,23,24,25,26. Bridging the implementation gap is one of the largest challenges currently facing the field of medical AI.
In this piece, we describe the currently predominant approach to medical AI deployment, which is based on a linear, model-centric understanding of AI. We then identify several shortcomings of this paradigm when applied to LLM-based systems and propose an alternative way of conceptualizing AI in medicine based on continual processes of model updating, real-world evidence generation, and safety monitoring, which we call dynamic deployment. We chart a path towards making such dynamic deployments a reality, drawing on well-established methods of adaptive clinical trials as well as more recent technical advances and developments in regulatory science of medical AI.
Linear model of AI deployment
Where AI models have been successfully deployed in healthcare, they typically follow a pattern which we refer to as the linear model of AI deployment (Fig. 1a). First, a model is developed in the research domain, most often by training on retrospective data. The model is then assessed, and its performance characteristics evaluated. When the decision is made to move the model from research into deployment, it is frozen: all the model’s parameters are locked and remain static for as long as it is deployed. Although it could be updated periodically in response to new data or performance degradations identified through post-deployment monitoring and auditing, there are few examples of this happening in practice.
a In the linear model of AI deployment, a model is first trained in the research/development setting, then deployed to the real-world setting with its parameters frozen. Model weights may be updated periodically following post-deployment monitoring and auditing. b In the dynamic framework, models are first pre-trained in the research/development setting. They remain dynamic when deployed, and mechanisms are in place to enable continuous updating in response to feedback signals from their deployment environments (arrows). Multiple AI models may be simultaneously deployed and interacting. All elements inside the blue box are considered part of the complex AI system, including the AI models, the users, the workflow integrations and interfaces by which they interact, and the feedback and update mechanisms.
In the linear framework, the focus is on a particular AI model. More specifically, it is a particular instance of the given model defined by its set of parameters. The linear model is intuitive and closely mirrors the process by which other technologies are brought into clinical practice. However, the linear model of AI deployment is a poor fit for modern LLM systems, for three principal reasons which we outline below.
-
1.
AI is an adaptive technology
AI systems have an important difference from other technologies in medicine: they are adaptive. In fact, one of the most important attributes of modern LLMs with billions of parameters is their flexibility. Model weights do not necessarily need to remain fixed throughout the lifespan of the model’s deployment, and can be periodically finetuned or updated as batches of new data come in. Methods such as reinforcement learning from human feedback (RLHF)27 and direct preference optimization (DPO)28 also allow LLMs to learn directly from their users in order to be better aligned with user preferences, and recent work has extended these approaches to the “online learning” setting, allowing for continuous updating of deployed models29,30. The behavior of LLMs can also be substantially changed during deployment through interactions with users, without updating any of the model parameters: for example, in-context learning allows LLMs to learn from new training data presented in their prompts31, and chain-of-thought prompting enables LLMs to more effectively reason through complex problems32. For all these reasons, the line between model development and model deployment is becoming increasingly blurred. Consequently, it is unclear how the linear model of AI deployment would handle such interactive and dynamic features of emerging AI systems. By relying on the underlying assumption that learning occurs only in discrete phases, the linear model struggles to encompass many of the most promising avenues of modern AI advances, especially with regards to the interactive and dynamic nature of LLMs.
-
2.
AI functions as part of a complex system
Secondly, the linear model does not sufficiently account for the complex systems in which AI models are employed. The outputs of the model itself are of course crucial, but are only one part of the system. Other factors beyond the model parameters also drive outcomes. For example, choices related to user interface design can shape interactions between humans and AI models, introducing new cognitive biases into clinical decision-making33,34,35,36. Even when clinicians are given access to LLM systems with super-human abilities, human users will not necessarily be able to effectively take advantage of the full potential of these tools without specialized training37. The behaviors of interactive AI systems, such as chatbots, also depend integrally on the behavioral patterns and values of the particular population of users38. Thus, even when model weights are frozen, the system is not static. By adopting a model-centric, parameter-centric view of AI, the linear model fails to adequately account for the numerous other factors contributing to meaningful outcomes in the real world.
-
3.
Health systems of the future will have many AI models operating at once.
Finally, the linear model of AI relies on the premise of isolating a single model for testing. This is reasonable today, as there are so few AI models in the wild. However, it potentially poses a major challenge for scaling up the extent of AI integration. In the near future, there may be orders of magnitude more models deployed in various contexts throughout the medical system. Users may interact with many different models during their routine workflows, and models could interact with each other and be interdependent in complex ways. This is exemplified by the emerging paradigm of multi-agent AI systems, whereby tasks are completed by a cohort of individual LLM-based agents, orchestrated by other “supervisory” models39,40,41. In such scenarios, AI clinical trial designs which seek to evaluate the behavior of a specific model in isolation would be impractical.
Dynamic systems model of AI deployment
To overcome these challenges, we propose an alternative framework for clinical trials and deployment of LLMs, which we call dynamic deployment (Fig. 1b). In a nutshell, the dynamic deployment model is distinguished from the linear deployment model in two key ways: 1) by embracing a systems-level understanding of medical AI, and 2) by explicitly accounting for the fact that such systems are dynamic and constantly changing. In this section we describe the framework and discuss how it can be applied in the real world through adaptive clinical trials.
The first principle is a systems-level approach to medical AI. In this model, the AI system is conceptualized as a complex system with multiple interconnected moving parts. The AI model itself is at the core and functions the same as in the linear model: taking input data and producing outputs according to its internal parameters. What sets apart this approach, however, is that other elements in the AI system are also explicitly included as parts of the intervention. This includes the population of users, each guided by their own set of values and behavioral patterns; the workflow integration and user interface by which users interact with models; and other automated elements, such as the data generation or processing pipelines and the update mechanisms for online learning. Each individual component contributes to the overall behavior of the system, although disentangling the exact contribution from each element might not be feasible. However, the systems-level view says that it is not actually necessary to measure these complex intra-system relationships. What matters is the behavior of the system as a whole, as measured by metrics that are meaningful in the real-world, such as patient outcomes42. For example, gradual degradation of performance metrics over time is a clear indicator that the system as a whole is not functioning well, even though it may be difficult or impossible to isolate the effects of AI model degradation from other sources of variation such as natural fluctuation in patient or user populations. A systems-level approach aims to use feedback loops to learn from these performance changes over time, regardless of their root causes. By shifting focus to a systems-level conceptualization of medical AI, we will be able to better measure things which actually matter.
The second principle informing the design of dynamic medical AI deployments is the recognition that they are systems which change over time. AI models still undergo an initial research and development phase before being deployed, however this is understood to be “pretraining,” i.e. the start of training rather than the end. Instead of models being frozen, they are allowed to continue to evolve in response to feedback signals during deployment. These can occur by mechanisms such as online learning or finetuning with new data, alignment with user preferences via RLHF or DPO, or more subtle causes such as drift in user populations altering system behavior due to differences in usage patterns. To provide concrete examples, we list several concrete examples of feedback signals in Table 1 and mechanisms of adaptation in response to these signals in Table 2. Rather than trying to freeze the system and measure its performance at discrete snapshots in time, the dynamic approach relies on feedback loops allowing for both continuous iteration and continuous evaluation. Discrete, post-deployment updates and audits are augmented by their continuous analogs, allowing for AI systems to continually update in response to new data.
To this extent, deployment itself can be thought of as another phase of the model-generation process whereby the model learns directly from its intended users and from new data as it comes in. In this sense, the linear notion of “train → deploy → monitor” is replaced by a system in which all three processes are happening at once. Treating medical AI systems as dynamic is more faithful to their real-world behavior and allows for intelligent systems which take maximal advantage of all available data and learn from every participant.
We note that if all the feedback flows (i.e., online learning, alignment, prompting, steering, etc.) are removed from the dynamic model, the result is a linear model. Therefore, the linear model is a special case of the dynamic model. The dynamic model simply formalizes and makes explicit the routes of information flow and system evolution which are implicitly present in all linear AI deployment systems.
Adaptive clinical trials for medical LLM deployment
Deployment and clinical validation
One of the most urgent challenges for medical AI is clinical validation. Deep learning models, especially LLMs, are largely empirical with few theoretical performance guarantees, meaning that our ability to characterize their real-world behavior in the research setting is limited. Retrospective analyses are often used to estimate the likely behavior and impact of AI models when deployed, but these are imperfect proxy measures and reliance on them can ultimately make AI systems more risky and potentially lead to unforeseen behavior43. Recent work has stressed the importance of real-world deployment for evaluating real-world model effectiveness44 and highlighted how model performance metrics assessed during training and development may change when deployed in the real world45,46,47.
However, a recent study of the 521 medical AI devices approved by the FDA found that more than 40% lacked any such clinical validation data48. Generative AI tools available to the general public are also being widely used in clinical settings, despite the fact that presumably none of them were validated or officially approved for medical use: a recent survey of 1000 doctors in the United Kingdom revealed that 20% of respondents reported using generative AI tools in their practice49. Using AI tools without clinical validation comes at the cost of increased risk of unforeseen consequences leading to negative outcomes and decreased trust among patients, clinicians, and the public.
Dynamic deployments help address this problem because continual performance monitoring is baked into the system design. Not only is deployment the only way to deliver the promises of AI in medicine to make tangible impacts on real patients and clinicians, but it is also the only way to directly study the behavior of AI models in situ. In addition to providing supervisory signal for online learning and other feedback mechanisms, these performance metrics can be used for real-time monitoring and oversight. By including performance assessment as a core principle in designing AI systems, each deployment can be viewed as a sort of local clinical trial; such recurring local validations may actually be better suited for modern AI systems than the alternative paradigm of external validation which multi-site clinical trials are based on50.
Existing precedent
At first blush, the proposed shift towards dynamic AI systems may seem to make clinical deployment even more difficult than it already is, possibly even widening the implementation gap. However, this need not necessarily be the case. In this section, we chart the path towards making dynamic medical AI a reality.
While AI is a new technology, forms of dynamic deployment have long been used in early-stage clinical trials to navigate the high degree of uncertainty in benefit/harm profile often seen in phase I clinical trials. For example, adaptive continual reassessment uses a Bayesian framework to learn from new data as it comes in and continually update the algorithm responsible for assigning patients to trial arms51. First developed more than 30 years ago, such adaptive trial designs are still being used today52. Not only does this approach satisfy an ethical concern by ensuring that no patient is given a treatment which is known to be inferior, but it also appeals to statistical efficiency by utilizing all possible data gleaned from previous trial participants51. Conceptually, this can be viewed as a form of dynamic deployment, where the AI model is a Bayesian model as opposed to an LLM, and online learning is used to continuously optimize the model parameters in response to patient outcomes. Guidelines for protocol design and reporting of clinical trials involving AI considered such continuously learning trial designs as “of interest” but intentionally excluded them as still too “early in development”53,54. However, because such adaptive trial designs are in fact already well-established and are already relied upon for making policy and treatment decisions, they could represent a promising blueprint for pursuing dynamic deployments of medical AI systems without the need to invent entirely new regulatory mechanisms.
Challenges
Practical challenges remain which must be addressed to enable widespread deployment of dynamic medical AI systems. First, building and maintaining infrastructure for feedback loops will require investment on the part of hospitals and health systems. Patient outcome metrics, although most important, may also be the most difficult to collect, necessitating patient follow-up, data integration and automated abstraction from health records, and such real-world evidence has known limitations55. As AI usage expands, costs for computational infrastructure and AI services could also quickly grow; care must be taken to ensure that dynamic LLM deployments are cost-effective for institutions56,57. Further, as many of the leading LLMs are currently proprietary closed-source models accessed through vendor services which do not allow modifications to be made to the underlying parameters, the options for finetuning models may be constrained. Data privacy and cybersecurity concerns, while not unique to AI, will continue to be of critical importance. Finally, institutions may be hesitant to try new models such as dynamic deployment given the rapidly evolving regulatory and medicolegal landscape of medical AI. Developing sustainable models of AI governance and quality oversight will be an essential task for regulatory bodies and local leadership, enabling medical AI integration while ensuring appropriate balance between oversight and innovation. Recent FDA guidance on predetermined change control plans58 is an important step in this direction.
Conclusion: looking forward
The current regime of linear AI deployment has largely failed to keep up with the pace of technological development and is a poor fit for the emerging paradigm of interactive, adaptive, multi-agent AI systems. We propose dynamic deployments as an alternative framework for medical AI deployments which are continually learning and adapting in response to new data, shifting focus beyond individual AI models towards a system-level perspective. Dynamic deployments can be used in the context of intervention arms in AI clinical trials to facilitate comparison with control groups and estimation of the causal effect of AI system implementation. They could also be used in the absence of control groups to deploy AI systems which learn and adapt over time.
Not all use cases will be amenable to such dynamic systems. Tasks with a highly predictable structure, such as image-based diagnostics, are less likely to benefit than unstructured tasks such as note writing. Additionally, in high-risk applications such as surgical robotics, or for fully autonomous systems with no humans in the loop, the benefits of continual learning might be outweighed by the risks. Careful oversight is necessary to govern appropriate use of dynamic AI systems, and these decisions will be highly context-dependent. For those cases where dynamic deployment is a good fit, continually learning AI systems present a promising path towards maximizing positive impact.
In the future of medicine, AI will likely be integrated in innumerable ways throughout the healthcare system. Hospitals will take advantage of intelligent, adaptive workflows, and healthcare will expand its reach to be more accessible than ever before. The current state of AI in medicine is analogous to the early days of the internet in the late 1990s: the core technologies are ready, but the field has not yet developed a mature, robust ecosystem to make it broadly useful beyond a core group of early adopters and enthusiasts. The next generation of medical AI will similarly be ushered in when we step back from individual models and instead focus on the larger picture of adaptive systems and networks, building upon the core principles of safety, real-world evidence, and regulatory oversight.
Data availability
No datasets were generated or analysed during the current study.
References
Chen, J. H. & Asch, S. M. Machine Learning and Prediction in Medicine — Beyond the Peak of Inflated Expectations. N. Engl. J. Med. 376, 2507–2509 (2017).
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).
Lee, P., Bubeck, S. & Petro, J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N. Engl. J. Med. 388, 1233–1239 (2023).
Dhawan, N., Cotta, L., Ullrich, K., Krishnan, R. G. & Maddison, C. J. End-To-End Causal Effect Estimation from Unstructured Natural Language Data. In: Advances in Neural Information Processing Systems vol. 37, 77165–77199 (Curran Associates, Inc., 2024).
Koller, D. et al. Why We Support and Encourage the Use of Large Language Models in NEJM AI Submissions. NEJM AI 1 AIe2300128 (2024).
Rosenbluth, T. In Constant Battle With Insurers, Doctors Reach for a Cudgel: A.I. (The New York Times, 2024).
Goldberg, C. B. et al. To Do No Harm — and the Most Good — with AI in Health Care. NEJM AI 1 AIp2400036 (2024).
Lu, C. et al. An Overview and Case Study of the Clinical AI Model Development Life Cycle for Healthcare Systems. Preprint at http://arxiv.org/abs/2003.07678 (2020).
Wang, F. & Beecy, A. Implementing AI models in clinical workflows: a roadmap. BMJ Evid.Based Med. https://doi.org/10.1136/bmjebm-2023-112727 (2024).
De Silva, D. & Alahakoon, D. An artificial intelligence life cycle: From conception to production. Patterns 3, 100489 (2022).
Kim, J. Y. et al. Organizational Governance of Emerging Technologies: AI Adoption in Healthcare. In: 2023 ACM Conference on Fairness, Accountability, and Transparency 1396–1417 (ACM, 2023). https://doi.org/10.1145/3593013.3594089.
Ruparelia, N. B. Software development lifecycle models. ACM SIGSOFT Softw. Eng. Notes 35, 8–13 (2010).
Ouyang, D. & Hogan, J. We Need More Randomized Clinical Trials of AI. NEJM AI 1, AIe2400881 (2024).
Wan, P. et al. Outpatient reception via collaboration between nurses and a large language model: a randomized controlled trial. Nat. Med. 1–8, https://doi.org/10.1038/s41591-024-03148-7 (2024).
Tai-Seale, M. et al. AI-Generated Draft Replies Integrated Into Health Records and Physicians’ Electronic Communication. JAMA Netw. Open 7, e246565 (2024).
Plana, D. et al. Randomized Clinical Trials of Machine Learning Interventions in Health Care: A Systematic Review. JAMA Netw. Open 5, e2233946 (2022).
Han, R. et al. Randomised controlled trials evaluating artificial intelligence in clinical practice: a scoping review. Lancet Digit. Health 6, e367–e373 (2024).
Wu, K. et al. Characterizing the Clinical Adoption of Medical AI Devices through U.S. Insurance Claims. NEJM AI https://doi.org/10.1056/AIoa2300030 (2023).
Seneviratne, M. G., Shah, N. H. & Chu, L. Bridging the implementation gap of machine learning in healthcare. BMJ Innov. 6, 45–47 (2020).
Reyna, M. A., Nsoesie, E. O. & Clifford, G. D. Rethinking Algorithm Performance Metrics for Artificial Intelligence in Diagnostic Medicine. JAMA https://doi.org/10.1001/jama.2022.10561 (2022).
Li, L. T., Haley, L. C., Boyd, A. K. & Bernstam, E. V. Technical/Algorithm, Stakeholder, and Society (TASS) barriers to the application of artificial intelligence in medicine: A systematic review. J. Biomed. Inform. 147, 104531 (2023).
Kelly, C. J., Karthikesalingam, A., Suleyman, M., Corrado, G. & King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 17, 195 (2019).
Sahni, N. R. & Carrus, B. Artificial Intelligence in U.S. Health Care Delivery. N. Engl. J. Med. 389, 348–358 (2023).
He, J. et al. The practical implementation of artificial intelligence technologies in medicine. Nat. Med. 25, 30–36 (2019).
Lennerz, J. K., Green, U., Williamson, D. F. K. & Mahmood, F. A unifying force for the realization of medical AI. Npj Digit. Med. 5, 172 (2022).
Sendak, M. P. et al. Strengthening the use of artificial intelligence within healthcare delivery organizations: balancing regulatory compliance and patient safety. J. Am. Med. Inform. Assoc. https://doi.org/10.1093/jamia/ocae119 (2024).
Christiano, P. F. et al. Deep Reinforcement Learning from Human Preferences. In Advances in Neural Information Processing Systems (ed. Guyon, I.) vol. 30 (Curran Associates, Inc., 2017).
1. Rafailov, R. et al. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. in Advances in Neural Information Processing Systems (eds. Oh, A. et al.) vol. 36, 53728–53741 (Curran Associates, Inc., 2023).
1. Dong, H. et al. RLHF Workflow: From Reward Modeling to Online RLHF. Transactions on Machine Learning Research (2024).
Guo, S. et al. Direct Language Model Alignment from Online AI Feedback. Preprint at https://doi.org/10.48550/arXiv.2402.04792 (2024).
Dong, Q. et al. A Survey on In-context Learning. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing 1107–1128 (Association for Computational Linguistics, Miami, Florida, USA, 2024). https://doi.org/10.18653/v1/2024.emnlp-main.64.
Wei, J. et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems (eds. Koyejo, S. et al.) vol. 35, 24824–24837 (Curran Associates, Inc., 2022).
Agarwal, N., Moehring, A., Rajpurkar, P. & Salz, T. Combining Human Expertise with Artificial Intelligence: Experimental Evidence from Radiology. w31422 http://www.nber.org/papers/w31422.pdf (2023).
Sanchez, M. et al. AI-clinician collaboration via disagreement prediction: A decision pipeline and retrospective analysis of real-world radiologist-AI interactions. Cell Rep. Med. 4, 101207 (2023).
Gaube, S. et al. Do as AI say: susceptibility in deployment of clinical decision-aids. Npj Digit. Med. 4, 1–8 (2021).
Reis, M., Reis, F. & Kunde, W. Influence of believed AI involvement on the perception of digital medical advice. Nat. Med. 1–3, https://doi.org/10.1038/s41591-024-03180-7 (2024).
Goh, E. et al. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw. Open 7, e2440969 (2024).
Yu, K.-H., Healey, E., Leong, T.-Y., Kohane, I. S. & Manrai, A. K. Medical Artificial Intelligence and Human Values. N. Engl. J. Med. 390, 1895–1904 (2024).
Yue, L., Xing, S., Chen, J. & Fu, T. ClinicalAgent: Clinical Trial Multi-Agent System with Large Language Model-based Reasoning. In Proceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics 1–10 (Association for Computing Machinery, New York, NY, USA, 2024).
Guo, T. et al. Large Language Model Based Multi-agents: A Survey of Progress and Challenges. In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24 (ed. Larson, K.) 8048–8057 (2024).
Qiu, J. et al. LLM-based agentic systems in medicine and healthcare. Nat. Mach. Intell. 1–3, https://doi.org/10.1038/s42256-024-00944-1 (2024).
Ayers, J. W., Desai, N. & Smith, D. M. Regulate Artificial Intelligence in Health Care by Prioritizing Patient Outcomes. JAMA 331, 639–640 (2024).
Coalition for Health AI. Blueprint for Trustworthy AI: Implementation Guidance and Assurance for Healthcare. Version 1.0. (Coalition for Health AI, 2023).
Longhurst, C. A., Singh, K., Chopra, A., Atreja, A. & Brownstein, J. S. A Call for Artificial Intelligence Implementation Science Centers to Evaluate Clinical Effectiveness. NEJM AI 0, AIp2400223 (2024).
Vaid, A. et al. Implications of the Use of Artificial Intelligence Predictive Models in Health Care Settings: A Simulation Study. Ann. Intern. Med. 176, 1358–1369 (2023).
Zhou, A. X., Aczon, M. D., Laksana, E., Ledbetter, D. R. & Wetzel, R. C. Narrowing the gap: expected versus deployment performance. J. Am. Med. Inform. Assoc. 30, 1474–1485 (2023).
Artificial Intelligence Risk Management Framework (AI RMF 1.0). 42 http://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdfhttps://doi.org/10.6028/NIST.AI.100-1 (2023).
Chouffani El Fassi, S. et al. Not all AI health tools with regulatory authorization are clinically validated. Nat. Med. https://doi.org/10.1038/s41591-024-03203-3 (2024).
Blease, C. R., Locher, C., Gaab, J., Hägglund, M. & Mandl, K. D. Generative artificial intelligence in primary care: an online survey of UK general practitioners. BMJ Health Care Inform. 31, e101102 (2024).
Youssef, A. et al. External validation of AI models in health should be replaced with recurring local validation. Nat. Med. https://doi.org/10.1038/s41591-023-02540-z (2023).
O’Quigley, J., Pepe, M. & Fisher, L. Continual Reassessment Method: A Practical Design for Phase 1 Clinical Trials in Cancer. Biometrics 46, 33 (1990).
Jones, L. W. et al. Neoadjuvant Exercise Therapy in Prostate Cancer: A Phase 1, Decentralized Nonrandomized ControlledTrial. JAMA Oncol. https://doi.org/10.1001/jamaoncol.2024.2156 (2024).
Liu, X. et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Lancet Digit. Health 2, e537–e548 (2020).
Cruz Rivera, S. et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Lancet Digit. Health 2, e549–e560 (2020).
Sherman, R. E. et al. Real-World Evidence — What Is It and What Can It Tell Us?. N. Engl. J. Med. 375, 2293–2297 (2016).
Klang, E. et al. A strategy for cost-effective large language model use at health system-scale. Npj Digit. Med. 7, 1–12 (2024).
Abramoff, M. D., Dai, T. & Zou, J. Scaling Adoption of Medical AI — Reimbursement from Value-Based Care and Fee-for-Service Perspectives. NEJM AI 1, AIpc2400083 (2024).
Marketing Submission Recommendations for a Predetermined Change Control Plan for Artificial Intelligence-Enabled Device Software Functions - Guidance for Industry and Food and Drug Administration Staff. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/marketing-submission-recommendations-predetermined-change-control-plan-artificial-intelligence (2024).
Zhou, Y. et al. Large Language Models Are Human-Level Prompt Engineers. In: The Eleventh International Conference on Learning Representations (2023).
Yuksekgonul, M. et al. Optimizing generative AI by backpropagating language model feedback. Nature 639, 609–616 (2025).
Acknowledgements
J.T.R. was supported by a Medical Scientist Training Program grant from the National Institute of General Medical Sciences of the NIH under award no. T32GM152349 to the Weill Cornell/Rockefeller/Sloan Kettering Tri-Institutional MD-PhD Program. The funder played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript.
Author information
Authors and Affiliations
Contributions
J.T.R. conceptualized the paper and wrote the manuscript. A.B. and M.R.S. contributed technical input and substantive feedback during manuscript revision. All authors read and approved the final version of this manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Rosenthal, J.T., Beecy, A. & Sabuncu, M.R. Rethinking clinical trials for medical AI with dynamic deployments of adaptive systems. npj Digit. Med. 8, 252 (2025). https://doi.org/10.1038/s41746-025-01674-3
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41746-025-01674-3
This article is cited by
-
Artificial intelligence in cancer: applications, challenges, and future perspectives
Molecular Cancer (2025)