Rethinking clinical trials for medical AI with dynamic deployments of adaptive systems

Rosenthal, Jacob T.; Beecy, Ashley; Sabuncu, Mert R.

doi:10.1038/s41746-025-01674-3

Download PDF

Perspective
Open access
Published: 06 May 2025

Rethinking clinical trials for medical AI with dynamic deployments of adaptive systems

npj Digital Medicine volume 8, Article number: 252 (2025) Cite this article

11k Accesses
12 Citations
16 Altmetric
Metrics details

Subjects

Abstract

There is a growing recognition of the need for clinical trials to safely and effectively deploy artificial intelligence (AI) in clinical settings. We introduce dynamic deployment as a framework for AI clinical trials tailored for the dynamic nature of large language models, making possible complex medical AI systems which continuously learn and adapt in situ from new data and interactions with users while enabling continuous real-time monitoring and clinical validation.

To warrant clinical adoption AI models require a multi-faceted implementation evaluation

Article Open access 06 March 2024

A perspective for adapting generalist AI to specialized medical AI applications and their challenges

Article Open access 11 July 2025

Moving towards vertically integrated artificial intelligence development

Article Open access 15 September 2022

Introduction

Artificial intelligence (AI) in medicine has been riding atop a wave of inflated expectations and hype for at least a decade¹, with the pace of research continuing to accelerate in recent years driven by the advent of large language models (LLMs)^2,3. Scientists and clinicians have sought to take advantage of these technological advancements by applying LLMs to a wide array of areas in the healthcare system, ranging from estimating causal treatment effects of medications from online forum posts⁴, to aiding in writing research articles⁵, and automating administrative tasks such as insurance prior authorization paperwork⁶. Among many in the healthcare field, there is a general consensus that AI in healthcare is “the future”⁷.

The lifecycle of an AI model can be conceptualized in stages, beginning with initial problem identification, then proceeding to a design phase, followed by model development in the research setting, silent deployment, then deployment to “production” setting in the real-world, and finally a post-deployment phase of monitoring and making changes or removing from production as needed^8,9,10,11. This has similarities to models of the software development lifecycle¹².

In the medical context, there is a growing recognition of the need for deployment to happen through clinical trials, so as to protect participants and rigorously ensure the safety and efficacy of models¹³. Yet while there have been some notable examples of prospective clinical trials of LLM tools, such as for aiding nurses with receptionist tasks¹⁴ and drafting responses to patient messages¹⁵, overall only a very small fraction of models ever makes it out of the research phase to be deployed in the real-world setting. A systematic review in 2022 found only 41 randomized trials of machine learning interventions worldwide¹⁶; by 2024, this number had increased to a total of only 86¹⁷. A 2023 analysis of insurance claims found a total of only 16 medical AI procedures with billing codes¹⁸. Overall, the medical system has failed to keep up with the pace of recent developments in AI – this disconnect, known by various terms such as “implementation gap”¹⁹ and “AI chasm”²⁰, means that the vast majority of research advances in medical AI never actually directly benefit patients or clinicians. The causes of the implementation gap are multifactorial and include not only technical and logistical barriers, but also sociocultural, ethical, economic, and regulatory factors^{21,22,23,24,25,26}. Bridging the implementation gap is one of the largest challenges currently facing the field of medical AI.

In this piece, we describe the currently predominant approach to medical AI deployment, which is based on a linear, model-centric understanding of AI. We then identify several shortcomings of this paradigm when applied to LLM-based systems and propose an alternative way of conceptualizing AI in medicine based on continual processes of model updating, real-world evidence generation, and safety monitoring, which we call dynamic deployment. We chart a path towards making such dynamic deployments a reality, drawing on well-established methods of adaptive clinical trials as well as more recent technical advances and developments in regulatory science of medical AI.

Linear model of AI deployment

Where AI models have been successfully deployed in healthcare, they typically follow a pattern which we refer to as the linear model of AI deployment (Fig. 1a). First, a model is developed in the research domain, most often by training on retrospective data. The model is then assessed, and its performance characteristics evaluated. When the decision is made to move the model from research into deployment, it is frozen: all the model’s parameters are locked and remain static for as long as it is deployed. Although it could be updated periodically in response to new data or performance degradations identified through post-deployment monitoring and auditing, there are few examples of this happening in practice.

**Fig. 1: Depiction of the linear and dynamic models of AI deployment.**

In the linear framework, the focus is on a particular AI model. More specifically, it is a particular instance of the given model defined by its set of parameters. The linear model is intuitive and closely mirrors the process by which other technologies are brought into clinical practice. However, the linear model of AI deployment is a poor fit for modern LLM systems, for three principal reasons which we outline below.

1.
AI is an adaptive technology

AI systems have an important difference from other technologies in medicine: they are adaptive. In fact, one of the most important attributes of modern LLMs with billions of parameters is their flexibility. Model weights do not necessarily need to remain fixed throughout the lifespan of the model’s deployment, and can be periodically finetuned or updated as batches of new data come in. Methods such as reinforcement learning from human feedback (RLHF)²⁷ and direct preference optimization (DPO)²⁸ also allow LLMs to learn directly from their users in order to be better aligned with user preferences, and recent work has extended these approaches to the “online learning” setting, allowing for continuous updating of deployed models^29,30. The behavior of LLMs can also be substantially changed during deployment through interactions with users, without updating any of the model parameters: for example, in-context learning allows LLMs to learn from new training data presented in their prompts³¹, and chain-of-thought prompting enables LLMs to more effectively reason through complex problems³². For all these reasons, the line between model development and model deployment is becoming increasingly blurred. Consequently, it is unclear how the linear model of AI deployment would handle such interactive and dynamic features of emerging AI systems. By relying on the underlying assumption that learning occurs only in discrete phases, the linear model struggles to encompass many of the most promising avenues of modern AI advances, especially with regards to the interactive and dynamic nature of LLMs.
2.
AI functions as part of a complex system

Secondly, the linear model does not sufficiently account for the complex systems in which AI models are employed. The outputs of the model itself are of course crucial, but are only one part of the system. Other factors beyond the model parameters also drive outcomes. For example, choices related to user interface design can shape interactions between humans and AI models, introducing new cognitive biases into clinical decision-making^33,34,35,36. Even when clinicians are given access to LLM systems with super-human abilities, human users will not necessarily be able to effectively take advantage of the full potential of these tools without specialized training³⁷. The behaviors of interactive AI systems, such as chatbots, also depend integrally on the behavioral patterns and values of the particular population of users³⁸. Thus, even when model weights are frozen, the system is not static. By adopting a model-centric, parameter-centric view of AI, the linear model fails to adequately account for the numerous other factors contributing to meaningful outcomes in the real world.
3.
Health systems of the future will have many AI models operating at once.

Finally, the linear model of AI relies on the premise of isolating a single model for testing. This is reasonable today, as there are so few AI models in the wild. However, it potentially poses a major challenge for scaling up the extent of AI integration. In the near future, there may be orders of magnitude more models deployed in various contexts throughout the medical system. Users may interact with many different models during their routine workflows, and models could interact with each other and be interdependent in complex ways. This is exemplified by the emerging paradigm of multi-agent AI systems, whereby tasks are completed by a cohort of individual LLM-based agents, orchestrated by other “supervisory” models^39,40,41. In such scenarios, AI clinical trial designs which seek to evaluate the behavior of a specific model in isolation would be impractical.

Dynamic systems model of AI deployment

To overcome these challenges, we propose an alternative framework for clinical trials and deployment of LLMs, which we call dynamic deployment (Fig. 1b). In a nutshell, the dynamic deployment model is distinguished from the linear deployment model in two key ways: 1) by embracing a systems-level understanding of medical AI, and 2) by explicitly accounting for the fact that such systems are dynamic and constantly changing. In this section we describe the framework and discuss how it can be applied in the real world through adaptive clinical trials.

The first principle is a systems-level approach to medical AI. In this model, the AI system is conceptualized as a complex system with multiple interconnected moving parts. The AI model itself is at the core and functions the same as in the linear model: taking input data and producing outputs according to its internal parameters. What sets apart this approach, however, is that other elements in the AI system are also explicitly included as parts of the intervention. This includes the population of users, each guided by their own set of values and behavioral patterns; the workflow integration and user interface by which users interact with models; and other automated elements, such as the data generation or processing pipelines and the update mechanisms for online learning. Each individual component contributes to the overall behavior of the system, although disentangling the exact contribution from each element might not be feasible. However, the systems-level view says that it is not actually necessary to measure these complex intra-system relationships. What matters is the behavior of the system as a whole, as measured by metrics that are meaningful in the real-world, such as patient outcomes⁴². For example, gradual degradation of performance metrics over time is a clear indicator that the system as a whole is not functioning well, even though it may be difficult or impossible to isolate the effects of AI model degradation from other sources of variation such as natural fluctuation in patient or user populations. A systems-level approach aims to use feedback loops to learn from these performance changes over time, regardless of their root causes. By shifting focus to a systems-level conceptualization of medical AI, we will be able to better measure things which actually matter.

The second principle informing the design of dynamic medical AI deployments is the recognition that they are systems which change over time. AI models still undergo an initial research and development phase before being deployed, however this is understood to be “pretraining,” i.e. the start of training rather than the end. Instead of models being frozen, they are allowed to continue to evolve in response to feedback signals during deployment. These can occur by mechanisms such as online learning or finetuning with new data, alignment with user preferences via RLHF or DPO, or more subtle causes such as drift in user populations altering system behavior due to differences in usage patterns. To provide concrete examples, we list several concrete examples of feedback signals in Table 1 and mechanisms of adaptation in response to these signals in Table 2. Rather than trying to freeze the system and measure its performance at discrete snapshots in time, the dynamic approach relies on feedback loops allowing for both continuous iteration and continuous evaluation. Discrete, post-deployment updates and audits are augmented by their continuous analogs, allowing for AI systems to continually update in response to new data.

Table 1 Selected examples of sources of AI system performance feedback which can be monitored and used to improve performance via feedback loops

Full size table

Table 2 Selected examples of mechanisms by which AI systems behavior can be modified in response to feedback signals

Full size table

To this extent, deployment itself can be thought of as another phase of the model-generation process whereby the model learns directly from its intended users and from new data as it comes in. In this sense, the linear notion of “train → deploy → monitor” is replaced by a system in which all three processes are happening at once. Treating medical AI systems as dynamic is more faithful to their real-world behavior and allows for intelligent systems which take maximal advantage of all available data and learn from every participant.

We note that if all the feedback flows (i.e., online learning, alignment, prompting, steering, etc.) are removed from the dynamic model, the result is a linear model. Therefore, the linear model is a special case of the dynamic model. The dynamic model simply formalizes and makes explicit the routes of information flow and system evolution which are implicitly present in all linear AI deployment systems.

Adaptive clinical trials for medical LLM deployment

Deployment and clinical validation

One of the most urgent challenges for medical AI is clinical validation. Deep learning models, especially LLMs, are largely empirical with few theoretical performance guarantees, meaning that our ability to characterize their real-world behavior in the research setting is limited. Retrospective analyses are often used to estimate the likely behavior and impact of AI models when deployed, but these are imperfect proxy measures and reliance on them can ultimately make AI systems more risky and potentially lead to unforeseen behavior⁴³. Recent work has stressed the importance of real-world deployment for evaluating real-world model effectiveness⁴⁴ and highlighted how model performance metrics assessed during training and development may change when deployed in the real world^45,46,47.

However, a recent study of the 521 medical AI devices approved by the FDA found that more than 40% lacked any such clinical validation data⁴⁸. Generative AI tools available to the general public are also being widely used in clinical settings, despite the fact that presumably none of them were validated or officially approved for medical use: a recent survey of 1000 doctors in the United Kingdom revealed that 20% of respondents reported using generative AI tools in their practice⁴⁹. Using AI tools without clinical validation comes at the cost of increased risk of unforeseen consequences leading to negative outcomes and decreased trust among patients, clinicians, and the public.

Dynamic deployments help address this problem because continual performance monitoring is baked into the system design. Not only is deployment the only way to deliver the promises of AI in medicine to make tangible impacts on real patients and clinicians, but it is also the only way to directly study the behavior of AI models in situ. In addition to providing supervisory signal for online learning and other feedback mechanisms, these performance metrics can be used for real-time monitoring and oversight. By including performance assessment as a core principle in designing AI systems, each deployment can be viewed as a sort of local clinical trial; such recurring local validations may actually be better suited for modern AI systems than the alternative paradigm of external validation which multi-site clinical trials are based on⁵⁰.

Existing precedent

At first blush, the proposed shift towards dynamic AI systems may seem to make clinical deployment even more difficult than it already is, possibly even widening the implementation gap. However, this need not necessarily be the case. In this section, we chart the path towards making dynamic medical AI a reality.

While AI is a new technology, forms of dynamic deployment have long been used in early-stage clinical trials to navigate the high degree of uncertainty in benefit/harm profile often seen in phase I clinical trials. For example, adaptive continual reassessment uses a Bayesian framework to learn from new data as it comes in and continually update the algorithm responsible for assigning patients to trial arms⁵¹. First developed more than 30 years ago, such adaptive trial designs are still being used today⁵². Not only does this approach satisfy an ethical concern by ensuring that no patient is given a treatment which is known to be inferior, but it also appeals to statistical efficiency by utilizing all possible data gleaned from previous trial participants⁵¹. Conceptually, this can be viewed as a form of dynamic deployment, where the AI model is a Bayesian model as opposed to an LLM, and online learning is used to continuously optimize the model parameters in response to patient outcomes. Guidelines for protocol design and reporting of clinical trials involving AI considered such continuously learning trial designs as “of interest” but intentionally excluded them as still too “early in development”^53,54. However, because such adaptive trial designs are in fact already well-established and are already relied upon for making policy and treatment decisions, they could represent a promising blueprint for pursuing dynamic deployments of medical AI systems without the need to invent entirely new regulatory mechanisms.

Challenges

Practical challenges remain which must be addressed to enable widespread deployment of dynamic medical AI systems. First, building and maintaining infrastructure for feedback loops will require investment on the part of hospitals and health systems. Patient outcome metrics, although most important, may also be the most difficult to collect, necessitating patient follow-up, data integration and automated abstraction from health records, and such real-world evidence has known limitations⁵⁵. As AI usage expands, costs for computational infrastructure and AI services could also quickly grow; care must be taken to ensure that dynamic LLM deployments are cost-effective for institutions^56,57. Further, as many of the leading LLMs are currently proprietary closed-source models accessed through vendor services which do not allow modifications to be made to the underlying parameters, the options for finetuning models may be constrained. Data privacy and cybersecurity concerns, while not unique to AI, will continue to be of critical importance. Finally, institutions may be hesitant to try new models such as dynamic deployment given the rapidly evolving regulatory and medicolegal landscape of medical AI. Developing sustainable models of AI governance and quality oversight will be an essential task for regulatory bodies and local leadership, enabling medical AI integration while ensuring appropriate balance between oversight and innovation. Recent FDA guidance on predetermined change control plans⁵⁸ is an important step in this direction.

Conclusion: looking forward

The current regime of linear AI deployment has largely failed to keep up with the pace of technological development and is a poor fit for the emerging paradigm of interactive, adaptive, multi-agent AI systems. We propose dynamic deployments as an alternative framework for medical AI deployments which are continually learning and adapting in response to new data, shifting focus beyond individual AI models towards a system-level perspective. Dynamic deployments can be used in the context of intervention arms in AI clinical trials to facilitate comparison with control groups and estimation of the causal effect of AI system implementation. They could also be used in the absence of control groups to deploy AI systems which learn and adapt over time.

Not all use cases will be amenable to such dynamic systems. Tasks with a highly predictable structure, such as image-based diagnostics, are less likely to benefit than unstructured tasks such as note writing. Additionally, in high-risk applications such as surgical robotics, or for fully autonomous systems with no humans in the loop, the benefits of continual learning might be outweighed by the risks. Careful oversight is necessary to govern appropriate use of dynamic AI systems, and these decisions will be highly context-dependent. For those cases where dynamic deployment is a good fit, continually learning AI systems present a promising path towards maximizing positive impact.

In the future of medicine, AI will likely be integrated in innumerable ways throughout the healthcare system. Hospitals will take advantage of intelligent, adaptive workflows, and healthcare will expand its reach to be more accessible than ever before. The current state of AI in medicine is analogous to the early days of the internet in the late 1990s: the core technologies are ready, but the field has not yet developed a mature, robust ecosystem to make it broadly useful beyond a core group of early adopters and enthusiasts. The next generation of medical AI will similarly be ushered in when we step back from individual models and instead focus on the larger picture of adaptive systems and networks, building upon the core principles of safety, real-world evidence, and regulatory oversight.

Data availability

No datasets were generated or analysed during the current study.

References

Chen, J. H. & Asch, S. M. Machine Learning and Prediction in Medicine — Beyond the Peak of Inflated Expectations. N. Engl. J. Med. 376, 2507–2509 (2017).
Article PubMed PubMed Central Google Scholar
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).
Article CAS PubMed Google Scholar
Lee, P., Bubeck, S. & Petro, J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N. Engl. J. Med. 388, 1233–1239 (2023).
Dhawan, N., Cotta, L., Ullrich, K., Krishnan, R. G. & Maddison, C. J. End-To-End Causal Effect Estimation from Unstructured Natural Language Data. In: Advances in Neural Information Processing Systems vol. 37, 77165–77199 (Curran Associates, Inc., 2024).
Koller, D. et al. Why We Support and Encourage the Use of Large Language Models in NEJM AI Submissions. NEJM AI 1 AIe2300128 (2024).
Rosenbluth, T. In Constant Battle With Insurers, Doctors Reach for a Cudgel: A.I. (The New York Times, 2024).
Goldberg, C. B. et al. To Do No Harm — and the Most Good — with AI in Health Care. NEJM AI 1 AIp2400036 (2024).
Lu, C. et al. An Overview and Case Study of the Clinical AI Model Development Life Cycle for Healthcare Systems. Preprint at http://arxiv.org/abs/2003.07678 (2020).
Wang, F. & Beecy, A. Implementing AI models in clinical workflows: a roadmap. BMJ Evid.Based Med. https://doi.org/10.1136/bmjebm-2023-112727 (2024).
Article PubMed PubMed Central Google Scholar
De Silva, D. & Alahakoon, D. An artificial intelligence life cycle: From conception to production. Patterns 3, 100489 (2022).
Article PubMed PubMed Central Google Scholar
Kim, J. Y. et al. Organizational Governance of Emerging Technologies: AI Adoption in Healthcare. In: 2023 ACM Conference on Fairness, Accountability, and Transparency 1396–1417 (ACM, 2023). https://doi.org/10.1145/3593013.3594089.
Ruparelia, N. B. Software development lifecycle models. ACM SIGSOFT Softw. Eng. Notes 35, 8–13 (2010).
Article Google Scholar
Ouyang, D. & Hogan, J. We Need More Randomized Clinical Trials of AI. NEJM AI 1, AIe2400881 (2024).
Article Google Scholar
Wan, P. et al. Outpatient reception via collaboration between nurses and a large language model: a randomized controlled trial. Nat. Med. 1–8, https://doi.org/10.1038/s41591-024-03148-7 (2024).
Tai-Seale, M. et al. AI-Generated Draft Replies Integrated Into Health Records and Physicians’ Electronic Communication. JAMA Netw. Open 7, e246565 (2024).
Article PubMed PubMed Central Google Scholar
Plana, D. et al. Randomized Clinical Trials of Machine Learning Interventions in Health Care: A Systematic Review. JAMA Netw. Open 5, e2233946 (2022).
Article PubMed PubMed Central Google Scholar
Han, R. et al. Randomised controlled trials evaluating artificial intelligence in clinical practice: a scoping review. Lancet Digit. Health 6, e367–e373 (2024).
Article CAS PubMed PubMed Central Google Scholar
Wu, K. et al. Characterizing the Clinical Adoption of Medical AI Devices through U.S. Insurance Claims. NEJM AI https://doi.org/10.1056/AIoa2300030 (2023).
Article Google Scholar
Seneviratne, M. G., Shah, N. H. & Chu, L. Bridging the implementation gap of machine learning in healthcare. BMJ Innov. 6, 45–47 (2020).
Article Google Scholar
Reyna, M. A., Nsoesie, E. O. & Clifford, G. D. Rethinking Algorithm Performance Metrics for Artificial Intelligence in Diagnostic Medicine. JAMA https://doi.org/10.1001/jama.2022.10561 (2022).
Article PubMed PubMed Central Google Scholar
Li, L. T., Haley, L. C., Boyd, A. K. & Bernstam, E. V. Technical/Algorithm, Stakeholder, and Society (TASS) barriers to the application of artificial intelligence in medicine: A systematic review. J. Biomed. Inform. 147, 104531 (2023).
Article PubMed Google Scholar
Kelly, C. J., Karthikesalingam, A., Suleyman, M., Corrado, G. & King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 17, 195 (2019).
Article PubMed PubMed Central Google Scholar
Sahni, N. R. & Carrus, B. Artificial Intelligence in U.S. Health Care Delivery. N. Engl. J. Med. 389, 348–358 (2023).
Article PubMed Google Scholar
He, J. et al. The practical implementation of artificial intelligence technologies in medicine. Nat. Med. 25, 30–36 (2019).
Article CAS PubMed PubMed Central Google Scholar
Lennerz, J. K., Green, U., Williamson, D. F. K. & Mahmood, F. A unifying force for the realization of medical AI. Npj Digit. Med. 5, 172 (2022).
Article PubMed PubMed Central Google Scholar
Sendak, M. P. et al. Strengthening the use of artificial intelligence within healthcare delivery organizations: balancing regulatory compliance and patient safety. J. Am. Med. Inform. Assoc. https://doi.org/10.1093/jamia/ocae119 (2024).
Article PubMed Google Scholar
Christiano, P. F. et al. Deep Reinforcement Learning from Human Preferences. In Advances in Neural Information Processing Systems (ed. Guyon, I.) vol. 30 (Curran Associates, Inc., 2017).
1. Rafailov, R. et al. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. in Advances in Neural Information Processing Systems (eds. Oh, A. et al.) vol. 36, 53728–53741 (Curran Associates, Inc., 2023).
1. Dong, H. et al. RLHF Workflow: From Reward Modeling to Online RLHF. Transactions on Machine Learning Research (2024).
Guo, S. et al. Direct Language Model Alignment from Online AI Feedback. Preprint at https://doi.org/10.48550/arXiv.2402.04792 (2024).
Dong, Q. et al. A Survey on In-context Learning. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing 1107–1128 (Association for Computational Linguistics, Miami, Florida, USA, 2024). https://doi.org/10.18653/v1/2024.emnlp-main.64.
Wei, J. et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems (eds. Koyejo, S. et al.) vol. 35, 24824–24837 (Curran Associates, Inc., 2022).
Agarwal, N., Moehring, A., Rajpurkar, P. & Salz, T. Combining Human Expertise with Artificial Intelligence: Experimental Evidence from Radiology. w31422 http://www.nber.org/papers/w31422.pdf (2023).
Sanchez, M. et al. AI-clinician collaboration via disagreement prediction: A decision pipeline and retrospective analysis of real-world radiologist-AI interactions. Cell Rep. Med. 4, 101207 (2023).
Article PubMed PubMed Central Google Scholar
Gaube, S. et al. Do as AI say: susceptibility in deployment of clinical decision-aids. Npj Digit. Med. 4, 1–8 (2021).
Article Google Scholar
Reis, M., Reis, F. & Kunde, W. Influence of believed AI involvement on the perception of digital medical advice. Nat. Med. 1–3, https://doi.org/10.1038/s41591-024-03180-7 (2024).
Goh, E. et al. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw. Open 7, e2440969 (2024).
Article PubMed PubMed Central Google Scholar
Yu, K.-H., Healey, E., Leong, T.-Y., Kohane, I. S. & Manrai, A. K. Medical Artificial Intelligence and Human Values. N. Engl. J. Med. 390, 1895–1904 (2024).
Article PubMed Google Scholar
Yue, L., Xing, S., Chen, J. & Fu, T. ClinicalAgent: Clinical Trial Multi-Agent System with Large Language Model-based Reasoning. In Proceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics 1–10 (Association for Computing Machinery, New York, NY, USA, 2024).
Guo, T. et al. Large Language Model Based Multi-agents: A Survey of Progress and Challenges. In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24 (ed. Larson, K.) 8048–8057 (2024).
Qiu, J. et al. LLM-based agentic systems in medicine and healthcare. Nat. Mach. Intell. 1–3, https://doi.org/10.1038/s42256-024-00944-1 (2024).
Ayers, J. W., Desai, N. & Smith, D. M. Regulate Artificial Intelligence in Health Care by Prioritizing Patient Outcomes. JAMA 331, 639–640 (2024).
Article PubMed Google Scholar
Coalition for Health AI. Blueprint for Trustworthy AI: Implementation Guidance and Assurance for Healthcare. Version 1.0. (Coalition for Health AI, 2023).
Longhurst, C. A., Singh, K., Chopra, A., Atreja, A. & Brownstein, J. S. A Call for Artificial Intelligence Implementation Science Centers to Evaluate Clinical Effectiveness. NEJM AI 0, AIp2400223 (2024).
Google Scholar
Vaid, A. et al. Implications of the Use of Artificial Intelligence Predictive Models in Health Care Settings: A Simulation Study. Ann. Intern. Med. 176, 1358–1369 (2023).
Article PubMed Google Scholar
Zhou, A. X., Aczon, M. D., Laksana, E., Ledbetter, D. R. & Wetzel, R. C. Narrowing the gap: expected versus deployment performance. J. Am. Med. Inform. Assoc. 30, 1474–1485 (2023).
Article PubMed PubMed Central Google Scholar
Artificial Intelligence Risk Management Framework (AI RMF 1.0). 42 http://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf https://doi.org/10.6028/NIST.AI.100-1 (2023).
Chouffani El Fassi, S. et al. Not all AI health tools with regulatory authorization are clinically validated. Nat. Med. https://doi.org/10.1038/s41591-024-03203-3 (2024).
Article PubMed Google Scholar
Blease, C. R., Locher, C., Gaab, J., Hägglund, M. & Mandl, K. D. Generative artificial intelligence in primary care: an online survey of UK general practitioners. BMJ Health Care Inform. 31, e101102 (2024).
Article PubMed PubMed Central Google Scholar
Youssef, A. et al. External validation of AI models in health should be replaced with recurring local validation. Nat. Med. https://doi.org/10.1038/s41591-023-02540-z (2023).
Article PubMed Google Scholar
O’Quigley, J., Pepe, M. & Fisher, L. Continual Reassessment Method: A Practical Design for Phase 1 Clinical Trials in Cancer. Biometrics 46, 33 (1990).
Article PubMed Google Scholar
Jones, L. W. et al. Neoadjuvant Exercise Therapy in Prostate Cancer: A Phase 1, Decentralized Nonrandomized ControlledTrial. JAMA Oncol. https://doi.org/10.1001/jamaoncol.2024.2156 (2024).
Article PubMed PubMed Central Google Scholar
Liu, X. et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Lancet Digit. Health 2, e537–e548 (2020).
Article PubMed PubMed Central Google Scholar
Cruz Rivera, S. et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Lancet Digit. Health 2, e549–e560 (2020).
Article PubMed Google Scholar
Sherman, R. E. et al. Real-World Evidence — What Is It and What Can It Tell Us?. N. Engl. J. Med. 375, 2293–2297 (2016).
Article PubMed Google Scholar
Klang, E. et al. A strategy for cost-effective large language model use at health system-scale. Npj Digit. Med. 7, 1–12 (2024).
Article Google Scholar
Abramoff, M. D., Dai, T. & Zou, J. Scaling Adoption of Medical AI — Reimbursement from Value-Based Care and Fee-for-Service Perspectives. NEJM AI 1, AIpc2400083 (2024).
Article Google Scholar
Marketing Submission Recommendations for a Predetermined Change Control Plan for Artificial Intelligence-Enabled Device Software Functions - Guidance for Industry and Food and Drug Administration Staff. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/marketing-submission-recommendations-predetermined-change-control-plan-artificial-intelligence (2024).
Zhou, Y. et al. Large Language Models Are Human-Level Prompt Engineers. In: The Eleventh International Conference on Learning Representations (2023).
Yuksekgonul, M. et al. Optimizing generative AI by backpropagating language model feedback. Nature 639, 609–616 (2025).

Download references

Acknowledgements

J.T.R. was supported by a Medical Scientist Training Program grant from the National Institute of General Medical Sciences of the NIH under award no. T32GM152349 to the Weill Cornell/Rockefeller/Sloan Kettering Tri-Institutional MD-PhD Program. The funder played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript.

Author information

Authors and Affiliations

Tri-Institutional MD-PhD program of Weill Cornell/Rockefeller/Sloan Kettering, New York, NY, USA
Jacob T. Rosenthal
Department of Radiology, Weill Cornell Medicine, New York, NY, USA
Jacob T. Rosenthal & Mert R. Sabuncu
Division of Cardiology, Department of Medicine, Weill Cornell Medicine and NewYork-Presbyterian, New York, NY, USA
Ashley Beecy
School of Electrical and Computer Engineering, Cornell Tech and Cornell University, New York, NY, USA
Mert R. Sabuncu

Authors

Jacob T. Rosenthal
View author publications
Search author on:PubMed Google Scholar
Ashley Beecy
View author publications
Search author on:PubMed Google Scholar
Mert R. Sabuncu
View author publications
Search author on:PubMed Google Scholar

Contributions

J.T.R. conceptualized the paper and wrote the manuscript. A.B. and M.R.S. contributed technical input and substantive feedback during manuscript revision. All authors read and approved the final version of this manuscript.

Corresponding author

Correspondence to Jacob T. Rosenthal.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Rosenthal, J.T., Beecy, A. & Sabuncu, M.R. Rethinking clinical trials for medical AI with dynamic deployments of adaptive systems. npj Digit. Med. 8, 252 (2025). https://doi.org/10.1038/s41746-025-01674-3

Download citation

Received: 19 December 2024
Accepted: 24 April 2025
Published: 06 May 2025
Version of record: 06 May 2025
DOI: https://doi.org/10.1038/s41746-025-01674-3

This article is cited by

Artificial intelligence in cancer: applications, challenges, and future perspectives
- Cillian H. Cheng
- Su-sheng Shi
Molecular Cancer (2025)