- Invited review
- Open access
- Published:
Foundation models for radiology—the position of the AI for Health Imaging (AI4HI) network
Insights into Imaging volume 16, Article number: 168 (2025)
Abstract
Foundation models are large models trained on big data which can be used for downstream tasks. In radiology, these models can potentially address several gaps in fairness and generalization, as they can be trained on massive datasets without labelled data and adapted to tasks requiring data with a small number of descriptions. This reduces one of the limiting bottlenecks in clinical model construction—data annotation—as these models can be trained through a variety of techniques that require little more than radiological images with or without their corresponding radiological reports. However, foundation models may be insufficient as they are affected—to a smaller extent when compared with traditional supervised learning approaches—by the same issues that lead to underperforming models, such as a lack of transparency/explainability, and biases. To address these issues, we advocate that the development of foundation models should not only be pursued but also accompanied by the development of a decentralized clinical validation and continuous training framework. This does not guarantee the resolution of the problems associated with foundation models, but it enables developers, clinicians and patients to know when, how and why models should be updated, creating a clinical AI ecosystem that is better capable of serving all stakeholders.
Critical relevance statement
Foundation models may mitigate issues like bias and poor generalization in radiology AI, but challenges persist. We propose a decentralized, cross-institutional framework for continuous validation and training to enhance model reliability, safety, and clinical utility.
Key Points
-
Foundation models trained on large datasets reduce annotation burdens and improve fairness and generalization in radiology.
-
Despite improvements, they still face challenges like limited transparency, explainability, and residual biases.
-
A decentralized, cross-institutional framework for clinical validation and continuous training can strengthen reliability and inclusivity in clinical AI.
Graphical Abstract
Introduction
Medical images offer a wealth of relevant information about the patient that clinicians use for better diagnosing and treating patients. Due to their centrality in diagnosis, the number of acquired scans has steadily increased over the years [1, 2], creating a great demand for trained professionals in acquiring, interpreting and integrating results, a high workload demand which is presently hard to adequately fulfill [3,4,5] and leads to increasing burnout rates among radiologists [6, 7]. To make matters worse, radiographers and medical physicists are also facing a global shortage [8, 9], making this problem transversal to image scheduling, acquisition, quality control, interpretation, and reporting.
Medical imaging AI systems have recently appeared as potential aids in not only reducing the burden placed on radiologists but also improving the patient journey and healthcare. For example, sensitivity can be boosted by AI in breast cancer detection by capturing lesions missed by radiologists [10,11,12,13,14,15], in chest X-ray by nodule interpretation [16, 17], and in CT scans by intracranial hemorrhage detection [18]. Performance boosts are also observed in prostate MRI significant cancer detection by using deep learning tools while also reducing the likelihood of false positives [19, 20], in diagnosing pathological complete response in colorectal cancer using MRI radiomics [21], or in the observation and delineation of lung nodules in chest CT with convolutional neural networks [22, 23]. In radiotherapy, AI can assist in contouring (i.e., through semi-automatic or automatic segmentation of organs at risk or clinical target volumes) with consistent time reductions [24, 25]. AI can also generate synthetic CT data for MRI-only radiotherapy treatment plans using a different set of CT images obtained in real life [26].
The potential of AI in radiology, including foundation and generative models, has been discussed among experts and stakeholders from the AI for Health Imaging (AI4HI) network. The AI4HI networks connect 5 large-scale EU-funded consortia developing, validating and deploying AI for health imaging (EuCanImage, CHAIMELEON, INCISIVE, ProCAncer-I, and PRIMAGE), and incorporate professionals from both technical and clinical backgrounds alike [27,28,29,30]. Based on the experiences and results from these projects, the consensus was that while emerging AI solutions hold promise, careful attention is required regarding their robustness, generalizability, long-term performance and ethical compliance. AI-driven predictions can sometimes lead to incorrect diagnoses, which may prompt clinicians to make erroneous treatment decisions [31, 32]. These issues largely stem from clinical AI models’ failure to generalize to external datasets or under-represented populations [33, 34].
In chest X-ray classification using clinical AI models, Rudolph and others highlighted how patient positioning could impact the performance of AI systems on external datasets [35], while Kim et al showed that shifts in disease prevalence between deployment and training cohorts could cause significant performance reductions [33, 36]. Differences in performance are particularly concerning when they disproportionately affect specific demographic groups, such as certain countries, races, genders, or age groups [37, 38]. Biased models that underperform for specific groups can result in overtreatment or underdiagnosis, depending on the direction of the bias.
Furthermore, the large data requirements of traditional AI systems may cause them to underperform when diagnosing relatively rare conditions [39]. For example, a model trained to diagnose rare syndromes based on facial recognition underperformed in non-white populations [40]. Hidden stratification—when there are underlying but unobservable subsets of data with different levels of performance—can also lead to underperforming models in unforeseen circumstances [41]. These disparities not only affect patients but also have implications for healthcare providers, as declining performance may expose institutions and clinicians to legal liability for inaccurate predictions [42].
Finally, systematic and local assessments have indicated that AI systems may struggle to maintain consistent performance, both at launch and over time, hindering their viability as clinical tools [43, 44]. Additionally, differences between datasets in terms of center, acquisition protocol and scanner manufacturer can hinder model performance [33, 36]. These phenomena, which may involve shifts in the spectrum of data, acquisition software, and clinical targets, cause AI models to degrade over time if not regularly monitored and updated [45, 46]. Continuous evaluation and recalibration are essential to ensure that AI systems remain tuned and effective in clinical practice, particularly as medical conditions, imaging technologies, and patient demographics evolve. To rigorously validate AI models and account for biases, large demographically balanced datasets with annotations are needed; this can oftentimes be unrealistic due to economical, legal and ethical constraints.
Foundation models: potential and limitations
Foundation models are large-scale AI models pre-trained on diverse, non-specific tasks (such as language generation, image captioning, or image reconstruction) and serve as adaptable starting points for specialized applications [47]. Some foundation models fall under the umbrella of generative AI, a subset of AI that focuses on synthesizing new content, be it text, images, video, voice, or other information modalities. Generative AI leverages these foundational architectures to produce novel outputs, while foundation models more broadly serve as a base for both generative and non-generative applications. For example, in healthcare, generative AI powered by foundation models enables tasks like synthetic data generation for rare disease research or personalized treatment simulations [48], whereas non-generative applications encompass clinical decision support systems, diagnostic assistance, and workflow optimization, which aid in improving diagnostic accuracy and clinical efficiency [49]. To assist the reader, we provide a glossary containing helpful definitions relating to foundation models used through this piece (Table 1).
Foundation models improve the performance of AI in specific tasks by leveraging very large amounts of data which have, usually, not been labeled or otherwise curated. Such training datasets have advantages: not only do they better represent the variations in quality observed in real-world data, but they are also more varied in terms of populations and acquisition conditions or parameters. Importantly, such data are easier to obtain, considering that they require less effort and time to collect and make available for AI model training. By leveraging these large, readily available and uncurated collections of data, powerful foundation models can be trained.
Many foundation models are used in the form of chat interfaces based on large language models (LLMs), which are among the earlier forms of foundation models [50]. LLMs are immensely complex models (typically more than 1 billion parameters) that are usually based on the transformer architecture [51] and capable of generating text similar to that of a human. Training an LLM is typically straightforward: the process focuses on predicting what the next word or token (a part of a sentence or word) can be [51]. By significantly scaling up this process both computationally and in terms of data, these models have been shown to accurately reproduce text similar to that of a human. Foundation models are developed not only for text-based tasks but also for other modalities, such as images, or combinations of information sources, such as vision language models (VLMs), which incorporate images and text into their processing capabilities.
Building a foundation model typically involves a “pre-training” process, i.e., a process during which models are trained on vast amounts of data that may have minimal or no annotations (Table 2). This core aspect of foundation models—the existence of a pre-training stage —is crucial for their downstream performance when foundation models are adapted to perform specific tasks (Fig. 1). For instance, in chatbots, pre-training is typically followed by some form of optimization which makes them more conversational and reduces the possibility of these models producing harmful output [52]. LLMs in the medical domain may be pre-trained on very big medical text datasets like electronic health records (EHRs), clinical notes, and scientific literature. In other cases, the solution of fine-tuning general (non-medical)-purpose LLMs with high-quality, curated medical data is preferred [53] to avoid the high computational expenses linked to pre-training. In medical image analysis, pre-training is typically followed by prompting or fine-tuning general models to perform specific tasks on specific anatomies and diseases, such as medical image segmentation, cancer diagnosis, or disease prediction. When pre-training uses no labels (as is the case with LLMs), this process can also be known as self-supervised learning, since the “supervision” (typically segmentation or classification labels) are the input data themselves (in the case of LLMs, the next token or word in a sentence provides the supervision). Some more recent approaches focus on what is known as “AI agents” (the moniker provided to specialized LLMs, VLMs, and other ML models capable of interacting with one another and with computational tools, functions, or software programs) and show potential application in oncological diagnosis and research [54, 55]. Multi-agent frameworks, which coordinate multiple AI agents, have been further posited as an essential step in further advancing the collaboration of clinicians with AI systems by automatically triggering specialist AI agents in an automated or semi-automated fashion [56, 57].
Typical foundation model workflow. Foundation models for radiological images are trained using different sources of information (radiological studies, series or images and, optionally, reports or clinical data). After training, foundation models can then be further applied or fine-tuned to a wide variety of clinically relevant downstream tasks
In the field of radiology, pre-training with no annotations has been performed extensively. These approaches include report or caption generation for medical images (allowing models to learn which parts of the image are relevant for radiologists; Fig. 2) [58, 59], masked autoencoders, (parts of the image are removed and a model is trained to predict the missing parts; Fig. 3) [60], and contrastive learning (a model learns how to numerically characterize an image consistently despite alterations to its content [61,62,63,64]; Fig. 4). Other approaches make use of annotations to derive foundation models. MedSAM [65], for instance, is a well-performing model for generic assisted medical image segmentation. Based on the Segment Anything Model (SAM) [66], MedSAM was trained to generate segmentation masks for different anatomically relevant regions using bounding boxes (rectangles or cubes that enclose a given anatomical region and which can thus be semi-automatically segmented with a high level of accuracy).
Pre-training with report generation. Input images must have paired reports. Each image is converted to a numerical embedding, which is used as input for a text generation model. The generated text is compared with the original report, and the model is trained to generate reports that closely resemble those provided as ground truths. This leads models to learn how to generate embeddings that closely relate to the reports, thus capturing relevant diagnostic information
Pre-training with contrastive learning. Input images may have paired text data (i.e., reports), but this is not required. Each image or report is converted to a numerical embedding, and these numerical vectors are approximated if they have common characteristics (same condition, patient or modality) and pulled apart if they have something conflicting (different patients or conditions, for example). In the end, embeddings are capable of characterizing both images and reports in a way that is semantically meaningful, and their embeddings cluster in this high-dimensional space
A particular advantage of vision foundation models is that they typically perform well when applied to external datasets [58, 65, 67]. They also require fewer annotated data to achieve better performance when compared with models trained using supervised learning alone [61, 64, 67]. For text-based tasks, some pre-trained LLM-based applications in impression and finding generation from radiology reports have shown potential [68,69,70]. Particularly, through few-shot learning—where image classifications are performed using a pre-trained model and only a few examples—generic foundation models can be adapted to medical image classification tasks [71]. Zero-shot learning—where no examples are provided in the training data—has also been shown to be more data-efficient (i.e., to require fewer training examples) when compared with supervised alternatives [72, 73].
However, as is the case with any AI tool, the performance of foundation models is bounded by the errors, limitations and biases of both their human creators and the data on which they have been trained. Known limitations of foundation models include:
-
1.
Lack of transparency and explainability: due to the extremely large number of parameters and probabilistic associations found in the training data that are difficult to interpret, models become harder to explain.
-
2.
Confabulations/hallucinations: model outputs may appear to be realistic (i.e., follow the typical structure of real data) but be non-factual. This includes the generation of false information [74, 75] and leads to high rates of inaccuracies in several different applications [76].
-
3.
Catastrophic forgetting [77]: when models are fine-tuned on small datasets, their performance or alignment with human values and safety principles may drop when tested on samples or datasets where the model used to perform well [78,79,80]. In other words, trying to adapt foundation models to specific tasks may lead to them losing performance on tasks where they previously performed well.
-
4.
Bias: foundation models may be biased against specific categories or characteristics. For instance, foundation chest X-ray models can exhibit biases in terms of gender and race, making their generic application problematic; indicatively, performance decreased when identifying chest radiographs with no findings and with pleural effusion for female and black individuals, respectively [81]. A systematic study of foundation models for medical images showed that sex and race biases are pervasive across foundation model pre-training approaches, despite the use of large amounts of data, and that increasing the amount of pre-training data or fine-tuning on balanced datasets leads only to partial mitigation of biases [82]. As an illustrative example, we refer to RETFound—a foundation model for retinal images [53]. While remarkable in its performance, a later analysis on whether RETFound, trained on a diverse Western population, could generalize to an Asian population showed that this foundation model did not provide an advantage compared to foundation models trained on natural image (i.e., non-medical imaging) data [83].
Additionally, while similar in many ways, key differences exist between natural and clinical images, which make training clinical image foundation models more difficult:
-
1.
Dataset size: ImageNet, a widely available natural image dataset, contains over 14 million images [84], while private datasets like those owned by Meta reach into the billions [85]. In contrast, the largest chest radiograph dataset—MIMIC-CXR [86]—has approximately 350,000 annotated X-rays. Efforts have been made to bridge this gap. To close this gap, datasets like SA-Med2D-20M have compiled 20 million masks across 4.6 million 2D medical images (58.4% from clinical imaging) [87,88,89]. These datasets are built by pooling publicly available sources—over 100 for SA-Med2D-20M and IMed-361M, and 20 for UMIE. Despite these advances, acquiring and annotating new clinical image data remains the primary challenge.
-
2.
Dimensionality: As noted, large datasets are typically achieved by treating clinical images as 2D. However, many clinical images are inherently 3D, and this dimensionality carries crucial contextual information essential for training clinical imaging foundation models. Some large 3D datasets exist: CT-RATE includes over 25,000 CT studies [90], and the UK Biobank is collecting cardiac, abdominal, and brain MRIs from up to 100,000 individuals [91]. For segmentation, datasets from BraTS challenges offer a few thousand annotated brain tumor studies [92]. The datasets used to train TotalSegmentator (over 1200 CT studies) and TotalSegmentator MRI (over 500 CT and 600 MRI studies) used high-quality annotations for over 100 anatomical structures [93, 94].
-
3.
Signal distribution: unlike natural images, which vary widely (e.g., different animals are easily distinguishable), medical images tend to be highly standardized and similar across individuals due to years of clinical protocol development. Moreover, diagnostically relevant features—hepatocellular carcinoma in abdominal CT, prostate lesions in multiparametric MRI, or intracranial hemorrhages in head CT—typically occupy only a small portion of each image. Although not yet well studied, this subtlety and limited variability may complicate model optimization.
-
4.
Inter-rater variability in annotation and annotation quality: different people may annotate objects and images differently. While this is not particularly discussed in natural images, inter-rater variability in annotations is of paramount importance in clinical imaging. Shwartzman and others recently showed that training a model on annotations from a single individual amplified the inter-rater variability of model outputs [95] in brain MRI. Similarly, decreasing the inter-rater variability in cell segmentation in histopathology led to improved performance despite smaller datasets [96].
Finally, when researchers assessed LLMs and, more recently, vision language models (VLMs) for their performance in medical licensing examinations, the performance was quite remarkable [97,98,99]. This naturally created a flurry of research and publications in the field of radiology using commercial LLMs for a wide array of tasks (such as diagnosis, report summarization or impression generation). However, results were mixed [76, 100, 101]: LLMs produce biased responses in medical [102, 103] and non-medical contexts [104, 105]. Additionally, certain medical LLM applications have proven to be worse than their human expert equivalent at impression generation [106, 107] and medical evidence summarization [108].
Foundation models as part of the future of AI in clinical practice
To answer, “How can we get models that perform well on present-day data,” foundation models may very well be the solution. However, if we ask how a robust and consistent clinical AI-supported ecosystem can thrive and practically serve patients and clinicians in years to come, foundation models are only part of the solution.
Foundation models can indeed bridge to some extent an existing gap in the generalization of machine-learning applications in the clinic by making use of medical knowledge existing in vast datasets. However, they may still be affected by some of the same limitations (outlined above), causing non-foundation models to remain largely unadopted in clinical practice. While the development of well-performing models is important, holistic validation in terms of trustworthiness and continuous learning is equally essential to ensure practical clinical utility and patient safety in an ever-evolving healthcare environment.
Modern decentralized approaches to medical data curation and federated storage, such as those applied in the Cancer Image Europe federated network, whose aim is to provide large amounts of data and federated learning/federated data processing approaches for medical research and experimentation [109], can act as groundbreaking foundations for the consistent training, validation, monitoring and continuous improvement of foundation models. Such frameworks may act as enablers for state-of-the-art AI modeling approaches, including foundation models and generative AI tools, by providing the data volume, variety, multimodality, and quality required for their extensive validation and retraining while preserving patient privacy.
Recent literature can also provide important insights into what a consistent, multi-centric, and continuous clinical model validation and training framework can look like. VAI-B, a Swedish national project focusing on the external validation of models, collects data from multiple institutions and, through careful orchestration of different models and their input/output requirements, is capable of delivering accurate estimates for the external performance of multiple models [110]. Similarly, RACOON, a German network of medical centers focusing on data collection for federated learning [111], shows promising results in coordinating between model training in institutions with good computational resources and model validation in institutions with fewer computational resources. Such an arrangement ensures that every institution can be involved in model training and validation. Similar approaches of continuous data collection and decentralized model validation and/or training could be deployed to not only train but also validate promising foundation models.
Integrating human-in-the-loop approaches and learning from clinical expert-user’s feedback [52, 112] to continually improve AI tools and render them dynamically adaptable and generalizable to ever-changing clinical practice operational conditions is especially important when using foundation models to address specific tasks [113]. Of equal importance is building “self-awareness” in foundation models by integrating awareness of the model’s limitations and uncertainties, e.g., by deploying uncertainty estimation techniques [114, 115] and providing mechanisms that enable AI to ask for human intervention or feedback when uncertainty is high, e.g., through clarification questions [116].
Holistic AI trustworthiness frameworks such as FUTURE-AI—which provides recommendations and guidelines for adherence to 6 main principles of trustworthiness: fairness, universality, traceability, usability, robustness and explainability [30]—can also act as critical guidance in the development, validation and deployment of foundation models that are trustworthy and have higher chances to be used in clinical practice. An important obstacle to trustworthiness lies in LLMs developed by large companies—such as ChatGPT by OpenAI or Gemini by Google. This is because data documentation for the pre-training of these models is limited or altogether non-existent, while datasets can be proprietary and inscrutable when models are further optimized using reinforcement learning with human feedback [52] or similar approaches. This is in direct confrontation with the traceability principle of the FUTURE-AI framework, suggesting that the whole lifecycle of the model—including its training process and data—be adequately documented and monitored. It is also in direct confrontation with the European Union’s AI Act, which requires high-risk AI systems (including clinical AI models) to be fully transparent about their training process and data [117].
Addressing the lack of transparency in proprietary models may involve the use of open LLMs, which openly document data—LLaMa models started by accurately documenting data sources which were used during training [118], while projects such as Pythia go the extra mile by providing the code necessary for full replication [119]. Technical developments in model explainability can also render model output easier to understand and increase the trust in their output—as highlighted in a recent review, LLM explainability can be achieved at several different levels, some of which mimic easily understandable explanations [120].
Conclusion
Here, we outline issues surrounding modern approaches to clinical machine-learning models using medical imaging, and consider how the development of a larger landscape of foundation models could partially address them. However, foundation models on their own are not sufficient to solve inherent biases or subpar generalization, particularly if there is a tendency to assume that these are entirely solvable without appropriate computational and data infrastructure. We thus recommend that efforts should focus on building diverse, well-documented datasets that involve clinical experts while enabling collaborative and decentralized training. Finally, the clinical and research communities should strive to certify that foundation models are transparent, clinically relevant, and broadly applicable. We expand on these efforts in Table 3.
We posit that these recommendations can make foundation models more transparent, robust and performant while also increasing the trust from both medical professionals and patients alike.
References
Fernandez M (2021) High-end global computed tomography purchases to propel the high-end CT segment revenue. In: Frost & Sullivan. Available via https://www.frost.com/news/press-releases/high-end-global-computed-tomography-purchases-to-propel-the-high-end-ct-segment-revenue/. Accessed 24 Dec 2024
Mahesh M, Ansari AJ, Mettler Jr FA (2023) Patient exposure from radiologic and nuclear medicine procedures in the United States and worldwide: 2009–2018. Radiology 307:e221263
Henderson M (2022) Radiology facing a global shortage. Available via https://www.rsna.org/news/2022/may/global-radiologist-shortage Accessed 22 Oct 2024
Goh CXY, Ho FCH (2023) The growing problem of radiologist shortages: perspectives from Singapore. Korean J Radiol 24:1176–1178
European Society of Radiology (ESR) (2022) Attracting the next generation of radiologists: a statement by the European Society of Radiology (ESR). Insights Imaging 13:84
Bailey CR, Bailey AM, McKenney AS, Weiss CR (2022) Understanding and appreciating burnout in radiologists. Radiographics 42:E137–E139
Fawzy NA, Tahir MJ, Saeed A et al (2023) Incidence and factors associated with burnout in radiologists: a systematic review. Eur J Radiol Open 11:100530
Konstantinidis K (2023) The shortage of radiographers: a global crisis in healthcare. J Med Imaging Radiat Sci 55:101333
Kramer D (2023) Alarm sounded over declining US radiation professional workforce. Phys Today 76:18–21
Lång K, Dustler M, Dahlblom V et al (2021) Identifying normal mammograms in a large screening population using artificial intelligence. Eur Radiol 31:1687–1692
Dahlblom V, Andersson I, Lång K et al (2021) Artificial intelligence detection of missed cancers at digital mammography that were detected at digital breast tomosynthesis. Radiol Artif Intell 3:e200299
Houssami N, Hofvind S, Soerensen AL et al (2021) Interval breast cancer rates for digital breast tomosynthesis versus digital mammography population screening: an individual participant data meta-analysis. EClinicalMedicine 34:100804
Çelik L, Aribal E (2024) The efficacy of artificial intelligence (AI) in detecting interval cancers in the national screening program of a middle-income country. Clin Radiol 79:e885–e891
Nanaa M, Gupta VO, Hickman SE et al (2024) Accuracy of an artificial intelligence system for interval breast cancer detection at screening mammography. Radiology 312:e232303
Anderson AW, Marinovich ML, Houssami N et al (2022) Independent external validation of artificial intelligence algorithms for automated interpretation of screening mammography: a systematic review. J Am Coll Radiol 19:259–273
Bennani S, Regnard N-E, Ventre J et al (2023) Using AI to improve radiologist performance in detection of abnormalities on chest radiographs. Radiology 309:e230860
Farouk S, Osman AM, Awadallah SM, Abdelrahman AS (2023) The added value of using artificial intelligence in adult chest X-rays for nodules and masses detection in daily radiology practice. Egypt J Radiol Nucl Med. https://doi.org/10.1186/s43055-023-01093-y
Arbabshirani MR, Fornwalt BK, Mongelluzzo GJ et al (2018) Advanced machine learning in action: identification of intracranial hemorrhage on computed tomography scans of the head with clinical workflow integration. NPJ Digit Med 1:9
Cai JC, Nakai H, Kuanar S et al (2024) Fully automated deep learning model to detect clinically significant prostate cancer at MRI. Radiology 312:e232635
Saha A, Bosma JS, Twilt JJ et al (2024) Artificial intelligence and radiologists in prostate cancer detection on MRI (PI-CAI): an international, paired, non-inferiority, confirmatory study. Lancet Oncol 25:879–887
Horvat N, Veeraraghavan H, Nahas CSR et al (2022) Combined artificial intelligence and radiologist model for predicting rectal cancer treatment response from magnetic resonance imaging: an external validation study. Abdom Radiol (NY) 47:2770–2782
Baldwin DR, Gustafson J, Pickup L et al (2020) External validation of a convolutional neural network artificial intelligence tool to predict malignancy in pulmonary nodules. Thorax 75:306–312
Baldwin D, Gustafson J, Pickup L et al (2020) Development and external validation of a new convolutional neural networks algorithm derived artificial intelligence tool to predict malignancy in pulmonary nodules. Lung Cancer 139:S7–S8
Ginn JS, Gay HA, Hilliard J et al (2023) A clinical and time savings evaluation of a deep learning automatic contouring algorithm. Med Dosim 48:55–60
Palazzo G, Mangili P, Deantoni C et al (2023) Real-world validation of artificial intelligence-based computed tomography auto-contouring for prostate cancer radiotherapy planning. Phys Imaging Radiat Oncol 28:100501
Bird D, Speight R, Andersson S et al (2024) Deep learning MRI-only synthetic-CT generation for pelvis, brain and head and neck cancers. Radiother Oncol 191:110052
Marti-Bonmati L, Koh D-M, Riklund K et al (2022) Considerations for artificial intelligence clinical impact in oncologic imaging: an AI4HI position paper. Insights Imaging 13:89
Kondylakis H, Ciarrocchi E, Cerda-Alberich L et al (2022) Position of the AI for Health Imaging (AI4HI) network on metadata models for imaging biobanks. Eur Radiol Exp 6:29
Kondylakis H, Kalokyri V, Sfakianakis S et al (2023) Data infrastructures for AI in medical imaging: a report on the experiences of five EU projects. Eur Radiol Exp 7:20
Lekadir K, Feragen A, Fofanah AJ et al (2023) FUTURE-AI: international consensus guideline for trustworthy and deployable artificial intelligence in healthcare. BMJ 388:e081554
Bernstein MH, Atalay MK, Dibble EH et al (2023) Can incorrect artificial intelligence (AI) results impact radiologists, and if so, what can we do about it? A multi-reader pilot study of lung cancer detection with chest radiography. Eur Radiol 33:8263–8269
Dratsch T, Chen X, Rezazade Mehrizi M et al (2023) Automation bias in mammography: the impact of artificial intelligence BI-RADS suggestions on reader performance. Radiology 307:e222176
Rodrigues NM, Almeida JGde, Verde ASC et al (2024) Analysis of domain shift in whole prostate gland, zonal and lesions segmentation and detection, using multicentric retrospective data. Comput Biol Med 171:108216
Ong Ly C, Unnikrishnan B, Tadic T et al (2024) Shortcut learning in medical AI hinders generalization: method for estimating AI model generalization without external data. NPJ Digit Med 7:124
Rudolph J, Schachtner B, Fink N et al (2022) Clinically focused multi-cohort benchmarking as a tool for external validation of artificial intelligence algorithm performance in basic chest radiography analysis. Sci Rep 12:12764
Almeida JG de, Rodrigues NM, Castro Verde AS et al (2025) Impact of scanner manufacturer, endorectal coil use, and clinical variables on deep learning-assisted prostate cancer classification using multiparametric MRI. Radiol Artif Intell 7:e230555
Goetz L, Seedat N, Vandersluis R, van der Schaar M (2024) Generalization-a key challenge for responsible AI in patient-facing clinical applications. NPJ Digit Med 7:126
Yang Y, Zhang H, Gichoya JW et al (2024) The limits of fair medical imaging AI in real-world generalization. Nat Med 30:2838–2848
He D, Wang R, Xu Z et al (2024) The use of artificial intelligence in the treatment of rare diseases: a scoping review. Intractable Rare Dis Res 13:12–22
Echeverry-Quiceno LM, Candelo E, Gómez E et al (2023) Population-specific facial traits and diagnosis accuracy of genetic and rare diseases in an admixed Colombian population. Sci Rep 13:6869
Oakden-Rayner L, Dunnmon J, Carneiro G, Ré C (2020) Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. Proc ACM Conf Health Inference Learn 2020:151–159
van Kolfschooten H, van Oirschot J (2024) The EU Artificial Intelligence Act (2024): implications for healthcare. Health Policy 149:105152
Lind Plesner L, Müller FC, Brejnebøl MW et al (2023) Commercially available chest radiograph AI tools for detecting airspace disease, pneumothorax, and pleural effusion. Radiology 308:e231236
Niehoff JH, Kalaitzidis J, Kroeger JR et al (2023) Evaluation of the clinical performance of an AI-based application for the automated analysis of chest X-rays. Sci Rep 13:3680
Roschewitz M, Khara G, Yearsley J et al (2023) Automatic correction of performance drift under acquisition shift in medical image classification. Nat Commun 14:6608
Sahiner B, Chen W, Samala RK, Petrick N (2023) Data drift in medical machine learning: implications and potential remedies. Br J Radiol 96:20220878
Moor M, Banerjee O, Abad ZSH et al (2023) Foundation models for generalist medical artificial intelligence. Nature 616:259–265
Wang J, Wang K, Yu Y et al (2024) Self-improving generative foundation model for synthetic medical image generation and clinical applications. Nat Med. https://doi.org/10.1038/s41591-024-03359-y
Alowais SA, Alghamdi SS, Alsuhebany N et al (2023) Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med Educ 23:689
Etchemendy J (2021) Introducing the Center for Research on Foundation Models (CRFM). In: Stanford HAI. Available via https://hai.stanford.edu/news/introducing-center-research-foundation-models-crfm. Accessed 24 Dec 2024
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, pp 6000–6010
Bai Y, Jones A, Ndousse K et al (2022) Training a helpful and harmless assistant with reinforcement learning from human feedback. Preprint at https://doi.org/10.48550/arXiv.2204.05862
Zhou Y, Chia MA, Wagner SK et al (2023) A foundation model for generalizable disease detection from retinal images. Nature 622:156–163
Lee Y, Ferber D, Rood JE et al (2024) How AI agents will change cancer research and oncology. Nat Cancer 5:1765–1767
Gao S, Fang A, Huang Y et al (2024) Empowering biomedical discovery with AI agents. Cell 187:6125–6151
Moritz M, Topol E, Rajpurkar P (2025) Coordinated AI agents for advancing healthcare. Nat Biomed Eng 9:432–438
Zou J, Topol EJ (2025) The rise of agentic AI teammates in medicine. Lancet 405:457
Tiu E, Talius E, Patel P et al (2022) Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat Biomed Eng 6:1399–1406
Wu C, Zhang X, Zhang Y et al (2023) Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data. Preprint at https://doi.org/10.48550/arXiv.2308.02463
Zhou L, Liu H, Bae J et al (2023) Self pre-training with masked autoencoders for medical image classification and segmentation. In: 2023 IEEE 20th international symposium on biomedical imaging (ISBI). IEEE, pp 1–6
Huang S-C, Pareek A, Jensen M et al (2023) Self-supervised learning for medical image classification: a systematic review and implementation guidelines. NPJ Digit Med 6:74
Wolf D, Payer T, Lisson CS et al (2023) Self-supervised pre-training with contrastive and masked autoencoder methods for dealing with small datasets in deep learning for medical imaging. Sci Rep 13:20260
Lin M, Li T, Sun Z et al (2024) Improving fairness of automated chest radiograph diagnosis by contrastive learning. Radiol Artif Intell 6:e230342
Almeida J, Castro Verde AS, Gaivão A et al (2024) Self-supervised learning for volumetric imaging: a prostate cancer biparametric magnetic resonance imaging case study. Social Science Research Network
Ma J, He Y, Li F et al (2024) Segment anything in medical images. Nat Commun 15:654
Kirillov A, Mintun E, Ravi N et al (2023) Segment anything. Preprint at https://doi.org/10.48550/arXiv.2304.02643
Azizi S, Culp L, Freyberg J et al (2023) Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nat Biomed Eng 7:756–779
Zhang L, Liu M, Wang L et al (2024) Constructing a large language model to generate impressions from findings in radiology reports. Radiology 312:e240885
Wu W, Li M, Wu J et al (2023) Learning to generate radiology findings from impressions based on large language model. In: 2023 IEEE international conference on big data (BigData). IEEE, pp 2550–2554
Serapio A, Chaudhari G, Savage C et al (2024) An open-source fine-tuned large language model for radiological impression generation: a multi-reader performance study. BMC Med Imaging 24:254
Ayed B (2024) Few-shot adaptation of medical vision-language models. In: MICCAI 2024—open access. Available via https://papers.miccai.org/miccai-2024/328-Paper2320.html. Accessed 13 May 2025
Mahapatra D, Bozorgtabar B, Ge Z (2021) Medical image classification using generalized zero shot learning. In: 2021 IEEE/CVF international conference on computer vision workshops (ICCVW). IEEE, pp 3337–3346
Jang J, Kyung D, Kim SH et al (2024) Significantly improving zero-shot X-ray pathology classification via fine-tuning pre-trained image-text encoders. Sci Rep 14:23199
Giuffrè M, You K, Shung DL (2024) Evaluating ChatGPT in medical contexts: the imperative to guard against hallucinations and partial accuracies. Clin Gastroenterol Hepatol 22:1145–1146
Gilbert S, Harvey H, Melvin T et al (2023) Large language model AI chatbots require approval as medical devices. Nat Med 29:2396–2398
Temperley HC, O’Sullivan NJ, Mac Curtain BM et al (2024) Current applications and future potential of ChatGPT in radiology: a systematic review. J Med Imaging Radiat Oncol 68:257–264
French RM (1999) Catastrophic forgetting in connectionist networks. Trends Cogn Sci 3:128–135
He, L, Xia M, Henderson P (2024) What is in your safe data? identifying benign data that breaks safety. In First Conference on Language Modeling. https://openreview.net/forum?id=Hi8jKh4HE9
Qi X, Zeng Y, Xie T et al (2024) Fine-tuning aligned language models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, OpenReview.net. https://openreview.net/forum?id=hTEGyKf0dZ
Soutif A, Magistri S, van de Weijer J, Bagdanov AD (2025) An empirical analysis of forgetting in pre-trained models with incremental low-rank updates. In Conference on Lifelong Learning Agents, PMLR, pp 996–1012
Glocker B, Jones C, Roschewitz M, Winzeck S (2023) Risk of bias in chest radiography deep learning foundation models. Radiol Artif Intell 5:e230060
Khan MO, Afzal MM, Mirza S, Fang Y (2023) How fair are medical imaging foundation models? PMLR 225:217–231
Xiong Z, Wang X, Zhou Y et al (2025) How generalizable are foundation models when applied to different demographic groups and settings? NEJM AI. https://doi.org/10.1056/aics2400497
ImageNet. Available via https://www.image-net.org/. Accessed 13 May 2025
Simonite T (2018) Your Instagram #dogs and #cats are training Facebook’s AI. In: WIRED. https://www.wired.com/story/your-instagram-dogs-and-cats-are-training-facebooks-ai/
Johnson AEW, Pollard TJ, Berkowitz SJ et al (2019) MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data. https://doi.org/10.1038/s41597-019-0322-0
Ye J, Cheng J, Chen J et al (2023) SA-Med2D-20M dataset: segment anything in 2D medical imaging with 20 million masks. Preprint at https://doi.org/10.48550/arXiv.2311.11969
Cheng J, Fu B, Ye J et al (2025) Interactive medical image segmentation: a benchmark dataset and baseline. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, 20841–51. Computer Vision Foundation / IEEE
Obuchowski, Aleksander (2024) Universal medical image encoder. TheLionAI V2 (blog). https://www.thelion.ai/post/universal-medical-image-encoder
Hamamci IE, Er S, Wang C et al (2024) Developing generalist foundation models from a multimodal dataset for 3D computed tomography. Preprint at https://doi.org/10.48550/arXiv.2403.17834
Littlejohns TJ, Holliday J, Gibson LM et al (2020) The UK Biobank imaging enhancement of 100,000 participants: rationale, data collection, management and future directions. Nat Commun 11:2624
LaBella D, Schumacher K, Mix M et al (2024) Brain tumor segmentation (BraTS) challenge 2024: meningioma radiotherapy planning automated segmentation. Preprint at https://doi.org/10.48550/arXiv.2405.18383
Wasserthal J, Breit H-C, Meyer MT et al (2023) TotalSegmentator: robust segmentation of 104 anatomic structures in CT images. Radiol Artif Intell 5:e230024
Akinci D’Antonoli T, Berger LK, Indrakanti AK et al (2025) TotalSegmentator MRI: robust sequence-independent segmentation of multiple anatomic structures in MRI. Radiology 314:e241613
Shwartzman O, Gazit H, Ben-Aryeh G et al (2025) The worrisome impact of an inter-rater bias on neural network training. In: Lecture notes in electrical engineering. Springer Nature Singapore, Singapore, pp 463–473
Kang C, Lee C, Song H et al (2023) Variability matters: evaluating inter-rater variability in histopathology for robust cell detection. In: Lecture notes in computer science. Springer Nature Switzerland, Cham, pp 552–565
Gilson A, Safranek CW, Huang T et al (2023) How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ 9:e45312
Shieh A, Tran B, He G et al (2024) Assessing ChatGPT 4.0’s test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports. Sci Rep 14:9330
Newton PM, Summers CJ, Zaheer U et al (2025) Can ChatGPT-4o really pass medical science exams? A pragmatic analysis using novel questions. Med Sci Educ 35:721–729
Sonoda Y, Kurokawa R, Nakamura Y et al (2024) Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in “Diagnosis Please” cases. Jpn J Radiol 42:1231–1235
Chen Z, Chambara N, Wu C et al (2024) Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images. Endocrine. https://doi.org/10.1007/s12020-024-04066-x
Ayoub NF, Balakrishnan K, Ayoub MS et al (2024) Inherent bias in large language models: a random sampling analysis. Mayo Clinic Proc Digit Health 2:186–191
Shah SV (2024) Accuracy, consistency, and hallucination of large language models when analyzing unstructured clinical notes in electronic medical records. JAMA Netw Open 7:e2425953
Kotek H, Dockum R, Sun D (2023) Gender bias and stereotypes in large language models. In Proceedings of The ACM Collective Intelligence Conference (CI '23). Association for Computing Machinery, New York, NY, USA, pp 12–24. https://doi.org/10.1145/3582269.3615599
Tjuatja L, Chen V, Wu T et al (2024) Do LLMs exhibit human-like response biases? A case study in survey design. Trans Assoc Comput Linguist 12:1011–1026
Sun Z, Ong H, Kennedy P et al (2023) Evaluating GPT4 on impressions generation in radiology reports. Radiology 307:e231259
Ziegelmayer S, Marka AW, Lenhart N et al (2023) Evaluation of GPT-4’s chest X-ray impression generation: a reader study on performance and perception. J Med Internet Res 25:e50865
Tang L, Sun Z, Idnay B et al (2023) Evaluating large language models on medical evidence summarization. NPJ Digit Med 6:158
EUCAIM (2023) Home. In: Cancer Image Europe. Available via https://cancerimage.eu/. Accessed 24 Oct 2024
Cossío F, Schurz H, Engström M et al (2023) VAI-B: a multicenter platform for the external validation of artificial intelligence algorithms in breast imaging. J Med Imaging 10:061404
Bujotzek MR, Akünal Ü, Denner S et al (2025) Real-world federated learning in radiology: hurdles to overcome and benefits to gain. J Am Med Inform Assoc 32:193–205
Ouyang L, Wu J, Jiang X et al (2022) Training language models to follow instructions with human feedback. Preprint at https://doi.org/10.48550/arXiv.2203.02155
Zhu F, Ma S, Cheng Z et al (2024) Open-world machine learning: a review and new outlooks. Preprint at https://doi.org/10.48550/arXiv.2403.01759
Turner M, Ive J, Velupillai S (2021) Linguistic uncertainty in clinical NLP: a taxonomy, dataset and approach. In: Lecture notes in computer science. Springer, Cham, pp 129–141
Ulmer D, Gubri M, Lee H, Yun S, Oh S (2024) Calibrating large language models using their generations only. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), edited by Lun-Wei Ku, Andre Martins, and Vivek Srikumar, 15440–59. Bangkok, Thailand: Association for Computational Linguistics
Testoni A, Fernández R (2024) Asking the right question at the right time: human and model uncertainty guidance to ask clarification questions. In: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Volume 1: Long Papers, St. Julian’s, Malta, March 17-22, 2024, edited by Yvette Graham and Matthew Purver, Association for Computational Linguistics, pp 258–275
Edwards L (2022) Expert explainer: the EU AI Act proposal. Available via https://www.adalovelaceinstitute.org/resource/eu-ai-act-explainer/. Accessed 9 Apr 2024
Touvron H, Lavril T, Izacard G et al (2023) LLaMA: open and efficient foundation language models. Preprint at https://doi.org/10.48550/arXiv.2302.13971
Biderman S, Schoelkopf H, Anthony QG et al (2023) Pythia: a suite for analyzing large language models across training and scaling. In: International Conference on Machine Learning, PMLR, pp 2397–2430
Zhao H, Chen H, Yang F et al (2024) Explainability for large language models: a survey. ACM Trans Intell Syst Technol 15:1–38
Isensee F, Rokuss M, Krämer L et al (2025) nnInteractive: redefining 3D promptable segmentation. Preprint at https://doi.org/10.48550/arXiv.2503.08373
Radford A, Wu J, Child R et al (2019) Language models are unsupervised multitask learners. OpenAI. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Zixuan G, Hu X, Tang H, Liu, Y (2025) Towards auto-regressive next-token prediction: in-context learning emerges from generalization. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, OpenReview.net. https://openreview.net/forum?id=gK1rl98VRp
DeepSeek-AI, Bi X, Chen D et al (2024) DeepSeek LLM: scaling open-source language models with longtermism. Preprint at https://doi.org/10.48550/arXiv.2401.02954
Gemma Team, Kamath A, Ferret J et al (2025) Gemma 3 technical report. Preprint at https://doi.org/10.48550/arXiv.2503.19786
Llama 3.2: revolutionizing edge AI and vision with open, customizable models. In: Meta AI. Available via https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/. Accessed 13 May 2025
Ollama. Available via https://ollama.com/. Accessed 13 May 2025
Gerganov G, ggml‑org community (2023) llama.cpp [Software]. GitHub https://github.com/ggml-org/llama.cpp
Thomas W, Debut L, Sanh V et al (2019) “HuggingFace’s Transformers: State-of-the-Art Natural Language Processing.” arXiv Preprint arXiv:1910. 03771.
Deng J, Dong W, Socher R et al (2009) ImageNet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 248–255
Nguyen HQ, Lam K, Le LT et al (2022) VinDr-CXR: an open dataset of chest X-rays with radiologist’s annotations. Sci Data 9:429
Pham HH, Nguyen NH, Tran TT et al (2023) PediCXR: an open, large-scale chest radiograph dataset for interpretation of common thoracic diseases in children. Sci Data 10:240
Koitka S, Baldini G, Kroll L et al (2024) SAROS: a dataset for whole-body region and organ segmentation in CT imaging. Sci Data 11:483
Radiological Society of North America (RSNA). 2022. RadLex Radiology Lexicon (version 1.0.2). RSNA. http://radlex.org
Funding
J.G.d.A. is funded by the Horizon Health grant (grant ID: 952159).
Author information
Authors and Affiliations
Contributions
J.G.d.A. wrote the main manuscript. The remaining authors participated in discussions about the content and provided edits to the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
de Almeida, J.G., Alberich, L.C., Tsakou, G. et al. Foundation models for radiology—the position of the AI for Health Imaging (AI4HI) network. Insights Imaging 16, 168 (2025). https://doi.org/10.1186/s13244-025-02056-9
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1186/s13244-025-02056-9