+
Skip to main content
Springer is moving to Springer Nature Link. Visit this journal in its new home.

Foundation models for radiology—the position of the AI for Health Imaging (AI4HI) network

Abstract

Foundation models are large models trained on big data which can be used for downstream tasks. In radiology, these models can potentially address several gaps in fairness and generalization, as they can be trained on massive datasets without labelled data and adapted to tasks requiring data with a small number of descriptions. This reduces one of the limiting bottlenecks in clinical model construction—data annotation—as these models can be trained through a variety of techniques that require little more than radiological images with or without their corresponding radiological reports. However, foundation models may be insufficient as they are affected—to a smaller extent when compared with traditional supervised learning approaches—by the same issues that lead to underperforming models, such as a lack of transparency/explainability, and biases. To address these issues, we advocate that the development of foundation models should not only be pursued but also accompanied by the development of a decentralized clinical validation and continuous training framework. This does not guarantee the resolution of the problems associated with foundation models, but it enables developers, clinicians and patients to know when, how and why models should be updated, creating a clinical AI ecosystem that is better capable of serving all stakeholders.

Critical relevance statement

Foundation models may mitigate issues like bias and poor generalization in radiology AI, but challenges persist. We propose a decentralized, cross-institutional framework for continuous validation and training to enhance model reliability, safety, and clinical utility.

Key Points

  • Foundation models trained on large datasets reduce annotation burdens and improve fairness and generalization in radiology.

  • Despite improvements, they still face challenges like limited transparency, explainability, and residual biases.

  • A decentralized, cross-institutional framework for clinical validation and continuous training can strengthen reliability and inclusivity in clinical AI.

Graphical Abstract

Introduction

Medical images offer a wealth of relevant information about the patient that clinicians use for better diagnosing and treating patients. Due to their centrality in diagnosis, the number of acquired scans has steadily increased over the years [1, 2], creating a great demand for trained professionals in acquiring, interpreting and integrating results, a high workload demand which is presently hard to adequately fulfill [3,4,5] and leads to increasing burnout rates among radiologists [6, 7]. To make matters worse, radiographers and medical physicists are also facing a global shortage [8, 9], making this problem transversal to image scheduling, acquisition, quality control, interpretation, and reporting.

Medical imaging AI systems have recently appeared as potential aids in not only reducing the burden placed on radiologists but also improving the patient journey and healthcare. For example, sensitivity can be boosted by AI in breast cancer detection by capturing lesions missed by radiologists [10,11,12,13,14,15], in chest X-ray by nodule interpretation [16, 17], and in CT scans by intracranial hemorrhage detection [18]. Performance boosts are also observed in prostate MRI significant cancer detection by using deep learning tools while also reducing the likelihood of false positives [19, 20], in diagnosing pathological complete response in colorectal cancer using MRI radiomics [21], or in the observation and delineation of lung nodules in chest CT with convolutional neural networks [22, 23]. In radiotherapy, AI can assist in contouring (i.e., through semi-automatic or automatic segmentation of organs at risk or clinical target volumes) with consistent time reductions [24, 25]. AI can also generate synthetic CT data for MRI-only radiotherapy treatment plans using a different set of CT images obtained in real life [26].

The potential of AI in radiology, including foundation and generative models, has been discussed among experts and stakeholders from the AI for Health Imaging (AI4HI) network. The AI4HI networks connect 5 large-scale EU-funded consortia developing, validating and deploying AI for health imaging (EuCanImage, CHAIMELEON, INCISIVE, ProCAncer-I, and PRIMAGE), and incorporate professionals from both technical and clinical backgrounds alike [27,28,29,30]. Based on the experiences and results from these projects, the consensus was that while emerging AI solutions hold promise, careful attention is required regarding their robustness, generalizability, long-term performance and ethical compliance. AI-driven predictions can sometimes lead to incorrect diagnoses, which may prompt clinicians to make erroneous treatment decisions [31, 32]. These issues largely stem from clinical AI models’ failure to generalize to external datasets or under-represented populations [33, 34].

In chest X-ray classification using clinical AI models, Rudolph and others highlighted how patient positioning could impact the performance of AI systems on external datasets [35], while Kim et al showed that shifts in disease prevalence between deployment and training cohorts could cause significant performance reductions [33, 36]. Differences in performance are particularly concerning when they disproportionately affect specific demographic groups, such as certain countries, races, genders, or age groups [37, 38]. Biased models that underperform for specific groups can result in overtreatment or underdiagnosis, depending on the direction of the bias.

Furthermore, the large data requirements of traditional AI systems may cause them to underperform when diagnosing relatively rare conditions [39]. For example, a model trained to diagnose rare syndromes based on facial recognition underperformed in non-white populations [40]. Hidden stratification—when there are underlying but unobservable subsets of data with different levels of performance—can also lead to underperforming models in unforeseen circumstances [41]. These disparities not only affect patients but also have implications for healthcare providers, as declining performance may expose institutions and clinicians to legal liability for inaccurate predictions [42].

Finally, systematic and local assessments have indicated that AI systems may struggle to maintain consistent performance, both at launch and over time, hindering their viability as clinical tools [43, 44]. Additionally, differences between datasets in terms of center, acquisition protocol and scanner manufacturer can hinder model performance [33, 36]. These phenomena, which may involve shifts in the spectrum of data, acquisition software, and clinical targets, cause AI models to degrade over time if not regularly monitored and updated [45, 46]. Continuous evaluation and recalibration are essential to ensure that AI systems remain tuned and effective in clinical practice, particularly as medical conditions, imaging technologies, and patient demographics evolve. To rigorously validate AI models and account for biases, large demographically balanced datasets with annotations are needed; this can oftentimes be unrealistic due to economical, legal and ethical constraints.

Foundation models: potential and limitations

Foundation models are large-scale AI models pre-trained on diverse, non-specific tasks (such as language generation, image captioning, or image reconstruction) and serve as adaptable starting points for specialized applications [47]. Some foundation models fall under the umbrella of generative AI, a subset of AI that focuses on synthesizing new content, be it text, images, video, voice, or other information modalities. Generative AI leverages these foundational architectures to produce novel outputs, while foundation models more broadly serve as a base for both generative and non-generative applications. For example, in healthcare, generative AI powered by foundation models enables tasks like synthetic data generation for rare disease research or personalized treatment simulations [48], whereas non-generative applications encompass clinical decision support systems, diagnostic assistance, and workflow optimization, which aid in improving diagnostic accuracy and clinical efficiency [49]. To assist the reader, we provide a glossary containing helpful definitions relating to foundation models used through this piece (Table 1).

Table 1 A short glossary containing helpful concepts when reading about and discussing foundation models

Foundation models improve the performance of AI in specific tasks by leveraging very large amounts of data which have, usually, not been labeled or otherwise curated. Such training datasets have advantages: not only do they better represent the variations in quality observed in real-world data, but they are also more varied in terms of populations and acquisition conditions or parameters. Importantly, such data are easier to obtain, considering that they require less effort and time to collect and make available for AI model training. By leveraging these large, readily available and uncurated collections of data, powerful foundation models can be trained.

Many foundation models are used in the form of chat interfaces based on large language models (LLMs), which are among the earlier forms of foundation models [50]. LLMs are immensely complex models (typically more than 1 billion parameters) that are usually based on the transformer architecture [51] and capable of generating text similar to that of a human. Training an LLM is typically straightforward: the process focuses on predicting what the next word or token (a part of a sentence or word) can be [51]. By significantly scaling up this process both computationally and in terms of data, these models have been shown to accurately reproduce text similar to that of a human. Foundation models are developed not only for text-based tasks but also for other modalities, such as images, or combinations of information sources, such as vision language models (VLMs), which incorporate images and text into their processing capabilities.

Building a foundation model typically involves a “pre-training” process, i.e., a process during which models are trained on vast amounts of data that may have minimal or no annotations (Table 2). This core aspect of foundation models—the existence of a pre-training stage —is crucial for their downstream performance when foundation models are adapted to perform specific tasks (Fig. 1). For instance, in chatbots, pre-training is typically followed by some form of optimization which makes them more conversational and reduces the possibility of these models producing harmful output [52]. LLMs in the medical domain may be pre-trained on very big medical text datasets like electronic health records (EHRs), clinical notes, and scientific literature. In other cases, the solution of fine-tuning general (non-medical)-purpose LLMs with high-quality, curated medical data is preferred [53] to avoid the high computational expenses linked to pre-training. In medical image analysis, pre-training is typically followed by prompting or fine-tuning general models to perform specific tasks on specific anatomies and diseases, such as medical image segmentation, cancer diagnosis, or disease prediction. When pre-training uses no labels (as is the case with LLMs), this process can also be known as self-supervised learning, since the “supervision” (typically segmentation or classification labels) are the input data themselves (in the case of LLMs, the next token or word in a sentence provides the supervision). Some more recent approaches focus on what is known as “AI agents” (the moniker provided to specialized LLMs, VLMs, and other ML models capable of interacting with one another and with computational tools, functions, or software programs) and show potential application in oncological diagnosis and research [54, 55]. Multi-agent frameworks, which coordinate multiple AI agents, have been further posited as an essential step in further advancing the collaboration of clinicians with AI systems by automatically triggering specialist AI agents in an automated or semi-automated fashion [56, 57].

Table 2 Summary of some pre-training strategies that can be used to train foundational models using image data
Fig. 1
figure 1

Typical foundation model workflow. Foundation models for radiological images are trained using different sources of information (radiological studies, series or images and, optionally, reports or clinical data). After training, foundation models can then be further applied or fine-tuned to a wide variety of clinically relevant downstream tasks

In the field of radiology, pre-training with no annotations has been performed extensively. These approaches include report or caption generation for medical images (allowing models to learn which parts of the image are relevant for radiologists; Fig. 2) [58, 59], masked autoencoders, (parts of the image are removed and a model is trained to predict the missing parts; Fig. 3) [60], and contrastive learning (a model learns how to numerically characterize an image consistently despite alterations to its content [61,62,63,64]; Fig. 4). Other approaches make use of annotations to derive foundation models. MedSAM [65], for instance, is a well-performing model for generic assisted medical image segmentation. Based on the Segment Anything Model (SAM) [66], MedSAM was trained to generate segmentation masks for different anatomically relevant regions using bounding boxes (rectangles or cubes that enclose a given anatomical region and which can thus be semi-automatically segmented with a high level of accuracy).

Fig. 2
figure 2

Pre-training with report generation. Input images must have paired reports. Each image is converted to a numerical embedding, which is used as input for a text generation model. The generated text is compared with the original report, and the model is trained to generate reports that closely resemble those provided as ground truths. This leads models to learn how to generate embeddings that closely relate to the reports, thus capturing relevant diagnostic information

Fig. 3
figure 3

Pre-training with masked auto-encoders. Small parts (also known as patches) are removed from each image. A model will then learn how to predict the image content in each missing patch, thus creating a model capable of capturing rich contextual information without having access to the entire image

Fig. 4
figure 4

Pre-training with contrastive learning. Input images may have paired text data (i.e., reports), but this is not required. Each image or report is converted to a numerical embedding, and these numerical vectors are approximated if they have common characteristics (same condition, patient or modality) and pulled apart if they have something conflicting (different patients or conditions, for example). In the end, embeddings are capable of characterizing both images and reports in a way that is semantically meaningful, and their embeddings cluster in this high-dimensional space

A particular advantage of vision foundation models is that they typically perform well when applied to external datasets [58, 65, 67]. They also require fewer annotated data to achieve better performance when compared with models trained using supervised learning alone [61, 64, 67]. For text-based tasks, some pre-trained LLM-based applications in impression and finding generation from radiology reports have shown potential [68,69,70]. Particularly, through few-shot learning—where image classifications are performed using a pre-trained model and only a few examples—generic foundation models can be adapted to medical image classification tasks [71]. Zero-shot learning—where no examples are provided in the training data—has also been shown to be more data-efficient (i.e., to require fewer training examples) when compared with supervised alternatives [72, 73].

However, as is the case with any AI tool, the performance of foundation models is bounded by the errors, limitations and biases of both their human creators and the data on which they have been trained. Known limitations of foundation models include:

  1. 1.

    Lack of transparency and explainability: due to the extremely large number of parameters and probabilistic associations found in the training data that are difficult to interpret, models become harder to explain.

  2. 2.

    Confabulations/hallucinations: model outputs may appear to be realistic (i.e., follow the typical structure of real data) but be non-factual. This includes the generation of false information [74, 75] and leads to high rates of inaccuracies in several different applications [76].

  3. 3.

    Catastrophic forgetting [77]: when models are fine-tuned on small datasets, their performance or alignment with human values and safety principles may drop when tested on samples or datasets where the model used to perform well [78,79,80]. In other words, trying to adapt foundation models to specific tasks may lead to them losing performance on tasks where they previously performed well.

  4. 4.

    Bias: foundation models may be biased against specific categories or characteristics. For instance, foundation chest X-ray models can exhibit biases in terms of gender and race, making their generic application problematic; indicatively, performance decreased when identifying chest radiographs with no findings and with pleural effusion for female and black individuals, respectively [81]. A systematic study of foundation models for medical images showed that sex and race biases are pervasive across foundation model pre-training approaches, despite the use of large amounts of data, and that increasing the amount of pre-training data or fine-tuning on balanced datasets leads only to partial mitigation of biases [82]. As an illustrative example, we refer to RETFound—a foundation model for retinal images [53]. While remarkable in its performance, a later analysis on whether RETFound, trained on a diverse Western population, could generalize to an Asian population showed that this foundation model did not provide an advantage compared to foundation models trained on natural image (i.e., non-medical imaging) data [83].

Additionally, while similar in many ways, key differences exist between natural and clinical images, which make training clinical image foundation models more difficult:

  1. 1.

    Dataset size: ImageNet, a widely available natural image dataset, contains over 14 million images [84], while private datasets like those owned by Meta reach into the billions [85]. In contrast, the largest chest radiograph dataset—MIMIC-CXR [86]—has approximately 350,000 annotated X-rays. Efforts have been made to bridge this gap. To close this gap, datasets like SA-Med2D-20M have compiled 20 million masks across 4.6 million 2D medical images (58.4% from clinical imaging) [87,88,89]. These datasets are built by pooling publicly available sources—over 100 for SA-Med2D-20M and IMed-361M, and 20 for UMIE. Despite these advances, acquiring and annotating new clinical image data remains the primary challenge.

  2. 2.

    Dimensionality: As noted, large datasets are typically achieved by treating clinical images as 2D. However, many clinical images are inherently 3D, and this dimensionality carries crucial contextual information essential for training clinical imaging foundation models. Some large 3D datasets exist: CT-RATE includes over 25,000 CT studies [90], and the UK Biobank is collecting cardiac, abdominal, and brain MRIs from up to 100,000 individuals [91]. For segmentation, datasets from BraTS challenges offer a few thousand annotated brain tumor studies [92]. The datasets used to train TotalSegmentator (over 1200 CT studies) and TotalSegmentator MRI (over 500 CT and 600 MRI studies) used high-quality annotations for over 100 anatomical structures [93, 94].

  3. 3.

    Signal distribution: unlike natural images, which vary widely (e.g., different animals are easily distinguishable), medical images tend to be highly standardized and similar across individuals due to years of clinical protocol development. Moreover, diagnostically relevant features—hepatocellular carcinoma in abdominal CT, prostate lesions in multiparametric MRI, or intracranial hemorrhages in head CT—typically occupy only a small portion of each image. Although not yet well studied, this subtlety and limited variability may complicate model optimization.

  4. 4.

    Inter-rater variability in annotation and annotation quality: different people may annotate objects and images differently. While this is not particularly discussed in natural images, inter-rater variability in annotations is of paramount importance in clinical imaging. Shwartzman and others recently showed that training a model on annotations from a single individual amplified the inter-rater variability of model outputs [95] in brain MRI. Similarly, decreasing the inter-rater variability in cell segmentation in histopathology led to improved performance despite smaller datasets [96].

Finally, when researchers assessed LLMs and, more recently, vision language models (VLMs) for their performance in medical licensing examinations, the performance was quite remarkable [97,98,99]. This naturally created a flurry of research and publications in the field of radiology using commercial LLMs for a wide array of tasks (such as diagnosis, report summarization or impression generation). However, results were mixed [76, 100, 101]: LLMs produce biased responses in medical [102, 103] and non-medical contexts [104, 105]. Additionally, certain medical LLM applications have proven to be worse than their human expert equivalent at impression generation [106, 107] and medical evidence summarization [108].

Foundation models as part of the future of AI in clinical practice

To answer, “How can we get models that perform well on present-day data,” foundation models may very well be the solution. However, if we ask how a robust and consistent clinical AI-supported ecosystem can thrive and practically serve patients and clinicians in years to come, foundation models are only part of the solution.

Foundation models can indeed bridge to some extent an existing gap in the generalization of machine-learning applications in the clinic by making use of medical knowledge existing in vast datasets. However, they may still be affected by some of the same limitations (outlined above), causing non-foundation models to remain largely unadopted in clinical practice. While the development of well-performing models is important, holistic validation in terms of trustworthiness and continuous learning is equally essential to ensure practical clinical utility and patient safety in an ever-evolving healthcare environment.

Modern decentralized approaches to medical data curation and federated storage, such as those applied in the Cancer Image Europe federated network, whose aim is to provide large amounts of data and federated learning/federated data processing approaches for medical research and experimentation [109], can act as groundbreaking foundations for the consistent training, validation, monitoring and continuous improvement of foundation models. Such frameworks may act as enablers for state-of-the-art AI modeling approaches, including foundation models and generative AI tools, by providing the data volume, variety, multimodality, and quality required for their extensive validation and retraining while preserving patient privacy.

Recent literature can also provide important insights into what a consistent, multi-centric, and continuous clinical model validation and training framework can look like. VAI-B, a Swedish national project focusing on the external validation of models, collects data from multiple institutions and, through careful orchestration of different models and their input/output requirements, is capable of delivering accurate estimates for the external performance of multiple models [110]. Similarly, RACOON, a German network of medical centers focusing on data collection for federated learning [111], shows promising results in coordinating between model training in institutions with good computational resources and model validation in institutions with fewer computational resources. Such an arrangement ensures that every institution can be involved in model training and validation. Similar approaches of continuous data collection and decentralized model validation and/or training could be deployed to not only train but also validate promising foundation models.

Integrating human-in-the-loop approaches and learning from clinical expert-user’s feedback [52, 112] to continually improve AI tools and render them dynamically adaptable and generalizable to ever-changing clinical practice operational conditions is especially important when using foundation models to address specific tasks [113]. Of equal importance is building “self-awareness” in foundation models by integrating awareness of the model’s limitations and uncertainties, e.g., by deploying uncertainty estimation techniques [114, 115] and providing mechanisms that enable AI to ask for human intervention or feedback when uncertainty is high, e.g., through clarification questions [116].

Holistic AI trustworthiness frameworks such as FUTURE-AI—which provides recommendations and guidelines for adherence to 6 main principles of trustworthiness: fairness, universality, traceability, usability, robustness and explainability [30]—can also act as critical guidance in the development, validation and deployment of foundation models that are trustworthy and have higher chances to be used in clinical practice. An important obstacle to trustworthiness lies in LLMs developed by large companies—such as ChatGPT by OpenAI or Gemini by Google. This is because data documentation for the pre-training of these models is limited or altogether non-existent, while datasets can be proprietary and inscrutable when models are further optimized using reinforcement learning with human feedback [52] or similar approaches. This is in direct confrontation with the traceability principle of the FUTURE-AI framework, suggesting that the whole lifecycle of the model—including its training process and data—be adequately documented and monitored. It is also in direct confrontation with the European Union’s AI Act, which requires high-risk AI systems (including clinical AI models) to be fully transparent about their training process and data [117].

Addressing the lack of transparency in proprietary models may involve the use of open LLMs, which openly document data—LLaMa models started by accurately documenting data sources which were used during training [118], while projects such as Pythia go the extra mile by providing the code necessary for full replication [119]. Technical developments in model explainability can also render model output easier to understand and increase the trust in their output—as highlighted in a recent review, LLM explainability can be achieved at several different levels, some of which mimic easily understandable explanations [120].

Conclusion

Here, we outline issues surrounding modern approaches to clinical machine-learning models using medical imaging, and consider how the development of a larger landscape of foundation models could partially address them. However, foundation models on their own are not sufficient to solve inherent biases or subpar generalization, particularly if there is a tendency to assume that these are entirely solvable without appropriate computational and data infrastructure. We thus recommend that efforts should focus on building diverse, well-documented datasets that involve clinical experts while enabling collaborative and decentralized training. Finally, the clinical and research communities should strive to certify that foundation models are transparent, clinically relevant, and broadly applicable. We expand on these efforts in Table 3.

Table 3 Practical recommendations regarding the future development of foundation models for clinical imaging, and which problems they address

We posit that these recommendations can make foundation models more transparent, robust and performant while also increasing the trust from both medical professionals and patients alike.

References

  1. Fernandez M (2021) High-end global computed tomography purchases to propel the high-end CT segment revenue. In: Frost & Sullivan. Available via https://www.frost.com/news/press-releases/high-end-global-computed-tomography-purchases-to-propel-the-high-end-ct-segment-revenue/. Accessed 24 Dec 2024

  2. Mahesh M, Ansari AJ, Mettler Jr FA (2023) Patient exposure from radiologic and nuclear medicine procedures in the United States and worldwide: 2009–2018. Radiology 307:e221263

    Article  PubMed  Google Scholar 

  3. Henderson M (2022) Radiology facing a global shortage. Available via https://www.rsna.org/news/2022/may/global-radiologist-shortage Accessed 22 Oct 2024

  4. Goh CXY, Ho FCH (2023) The growing problem of radiologist shortages: perspectives from Singapore. Korean J Radiol 24:1176–1178

    Article  PubMed  PubMed Central  Google Scholar 

  5. European Society of Radiology (ESR) (2022) Attracting the next generation of radiologists: a statement by the European Society of Radiology (ESR). Insights Imaging 13:84

    Article  Google Scholar 

  6. Bailey CR, Bailey AM, McKenney AS, Weiss CR (2022) Understanding and appreciating burnout in radiologists. Radiographics 42:E137–E139

    Article  PubMed  Google Scholar 

  7. Fawzy NA, Tahir MJ, Saeed A et al (2023) Incidence and factors associated with burnout in radiologists: a systematic review. Eur J Radiol Open 11:100530

    Article  PubMed  PubMed Central  Google Scholar 

  8. Konstantinidis K (2023) The shortage of radiographers: a global crisis in healthcare. J Med Imaging Radiat Sci 55:101333

    Article  PubMed  Google Scholar 

  9. Kramer D (2023) Alarm sounded over declining US radiation professional workforce. Phys Today 76:18–21

    Google Scholar 

  10. Lång K, Dustler M, Dahlblom V et al (2021) Identifying normal mammograms in a large screening population using artificial intelligence. Eur Radiol 31:1687–1692

    Article  PubMed  Google Scholar 

  11. Dahlblom V, Andersson I, Lång K et al (2021) Artificial intelligence detection of missed cancers at digital mammography that were detected at digital breast tomosynthesis. Radiol Artif Intell 3:e200299

    Article  PubMed  PubMed Central  Google Scholar 

  12. Houssami N, Hofvind S, Soerensen AL et al (2021) Interval breast cancer rates for digital breast tomosynthesis versus digital mammography population screening: an individual participant data meta-analysis. EClinicalMedicine 34:100804

    Article  PubMed  PubMed Central  Google Scholar 

  13. Çelik L, Aribal E (2024) The efficacy of artificial intelligence (AI) in detecting interval cancers in the national screening program of a middle-income country. Clin Radiol 79:e885–e891

    Article  PubMed  Google Scholar 

  14. Nanaa M, Gupta VO, Hickman SE et al (2024) Accuracy of an artificial intelligence system for interval breast cancer detection at screening mammography. Radiology 312:e232303

    Article  PubMed  Google Scholar 

  15. Anderson AW, Marinovich ML, Houssami N et al (2022) Independent external validation of artificial intelligence algorithms for automated interpretation of screening mammography: a systematic review. J Am Coll Radiol 19:259–273

    Article  PubMed  PubMed Central  Google Scholar 

  16. Bennani S, Regnard N-E, Ventre J et al (2023) Using AI to improve radiologist performance in detection of abnormalities on chest radiographs. Radiology 309:e230860

    Article  PubMed  Google Scholar 

  17. Farouk S, Osman AM, Awadallah SM, Abdelrahman AS (2023) The added value of using artificial intelligence in adult chest X-rays for nodules and masses detection in daily radiology practice. Egypt J Radiol Nucl Med. https://doi.org/10.1186/s43055-023-01093-y

  18. Arbabshirani MR, Fornwalt BK, Mongelluzzo GJ et al (2018) Advanced machine learning in action: identification of intracranial hemorrhage on computed tomography scans of the head with clinical workflow integration. NPJ Digit Med 1:9

    Article  PubMed  PubMed Central  Google Scholar 

  19. Cai JC, Nakai H, Kuanar S et al (2024) Fully automated deep learning model to detect clinically significant prostate cancer at MRI. Radiology 312:e232635

    Article  PubMed  Google Scholar 

  20. Saha A, Bosma JS, Twilt JJ et al (2024) Artificial intelligence and radiologists in prostate cancer detection on MRI (PI-CAI): an international, paired, non-inferiority, confirmatory study. Lancet Oncol 25:879–887

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Horvat N, Veeraraghavan H, Nahas CSR et al (2022) Combined artificial intelligence and radiologist model for predicting rectal cancer treatment response from magnetic resonance imaging: an external validation study. Abdom Radiol (NY) 47:2770–2782

    Article  PubMed  Google Scholar 

  22. Baldwin DR, Gustafson J, Pickup L et al (2020) External validation of a convolutional neural network artificial intelligence tool to predict malignancy in pulmonary nodules. Thorax 75:306–312

    Article  PubMed  Google Scholar 

  23. Baldwin D, Gustafson J, Pickup L et al (2020) Development and external validation of a new convolutional neural networks algorithm derived artificial intelligence tool to predict malignancy in pulmonary nodules. Lung Cancer 139:S7–S8

    Article  Google Scholar 

  24. Ginn JS, Gay HA, Hilliard J et al (2023) A clinical and time savings evaluation of a deep learning automatic contouring algorithm. Med Dosim 48:55–60

    Article  PubMed  Google Scholar 

  25. Palazzo G, Mangili P, Deantoni C et al (2023) Real-world validation of artificial intelligence-based computed tomography auto-contouring for prostate cancer radiotherapy planning. Phys Imaging Radiat Oncol 28:100501

    Article  PubMed  PubMed Central  Google Scholar 

  26. Bird D, Speight R, Andersson S et al (2024) Deep learning MRI-only synthetic-CT generation for pelvis, brain and head and neck cancers. Radiother Oncol 191:110052

    Article  PubMed  Google Scholar 

  27. Marti-Bonmati L, Koh D-M, Riklund K et al (2022) Considerations for artificial intelligence clinical impact in oncologic imaging: an AI4HI position paper. Insights Imaging 13:89

    Article  PubMed  PubMed Central  Google Scholar 

  28. Kondylakis H, Ciarrocchi E, Cerda-Alberich L et al (2022) Position of the AI for Health Imaging (AI4HI) network on metadata models for imaging biobanks. Eur Radiol Exp 6:29

    Article  PubMed  PubMed Central  Google Scholar 

  29. Kondylakis H, Kalokyri V, Sfakianakis S et al (2023) Data infrastructures for AI in medical imaging: a report on the experiences of five EU projects. Eur Radiol Exp 7:20

    Article  PubMed  PubMed Central  Google Scholar 

  30. Lekadir K, Feragen A, Fofanah AJ et al (2023) FUTURE-AI: international consensus guideline for trustworthy and deployable artificial intelligence in healthcare. BMJ 388:e081554

  31. Bernstein MH, Atalay MK, Dibble EH et al (2023) Can incorrect artificial intelligence (AI) results impact radiologists, and if so, what can we do about it? A multi-reader pilot study of lung cancer detection with chest radiography. Eur Radiol 33:8263–8269

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Dratsch T, Chen X, Rezazade Mehrizi M et al (2023) Automation bias in mammography: the impact of artificial intelligence BI-RADS suggestions on reader performance. Radiology 307:e222176

    Article  PubMed  Google Scholar 

  33. Rodrigues NM, Almeida JGde, Verde ASC et al (2024) Analysis of domain shift in whole prostate gland, zonal and lesions segmentation and detection, using multicentric retrospective data. Comput Biol Med 171:108216

    Article  PubMed  Google Scholar 

  34. Ong Ly C, Unnikrishnan B, Tadic T et al (2024) Shortcut learning in medical AI hinders generalization: method for estimating AI model generalization without external data. NPJ Digit Med 7:124

    Article  PubMed  PubMed Central  Google Scholar 

  35. Rudolph J, Schachtner B, Fink N et al (2022) Clinically focused multi-cohort benchmarking as a tool for external validation of artificial intelligence algorithm performance in basic chest radiography analysis. Sci Rep 12:12764

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Almeida JG de, Rodrigues NM, Castro Verde AS et al (2025) Impact of scanner manufacturer, endorectal coil use, and clinical variables on deep learning-assisted prostate cancer classification using multiparametric MRI. Radiol Artif Intell 7:e230555

  37. Goetz L, Seedat N, Vandersluis R, van der Schaar M (2024) Generalization-a key challenge for responsible AI in patient-facing clinical applications. NPJ Digit Med 7:126

    Article  PubMed  PubMed Central  Google Scholar 

  38. Yang Y, Zhang H, Gichoya JW et al (2024) The limits of fair medical imaging AI in real-world generalization. Nat Med 30:2838–2848

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. He D, Wang R, Xu Z et al (2024) The use of artificial intelligence in the treatment of rare diseases: a scoping review. Intractable Rare Dis Res 13:12–22

    Article  PubMed  PubMed Central  Google Scholar 

  40. Echeverry-Quiceno LM, Candelo E, Gómez E et al (2023) Population-specific facial traits and diagnosis accuracy of genetic and rare diseases in an admixed Colombian population. Sci Rep 13:6869

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Oakden-Rayner L, Dunnmon J, Carneiro G, Ré C (2020) Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. Proc ACM Conf Health Inference Learn 2020:151–159

  42. van Kolfschooten H, van Oirschot J (2024) The EU Artificial Intelligence Act (2024): implications for healthcare. Health Policy 149:105152

    Article  PubMed  Google Scholar 

  43. Lind Plesner L, Müller FC, Brejnebøl MW et al (2023) Commercially available chest radiograph AI tools for detecting airspace disease, pneumothorax, and pleural effusion. Radiology 308:e231236

    Article  PubMed  Google Scholar 

  44. Niehoff JH, Kalaitzidis J, Kroeger JR et al (2023) Evaluation of the clinical performance of an AI-based application for the automated analysis of chest X-rays. Sci Rep 13:3680

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Roschewitz M, Khara G, Yearsley J et al (2023) Automatic correction of performance drift under acquisition shift in medical image classification. Nat Commun 14:6608

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Sahiner B, Chen W, Samala RK, Petrick N (2023) Data drift in medical machine learning: implications and potential remedies. Br J Radiol 96:20220878

    Article  PubMed  PubMed Central  Google Scholar 

  47. Moor M, Banerjee O, Abad ZSH et al (2023) Foundation models for generalist medical artificial intelligence. Nature 616:259–265

    Article  CAS  PubMed  Google Scholar 

  48. Wang J, Wang K, Yu Y et al (2024) Self-improving generative foundation model for synthetic medical image generation and clinical applications. Nat Med. https://doi.org/10.1038/s41591-024-03359-y

  49. Alowais SA, Alghamdi SS, Alsuhebany N et al (2023) Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med Educ 23:689

    Article  PubMed  PubMed Central  Google Scholar 

  50. Etchemendy J (2021) Introducing the Center for Research on Foundation Models (CRFM). In: Stanford HAI. Available via https://hai.stanford.edu/news/introducing-center-research-foundation-models-crfm. Accessed 24 Dec 2024

  51. Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, pp 6000–6010

  52. Bai Y, Jones A, Ndousse K et al (2022) Training a helpful and harmless assistant with reinforcement learning from human feedback. Preprint at https://doi.org/10.48550/arXiv.2204.05862

  53. Zhou Y, Chia MA, Wagner SK et al (2023) A foundation model for generalizable disease detection from retinal images. Nature 622:156–163

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Lee Y, Ferber D, Rood JE et al (2024) How AI agents will change cancer research and oncology. Nat Cancer 5:1765–1767

  55. Gao S, Fang A, Huang Y et al (2024) Empowering biomedical discovery with AI agents. Cell 187:6125–6151

    Article  CAS  PubMed  Google Scholar 

  56. Moritz M, Topol E, Rajpurkar P (2025) Coordinated AI agents for advancing healthcare. Nat Biomed Eng 9:432–438

    Article  PubMed  Google Scholar 

  57. Zou J, Topol EJ (2025) The rise of agentic AI teammates in medicine. Lancet 405:457

    Article  PubMed  Google Scholar 

  58. Tiu E, Talius E, Patel P et al (2022) Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat Biomed Eng 6:1399–1406

    Article  PubMed  PubMed Central  Google Scholar 

  59. Wu C, Zhang X, Zhang Y et al (2023) Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data. Preprint at https://doi.org/10.48550/arXiv.2308.02463

  60. Zhou L, Liu H, Bae J et al (2023) Self pre-training with masked autoencoders for medical image classification and segmentation. In: 2023 IEEE 20th international symposium on biomedical imaging (ISBI). IEEE, pp 1–6

  61. Huang S-C, Pareek A, Jensen M et al (2023) Self-supervised learning for medical image classification: a systematic review and implementation guidelines. NPJ Digit Med 6:74

    Article  PubMed  PubMed Central  Google Scholar 

  62. Wolf D, Payer T, Lisson CS et al (2023) Self-supervised pre-training with contrastive and masked autoencoder methods for dealing with small datasets in deep learning for medical imaging. Sci Rep 13:20260

    Article  PubMed  PubMed Central  Google Scholar 

  63. Lin M, Li T, Sun Z et al (2024) Improving fairness of automated chest radiograph diagnosis by contrastive learning. Radiol Artif Intell 6:e230342

    Article  PubMed  PubMed Central  Google Scholar 

  64. Almeida J, Castro Verde AS, Gaivão A et al (2024) Self-supervised learning for volumetric imaging: a prostate cancer biparametric magnetic resonance imaging case study. Social Science Research Network

  65. Ma J, He Y, Li F et al (2024) Segment anything in medical images. Nat Commun 15:654

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Kirillov A, Mintun E, Ravi N et al (2023) Segment anything. Preprint at https://doi.org/10.48550/arXiv.2304.02643

  67. Azizi S, Culp L, Freyberg J et al (2023) Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nat Biomed Eng 7:756–779

    Article  PubMed  Google Scholar 

  68. Zhang L, Liu M, Wang L et al (2024) Constructing a large language model to generate impressions from findings in radiology reports. Radiology 312:e240885

    Article  PubMed  Google Scholar 

  69. Wu W, Li M, Wu J et al (2023) Learning to generate radiology findings from impressions based on large language model. In: 2023 IEEE international conference on big data (BigData). IEEE, pp 2550–2554

  70. Serapio A, Chaudhari G, Savage C et al (2024) An open-source fine-tuned large language model for radiological impression generation: a multi-reader performance study. BMC Med Imaging 24:254

    Article  PubMed  PubMed Central  Google Scholar 

  71. Ayed B (2024) Few-shot adaptation of medical vision-language models. In: MICCAI 2024—open access. Available via https://papers.miccai.org/miccai-2024/328-Paper2320.html. Accessed 13 May 2025

  72. Mahapatra D, Bozorgtabar B, Ge Z (2021) Medical image classification using generalized zero shot learning. In: 2021 IEEE/CVF international conference on computer vision workshops (ICCVW). IEEE, pp 3337–3346

  73. Jang J, Kyung D, Kim SH et al (2024) Significantly improving zero-shot X-ray pathology classification via fine-tuning pre-trained image-text encoders. Sci Rep 14:23199

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. Giuffrè M, You K, Shung DL (2024) Evaluating ChatGPT in medical contexts: the imperative to guard against hallucinations and partial accuracies. Clin Gastroenterol Hepatol 22:1145–1146

    Article  PubMed  Google Scholar 

  75. Gilbert S, Harvey H, Melvin T et al (2023) Large language model AI chatbots require approval as medical devices. Nat Med 29:2396–2398

    Article  CAS  PubMed  Google Scholar 

  76. Temperley HC, O’Sullivan NJ, Mac Curtain BM et al (2024) Current applications and future potential of ChatGPT in radiology: a systematic review. J Med Imaging Radiat Oncol 68:257–264

    Article  PubMed  Google Scholar 

  77. French RM (1999) Catastrophic forgetting in connectionist networks. Trends Cogn Sci 3:128–135

    Article  CAS  PubMed  Google Scholar 

  78. He, L, Xia M, Henderson P (2024) What is in your safe data? identifying benign data that breaks safety. In First Conference on Language Modeling. https://openreview.net/forum?id=Hi8jKh4HE9

  79. Qi X, Zeng Y, Xie T et al (2024) Fine-tuning aligned language models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, OpenReview.net. https://openreview.net/forum?id=hTEGyKf0dZ

  80. Soutif A, Magistri S, van de Weijer J, Bagdanov AD (2025) An empirical analysis of forgetting in pre-trained models with incremental low-rank updates. In Conference on Lifelong Learning Agents, PMLR, pp 996–1012

  81. Glocker B, Jones C, Roschewitz M, Winzeck S (2023) Risk of bias in chest radiography deep learning foundation models. Radiol Artif Intell 5:e230060

    Article  PubMed  PubMed Central  Google Scholar 

  82. Khan MO, Afzal MM, Mirza S, Fang Y (2023) How fair are medical imaging foundation models? PMLR 225:217–231

  83. Xiong Z, Wang X, Zhou Y et al (2025) How generalizable are foundation models when applied to different demographic groups and settings? NEJM AI. https://doi.org/10.1056/aics2400497

  84. ImageNet. Available via https://www.image-net.org/. Accessed 13 May 2025

  85. Simonite T (2018) Your Instagram #dogs and #cats are training Facebook’s AI. In: WIRED. https://www.wired.com/story/your-instagram-dogs-and-cats-are-training-facebooks-ai/

  86. Johnson AEW, Pollard TJ, Berkowitz SJ et al (2019) MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data. https://doi.org/10.1038/s41597-019-0322-0

  87. Ye J, Cheng J, Chen J et al (2023) SA-Med2D-20M dataset: segment anything in 2D medical imaging with 20 million masks. Preprint at https://doi.org/10.48550/arXiv.2311.11969

  88. Cheng J, Fu B, Ye J et al (2025) Interactive medical image segmentation: a benchmark dataset and baseline. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, 20841–51. Computer Vision Foundation / IEEE

  89. Obuchowski, Aleksander (2024) Universal medical image encoder. TheLionAI V2 (blog). https://www.thelion.ai/post/universal-medical-image-encoder

  90. Hamamci IE, Er S, Wang C et al (2024) Developing generalist foundation models from a multimodal dataset for 3D computed tomography. Preprint at https://doi.org/10.48550/arXiv.2403.17834

  91. Littlejohns TJ, Holliday J, Gibson LM et al (2020) The UK Biobank imaging enhancement of 100,000 participants: rationale, data collection, management and future directions. Nat Commun 11:2624

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  92. LaBella D, Schumacher K, Mix M et al (2024) Brain tumor segmentation (BraTS) challenge 2024: meningioma radiotherapy planning automated segmentation. Preprint at https://doi.org/10.48550/arXiv.2405.18383

  93. Wasserthal J, Breit H-C, Meyer MT et al (2023) TotalSegmentator: robust segmentation of 104 anatomic structures in CT images. Radiol Artif Intell 5:e230024

    Article  PubMed  PubMed Central  Google Scholar 

  94. Akinci D’Antonoli T, Berger LK, Indrakanti AK et al (2025) TotalSegmentator MRI: robust sequence-independent segmentation of multiple anatomic structures in MRI. Radiology 314:e241613

    Article  PubMed  Google Scholar 

  95. Shwartzman O, Gazit H, Ben-Aryeh G et al (2025) The worrisome impact of an inter-rater bias on neural network training. In: Lecture notes in electrical engineering. Springer Nature Singapore, Singapore, pp 463–473

  96. Kang C, Lee C, Song H et al (2023) Variability matters: evaluating inter-rater variability in histopathology for robust cell detection. In: Lecture notes in computer science. Springer Nature Switzerland, Cham, pp 552–565

  97. Gilson A, Safranek CW, Huang T et al (2023) How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ 9:e45312

    Article  PubMed  PubMed Central  Google Scholar 

  98. Shieh A, Tran B, He G et al (2024) Assessing ChatGPT 4.0’s test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports. Sci Rep 14:9330

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  99. Newton PM, Summers CJ, Zaheer U et al (2025) Can ChatGPT-4o really pass medical science exams? A pragmatic analysis using novel questions. Med Sci Educ 35:721–729

  100. Sonoda Y, Kurokawa R, Nakamura Y et al (2024) Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in “Diagnosis Please” cases. Jpn J Radiol 42:1231–1235

    Article  PubMed  PubMed Central  Google Scholar 

  101. Chen Z, Chambara N, Wu C et al (2024) Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images. Endocrine. https://doi.org/10.1007/s12020-024-04066-x

  102. Ayoub NF, Balakrishnan K, Ayoub MS et al (2024) Inherent bias in large language models: a random sampling analysis. Mayo Clinic Proc Digit Health 2:186–191

    Article  Google Scholar 

  103. Shah SV (2024) Accuracy, consistency, and hallucination of large language models when analyzing unstructured clinical notes in electronic medical records. JAMA Netw Open 7:e2425953

    Article  PubMed  Google Scholar 

  104. Kotek H, Dockum R, Sun D (2023) Gender bias and stereotypes in large language models. In Proceedings of The ACM Collective Intelligence Conference (CI '23). Association for Computing Machinery, New York, NY, USA, pp 12–24. https://doi.org/10.1145/3582269.3615599

  105. Tjuatja L, Chen V, Wu T et al (2024) Do LLMs exhibit human-like response biases? A case study in survey design. Trans Assoc Comput Linguist 12:1011–1026

    Article  Google Scholar 

  106. Sun Z, Ong H, Kennedy P et al (2023) Evaluating GPT4 on impressions generation in radiology reports. Radiology 307:e231259

    Article  PubMed  Google Scholar 

  107. Ziegelmayer S, Marka AW, Lenhart N et al (2023) Evaluation of GPT-4’s chest X-ray impression generation: a reader study on performance and perception. J Med Internet Res 25:e50865

    Article  PubMed  PubMed Central  Google Scholar 

  108. Tang L, Sun Z, Idnay B et al (2023) Evaluating large language models on medical evidence summarization. NPJ Digit Med 6:158

    Article  PubMed  PubMed Central  Google Scholar 

  109. EUCAIM (2023) Home. In: Cancer Image Europe. Available via https://cancerimage.eu/. Accessed 24 Oct 2024

  110. Cossío F, Schurz H, Engström M et al (2023) VAI-B: a multicenter platform for the external validation of artificial intelligence algorithms in breast imaging. J Med Imaging 10:061404

    Article  Google Scholar 

  111. Bujotzek MR, Akünal Ü, Denner S et al (2025) Real-world federated learning in radiology: hurdles to overcome and benefits to gain. J Am Med Inform Assoc 32:193–205

  112. Ouyang L, Wu J, Jiang X et al (2022) Training language models to follow instructions with human feedback. Preprint at https://doi.org/10.48550/arXiv.2203.02155

  113. Zhu F, Ma S, Cheng Z et al (2024) Open-world machine learning: a review and new outlooks. Preprint at https://doi.org/10.48550/arXiv.2403.01759

  114. Turner M, Ive J, Velupillai S (2021) Linguistic uncertainty in clinical NLP: a taxonomy, dataset and approach. In: Lecture notes in computer science. Springer, Cham, pp 129–141

  115. Ulmer D, Gubri M, Lee H, Yun S, Oh S (2024) Calibrating large language models using their generations only. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), edited by Lun-Wei Ku, Andre Martins, and Vivek Srikumar, 15440–59. Bangkok, Thailand: Association for Computational Linguistics

  116. Testoni A, Fernández R (2024) Asking the right question at the right time: human and model uncertainty guidance to ask clarification questions. In: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Volume 1: Long Papers, St. Julian’s, Malta, March 17-22, 2024, edited by Yvette Graham and Matthew Purver, Association for Computational Linguistics, pp 258–275

  117. Edwards L (2022) Expert explainer: the EU AI Act proposal. Available via https://www.adalovelaceinstitute.org/resource/eu-ai-act-explainer/. Accessed 9 Apr 2024

  118. Touvron H, Lavril T, Izacard G et al (2023) LLaMA: open and efficient foundation language models. Preprint at https://doi.org/10.48550/arXiv.2302.13971

  119. Biderman S, Schoelkopf H, Anthony QG et al (2023) Pythia: a suite for analyzing large language models across training and scaling. In: International Conference on Machine Learning, PMLR, pp 2397–2430

  120. Zhao H, Chen H, Yang F et al (2024) Explainability for large language models: a survey. ACM Trans Intell Syst Technol 15:1–38

    Google Scholar 

  121. Isensee F, Rokuss M, Krämer L et al (2025) nnInteractive: redefining 3D promptable segmentation. Preprint at https://doi.org/10.48550/arXiv.2503.08373

  122. Radford A, Wu J, Child R et al (2019) Language models are unsupervised multitask learners. OpenAI. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

  123. Zixuan G, Hu X, Tang H, Liu, Y (2025) Towards auto-regressive next-token prediction: in-context learning emerges from generalization. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, OpenReview.net. https://openreview.net/forum?id=gK1rl98VRp

  124. DeepSeek-AI, Bi X, Chen D et al (2024) DeepSeek LLM: scaling open-source language models with longtermism. Preprint at https://doi.org/10.48550/arXiv.2401.02954

  125. Gemma Team, Kamath A, Ferret J et al (2025) Gemma 3 technical report. Preprint at https://doi.org/10.48550/arXiv.2503.19786

  126. Llama 3.2: revolutionizing edge AI and vision with open, customizable models. In: Meta AI. Available via https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/. Accessed 13 May 2025

  127. Ollama. Available via https://ollama.com/. Accessed 13 May 2025

  128. Gerganov G, ggml‑org community (2023) llama.cpp [Software]. GitHub https://github.com/ggml-org/llama.cpp

  129. Thomas W, Debut L, Sanh V et al (2019) “HuggingFace’s Transformers: State-of-the-Art Natural Language Processing.” arXiv Preprint arXiv:1910. 03771.

  130. Deng J, Dong W, Socher R et al (2009) ImageNet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 248–255

  131. Nguyen HQ, Lam K, Le LT et al (2022) VinDr-CXR: an open dataset of chest X-rays with radiologist’s annotations. Sci Data 9:429

    Article  PubMed  PubMed Central  Google Scholar 

  132. Pham HH, Nguyen NH, Tran TT et al (2023) PediCXR: an open, large-scale chest radiograph dataset for interpretation of common thoracic diseases in children. Sci Data 10:240

    Article  PubMed  PubMed Central  Google Scholar 

  133. Koitka S, Baldini G, Kroll L et al (2024) SAROS: a dataset for whole-body region and organ segmentation in CT imaging. Sci Data 11:483

    Article  PubMed  PubMed Central  Google Scholar 

  134. Radiological Society of North America (RSNA). 2022. RadLex Radiology Lexicon (version 1.0.2). RSNA. http://radlex.org

Download references

Funding

J.G.d.A. is funded by the Horizon Health grant (grant ID: 952159).

Author information

Authors and Affiliations

Authors

Contributions

J.G.d.A. wrote the main manuscript. The remaining authors participated in discussions about the content and provided edits to the manuscript.

Corresponding author

Correspondence to José Guilherme de Almeida.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

de Almeida, J.G., Alberich, L.C., Tsakou, G. et al. Foundation models for radiology—the position of the AI for Health Imaging (AI4HI) network. Insights Imaging 16, 168 (2025). https://doi.org/10.1186/s13244-025-02056-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1186/s13244-025-02056-9

Keywords

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载