Collaboration between clinicians and vision–language models in radiology report generation

Tanno, Ryutaro; Barrett, David G. T.; Sellergren, Andrew; Ghaisas, Sumedh; Dathathri, Sumanth; See, Abigail; Welbl, Johannes; Lau, Charles; Tu, Tao; Azizi, Shekoofeh; Singhal, Karan; Schaekermann, Mike; May, Rhys; Lee, Roy; Man, SiWai; Mahdavi, Sara; Ahmed, Zahra; Matias, Yossi; Barral, Joelle; Eslami, S. M. Ali; Belgrave, Danielle; Liu, Yun; Kalidindi, Sreenivasa Raju; Shetty, Shravya; Natarajan, Vivek; Kohli, Pushmeet; Huang, Po-Sen; Karthikesalingam, Alan; Ktena, Ira

doi:10.1038/s41591-024-03302-1

Download PDF

Article
Open access
Published: 07 November 2024

Collaboration between clinicians and vision–language models in radiology report generation

Nature Medicine volume 31, pages 599–608 (2025)Cite this article

23k Accesses
39 Altmetric
Metrics details

Subjects

Abstract

Automated radiology report generation has the potential to improve patient care and reduce the workload of radiologists. However, the path toward real-world adoption has been stymied by the challenge of evaluating the clinical quality of artificial intelligence (AI)-generated reports. We build a state-of-the-art report generation system for chest radiographs, called Flamingo-CXR, and perform an expert evaluation of AI-generated reports by engaging a panel of board-certified radiologists. We observe a wide distribution of preferences across the panel and across clinical settings, with 56.1% of Flamingo-CXR intensive care reports evaluated to be preferable or equivalent to clinician reports, by half or more of the panel, rising to 77.7% for in/outpatient X-rays overall and to 94% for the subset of cases with no pertinent abnormal findings. Errors were observed in human-written reports and Flamingo-CXR reports, with 24.8% of in/outpatient cases containing clinically significant errors in both report types, 22.8% in Flamingo-CXR reports only and 14.0% in human reports only. For reports that contain errors we develop an assistive setting, a demonstration of clinician–AI collaboration for radiology report composition, indicating new possibilities for potential clinical utility.

Multimodal generative AI for medical image interpretation

Article 26 March 2025

Patient-centered radiology reports with generative artificial intelligence: adding value to radiology reporting

Article Open access 08 June 2024

Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation

Article Open access 06 March 2025

Main

Radiology plays an integral and increasingly important role in modern medicine, by informing diagnosis, treatment and management of patients through medical imaging. However, the current global shortage of radiologists restricts access to expert care and causes heavy workloads for radiologists, resulting in undesirable delays and errors in clinical decisions^1,2. In the past decade, we have witnessed the remarkable promise of AI algorithms as assistive technology for improving the access, efficiency and quality of radiological care, with more than 200 US Food and Drug Administration approved commercial products developed by companies based in more than 20 countries³ and approximately one in every three radiologists in the United States already benefiting from AI as part of their clinical workflow⁴.

The vast majority of these approved AI applications, however, focus only on the classification and quantification of very specific pathologies⁵. In practice, clinical radiology is much more than an accumulation of such narrow interpretive tasks, because findings must be communicated with appropriate nuance, synthesized in a broader clinical context and combined with overall impressions and recommendations that are useful for patient care. Radiologist experts use natural language to communicate this synthesis of the imaging findings alongside their overall impression and recommendations in the form of written reports. The recent progress in AI for modeling vision and language data simultaneously^6,7,8,9, coupled with the growing availability of digitized multimodal radiology data, has enabled the possibility of developing an automatic report generation system that is capable of producing a complete free-text description of the medical image^{10,11,12,13,14}. Framing report generation as the north star for a useful radiology AI system is more closely aligned to current radiologist practice and patient care, and allows for a more fine-grained and diverse description of the relevant findings that can be tailored to the needs of a given clinical scenario, including aspects such as location, size and severity, ambiguity, relation to clinical context of specific pathologies or their impact on onward care and more¹⁵.

Despite the increasing number of publications on AI-based report generation and its potential in improving the radiology workflow, automated report generation has not yet been widely adopted in real practice⁵. Several unmet needs represent key barriers to automated reporting achieving real-world impact. One notable obstacle is the difficulty of meaningfully evaluating the clinical quality of generated reports. The high degree of freedom in free-form reports introduces a wide range of possible errors to measure and classify. Exacerbating this, the desirable contents of a report differ between clinical settings (for example, an emergency setting versus a medical check-up), geographic regions¹⁶ and preferred approaches to standardization¹⁷. Previous works have approached this challenge by proposing automated metrics for evaluating the clinical quality of generated reports^18,19,20,21 but many limitations remain. First, there has been a paucity of comprehensive evaluation of automated reports against reports produced by human experts (certified radiologists), which are known themselves to have variable style and quality. Despite impressive progress in automated metrics for report quality, only one study²² has directly assessed whether AI-generated reports were considered preferable to those by human experts, whereas others²³ have evaluated their utility in practice in a specific clinical setting only. Furthermore, the reasons given for preference choices have not been explored sufficiently. Second, previous work has only evaluated AI-generated reports as stand-alone artifacts, meaning the utility of these systems as assistive tools remains unknown. Evaluation in clinician–AI collaboration scenarios is arguably more realistic, given that most AI tools approved for clinical decision-making have been developed for an assistive rather than autonomous role in care delivery^24,25.

In addition to the above evaluation challenges, there remains considerable headroom for improvement in the clinical accuracy of existing AI report generation models²¹. Recent breakthroughs in multimodal foundation models^9,21 have demonstrated that AI systems trained on a vast quantity of unlabeled data can be adapted and achieve state-of-the-art accuracy in a wide range of downstream specialized tasks, including biomedical problems²⁶. However, most existing report generation models^10,11,12,13 are built from scratch, neglecting the likely useful transfer of knowledge from such pretrained models. By leveraging advances accrued through large-scale pretraining of vision–language models and tailoring them to a specific medical task, there is an opportunity to build an even more powerful report generation system.

In this work, we directly address these key unmet needs for AI report generation. We present Flamingo-CXR, a system for AI report generation predicated on a recent vision–language foundation model that achieves state-of-art performance in multiple automated metrics⁸. We evaluate Flamingo-CXR on historic, deidentified datasets across a diversity of clinical and geographic settings—both intensive care in the United States and in/outpatient care delivery in India—and move beyond automated metrics to a detailed human evaluation of the reports generated with a pool of 27 radiologists, including a direct comparison of clinicians’ preferences for AI reports versus human reports. Furthermore, we evaluate the system in an autonomous as well as assistive context. Figure 1 shows an overview of the proposed evaluation framework.

**Fig. 1: Schematic overview of our human evaluation framework.**

Our contributions richly characterize the wide spectrum of agreement and disagreement that exists between clinical experts, among themselves and with Flamingo-CXR and where there has been disparity, we have taken this as an opportunity to develop a collaborative assistive setting, with Flamingo-CXR and clinicians working together to improve clinical accuracy.

Results

The Flamingo-CXR report generation model is developed by fine-tuning the Flamingo vision–language foundation model⁸ on the task of generating a radiology report for a chest X-ray (CXR), using training data from two large deidentified datasets of CXR images and the corresponding radiology reports: (1) the MIMIC-CXR dataset²⁷, which is the largest public CXR dataset, acquired from a US emergency department, and (2) the IND1 dataset²⁸, obtained from in/outpatient settings across India (see Methods and Extended Data Table 1 for further details of model training). To measure the quality of reports generated by our model, we conduct an expert radiologist evaluation of the generated reports, and we also use a set of report generation metrics, including two widely used clinical metrics: (1) the CheXpert F₁ score and (2) the RadGraph F₁ score, which measure the similarity between generated reports and original reports; we also use a set of widely adopted natural language generation (NLG) metrics (see Methods and Extended Data Table 1 for further details of model training).

Automated report generation metrics

We find that Flamingo-CXR achieves a CheXpert F₁ score of 0.519 and a RadGraph F₁ score of 0.205 on the MIMIC-CXR dataset (Table 1). Among the methods capable of generating both the ‘findings’ and ‘impression’ section, Flamingo-CXR has outperformed the current state-of-the-art (SoTA) method by a large margin, attaining a 33% improvement relative to 0.389 as measured by the CheXpert F₁ score (R2GenGPT²⁹) and a 33% improvement from 0.154 as measured by the RadGraph F₁ score (CvT-21DistillGPT2 (ref. ¹³)) (see Methods for further details). For the sake of completeness, we also list CheXpert F₁ scores and RadGraph F₁ scores for models that only generate the ‘findings’ sections of reports. Even though our model is evaluated across a longer portion of text, the overall F₁ scores are still competitive, with a CheXpert F1 score that is 1% greater than the current SoTA method even though this was evaluated on the Findings section alone (Med-PaLM-M²², 12B). In terms of the NLG metrics (CIDEr, BLEU4 and Rouge), the results are mixed; we achieve competitive BLEU4 and Rouge scores while attaining a compromised CIDEr score (Extended Data Table 2). This is also consistent with the established observation that NLG metrics do not reflect the clinical accuracy of the generated reports^18,21,30, for which our model, in particular, confers an improvement over the relevant previous methods.

Table 1 Comparison of automatic report generation metrics on the MIMIC-CXR dataset

Full size table

Disease classification in comparison with human radiologists

For the IND1 dataset, Fig. 2a shows that the generated reports of our model are overall as accurate (in terms of the microaveraged F₁ score) as one of the two radiologists in describing six clinical conditions in chest radiographs (namely, cardiomegaly, pleural effusion, lung opacity, edema, enlarged cardiomediastinum and fracture). For conditions that are frequent in the training dataset such as cardiomegaly and pleural effusion, we attain comparable or even superior agreement with the experts labels (as measured in the Kendall’s tau coefficients) with respect to the two held-out radiologists (Fig. 2b). On the other hand, for under-represented conditions such as edema and enlarged cardiomediastinum with extremely low prevalence rates (0.19% and 0.15%, respectively), the agreement scores of our model are lower than the two radiologists. The receiver operating characteristic (ROC) curves for the individual conditions (Extended Data Fig. 2) exhibit patterns consistent with such variation in the accuracy across conditions of different prevalence (see Methods for further details).

**Fig. 2: Comparison of detection accuracy with expert labels on the IND1 dataset.**

Expert evaluation of AI-generated and human-written reports

To achieve a more fine-grained and realistic assessment of the clinical quality of radiology reports generated by our model, we conduct an expert evaluation for reports in both the MIMIC-CXR and IND1 datasets. We recruit a group of 11 radiologists in the United States and 16 in India with board certification to perform two complementary evaluation tasks, namely (1) a pairwise preference test and (2) an error correction task (Fig. 1a and Extended Data Fig. 1; see Methods for further details).

Pairwise preference test

In this evaluation task, radiologists are provided with (1) a frontal view of a CXR image, (2) a radiology report generated by our AI system and (3) the original report written by a radiologist. They are asked to describe their preference from three options: report A, report B or equivalence between the two (that is, ‘neither is better than the other’). Furthermore, they are asked to provide a justification for their preference, in free-form text, to better understand strengths and limitations of both reports (Extended Data Fig. 1a).

Across both datasets, generated reports from Flamingo-CXR were often considered preferable or equivalent to the ground truth (GT) report (Fig. 3 and Extended Data Fig. 3). For instance, in 77.7% of IND1 cases (and 56.1% of MIMIC-CXR cases), Flamingo-CXR reports were rated as equivalent or preferred relative to the original clinician report by at least half of the radiologists in our panel (Fig. 3a). Furthermore, in 94% of normal IND1 cases, Flamingo-CXR reports were rated as equivalent or preferred relative to the original clinician report by at least half of the radiologists in our panel (Fig. 3c). For this normal in/outpatient setting, more raters gave an equivalence rating rather than a preference rating for Flamingo-CXR reports (Extended Data Fig. 3), which is expected, given that normal in/outpatient reports have a relatively stereotypical structure that makes it difficult to discern differences between high-quality reports. In other settings, the majority of raters indicate a preference for Flamingo-CXR reports ahead of equivalence with original reports. Although these are strong results, it is clear that MIMIC-CXR reports are more challenging to model, which is not entirely surprising given that the MIMIC-CXR training dataset size is smaller and also contains a greater diversity of reports compared with the in/outpatient IND1 setting. To better understand the inter-rater diversity, we grouped all of our preference results according to the level of agreement between raters, from unanimity and majority to minority. This analysis reveals substantial disagreement among raters, who only reach unanimity (for Flamingo reports or GT reports) in their preferences in 27.4% of MIMIC-CXR cases and 44% of IND1 cases. Across rater locations (India and the United States), the distribution of inter-rater variability is reasonably consistent (Fig. 3b). The strongest agreement is observed for normal IND1 cases, where 76% of cases reach agreement (with only 1% agreement for GT reports). By reporting progressive degrees of agreement and disagreement, our results can be interpreted relative to the desiderata of specific application scenarios, which may require greater or lesser degrees of agreement.

**Fig. 3: Results of pairwise preference test for MIMIC-CXR and IND1.**

Last, in Fig. 3d, we provide a comparison of representative examples of AI-generated and human-written reports with varying degrees of inter-rater preference agreement. We also share the corresponding preference reasons from the respective raters. The top example shows a case for which the Flamingo-CXR report was preferred or rated as equivalent to the original clinician’s report by all four radiologists on the panel. In this example, the raters explained that the Flamingo-CXR report correctly ruled out the ‘retrocardiac opacity’ originally noted, and also expressed caution against potential over-diagnosing in the original report of ‘left lower lobe pneumonia/aspiration’, recommending a repeat radiograph if clinically warranted (which is consistent with the conditional request for a repeat radiograph in the Flamingo-CXR report). We also give an example of a report in which all four radiologists prefer the clinician report and another where the panel is split 50:50.

Error correction

In the error correction evaluation, the expert raters are provided with (1) the CXR image (a frontal view), and (2) a radiology report for this image, consisting of the findings and impression sections. Their task is to assess the accuracy of the given radiology report by identifying errors in the report and providing suggested replacements (Extended Data Fig. 1b).

Our results show that a non-negligible percentage (>10%) of the GT reports contain clinically significant disagreements for both MIMIC-CXR and IND1 datasets (upper row in Fig. 4a). The frequency of disagreement is also considerably different between the two locations of raters; Fig. 4b shows that the US-based radiologists disagree with the GT reports more often than the India-based radiologists. Last, we also observe from Fig. 4c that the GT reports for abnormal cases contain errors more often than the normal cases, likely caused by the higher variability and complexity of report contents.

**Fig. 4: Comparison of error correction for the AI-generated reports and the original GT reports.**

The relative frequency of errors between the AI system and the human experts varies across the two datasets. Figure 4a (lower row) shows that, for the IND1 dataset, the model makes fewer errors (0.31) on average than the human experts (0.39), although the frequency of clinically significant errors is marginally higher (0.23 versus 0.20). By contrast, for the MIMIC-CXR dataset, more (clinically significant) disagreements on average were reported in the AI-generated reports than in the original reports with a larger gap from 0.49 (0.28) to 0.27 (0.14) in terms of the average number of errors per report. Further decomposing this comparison into the distinct locations of raters in Fig. 4b reveals that the above patterns are largely preserved between the radiologists in the United States and those in India, but there remain a couple of noteworthy differences. The US-based raters reported considerably more disagreements on average than the India-based raters across the board, particularly with more pronounced differences for IND1 dataset (acquired in India). It is known that there is a wide variety of radiology reporting styles, ranging from semi-structured free-form reports (for example, the MIMIC-CXR reports) through to a more structured style (for example, the IND1 reports) and these stylistic differences reflect the preferences of the clinicians who write those reports, the stylistic preferences taught by their radiology trainers along with their hospital and regional guidelines^16,17. These regional variations in reporting style are likely to account in part for observed regional variation in rater preferences. We also highlight that the raters in two locations are incongruent on the relative frequency of clinically significant errors for the IND1 dataset; the India-based raters flagged fewer errors in AI-generated reports than in the GT reports, whereas the reverse trend was observed for the US-based raters. Finally, Fig. 4c compares the amount of disagreement between the abnormal and the normal cases. For the abnormal cases of IND1, marginally more clinically significant errors were reported in the human-written GT reports than in the AI-generated reports on average and vice versa for the MIMIC-CXR dataset.

To compare the distributions of error types across datasets, we explore the disagreement reasons for the edits made in reports (Extended Data Figs. 4a and 5). For both the model-generated reports and the original ones, the most dominant category of errors across the two datasets is the ‘incorrect finding’ category. The ‘incorrect finding’ category is less specific than the other two categories (‘incorrect severity’ and ‘incorrect location’). For the abnormal cases in the MIMIC-CXR dataset, statements with incorrect severity are much more common than those with incorrect locations in the original reports, whereas both are comparably frequent in the AI-generated reports. For the AI-generated reports (or human-written GT), 0.32 (0.14) errors on average correspond to incorrect findings, 0.11 (0.03) are due to incorrect location of the finding and 0.09 (0.08) to incorrect severity. For the IND1 abnormal cases, however, the second most common error type is related to incorrect severity for both the GTs and AI reports. Overall, errors due to incorrect location of findings in the report (for example, opacity in left versus right lung) are more prevalent for the MIMIC-CXR abnormal cases than for the abnormal cases in IND1.

Lastly, in Extended Data Fig. 4b we show the differences and intersection of cases with errors between the original reports and the ones generated by our model. It is worth noting that for this analysis we consider (clinically) significant errors to be present in a case if at least one of the four raters identified an error in the corresponding report. Large proportions of the clinically significant errors are nonoverlapping (72.7% for MIMIC-CXR and 59.7% for IND1 of total cases with at least one clinically significant error, respectively), suggesting frequent inconsistency in detected issues between the AI-generated reports and the original ones. Notably, in 27.3% and 22.7% of such cases in the MIMIC-CXR and IND1 datasets, clinically significant errors were identified only in the human reports, but not in the corresponding AI-generated reports. Some examples are provided in Extended Data Table 3 illustrating the nuanced nature of these differences. By contrast, there are also a considerable number of instances in which the AI-generated reports contain clinically significant errors, but the original reports do not. Examples of such instances are provided in Extended Data Table 4; some of these errors pertain to the limited spatial reasoning and counting capabilities of visual–language models. The presence of such disparities suggests that there may be potential complementarity between the AI system and the human experts in composing accurate radiology reports, which motivates us to investigate the utility of CXR-Flamingo in a clinician–AI collaboration setting.

Clinician–AI collaboration

In this section we explore collaboration between clinicians and Flamingo-CXR. For this collaboration, Flamingo-CXR produces a first draft report, and then a radiologist edits the report if necessary, by replacing sentences from the first draft with alternative sentences or by adding additional sentences to the report (Fig. 1b). The radiologists can make as many changes to the first draft report as they wish. We use the replacement sentences collected from the error correction task to produce these collaborative reports. To evaluate the quality of these clinician–AI reports, we ask our expert raters to indicate their preference for clinician–AI reports relative to the corresponding original clinician reports (Methods).

In Fig. 5d, we see an example of a clinician–AI report, in which a radiologist decided to replace sentences in the AI report that mentioned ‘pneumothorax’ with new sentences that mention hydropneumothorax instead. All four radiologists in our panel indicated that the clinician–AI report was preferable (or equivalent) to the original MIMIC-CXR clinician report, because the clinician–AI report was ‘more succinct’ and ‘covey’s the clinical findings better’ [sic] and because of the statements concerning ‘Right side pleural effusion and hydropneumothorax’. By contrast, for the AI report without edits, all four radiologists indicated a preference for the original clinician report because there was ‘no residual pneumothorax’ and because of the ‘More accurate lung findings’.

**Fig. 5: Results of pairwise preference test for clinician–AI collaboration.**

For 53.6% of the MIMIC-CXR cases, we find that clinician–AI reports were rated as equivalent or preferred relative to the original clinician report, by at least half of the radiologists in our panel (Fig. 5a). In comparison, for reports generated by Flamingo-CXR alone without collaboration, 44.4% of reports were rated as equivalent or preferred relative to the original clinician report, by at least half or more of the radiologists in our panel. We observe similar findings for IND1, where the reports from the clinician–AI collaboration were rated as preferable or equivalent by half or more of the radiologists in 71.2% of cases, in comparison with 51.2% for reports generated by Flamingo-CXR alone. We also observe variation in the preference results between normal and abnormal reports, and between different cohorts of collaborating clinicians, most likely reflecting variations in stylistic preferences across regions (Fig. 5b and Extended Data Fig. 6).

Discussion

In this work, we present Flamingo-CXR, a state-of-the-art AI radiology report generation system for chest radiographs built by specializing a recent vision–language foundation model⁸ on this challenging task. Our model achieves competitive performance in multiple automated metrics in two clinical contexts and geographical locations, namely intensive care in the United States and in/outpatient care delivery in India. To gauge the clinical quality and potential real-world utility of our report generation system we perform the most comprehensive expert evaluation of AI-generated reports published to date, and compare these with human-written GT reports with a group of certified radiologists. This evaluation is performed both in an autonomous and an assistive AI context. In addition, nuanced feedback from clinicians provides insight into disparities and defines areas for future enhancement.

Previous work has repeatedly reported the shortcomings of automated ‘natural language generation’ metrics for assessing reports of radiology images²¹. However, the majority of published works on the development of AI systems for this task, including recent approaches with acclaimed state-of-the-art performance, solely report automated metrics, while the direct proximity to expert accuracy and potential clinical utility remains unknown. Only a handful of previous works have attempted to evaluate AI systems with human experts. We go further in this work, in our fine-grained exploration of diversity and granularity of expert radiologist evaluations. For example, a similar evaluation schema for the same US dataset (MIMIC-CXR) was previously explored²², but assumed that the GT report is correct, without evaluating the inter-rater variability inherent in chest radiograph interpretation³¹. In another recent study, AI-generated reports for in-house emergency chest radiographs were compared against experts, revealing that the quality, on average, was only marginally inferior to that of on-site radiologists and surpassed that of teleradiology reports²³. However, both studies only evaluated the AI report generation model as a stand-alone system on a dataset acquired in an emergency department in the United States, whereas our study considers a more diverse setup that encompasses both autonomous and assistive scenarios for datasets from intensive care in the United States as well as in/outpatient care delivery in India, using evaluations from two distinct groups of clinicians, working in India and in the United States. Furthermore, our study enriches this evaluation by collecting granular information on error types (for example, distinction between incorrect findings, location and severity), and provides fine-grained insights into how the AI system differs from human experts, which was absent in the previous works.

Human evaluation results shed more light on the aspects of our model’s report quality that might inform and enable applications of the technology in future clinical workflows. Notably, for the normal IND1 cases, the raters unanimously viewed the AI-generated reports to be at least equivalent to the human reports in 75% of the cases. This strong performance on normal cases suggests potential clinical applicability in using the report generation model in the subset of such in/outpatient cases (for instance, taken alongside previous works that show AI systems to have strong accuracy in predicting whether CXRs are normal or abnormal²⁸), allowing radiologist attention to be allocated to patients with abnormalities. However, we notice there is considerable room for improvement for MIMIC-CXR whose original reports are in general more detailed and less templated than IND1.

This inter-dataset discrepancy in report quality highlights the importance of evaluation in different clinical contexts and geographic regions, which was previously not considered. The desired contents of a report are ultimately contingent on the given clinical context, and assuming access to large quantities of training data from every plausible scenario is not realistic. Future work will consider reinforcing our system with the capability to follow user instructions^28,32 so the users can control the outputs more flexibly through natural language and the capability to learn efficiently from a small quantity of data through techniques such as in-context learning³³ or parameter-efficient optimization³².

The complexity in evaluating the quality of radiology reports is underscored by the observed high inter-rater variability, as evidenced by: (1) identified (clinically significant) errors in the GT reports as part of the error correction task, and (2) the variability in both human evaluation tasks in terms of preferences and disagreements with report statements. For instance, there is unanimous agreement among our panel of raters in only 27.4% of MIMIC-CXR cases and 44% IND1 cases, respectively. This indicates the importance of our approach to obtaining multiple readings per case, unlike previous works that have only evaluated each case once²².

In-depth analysis shows that both human and AI systems can make errors in different ways, hinting at potential complementary properties between the two. Manual inspection unveils some examples in which nuanced clinical errors were detected in the human reports, but not in the corresponding AI-generated reports and vice versa (Methods and Extended Data Tables 3 and 4). Finally, another difference between clinicians and our AI system is the input information at disposal when writing the reports. Integrating such extra information into our AI system will likely enhance the reporting accuracy¹⁵ but requires further study.

Moving beyond the autonomous setting, this work evaluates CXR report generation in an assistive setting. Our results indicate that AI-generated reports with expert revisions were reported to be preferable or equivalent to original clinician reports in 71.2% of INDI cases in comparison with 51.2% of cases without expert revisions, and similarly, in 53.6% of MIMIC-CXR cases in comparison with 44.4% of cases without expert revisions, according to half or more of our raters. Our proof-of-concept evaluation exhibits the initial promise of AI report generation as an assistive system that augments the report writing process of radiologists.

These results are not without limitations. We have demonstrated the ability of Flamingo-CXR to generalize to previously unseen X-ray images from an intensive care setting (given by the standard MIMIC-CXR test set) and to an in/outpatient setting in India (given by the IND1 test set), but for other clinical settings that involve different types of data, such as CXRs with lateral views or other nonfrontal views, CXRs from multiple time points and CXRs containing out-of-distribution conditions that do not appear in the training data, we expect that additional training data will be required for further fine-tuning our model. We also observe that the AI reports with human edits do not reach perfect preference or equivalence compared with the original reports. There are several possible reasons for this. First, there is a baseline level of inter-rater variability both in the preference decision and the error correction process. Second, the location of the clinician making edits in the assistive setting has an impact on the preference decisions. This may reflect a difference in stylistic preferences across regions. We also observed some variability in the quality of edits (a whole sentence replaced with a single word; for example, ‘cardiomegaly’), which render the resultant reports quite unnatural despite being clinically correct. Third, it is possible that a clinician working in collaboration with AI may produce a report that is less accurate than a clinician working alone. Indeed, this is a common phenomenon observed in multiple lines of work in CXR classification tasks, where collaboration often result in less accurate predictions³. Clinician–AI collaboration typically becomes unhelpful when the experts overly rely on the AI predictions^34,35 or are unduly critical of them³⁶. Development of strategies for identifying when to provide AI-generated reports is likely to be helpful for maximizing the benefits of AI assistance³⁷. Fourth, although it is plausible that revising an AI-generated report may require less time than composing a report from scratch, this work does not assess this explicitly and it is beyond the scope of the current work. Quantifying the time-saving aspect, however, warrants another carefully designed human study focused on measuring the reporting time of human experts, which commonly varies between individuals and is influenced by a plethora of factors such as the clinical context, reporting style, expertise and complexity of cases. Finally, clinician–AI collaborations can take more complex forms than our design and ideally should ultimately be bidirectional and interactive, much like an experienced colleague that answers the radiologist’s questions and provides high-quality feedback on their reports (for example, flagging potential errors and missing findings). Although we have witnessed initial signs of such possibilities in the recent work on interactive, multimodal medical AI^26,33,38, there remains a considerable amount of progress to be made toward building a clinically useful writing assistant for radiology.

Overall, our observation of a positive effect from clinician–AI teamwork is very encouraging, especially given the limitations outlined above, the possibilities for future developments and the clinical relevance of this setting, where most AI tools that are approved for clinical decision-making are deployed in an assistive rather than autonomous setting^24,25. Furthermore, our observation of strong baseline preference ratings for Flamingo-CXR reports without clinician assistance, especially for normal in/outpatient reports, is intriguing, and may already raise the possibility for potential clinical applicability. Finally, by moving beyond automatic evaluation metrics, by engaging expert clinicians for evaluations and error correction, across a diversity of regions, clinical settings and data types, we have been able to richly characterize the wide spectrum of agreement and disagreement that exists between clinical experts, among themselves and with Flamingo-CXR, and where there has been prevailing disparity, we have embraced this as an opportunity for collaboration between Flamingo-CXR and clinicians working together in an assistive setting. Although there are immediate possibilities for enhancements and applications, Flamingo-CXR is intended as an experimental research-only model, and not as a tool for clinical deployment. However, we hope that this work will encourage and support the wider research community to further explore the full nuance, complexity and variability of the socio-technical landscape induced by the application of visual–language models in radiology report generation and beyond.

Methods

Ethical approval

The use of deidentified retrospective datasets was reviewed by Advarra IRB (Columbia, MD), which determined that it was exempt from further review under 45 CFR 46. The involvement of clinicians in this study, using the same deidentified retrospective data is also covered in this waiver.

Model

Our report generation model is built by fine-tuning a state-of-the-art vision–language foundation model, Flamingo⁸, which has attained impressive performance on data-efficient adaptation to new tasks. We fine-tune this model on the radiology report generation task, with an effective combination of regularization and adaptation techniques. Flamingo has a flexible transformer-based multimodal sequence-to-sequence architecture that can learn to integrate a mixture of medical images and reports with no model modifications.

Task

Our model is trained to generate both the ‘findings’ and ‘impression’ sections of the report for a frontal view (anterior–posterior or posterior–anterior) of the chest radiograph, which typically captures all the relevant observations the radiologist makes in a study. The model is not provided with additional projections, such as lateral views or prior views, other clinical history data or indication data. Flamingo-CXR only had access to the current radiograph at a lower resolution of 1 megapixel (in contrast with the original resolution of approximately 4 megapixels), whereas the original radiologists additionally had access to contextual information, patient history and previous scans. In the clinical setting, additional data, such as lateral views and prior views are often required, and we expect that fine-tuning our model with this data would enhance the capabilities of our model. However, recent studies do not use this additional data, so in our task formulation, we have also adopted this convention, which allows us to make a fair comparison with previously published benchmarks^10,15,29,40.

Architecture

Flamingo is a general-purpose family of transformer-based visual–language models that take visual data as input (for example, images), interleaved with text and produce free-form text as output. The key architectural components are (1) the language model that operates on the input text and generates the output text, (2) the vision encoder that maps visual data into the same representation space as text input and (3) the connective module that integrates both modalities. The combination of the perceiver resampler⁴¹ and cross-attention layers in this connective component offer an expressive way for the language model to incorporate visual information for the next-token prediction task. There are multiple versions of Flamingo at different scales, and our report generation model, Flamingo-CXR is built using a parsimonious 400 million parameter version. Flamingo models the likelihood of the radiology report y conditioned on the input image x in an auto-regressive fashion:

$$p(\;y|x)=\mathop{\prod }\limits_{\ell =1}^{L}p(\;{y}_{\ell }|\;{y}_{ < \ell },{x}_{\le \ell }),$$

where ${y}_{\ell }$ is the $\ell$-th language token of the input report, ${y}_{ < \ell }$ is the set of preceding tokens and p is parameterized by the model.

Optimization

We take a version of Flamingo, pretrained on a large set of interleaved text-image data, and fine-tune it on the specific task of radiology report generation by minimizing a weighted sum of the expected negative log-likelihoods of report given the chest radiograph over both MIMIC-CXR (United States) and IND1 (India) datasets:

$$\begin{array}{l}{\lambda }_{{\rm{US}}} {{\mathbb{E}}}_{(x,y)\sim {{\mathscr{D}}}_{{\rm{US}}}}\left[-\mathop{\sum }\limits_{\ell =1}^{L}w(x,y)\log p(\;{y}_{\ell }|\,{y}_{ < \ell },{x}_{\le \ell })\right]\\+{\lambda }_{{\rm{India}}} {{\mathbb{E}}}_{(x,y)\sim {{\mathscr{D}}}_{{\rm{India}}}}\left[-\mathop{\sum }\limits_{\ell =1}^{L}w(x,y)\log p(\;{y}_{\ell }|\,{y}_{ < \ell },{x}_{\le \ell })\right],\end{array}$$

where ${{\mathscr{D}}}_{{\rm{US}}}$ and ${{\mathscr{D}}}_{{\rm{India}}}$ denote the MIMIC-CXR and IND1 datasets respectively, λ_US and λ_India are the data-specific coefficients that are tuned to maximize the benefits of jointly training on both datasets, and lastly $w\left(x,y\right)$ is a reweighting function that changes the amount of penalty depending on whether the example $\left(x,y\right)$ contains any thoracic abnormalities. Specifically, we use importance weighting here⁴¹ and define $w\left(x,y\right)$ to output the inverse of the proportion of healthy cases in the corresponding dataset (if the given example is normal) or otherwise that of abnormal cases. This ensures that the model is equally penalized to compose inaccurate reports across the healthy and the abnormal cases; this is particularly important for the IND1 dataset in which the healthy cases account for more than 90% of the training data. We set the weighting coefficients λ_US = 1.0 and λ_India = 0.5.

To further enhance the reporting accuracy on abnormal cases, we augment the above training objective with an auxiliary classification loss for abnormality classification. To this end, we applied a published labeling software, CheXpert⁴² to extract the presence of multiple thoracic conditions from the training reports, derived binary abnormality labels (1 if any of the conditions is present or else 0), and used them to compute this auxiliary classification loss. We found the addition of this abnormality classification task to be helpful in improving the sensitivity of the generated reports across these conditions.

We optimize parameters using AdamW⁴³ with initial learning rate of 10⁻³ and β = [0.9, 0.999] with batch size of 16 examples and we train for 150,000 steps. The above hyper-parameters are selected based on the overall microaveraged F₁ score for detection of CheXpert conditions on the validation set. The best checkpoint was selected based on the overall CIDEr-d score on the validation set. We freeze the language component and only update the parameters in the vision encoder and the connective component (perceiver resampler and cross-attention layers) because our initial experiments showed updating the language part resulted in overfitting and fine-tuning the rest of the architecture was important for adapting to the unfamiliar medical domain not represented in the pretraining datasets.

Inference

Once Flamingo is trained, we use it to generate the radiology reports on the test chest radiographs with two decoding strategies: beam search with the width size set to 3 and nucleus sampling⁴⁴ with P = 0.9. We used the former deterministic decoding method by default, and the generated reports are used in calculating of reported NLG and clinical metrics in Table 1 and Extended Data Table 2 as well as in the subsequent expert evaluation. However, we also used the latter stochastic decoding method when we needed to generate multiple reports. For example, to plot the ROC curves in Fig. 2 and Extended Data Fig. 2 for measuring the disease classification accuracy of reports, we used the nucleus sampling to generate 250 candidate reports, derived the condition labels from each with the CheXpert labeler and aggregated them to compute the per-condition probability.

Datasets and preprocessing

We developed and evaluated our automatic report generation model using two large deidentified datasets of CXR images and corresponding radiology reports from the United States and India. Chest radiography offers a valuable testbed for automatic report generation systems because it is the most widely used thoracic imaging modality in the world²⁸. Even for such a specific domain, the contents of radiology reports differ widely between geographic regions and clinical contexts. To account for these variations, we used the combination of the MIMIC-CXR dataset²⁷, acquired in the emergency department of the Beth Israel Deaconess Medical Center in the United States, and another private research dataset of a similar scale, which we refer to as IND1 (ref. ²⁸), obtained from a large hospital group in India. These datasets do not contain sex or gender information.

IND1

This is a deidentified dataset⁴⁵ of 263,021 frontal chest radiographs (digital and scanned) with reports obtained from five regional centers across a large hospital group in India (Bangalore, Bhubaneswar, Chennai, Hyderabad and New Delhi) between November 2010 and January 2018. We use the same training, validation and test split as in previous studies²⁸. Thus, a total of 250,066 samples are used for training, 4,960 samples for validation and 7,995 samples for testing of Flamingo-CXR. Furthermore, a small subset of 2,306 cases are annotated with varying numbers of binary labels (0, absent; 1, present) for six thoracic conditions (cardiomegaly, pleural effusion, lung opacity, edema, enlarged cardiomediastinum and fracture) obtained from a pool of 18 certified radiologists in the United States. The agreement labels are derived by calculating the majority vote, and used as the reference labels for evaluation of report quality in classification accuracy (for example, ROC curves in Extended Data Fig. 2 and F₁ scores in Extended Data Table 2).

MIMIC-CXR

As the largest public dataset to date, MIMIC-CXR²⁷ contains 377,110 images and 227,835 reports. In our experiments, we use the official split provided by the dataset resulting in 222,758 training examples, 1,808 validation examples and 3,269 test examples. For the reports, we remove redundant whitespaces (line breaks and so on). We only use frontal view scans (anterior–posterior and posterior–anterior views) and discard samples where only lateral views are provided. We only keep the FINDINGS and IMPRESSION sections of reports and filter out cases that do not contain an IMPRESSION section, following previous studies¹⁵.

Lastly, more than 50% of the examples in MIMIC-CXR contain previous scans¹⁵ and the corresponding reports often describe findings in reference to these measurements (see the highlighted sentence in the left column of Extended Data Table 1 for an example). Consequently, as also reported in recent work⁴⁶, naively training on the entirety of the MIMIC-CXR data leads to a model that generates reports with hallucinated references to nonexistent previous reports (see the right column; note that the model only has access to the current radiograph). To ameliorate this issue, we remove all the training examples with references to previous studies (see the middle column for an example of the improved prediction as a result). However, we still report the evaluation metrics on all the test examples for a fair comparison with the previous studies. The combination of all the above preprocessing and filtering steps result in 90,968 training, 688 validation and 1,931 test examples.

Image processing

All images in both datasets are resized to 320 × 320 while preserving the original aspect ratio, padded if needed, and normalized to zero mean and unit standard deviation. Color jitter and resize/crop transformations are applied as data-augmentation during the training of Flamingo-CXR.

Icons in Figs. 1, 3, 4 and 5 and Extended Data Figs. 4 and 6 were sourced from Font Awesome (https://fontawesome.com) under the CC BY 4.0 License (https://creativecommons.org/licenses/by/4.0/).