Fig. 5: Results of pairwise preference test for clinician–AI collaboration. | Nature Medicine

Fig. 5: Results of pairwise preference test for clinician–AI collaboration.

From: Collaboration between clinicians and vision–language models in radiology report generation

Fig. 5

a, Preferences for reports produced from the clinician–AI collaboration relative to the original clinicians’ reports are shown here. The corresponding preference scores for reports produced by Flamingo-CXR without human collaboration are also given. Reports are grouped by the level of agreement between reviewers, and in all cases, we show results for the subset of reports that required editing during the error correction task. Data for all panels are presented as mean values and error bars show 95% confidence intervals for the cumulative preference scores. Significant differences (P < 0.05) between clinician–AI results and AI-only results calculated using a one-sided chi-squared test are indicated by an asterisk (with MIMIC-CXR P values given by *P = 1.3 × 10−2, **P = 5.7 × 10−4, ***P = 3.2 × 10−9; and IND1 P values given by *P = 1.2 × 10−7, **P = 4.4 × 10−9, ***P = 7.7 × 10−6). b, Preferences for reports produced from a collaboration between Flamingo-CXR and radiologists from our US-based cohort and separately, from our India-based cohort. c, Preferences for normal reports and separately, for abnormal reports. d, An example of a pairwise preference test for a clinician–AI report and an AI report, relative to the original clinician’s MIMIC-CXR report. All four radiologists initially indicated a preference for the original clinician’s report to the AI report. Another radiologist revised two sentences in the AI report (indicated in red), resulting in a complete flip in preference in which all four radiologists unanimously expressed the superiority (or equivalence) of the clinician–AI report.

Back to article page