Extended Data Fig. 6: Clinician-AI collaboration and clinically significant errors.
From: Collaboration between clinicians and vision–language models in radiology report generation
Subgroup analysis of the data presented in Fig. 5 illustrates that (a) clinician-AI collaboration produced an improvement in ratings for the subgroup of AI reports that had clinically significant errors (with MIMIC-CXR p values given by p* = 2.6x10−3, p** = 1.5x10−7, p*** = 2.9x10−8 and with IND1 p values given by p* = 6.3x10−7, p** = 4.0x10−8 p*** = 1.3x10−5), whereas (b), there was little or no improvement for the subgroup of AI reports that did not have clinically significant errors (with MIMIC-CXR p values given by p* = 1.2x10−2, p** = 1.2x10−2 and with IND1 p values given by p* = 3.2x10−2). As before, significant differences (p < 0.05) between clinician-AI results and AI-only results calculated using a one-sided Chi-squared are indicated by asterisks. This suggests that the positive impact of clinician-AI collaboration is largely attributable to edits in AI reports that had clinically significant errors. Data for all panels is presented as mean values and error bars show 95% confidence intervals for the cumulative preference scores.