Extended Data Fig. 5: Average number of clinically significant errors and percentage of reports with at least one error reported by experts in human-written and AI-generated reports across conditions for the MIMIC-CXR and IND1 datasets. | Nature Medicine

Extended Data Fig. 5: Average number of clinically significant errors and percentage of reports with at least one error reported by experts in human-written and AI-generated reports across conditions for the MIMIC-CXR and IND1 datasets.

From: Collaboration between clinicians and vision–language models in radiology report generation

Extended Data Fig. 5

(a) For MIMIC-CXR, the average number of clinically significant errors in reports that are capturing cases with pheumothorax is almost double the number of those with edema, but for most other conditions the occurrence of errors does not vary significantly. It is worth noting that the condition labels for MIMIC-CXR cases are obtained using CheXpert52 on the original human-written reports. Additionally, if more than one condition is associated with a particular chest X-ray image (which is often the case), the clinically significant errors on the corresponding reports are reported for all of these conditions. (b) For IND1, we do not observe striking differences across conditions in terms of clinically significant errors reported in the AI-generated reports, even though there are more errors on average reported for cases with pleural effusion than those with cardiomegaly. Interestingly, no errors are reported in cases with fracture, so we omit this condition from the figure. These findings indicate that condition prevalence in the training data does not necessarily affect report quality.

Back to article page