Fig. 4: Comparison of error correction for the AI-generated reports and the original GT reports. | Nature Medicine

Fig. 4: Comparison of error correction for the AI-generated reports and the original GT reports.

From: Collaboration between clinicians and vision–language models in radiology report generation

Fig. 4

ac, The upper row shows the percentage of reports with at least one (clinically significant) error, and the bottom row shows the average number of identified (clinically significant) errors per report computed as the total number of detected errors divided by the number of all reports, including the ones without errors. These two metrics are compared across the IND1 and MIMIC-CXR datasets overall (a), the two rater locations (India and the United States) to illustrate the regional inter-rater variation (b) and the normal and abnormal cases in the respective datasets (c). Error statistics for GT reports and Flamingo-CXR reports are given for each setting and grouped together as indicated by dashed lines. Data are presented as mean values and error bars correspond to 95% confidence intervals across cases and expert assessments.

Back to article page