Extended Data Fig. 1: Labelling interface.
From: Collaboration between clinicians and vision–language models in radiology report generation
(a) In the labelling interface for the pairwise preference test, raters are provided with (i) a frontal view (PA or AP) in the original resolution, (ii) a radiology report generated by our AI system and (iii) the original report written by a radiologist, and are asked to provide their preference. For each case, the raters are unaware of which report is the ground-truth and which one is generated by our model, and are requested to describe their preference out of three options; report A, report B, or equivalence between the two (that is, ‘neither is better than the other’). The interface allows the raters to zoom in and out on the image as needed. They are additionally asked to provide an explanation for their choice. (b) In the labelling interface for the error correction task, raters are provided with (i) the chest X-ray image (a frontal view) and (ii) a radiology report for this image, consisting of the findings and impression sections. Their task is to assess the accuracy of the given radiology report by identifying errors in the report and correcting them. Before each annotation task, clinicians are asked whether the presented image is of sufficient quality for them to complete the task. They are then asked whether there is any part of the report that they do not agree with and, if so, are asked to (a) select the passage that they disagree with, (b) select the reason for disagreement (finding I do not agree with is present; incorrect location of finding; incorrect severity of finding), (c) specify whether the error is clinically significant or not, and (d) provide a replacement for the selected passage.