Fig. 2: Comparison of detection accuracy with expert labels on the IND1 dataset.
From: Collaboration between clinicians and vision–language models in radiology report generation
a, The ROC curve of the Flamingo-CXR report generation model with stochastic generation method (Nucleus) and corresponding area under the curve (AUC), shown along with the sensitivity and 1 − specificity pairs for two certified radiologists. The operating point of our model with the default deterministic inference scheme (Beam 3) is also shown. Details of the two inference algorithms are available in the Methods. The curve and the metrics are microaveraged across six conditions (cardiomegaly, pleural effusion, lung opacity, edema, enlarged cardiomediastinum and fracture) for which the labels were collected (n = 7,995 is the total number of IND1 test set reports). The GT labels are defined as the majority vote among the 5 labels obtained from the pool of 18 certified radiologists. Error bars represent 95% confidence intervals (calculated using bootstrapping with 1,000 repetitions). b, Kendall’s tau coefficients with respect to the expert labels are shown for the two held-out radiologists as well as for two inference schemes of our Flamingo-CXR model. We use the ‘soft’ labels derived by averaging over the available annotations instead of the majority vote labels as the target for computing the metric. On the vertical axis, the prevalence rates (PRs) of the respective conditions in the training set and their sample size in the test set are also shown. The target labels are the probabilities over the presence of the respective conditions calculated by averaging the binary condition labels from the expert pool.