Table 1 Comparison of automatic report generation metrics on the MIMIC-CXR dataset

From: Collaboration between clinicians and vision–language models in radiology report generation

Model

Sections

Clinical metrics

CheXpert F1 (all)

CheXpert F1 (top 5)

Radiograph F1

CXR-RePaiR11

Findings only

0.281

0.091

M2 Transformer12

Findings only

0.567

0.220

RGRG39

Findings only

0.447

0.547

Med-PaLM-M22, 12B

Findings only

0.514

0.565

0.252

R2Gen10

Findings + Impressions

0.228

0.346

0.134

WCT14

Findings + Impressions

0.294

0.143

CvT-21DistillGPT2 (ref. 13)

Findings + Impressions

0.384

0.154

BioVil-T15

Findings + Impressions

0.317

R2GenGPT29

Findings + Impressions

0.389

Flamingo-CXR (Ours)

Findings + Impressions

0.519

0.580

0.205

  1. The clinical metrics for models that generate the ‘Findings’ sections (top) and the ‘Findings’ and ‘Impressions’ sections (bottom) for MIMIC-CXR radiographs are listed. Flamingo-CXR is trained to generate both ‘Findings’ and ‘Impressions’, and we observe that it outperforms the current SoTA method by 33%, when compared with other models that also generate ‘Findings’ and ‘Impressions’ sections. CheXpert F1 (all) denotes the microaveraged F1 score across all 14 categories of findings, whereas CheXpert F1 (top 5) shows the same metric but over the most prevalent five categories from the MIMIC-CXR dataset (atelectasis, cardiomegaly, edema, consolidation and pleural effusion). All metrics are reported on the preprocessed test set (n = 1,931). For all metrics, the higher the better, and the best results are shown in bold. An extended version of this table with NLG metrics is provided in Extended Data Table 2.