Table 1 Comparison of automatic report generation metrics on the MIMIC-CXR dataset

The clinical metrics for models that generate the ‘Findings’ sections (top) and the ‘Findings’ and ‘Impressions’ sections (bottom) for MIMIC-CXR radiographs are listed. Flamingo-CXR is trained to generate both ‘Findings’ and ‘Impressions’, and we observe that it outperforms the current SoTA method by 33%, when compared with other models that also generate ‘Findings’ and ‘Impressions’ sections. CheXpert F₁ (all) denotes the microaveraged F₁ score across all 14 categories of findings, whereas CheXpert F₁ (top 5) shows the same metric but over the most prevalent five categories from the MIMIC-CXR dataset (atelectasis, cardiomegaly, edema, consolidation and pleural effusion). All metrics are reported on the preprocessed test set (n = 1,931). For all metrics, the higher the better, and the best results are shown in bold. An extended version of this table with NLG metrics is provided in Extended Data Table 2.

Quick links

Search