Abstract
Predictive artificial intelligence (AI) systems based on deep learning have been shown to achieve expert-level identification of diseases in multiple medical imaging settings, but can make errors in cases accurately diagnosed by clinicians and vice versa. We developed Complementarity-Driven Deferral to Clinical Workflow (CoDoC), a system that can learn to decide between the opinion of a predictive AI model and a clinical workflow. CoDoC enhances accuracy relative to clinician-only or AI-only baselines in clinical workflows that screen for breast cancer or tuberculosis (TB). For breast cancer screening, compared to double reading with arbitration in a screening program in the UK, CoDoC reduced false positives by 25% at the same false-negative rate, while achieving a 66% reduction in clinician workload. For TB triaging, compared to standalone AI and clinical workflows, CoDoC achieved a 5–15% reduction in false positives at the same false-negative rate for three of five commercially available predictive AI systems. To facilitate the deployment of CoDoC in novel futuristic clinical settings, we present results showing that CoDoC’s performance gains are sustained across several axes of variation (imaging modality, clinical setting and predictive AI system) and discuss the limitations of our evaluation and where further validation would be needed. We provide an open-source implementation to encourage further research and application.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The mammography datasets from Northwestern Medicine and St. Clair Hospital were used under licenses for the current study and are not publicly available. The tuberculosis datasets from the Stop TB Partnership and icddr,b were used under a license for the current study and are not publicly available. US Mammography Dataset 2 can be requested via email to k.j.geras@nyu.edu for research purposes, and access shall be granted within a week’s time. The Github link where the code is also hosted contains details on how to obtain access to the data required to reproduce results for the UK mammography dataset and the associated timeline. These datasets only consist of the data required to train CoDoC (a database of predictive AI confidence scores, clinician opinion and ground truth disease label for each case in the tuning/validation/test set). For the UK Mammography Dataset, the images and data used in this publication are derived from the OPTIMAM database (https://pubs.rsna.org/doi/abs/10.1148/ryai.2020200103?journalCode=ai), the creation of which was funded by Cancer Research UK. The full database including medical images used to train the predictive AI can be requested at https://medphys.royalsurrey.nhs.uk/omidb/getting-access/; this request will be reviewed by the OPTIMAM steering committee (https://medphys.royalsurrey.nhs.uk/omidb/the-steering-committee/).
Code availability
The code is available at https://github.com/deepmind/codoc.
References
Ruamviboonsuk, P. et al. Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program. NPJ Digit. Med. 2, 25 (2019).
McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature 577, 89–94 (2020).
Lee, C. S. & Lee, A. Y. Clinical applications of continual learning machine learning. Lancet Digit. Health 2, e279–e281 (2020).
Shen, Y. et al. An interpretable classifier for high-resolution breast cancer screening images utilizing weakly supervised localization. Med. Image Anal. 68, 101908 (2021).
Vokinger, K. N., Feuerriegel, S. & Kesselheim, A. S. Continual learning in medical devices: FDA’s action plan and beyond. Lancet Digit. Health 3, e337–e338 (2021).
Hendrycks, D. & Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In Proceedings of International Conference on Learning Representations (ICLR) (OpenReview.net, 2017).
Leibig, C. et al. Combining the strengths of radiologists and AI for breast cancer screening: a retrospective analysis. Lancet Digit. Health 4, e507–e519 (2022).
D'Amour, A. et al. Underspecification presents challenges for credibility in modern machine learning. J. Mach. Learn. Res. 23, 1–61 (2022).
Ardila, D. et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat. Med. 25, 954–961 (2019).
Mustafa, B. et al. Supervised transfer learning at scale for medical imaging. Preprint at arXiv https://doi.org/10.48550/arXiv.2101.05913 (2021).
Azizi, S. et al. Big self-supervised models advance medical image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision 3478–3488 (IEEE Computer Society, 2021).
Stadnick, B. et al. Meta-repository of screening mammography classifiers. Preprint at arXiv https://doi.org/10.48550/arXiv.2108.04800 (2021).
Habib, S. S. et al. Evaluation of computer aided detection of tuberculosis on chest radiography among people with diabetes in Karachi Pakistan. Sci. Rep. 10, 6276 (2020).
Subbaswamy, A. & Saria, S. From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics 21, 345–352 (2020).
Freeman, K. et al. Use of artificial intelligence for image analysis in breast cancer screening programmes: systematic review of test accuracy. BMJ 374, n1872 (2021).
Guidance on Screening and Symptomatic Breast Imaging 4th edn https://www.rcr.ac.uk/system/files/publication/field_publication_files/bfcr199-guidance-on-screening-and-symptomatic-breast-imaging_0.pdf (Royal College of Radiology, 2019).
European Commission. Use of double reading in mammography screening. https://healthcare-quality.jrc.ec.europa.eu/european-breast-cancer-guidelines/organisation-of-screening-programme/how-mammography-should-be-read (2019).
UK.GOV. Breast screening: quality assurance standards in radiology. https://www.gov.uk/government/publications/breast-screening-quality-assurance-standards-in-radiology (2011).
Sharma, N. et al. Large-scale evaluation of an AI system as an independent reader for double reading in breast cancer screening. Preprint at medRxiv https://doi.org/10.1101/2021.02.26.21252537 (2022).
Janssen, N., Rodriguez-Ruiz, A., Mieskes, C., Karssemeijer, N. & Heywang-Köbrunner, S. H. The potential of AI to replace a first reader in a double reading breast cancer screening program: a feasibility study. ScreenPoint Medical https://screenpoint-medical.com/evidence/the-potential-of-ai-to-replace-a-first-reader-in-a-double-reading-breast-cancer-screening-program-a-feasibility-study/ (2021).
Larsen, M. et al. Artificial intelligence evaluation of 122 969 mammography examinations from a population-based screening program. Radiology https://doi.org/10.1148/radiol.212381 (2022).
Qin, Z. Z. et al. Early user experience andlessons learned using ultra-portable digital X-ray with computer-aided detection (DXR-CAD) products: a qualitative study from the perspective of healthcare providers. PLoS ONE 18, e0277843 (2023).
Gaube, S. et al. Do as AI say: susceptibility in deployment of clinical decision-aids. NPJ Digit. Med. 4, 31 (2021).
Oakden-Rayner, L. & Palmer, L. Docs are ROCs: a simple off-the-shelf approach for estimating average human performance in diagnostic studies. Preprint at arXiv https://doi.org/10.48550/arXiv.2009.11060 (2020).
Silverman, B. W. Algorithm AS 176: kernel density estimation using the fast Fourier transform. Appl. Stat. 31, 93 (1982).
Hall, P. & Wand, M. P. On the accuracy of binned kernel density estimators. J. Multivar. Anal. 56, 165–184 (1996).
Silverman, B. W. Density Estimation for Statistics and Data Analysis (Chapman & Hall, 1986).
Fan, J. & Marron, J. S. Fast implementations of nonparametric curve estimators. J. Comput. Graph. Stat. 3, 35–56 (1994).
Liu, J.-P., Hsueh, H.-M., Hsieh, E. & Chen, J. J. Tests for equivalence or non-inferiority for paired binary data. Stat. Med. 21, 231–245 (2002).
McNemar, Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12, 153–157 (1947).
Fagerland, M. W., Lydersen, S. & Laake, P. Recommended tests and confidence intervals for paired binomial proportions. Stat. Med. 33, 2850–2875 (2014).
Aickin, M. & Gensler, H. Adjusting for multiple testing when reporting research results: the Bonferroni vs Holm methods. Am. J. Public Health 86, 726–728 (1996).
Mozannar, H. & Sontag, D. Consistent estimators for learning to defer to an expert. Preprint at arXiv https://doi.org/10.48550/arXiv.2006.01862 (2021).
Wilder, B., Horvitz, E. & Kamar, E. Learning to complement humans. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (Ed. Bessiere, C.) 1526–1533 (International Joint Conferences on Artificial Intelligence Organization, 2020); https://doi.org/10.24963/ijcai.2020/212
Raghu, M. et al. The algorithmic automation problem: prediction, triage, and human effort. Preprint at arXiv https://doi.org/10.48550/arXiv.1903.12220 (2019).
Charusaie, M.-A., Mozannar, H., Sontag, D. & Samadi, S. Sample efficient learning of predictors that complement humans. In Proceedings of the 39th International Conference on Machine Learning (Eds. Chaudhuri. K. et al.) 2972–3005 (PMLR, 2022).
Narasimhan, H., Jitkrittum, W., Menon, A. K., Rawat, A. S. & Kumar, S. Post-hoc estimators for learning to defer to an expert. In Advances in Neural Information Processing Systems (Eds. Koyejo, S. et al.) 29292–29304 (Curran Associates, 2022).
Kerrigan, G., Smyth, P. & Steyvers, M. Combining human predictions with model probabilities via confusion matrices and calibration. In Advances in Neural Information Processing Systems Vol. 34 (Eds. Ranzato, M. A. et al.) 4421–4434 (Curran Associates, Inc., 2021).
Qin, Z. Z. et al. Tuberculosis detection from chest x-rays for triaging in a high tuberculosis-burden setting: an evaluation of five artificial intelligence algorithms. Lancet Digit. Health 3, e543–e554 (2021).
Acknowledgements
We would like to acknowledge the multiple contributors to this international project: Stop TB Partnership hosted by UNOPS; Cancer Research UK; the OPTIMAM project team and staff at the Royal Surrey County Hospital, who developed the UK Mammography OPTIMAM imaging database; our collaborators at Northwestern Medicine and all members of the Etemadi research group for their continued support of this work; St. Clair Hospital; and B.A. Klepchick, J.M. Andrus, R.J. Schaeffer and J.T. Sullivan. We thank the National Cancer Institute (NCI) for access to NCI data collected by the National Lung Screening Trial. The statements contained herein are solely those of the authors and do not represent or imply concurrence or endorsement by the NCI. We also thank L. Peng, D. Webster, U. Telang and D. Belgrave for their valuable feedback and support throughout the course of this project; D. Tran, N. de Freitas and K. Kavukcuoglu for critically reading the manuscript and providing feedback; R. Pilgrim, A. Kiani and J. Rizk for work on partnership formation and engagement; R. May and E. Sutherland Robson for assistance with project coordination; S. Baur and S. Prabhakara for mammography domain expertise; and M. Wilson for early engineering work. The work by S.M., M.B. and N.P. was done at Google DeepMind/Google Research.
Author information
Authors and Affiliations
Contributions
K.D., J. Winkens, S.G., N.P., R.S., Y.B., P.K., T.C. and A. Karthikesalingam contributed to study conception and design. J. Witowski, S.M., S.S., M.S., T.S., G.C. and A. Karthikesalingam contributed to data acquisition; K.D., J. Winkens, M.B., S.G., N.P., R.S., M.D., T.S. and T.C. contributed to data analysis; K.D., J. Winkens, M.B., S.G., R.S., C.K., S.M., Z.Z.Q., J.C., K.G., J. Witowski, P.K., T.C. and A. Karthikesalingam contributed to data interpretation; K.D., J. Winkens, S.G., M.B., N.P., R.S., S.A., L.C., M.D., J.F., A. Kiraly, T.K., S.M., B.M., V.N., S.S., M.S. and T.C. contributed to the creation of new software used in this study; K.D., J. Winkens, M.B., S.G., R.S., P.S., Z.A., C.K., A. Kiraly, Z.Z.Q., J.C., K.G., J. Witowski, P.K., T.C. and A. Karthikesalingam contributed to drafting and revising the manuscript; and K.D., J. Winkens, M.B., S.G., R.S., P.S., Z.A., P.K., T.C. and A. Karthikesalingam contributed to paper organization and team logistics.
Corresponding authors
Ethics declarations
Competing interests
This study was funded by Google LLC and/or a subsidiary thereof (‘Google’). K.D., J. Winkens, S.G., R.S., P.S., Z.A., S.A., Y.B., L.C., M.D., J.F., C.K., A. Kiraly, T.K., B.M., V.N., S.S., M.S., T.S., G.C., P.K., T.C. and A. Karthikesalingam are employees of Google and own stock as part of the standard compensation package. S.M., M.B. and N.P. are previous employees of Google, N.P. is a current employee of Microsoft and S.M. is a current employee of OpenAI. Z.Z.Q. and J.C. are employees of the Stop TB Partnership and collaborated with Google to support this research effort. K.G. and J. Witowski are employees of the NYU Grossman School of Medicine. K.G. and J. Witowski collaborated with Google to support this research effort.
Peer review
Peer review information
Nature Medicine thanks Pranav Rajpurkar and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Joao Monteiro and Lorenzo Righetto, in collaboration with the Nature Medicine team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Additional information on datasets and images from breast examinations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dvijotham, K.(., Winkens, J., Barsbey, M. et al. Enhancing the reliability and accuracy of AI-enabled diagnosis via complementarity-driven deferral to clinicians. Nat Med 29, 1814–1820 (2023). https://doi.org/10.1038/s41591-023-02437-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41591-023-02437-x
This article is cited by
-
Towards conversational diagnostic artificial intelligence
Nature (2025)
-
Using artificial intelligence to enhance patient autonomy in healthcare decision-making
AI & SOCIETY (2025)
-
Evaluation of phone posterior probabilities for pathology detection in speech data using deep learning models
International Journal of Speech Technology (2025)
-
The AI-extended professional self: user-centric AI integration into professional practice with exemplars from healthcare
AI & SOCIETY (2025)
-
Edge AI: A Taxonomy, Systematic Review and Future Directions
Cluster Computing (2025)