-
Health AI Developer Foundations
Authors:
Atilla P. Kiraly,
Sebastien Baur,
Kenneth Philbrick,
Fereshteh Mahvar,
Liron Yatziv,
Tiffany Chen,
Bram Sterling,
Nick George,
Fayaz Jamil,
Jing Tang,
Kai Bailey,
Faruk Ahmed,
Akshay Goel,
Abbi Ward,
Lin Yang,
Andrew Sellergren,
Yossi Matias,
Avinatan Hassidim,
Shravya Shetty,
Daniel Golden,
Shekoofeh Azizi,
David F. Steiner,
Yun Liu,
Tim Thelin,
Rory Pilgrim
, et al. (1 additional authors not shown)
Abstract:
Robust medical Machine Learning (ML) models have the potential to revolutionize healthcare by accelerating clinical research, improving workflows and outcomes, and producing novel insights or capabilities. Developing such ML models from scratch is cost prohibitive and requires substantial compute, data, and time (e.g., expert labeling). To address these challenges, we introduce Health AI Developer…
▽ More
Robust medical Machine Learning (ML) models have the potential to revolutionize healthcare by accelerating clinical research, improving workflows and outcomes, and producing novel insights or capabilities. Developing such ML models from scratch is cost prohibitive and requires substantial compute, data, and time (e.g., expert labeling). To address these challenges, we introduce Health AI Developer Foundations (HAI-DEF), a suite of pre-trained, domain-specific foundation models, tools, and recipes to accelerate building ML for health applications. The models cover various modalities and domains, including radiology (X-rays and computed tomography), histopathology, dermatological imaging, and audio. These models provide domain specific embeddings that facilitate AI development with less labeled data, shorter training times, and reduced computational costs compared to traditional approaches. In addition, we utilize a common interface and style across these models, and prioritize usability to enable developers to integrate HAI-DEF efficiently. We present model evaluations across various tasks and conclude with a discussion of their application and evaluation, covering the importance of ensuring efficacy, fairness, and equity. Finally, while HAI-DEF and specifically the foundation models lower the barrier to entry for ML in healthcare, we emphasize the importance of validation with problem- and population-specific data for each desired usage setting. This technical report will be updated over time as more modalities and features are added.
△ Less
Submitted 26 November, 2024; v1 submitted 22 November, 2024;
originally announced November 2024.
-
Closing the AI generalization gap by adjusting for dermatology condition distribution differences across clinical settings
Authors:
Rajeev V. Rikhye,
Aaron Loh,
Grace Eunhae Hong,
Preeti Singh,
Margaret Ann Smith,
Vijaytha Muralidharan,
Doris Wong,
Rory Sayres,
Michelle Phung,
Nicolas Betancourt,
Bradley Fong,
Rachna Sahasrabudhe,
Khoban Nasim,
Alec Eschholz,
Basil Mustafa,
Jan Freyberg,
Terry Spitz,
Yossi Matias,
Greg S. Corrado,
Katherine Chou,
Dale R. Webster,
Peggy Bui,
Yuan Liu,
Yun Liu,
Justin Ko
, et al. (1 additional authors not shown)
Abstract:
Recently, there has been great progress in the ability of artificial intelligence (AI) algorithms to classify dermatological conditions from clinical photographs. However, little is known about the robustness of these algorithms in real-world settings where several factors can lead to a loss of generalizability. Understanding and overcoming these limitations will permit the development of generali…
▽ More
Recently, there has been great progress in the ability of artificial intelligence (AI) algorithms to classify dermatological conditions from clinical photographs. However, little is known about the robustness of these algorithms in real-world settings where several factors can lead to a loss of generalizability. Understanding and overcoming these limitations will permit the development of generalizable AI that can aid in the diagnosis of skin conditions across a variety of clinical settings. In this retrospective study, we demonstrate that differences in skin condition distribution, rather than in demographics or image capture mode are the main source of errors when an AI algorithm is evaluated on data from a previously unseen source. We demonstrate a series of steps to close this generalization gap, requiring progressively more information about the new source, ranging from the condition distribution to training data enriched for data less frequently seen during training. Our results also suggest comparable performance from end-to-end fine tuning versus fine tuning solely the classification layer on top of a frozen embedding model. Our approach can inform the adaptation of AI algorithms to new settings, based on the information and resources available.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.
-
Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation
Authors:
Ryutaro Tanno,
David G. T. Barrett,
Andrew Sellergren,
Sumedh Ghaisas,
Sumanth Dathathri,
Abigail See,
Johannes Welbl,
Karan Singhal,
Shekoofeh Azizi,
Tao Tu,
Mike Schaekermann,
Rhys May,
Roy Lee,
SiWai Man,
Zahra Ahmed,
Sara Mahdavi,
Yossi Matias,
Joelle Barral,
Ali Eslami,
Danielle Belgrave,
Vivek Natarajan,
Shravya Shetty,
Pushmeet Kohli,
Po-Sen Huang,
Alan Karthikesalingam
, et al. (1 additional authors not shown)
Abstract:
Radiology reports are an instrumental part of modern medicine, informing key clinical decisions such as diagnosis and treatment. The worldwide shortage of radiologists, however, restricts access to expert care and imposes heavy workloads, contributing to avoidable errors and delays in report delivery. While recent progress in automated report generation with vision-language models offer clear pote…
▽ More
Radiology reports are an instrumental part of modern medicine, informing key clinical decisions such as diagnosis and treatment. The worldwide shortage of radiologists, however, restricts access to expert care and imposes heavy workloads, contributing to avoidable errors and delays in report delivery. While recent progress in automated report generation with vision-language models offer clear potential in ameliorating the situation, the path to real-world adoption has been stymied by the challenge of evaluating the clinical quality of AI-generated reports. In this study, we build a state-of-the-art report generation system for chest radiographs, $\textit{Flamingo-CXR}$, by fine-tuning a well-known vision-language foundation model on radiology data. To evaluate the quality of the AI-generated reports, a group of 16 certified radiologists provide detailed evaluations of AI-generated and human written reports for chest X-rays from an intensive care setting in the United States and an inpatient setting in India. At least one radiologist (out of two per case) preferred the AI report to the ground truth report in over 60$\%$ of cases for both datasets. Amongst the subset of AI-generated reports that contain errors, the most frequently cited reasons were related to the location and finding, whereas for human written reports, most mistakes were related to severity and finding. This disparity suggested potential complementarity between our AI system and human experts, prompting us to develop an assistive scenario in which Flamingo-CXR generates a first-draft report, which is subsequently revised by a clinician. This is the first demonstration of clinician-AI collaboration for report writing, and the resultant reports are assessed to be equivalent or preferred by at least one radiologist to reports written by experts alone in 80$\%$ of in-patient cases and 60$\%$ of intensive care cases.
△ Less
Submitted 20 December, 2023; v1 submitted 30 November, 2023;
originally announced November 2023.
-
Domain-specific optimization and diverse evaluation of self-supervised models for histopathology
Authors:
Jeremy Lai,
Faruk Ahmed,
Supriya Vijay,
Tiam Jaroensri,
Jessica Loo,
Saurabh Vyawahare,
Saloni Agarwal,
Fayaz Jamil,
Yossi Matias,
Greg S. Corrado,
Dale R. Webster,
Jonathan Krause,
Yun Liu,
Po-Hsuan Cameron Chen,
Ellery Wulczyn,
David F. Steiner
Abstract:
Task-specific deep learning models in histopathology offer promising opportunities for improving diagnosis, clinical research, and precision medicine. However, development of such models is often limited by availability of high-quality data. Foundation models in histopathology that learn general representations across a wide range of tissue types, diagnoses, and magnifications offer the potential…
▽ More
Task-specific deep learning models in histopathology offer promising opportunities for improving diagnosis, clinical research, and precision medicine. However, development of such models is often limited by availability of high-quality data. Foundation models in histopathology that learn general representations across a wide range of tissue types, diagnoses, and magnifications offer the potential to reduce the data, compute, and technical expertise necessary to develop task-specific deep learning models with the required level of model performance. In this work, we describe the development and evaluation of foundation models for histopathology via self-supervised learning (SSL). We first establish a diverse set of benchmark tasks involving 17 unique tissue types and 12 unique cancer types and spanning different optimal magnifications and task types. Next, we use this benchmark to explore and evaluate histopathology-specific SSL methods followed by further evaluation on held out patch-level and weakly supervised tasks. We found that standard SSL methods thoughtfully applied to histopathology images are performant across our benchmark tasks and that domain-specific methodological improvements can further increase performance. Our findings reinforce the value of using domain-specific SSL methods in pathology, and establish a set of high quality foundation models to enable further research across diverse applications.
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
Optimizing Audio Augmentations for Contrastive Learning of Health-Related Acoustic Signals
Authors:
Louis Blankemeier,
Sebastien Baur,
Wei-Hung Weng,
Jake Garrison,
Yossi Matias,
Shruthi Prabhakara,
Diego Ardila,
Zaid Nabulsi
Abstract:
Health-related acoustic signals, such as cough and breathing sounds, are relevant for medical diagnosis and continuous health monitoring. Most existing machine learning approaches for health acoustics are trained and evaluated on specific tasks, limiting their generalizability across various healthcare applications. In this paper, we leverage a self-supervised learning framework, SimCLR with a Slo…
▽ More
Health-related acoustic signals, such as cough and breathing sounds, are relevant for medical diagnosis and continuous health monitoring. Most existing machine learning approaches for health acoustics are trained and evaluated on specific tasks, limiting their generalizability across various healthcare applications. In this paper, we leverage a self-supervised learning framework, SimCLR with a Slowfast NFNet backbone, for contrastive learning of health acoustics. A crucial aspect of optimizing Slowfast NFNet for this application lies in identifying effective audio augmentations. We conduct an in-depth analysis of various audio augmentation strategies and demonstrate that an appropriate augmentation strategy enhances the performance of the Slowfast NFNet audio encoder across a diverse set of health acoustic tasks. Our findings reveal that when augmentations are combined, they can produce synergistic effects that exceed the benefits seen when each is applied individually.
△ Less
Submitted 11 September, 2023;
originally announced September 2023.
-
ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders
Authors:
Shawn Xu,
Lin Yang,
Christopher Kelly,
Marcin Sieniek,
Timo Kohlberger,
Martin Ma,
Wei-Hung Weng,
Atilla Kiraly,
Sahar Kazemzadeh,
Zakkai Melamed,
Jungyeon Park,
Patricia Strachan,
Yun Liu,
Chuck Lau,
Preeti Singh,
Christina Chen,
Mozziyar Etemadi,
Sreenivasa Raju Kalidindi,
Yossi Matias,
Katherine Chou,
Greg S. Corrado,
Shravya Shetty,
Daniel Tse,
Shruthi Prabhakara,
Daniel Golden
, et al. (3 additional authors not shown)
Abstract:
In this work, we present an approach, which we call Embeddings for Language/Image-aligned X-Rays, or ELIXR, that leverages a language-aligned image encoder combined or grafted onto a fixed LLM, PaLM 2, to perform a broad range of chest X-ray tasks. We train this lightweight adapter architecture using images paired with corresponding free-text radiology reports from the MIMIC-CXR dataset. ELIXR ach…
▽ More
In this work, we present an approach, which we call Embeddings for Language/Image-aligned X-Rays, or ELIXR, that leverages a language-aligned image encoder combined or grafted onto a fixed LLM, PaLM 2, to perform a broad range of chest X-ray tasks. We train this lightweight adapter architecture using images paired with corresponding free-text radiology reports from the MIMIC-CXR dataset. ELIXR achieved state-of-the-art performance on zero-shot chest X-ray (CXR) classification (mean AUC of 0.850 across 13 findings), data-efficient CXR classification (mean AUCs of 0.893 and 0.898 across five findings (atelectasis, cardiomegaly, consolidation, pleural effusion, and pulmonary edema) for 1% (~2,200 images) and 10% (~22,000 images) training data), and semantic search (0.76 normalized discounted cumulative gain (NDCG) across nineteen queries, including perfect retrieval on twelve of them). Compared to existing data-efficient methods including supervised contrastive learning (SupCon), ELIXR required two orders of magnitude less data to reach similar performance. ELIXR also showed promise on CXR vision-language tasks, demonstrating overall accuracies of 58.7% and 62.5% on visual question answering and report quality assurance tasks, respectively. These results suggest that ELIXR is a robust and versatile approach to CXR AI.
△ Less
Submitted 7 September, 2023; v1 submitted 2 August, 2023;
originally announced August 2023.
-
Using generative AI to investigate medical imagery models and datasets
Authors:
Oran Lang,
Doron Yaya-Stupp,
Ilana Traynis,
Heather Cole-Lewis,
Chloe R. Bennett,
Courtney Lyles,
Charles Lau,
Michal Irani,
Christopher Semturs,
Dale R. Webster,
Greg S. Corrado,
Avinatan Hassidim,
Yossi Matias,
Yun Liu,
Naama Hammel,
Boris Babenko
Abstract:
AI models have shown promise in many medical imaging tasks. However, our ability to explain what signals these models have learned is severely lacking. Explanations are needed in order to increase the trust in AI-based models, and could enable novel scientific discovery by uncovering signals in the data that are not yet known to experts. In this paper, we present a method for automatic visual expl…
▽ More
AI models have shown promise in many medical imaging tasks. However, our ability to explain what signals these models have learned is severely lacking. Explanations are needed in order to increase the trust in AI-based models, and could enable novel scientific discovery by uncovering signals in the data that are not yet known to experts. In this paper, we present a method for automatic visual explanations leveraging team-based expertise by generating hypotheses of what visual signals in the images are correlated with the task. We propose the following 4 steps: (i) Train a classifier to perform a given task (ii) Train a classifier guided StyleGAN-based image generator (StylEx) (iii) Automatically detect and visualize the top visual attributes that the classifier is sensitive towards (iv) Formulate hypotheses for the underlying mechanisms, to stimulate future research. Specifically, we present the discovered attributes to an interdisciplinary panel of experts so that hypotheses can account for social and structural determinants of health. We demonstrate results on eight prediction tasks across three medical imaging modalities: retinal fundus photographs, external eye photographs, and chest radiographs. We showcase examples of attributes that capture clinically known features, confounders that arise from factors beyond physiological mechanisms, and reveal a number of physiologically plausible novel attributes. Our approach has the potential to enable researchers to better understand, improve their assessment, and extract new knowledge from AI-based models. Importantly, we highlight that attributes generated by our framework can capture phenomena beyond physiology or pathophysiology, reflecting the real world nature of healthcare delivery and socio-cultural factors. Finally, we intend to release code to enable researchers to train their own StylEx models and analyze their predictive tasks.
△ Less
Submitted 4 July, 2024; v1 submitted 1 June, 2023;
originally announced June 2023.
-
Discovering novel systemic biomarkers in photos of the external eye
Authors:
Boris Babenko,
Ilana Traynis,
Christina Chen,
Preeti Singh,
Akib Uddin,
Jorge Cuadros,
Lauren P. Daskivich,
April Y. Maa,
Ramasamy Kim,
Eugene Yu-Chuan Kang,
Yossi Matias,
Greg S. Corrado,
Lily Peng,
Dale R. Webster,
Christopher Semturs,
Jonathan Krause,
Avinash V. Varadarajan,
Naama Hammel,
Yun Liu
Abstract:
External eye photos were recently shown to reveal signs of diabetic retinal disease and elevated HbA1c. In this paper, we evaluate if external eye photos contain information about additional systemic medical conditions. We developed a deep learning system (DLS) that takes external eye photos as input and predicts multiple systemic parameters, such as those related to the liver (albumin, AST); kidn…
▽ More
External eye photos were recently shown to reveal signs of diabetic retinal disease and elevated HbA1c. In this paper, we evaluate if external eye photos contain information about additional systemic medical conditions. We developed a deep learning system (DLS) that takes external eye photos as input and predicts multiple systemic parameters, such as those related to the liver (albumin, AST); kidney (eGFR estimated using the race-free 2021 CKD-EPI creatinine equation, the urine ACR); bone & mineral (calcium); thyroid (TSH); and blood count (Hgb, WBC, platelets). Development leveraged 151,237 images from 49,015 patients with diabetes undergoing diabetic eye screening in 11 sites across Los Angeles county, CA. Evaluation focused on 9 pre-specified systemic parameters and leveraged 3 validation sets (A, B, C) spanning 28,869 patients with and without diabetes undergoing eye screening in 3 independent sites in Los Angeles County, CA, and the greater Atlanta area, GA. We compared against baseline models incorporating available clinicodemographic variables (e.g. age, sex, race/ethnicity, years with diabetes). Relative to the baseline, the DLS achieved statistically significant superior performance at detecting AST>36, calcium<8.6, eGFR<60, Hgb<11, platelets<150, ACR>=300, and WBC<4 on validation set A (a patient population similar to the development sets), where the AUC of DLS exceeded that of the baseline by 5.2-19.4%. On validation sets B and C, with substantial patient population differences compared to the development sets, the DLS outperformed the baseline for ACR>=300 and Hgb<11 by 7.3-13.2%. Our findings provide further evidence that external eye photos contain important biomarkers of systemic health spanning multiple organ systems. Further work is needed to investigate whether and how these biomarkers can be translated into clinical impact.
△ Less
Submitted 18 July, 2022;
originally announced July 2022.
-
Personalizing ASR for Dysarthric and Accented Speech with Limited Data
Authors:
Joel Shor,
Dotan Emanuel,
Oran Lang,
Omry Tuval,
Michael Brenner,
Julie Cattiau,
Fernando Vieira,
Maeve McNally,
Taylor Charbonneau,
Melissa Nollstadt,
Avinatan Hassidim,
Yossi Matias
Abstract:
Automatic speech recognition (ASR) systems have dramatically improved over the last few years. ASR systems are most often trained from 'typical' speech, which means that underrepresented groups don't experience the same level of improvement. In this paper, we present and evaluate finetuning techniques to improve ASR for users with non-standard speech. We focus on two types of non-standard speech:…
▽ More
Automatic speech recognition (ASR) systems have dramatically improved over the last few years. ASR systems are most often trained from 'typical' speech, which means that underrepresented groups don't experience the same level of improvement. In this paper, we present and evaluate finetuning techniques to improve ASR for users with non-standard speech. We focus on two types of non-standard speech: speech from people with amyotrophic lateral sclerosis (ALS) and accented speech. We train personalized models that achieve 62% and 35% relative WER improvement on these two groups, bringing the absolute WER for ALS speakers, on a test set of message bank phrases, down to 10% for mild dysarthria and 20% for more serious dysarthria. We show that 71% of the improvement comes from only 5 minutes of training data. Finetuning a particular subset of layers (with many fewer parameters) often gives better results than finetuning the entire model. This is the first step towards building state of the art ASR models for dysarthric speech.
△ Less
Submitted 31 July, 2019;
originally announced July 2019.