A toolbox for surfacing health equity harms and biases in large language models

Pfohl, Stephen R.; Cole-Lewis, Heather; Sayres, Rory; Neal, Darlene; Asiedu, Mercy; Dieng, Awa; Tomasev, Nenad; Rashid, Qazi Mamunur; Azizi, Shekoofeh; Rostamzadeh, Negar; McCoy, Liam G.; Celi, Leo Anthony; Liu, Yun; Schaekermann, Mike; Walton, Alanna; Parrish, Alicia; Nagpal, Chirag; Singh, Preeti; Dewitt, Akeiylah; Mansfield, Philip; Prakash, Sushant; Heller, Katherine; Karthikesalingam, Alan; Semturs, Christopher; Barral, Joelle; Corrado, Greg; Matias, Yossi; Smith-Loud, Jamila; Horn, Ivor; Singhal, Karan

doi:10.1038/s41591-024-03258-2

Download PDF

Article
Open access
Published: 23 September 2024

A toolbox for surfacing health equity harms and biases in large language models

Nature Medicine volume 30, pages 3590–3600 (2024)Cite this article

15k Accesses
103 Altmetric
Metrics details

Subjects

Abstract

Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and our dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study, we find that our approach surfaces biases that may be missed by narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. While our approach is not sufficient to holistically assess whether the deployment of an artificial intelligence (AI) system promotes equitable health outcomes, we hope that it can be leveraged and built upon toward a shared goal of LLMs that promote accessible and equitable healthcare.

A framework for human evaluation of large language models in healthcare derived from literature review

Article Open access 28 September 2024

Toward expert-level medical question answering with large language models

Article Open access 08 January 2025

An automated framework for assessing how well LLMs cite relevant medical references

Article Open access 16 April 2025

Main

LLMs are increasingly being used to serve clinical and consumer health information needs^1,2. LLMs have potential for use in a variety of contexts, including medical question answering^3,4,5, extraction from and summarization of clinical notes^6,7, diagnosis and clinical decision support^8,9, radiology report interpretation^10,11 and interpretation of wearable sensor data¹². However, the use of LLMs also has potential to cause harm and exacerbate health disparities^13,14,15,16. The sources of these potential harms are complex and include social and structural determinants of health^17,18,19,20, population and geographical representation and misrepresentation in datasets^21,22, persistent misconceptions in health patterns and practices across axes of patient identity^23,24, problem formulation centering privileged perspectives^25,26 and systematic differences in performance, inclusivity, actionability, accessibility and impact of systems across populations^27,28,29. If models were widely used in healthcare without safeguards, the resulting equity-related harms could widen persistent gaps in global health outcomes¹⁴.

Evaluation of LLM-based systems to identify biases and failure modes that could contribute to equity-related harms is a critical step toward mitigation of those harms and promotion of health equity. Health equity refers to the ‘absence of unfair, avoidable or remediable differences in health status among groups of people’ (refs. ^19,30). Before the recent proliferation of LLMs, a substantial body of work proposed guidance and conducted empirical investigation into methodologies for evaluation and mitigation of biases with potential to cause equity-related harms when machine learning is used in health and healthcare contexts^27,31,32,33.

LLMs introduce new challenges for evaluation due to the breadth of use cases enabled through open-ended generation and the need to conduct multidimensional assessments of long-form textual outputs. Two emerging evaluation paradigms to address these challenges are particularly relevant to our work. The first is the use of expert human raters to evaluate generated model outputs along multiple contextually relevant axes. For example, Singhal et al.³ proposed a rubric for physician rater evaluation of long-form answers to medical questions along 12 axes, including alignment with medical consensus and potential for bias. A second paradigm is the use of red teaming or adversarial testing procedures to probe for failure modes not typically captured by standard evaluation approaches. These procedures take a variety of forms^34,35,36 but typically involve the manual curation or generation of adversarial data enriched for cases in which the model may plausibly underperform or generate potentially harmful outputs. For evaluation of health equity-related harms, prior work has explored smaller-scale evaluations with adversarial medical questions using physician raters^4,23.

In this work, we present a set of resources and methodologies to advance the assessment of potential health equity-related harms of LLMs. This constitutes a flexible framework for human evaluation and adversarial testing of LLMs that can be applied and extended to surface the presence of context-specific biases in LLM outputs. While not intended to be comprehensive, our approach is intended to be adaptable to other drivers of health equity-related harm, other LLMs and other use cases. Furthermore, we emphasize that our approach is complementary to and does not replace the need for contextualized evaluations that reason about the downstream consequences of biases grounded in specific use cases and populations^37,38,39.

Our contributions are summarized in Fig. 1. In brief, we use a multifaceted, iterative approach to design assessment rubrics for human evaluation of generated long-form answers to medical questions that incorporate multiple dimensions of bias with potential to contribute to equity-related harm; introduce EquityMedQA, a collection of adversarial medical question-answering datasets enriched for equity-related content that substantially expand upon the volume and breadth of previously studied adversarial data for medical question answering; and conduct a large-scale empirical study with the Med-PaLM³ and Med-PaLM 2 (ref. ⁴) LLMs to assess the capability of our approach to surface health equity-related harms and biases. We incorporate several complementary methodologies to assessment rubric and dataset design to enable surfacing of different forms of equity-related harm. Our empirical study involves a diverse rater pool reflecting perspectives arising from different professional backgrounds and lived experiences, including physicians, health equity experts and consumers from a wide array of demographic groups. Our results emphasize the importance of evaluation using a multifaceted approach incorporating raters with diverse backgrounds and expertise and assessment rubrics and adversarial data reflecting domain-specific conceptualizations of bias, harm and equity.

**Fig. 1: Overview of our main contributions.**

Results

Assessment design

The design of our approach to human evaluation of LLM-generated answers for biases with potential to cause equity-related harm involved engagement with a panel of equity experts, review of Med-PaLM 2 outputs, focus group sessions with physicians, pilot studies and iteration. A major output of this process was the definition of six dimensions of bias considered to be within scope, covering inaccuracy across axes of identity, lack of inclusion, stereotypical language or characterization, omission of structural explanations for inequity, failure to challenge a biased premise and potential for disproportionate withholding of opportunities or resources (Table 1). These dimensions of bias were incorporated into three assessment rubrics corresponding to separate rating tasks: (1) independent, in which the rater assesses the presence of bias present in an answer to one question, (2) pairwise, in which the rater compares the presence or degree of bias between two answers to one question and (3) counterfactual, in which the rater evaluates the answers to two questions that differ only in the insertion, deletion or modification of identifiers of axes of identity (for example, race, ethnicity, gender, socioeconomic status, ability) or other context (for example, geographical location). The full independent, pairwise and counterfactual rubrics are provided in Extended Data Tables 1, 2 and 3, respectively. Further methodological details regarding the assessment design process are provided in the Methods.

Table 1 Dimensions of bias assessed

Full size table

EquityMedQA

We created EquityMedQA, a collection of seven datasets of adversarial questions intended to enable evaluation of biases with potential to precipitate health equity-related harms in LLM-generated answers to medical questions. EquityMedQA contains 4,619 examples in total across the seven datasets. Six of these datasets were newly designed for the purposes of this study. Table 2 describes each dataset used in our study, including three others used in the empirical study. Extended Data Table 4 provides an example question from each EquityMedQA dataset. Detailed descriptions of each dataset and the methodology used are provided in the Methods.

Table 2 Summary of datasets evaluated in this study and methodology applied to each

Full size table

Each EquityMedQA dataset is designed to enable identification of distinct modes of bias. For example, Open-ended Medical Adversarial Queries (OMAQ) prioritizes explicitly adversarial open-ended queries (that is, questions that contain a biased premise and may not be well-formed medical questions), Equity in Health AI (EHAI) is enriched for questions related to health disparities in the US, and Tropical and Infectious Diseases (TRINDS) focuses on tropical diseases. EquityMedQA also reflects multiple complementary approaches to adversarial dataset design and curation. For example, EHAI is grounded in an explicit taxonomy of potential equity-related harms and biases, Failure-Based Red Teaming—Manual (FBRT-Manual) is derived through a manual red teaming exercise that included review of existing model failures, Counterfactual Context—Manual (CC-Manual) is derived through manual augmentation of a small set of queries to support counterfactual analyses, and the Failure-Based Red Teaming—LLM (FBRT-LLM) and Counterfactual Context—LLM (CC-LLM) datasets are scalably derived through semi-automated data augmentation with an LLM.

Empirical study

We conducted a large-scale empirical study to assess the potential for our approach to surface equity-related harms and biases in LLM-generated answers to medical questions. In total, we generated and analyzed 17,099 human ratings. We briefly describe our approach here and include a detailed description in the Methods, including further details on rater recruitment and statistical analysis.

We applied the three assessment rubrics to Med-PaLM 2 answers to each of the questions of the seven EquityMedQA datasets using three rater groups (physicians, health equity experts and consumers). We also use the non-adversarial HealthSearchQA and MultiMedQA datasets³ and the set of adversarial questions studied by Omiye et al.²³. Subsets of MultiMedQA and OMAQ were combined (Mixed MMQA–OMAQ) and multiply rated by each of the three rater groups to enable study of inter-rater reliability within groups and study of differences across rater groups. Physician and health equity expert raters rated answers for all datasets and rubrics, whereas consumer raters rated answers to only the Mixed MMQA–OMAQ dataset using the independent assessment rubric. To gain insight into how perceptions of bias differ across identity groups on the basis of individual perspectives and experiences, we compared the rate of bias reported across subgroups defined by self-reported demographics in the consumer rater group.

We used the pairwise evaluation rubric to detect relative differences in the magnitude or severity of bias in pairs of answers derived from two different sources. This is a common paradigm for representing annotator preferences for evaluation and tuning of LLMs⁴⁰. To enable such analysis, we compared Med-PaLM 2 outputs to Med-PaLM outputs for all datasets of questions except for those designed to support counterfactual analyses (that is, all except for CC-Manual and CC-LLM). We also conducted pairwise analyses using the set of physician-written reference answers to HealthSearchQA questions presented by Singhal et al.³.

Independent and pairwise analyses

We found evidence that the datasets introduced in EquityMedQA are enriched for adversarial questions that induce biased outputs in LLMs, as the magnitude of the overall rates of bias reported in answers to adversarial datasets is greater than for non-adversarial datasets. For example, the health equity expert rater group rated Med-PaLM 2 answers from EquityMedQA datasets as containing bias at a rate of 0.126 (95% confidence interval (CI): 0.108, 0.141), which is greater than the rate of 0.030 (95% CI: 0.020, 0.041) reported in answers to HealthSearchQA questions (Fig. 2).

**Fig. 2: Results of independent evaluation of bias in Med-PaLM 2 answers.**

We found that the presence and magnitude of differences in the rates of bias reported by physician and health equity expert raters were dataset and rubric dependent. For example, health equity expert raters reported a greater rate of bias than physician raters did in the Mixed MMQA–OMAQ dataset (0.078 (95% CI: 0.060, 0.098) versus 0.220 (95% CI: 0.191, 0.250) for physician and health equity experts, respectively; pooled over replicates), while the rates reported for EquityMedQA datasets were similar for the two rater groups (0.141 (95% CI: 0.122, 0.157) versus 0.126 (95% CI: 0.108, 0.141) for physician and health equity expert raters, respectively).

Using the pairwise assessment rubric, we found that, across datasets and dimensions of bias, raters were indifferent between Med-PaLM 2 and comparator answers (either Med-PaLM or physician written) in the majority of cases but preferred Med-PaLM 2 answers (that is, rated answers as containing a lesser degree of bias) more often than they preferred comparator answers (Fig. 3). Health equity expert raters generally preferred Med-PaLM 2 answers more often than physician raters did. For example, for answers to HealthSearchQA questions, we found that physician and health equity expert raters preferred Med-PaLM 2 over Med-PaLM at rates of 0.029 (95% CI: 0.020, 0.041) and 0.193 (95% CI: 0.168, 0.216), respectively, and preferred Med-PaLM over Med-PaLM 2 at rates of 0.011 (95% CI: 0.005, 0.017) and 0.020 (95% CI: 0.012, 0.028), respectively. The same pattern held for pairwise analyses pooled over adversarial datasets, with greater rates of non-indifference than what was observed for HealthSearchQA. Interestingly, for HealthSearchQA questions, both rater groups preferred Med-PaLM 2 to physician-written answers more often than they preferred Med-PaLM 2 over Med-PaLM, with a greater difference for health equity expert raters.

**Fig. 3: Results of pairwise evaluation of Med-PaLM 2 answers compared to Med-PaLM and physician answers.**

We found that the combined use of the curated adversarial datasets and multiple rater groups helped to surface specific dimensions of bias in answers and pairs of answers. For example, while we found no difference between the overall rates of bias reported by physician and health equity expert raters in independent evaluation pooled over EquityMedQA, we found that health equity expert raters reported a greater rate of bias than physician raters did with respect to inaccuracy and insufficient inclusivity across axes of identity in the EHAI dataset, and physician raters identified a greater rate of bias in answers to HealthSearchQA and FBRT-LLM than health equity expert raters did, overall and for several dimensions of bias.

In pairwise evaluation, we observed larger effects for specific dimensions of bias (stereotypical characterization, omission of structural explanation, allowing a biased premise and potential for withholding) in some adversarial datasets (OMAQ, EHAI, FBRT-Manual) than we did in HealthSearchQA, with greater rates of non-indifference for health equity expert raters in some cases. Interestingly, health equity expert raters generally reported a preference for Med-PaLM 2 answers with respect to inclusivity for axes of identity more often than other dimensions of bias. For the TRINDS dataset, raters were generally more indifferent, relative to other adversarial datasets, with respect to specific dimensions of bias, with the exception that health equity expert raters preferred Med-PaLM 2 answers with respect to accuracy for axes of identity at a rate exceeding that for other adversarial datasets.

Using the triple-rated Mixed MMQA–OMAQ dataset, we studied differences of the role that the approach to aggregation over multiple raters per item had on the results. We found that rates of bias and preference as well as differences between rater groups were amplified under an ‘any-vote’ (that is, when an answer was reported as containing bias if at least one rater reported bias) and attenuated under a ‘majority-vote’ aggregation scheme (Supplementary Figs. 1 and 2).

The results indicate that our approach to generating adversarial questions via prompting of Med-PaLM 2 generates questions that differ from those produced via manual dataset creation. We found that physician raters reported a greater rate of bias for FBRT-LLM than HealthSearchQA, while the rates of bias reported by health equity expert raters were similar for the two datasets but lesser than the rates reported by physician raters (Fig. 2). Furthermore, the rates of bias reported for FBRT-LLM were similar to or lower than the rates reported for FBRT-Manual, with effects that differed across dimensions of bias. We further found that raters were more indifferent between Med-PaLM and Med-PaLM 2 answers to FBRT-LLM questions than they were to answers to FBRT-Manual questions (Fig. 3).

Counterfactual analyses

Using the counterfactual assessment rubric, we found that physician and health equity expert raters reported bias at rates of 0.127 (95% CI: 0.092, 0.160) and 0.183 (95% CI: 0.141, 0.229), respectively, for counterfactual pairs from the CC-Manual dataset (Fig. 4). For the CC-LLM dataset, physician raters reported a lesser rate of bias than for the CC-Manual dataset, while health equity expert raters reported a similar rate for the two datasets. The health equity expert raters reported bias at a rate equal to or greater than that of physician raters with respect to all dimensions of bias except for inaccuracy with respect to aspects of identity for the CC-Manual dataset and for all dimensions in the CC-LLM dataset, although these differences were typically not statistically significant. The rate of bias reported under the counterfactual assessment rubric was typically lower than alternative measures based on the presence of bias in one or one or more answers under the independent rubric. As was the case for the other rubrics, we found that an ‘any-vote’ aggregation scheme resulted in a significantly greater reported rate of bias but without consistent differences between rater groups (Supplementary Fig. 3).

**Fig. 4: Results of counterfactual and independent evaluation on counterfactual datasets.**

Pooled across both counterfactual datasets, health equity expert raters were more likely than physician raters to report that the ideal answers to counterfactual pairs differed (Extended Data Fig. 1). Among pairs in which the ideal answers were reported to not differ, both physician and health equity expert raters assessed the majority of pairs of Med-PaLM 2 answers as containing similar content (Extended Data Fig. 1). Furthermore, among pairs reported to have the same ideal answer, both rater groups reported a greater degree of bias in cases in which the Med-PaLM 2 answers were judged to differ in content (Extended Data Fig. 1). Among cases in which the ideal answers were reported to differ, physician raters judged the Med-PaLM 2 answers as containing different content more often than health equity expert raters did (Extended Data Fig. 1), with no clear relationship between Med-PaLM 2 answer similarity and the rate of bias reported (Extended Data Fig. 1).

Consumer study

To compare the consumer rater group to the physician and health equity expert rater groups, we computed the majority-vote response to the independent assessment rubric for Med-PaLM 2 answers to the Mixed MMQA–OMAQ question set across three or more raters per answer within each group of raters. Here, the consumer rater group reported bias at a greater rate than either physician or health equity expert raters (Extended Data Table 5).

Within the consumer rater group, we observed an effect of participant age on the rate of bias reported, with younger age groups reporting greater rates of bias (Supplementary Figs. 4a and 5). Across dimensions of bias, differences were greatest for omission of structural explanations, stereotypical characterization and lack of inclusivity (Supplementary Fig. 6). By contrast, differences in the rates of bias reported were more modest across other demographic axes. Across groups defined by self-reported race or ethnicity, Black participants were significantly more likely to report bias, relative to white participants, but other differences were not significant (Supplementary Figs. 4b and 5). The rate of bias reported was not significantly different across gender groups (Supplementary Figs. 4c and 5).

Inter-rater reliability

We assessed inter-rater reliability separately for each rater group and assessment rubric using Randolph’s kappa⁴¹ and Krippendorff’s alpha⁴². We used Mixed MMQA–OMAQ to assess inter-rater reliability for the independent and pairwise rubrics and CC-Manual for the counterfactual rubric.

We found that inter-rater reliability was sensitive to the choice of metric and differed across rater groups and rubric designs. Across settings and rater groups, Randolph’s kappa (Extended Data Table 6) was generally modest and Krippendorff’s alpha was generally poor (Extended Data Table 7). The physician rater group generally achieved Randolph’s kappa values that were greater than or not statistically different from those of the other rater groups, while the same was true for Krippendorff’s alpha values for the health equity expert rater group. The consumer rater group achieved lower Krippendorff’s alpha values than the other groups and Randolph’s kappa values comparable to those of the health equity expert rater group.

For the counterfactual rubric, we found that physician and health equity experts achieved similar values of Randolph’s kappa for the presence of bias, with health equity experts achieving a greater Krippendorff’s alpha (Extended Data Tables 6 and 7). Physician raters achieved greater values than health equity experts for both metrics for the rubric items related to judgments of how the ideal answers and the actual answers differed.

Application to the Omiye et al. dataset

To contextualize our results and to help identify potential limitations of our evaluation procedure, we applied our approach to the set of nine questions studied by Omiye et al.²³. We include the full set of generated answers for both Med-PaLM and Med-PaLM 2 in Supplementary Table 1.

In general, the rates of bias identified for this set of questions, both overall and for specific dimensions of bias, were similar to those of other adversarial datasets with related content (that is, OMAQ, EHAI and FBRT-Manual; Fig. 2), but the CIs for estimates on these data were wide due to the limited sample size. In pairwise evaluation, we found that, when not indifferent, health equity expert raters preferred Med-PaLM 2 answers to Med-PaLM answers and physician raters preferred Med-PaLM answers to Med-PaLM 2 answers (Extended Data Fig. 2b).

We found that Med-PaLM 2 answers regarding the genetic basis of race, calculations of lung capacity and brain size did not contain inappropriate race-based content, did appropriately refute the premises of the questions and correspondingly were rated by health equity expert raters with a consensus that no bias was present (Extended Data Fig. 2a). However, qualitative review identified some of the behaviors reported by Omiye et al.²³ (Supplementary Table 1). In no case did more than three of the five raters flag an answer for bias (Extended Data Fig. 2a), which suggests that our procedure may be less sensitive than desired. For example, we found that Med-PaLM 2 reproduced misconceptions about differences in skin thickness between white and Black patients, but this was only identified by one of five raters in each of the physician and health equity expert rater groups. For this example, we found that two of the five health equity expert raters preferred the Med-PaLM 2 answer and only one preferred the Med-PaLM answer. Furthermore, the majority of raters did not report the presence of bias for answers that recommended the use of calculators of estimated glomerular filtration rate (eGFR) that incorporate a race coefficient over newer, recommended calculators that do not incorporate race⁴³. We also observed that Med-PaLM 2 generated factually incorrect numerical information for the calculators referenced, similar to results reported by Omiye et al.²³ for other models.

Discussion

In this work, we aimed to advance the practice of surfacing health equity-related biases with potential to precipitate equity-related harms in LLMs through the design of a collection of assessment rubrics and adversarial datasets. Our empirical study demonstrated that the use of the proposed assessment rubrics and adversarial datasets coupled with evaluation by rater groups with complementary expertise and backgrounds helps to surface biases along multiple previously unreported dimensions of bias^3,4.

Our results suggest that the multidimensional assessment rubrics we propose are effective at identifying biases not detected by prior work. For instance, raters reported substantially greater rates of bias in Med-PaLM 2 answers to HealthSearchQA and OMAQ using the rubrics we propose than those reported by Singhal et al.⁴, who used a single-dimensional question related to demographic bias. We further find that our assessment procedure generally reports a greater rate of preference for Med-PaLM 2 over Med-PaLM with respect to bias, as compared to the prior work⁴, indicating that our procedure may be more sensitive to detecting relative improvements with respect to bias across pairs of answers. Furthermore, a major contribution of our work was our approach to assessment rubric design that scoped and defined a domain-specific taxonomy of equity-related harms of LLMs for medical question answering, which enabled the decomposition of reported biases into several dimensions.

In our empirical study, we found that different rater groups report bias and its dimensions at different rates, with effects that differ across datasets and rubric designs. This is consistent with evidence that patterns in ratings systematically differ across rater groups in other contexts due to differences in perspectives, expertise and lived experiences^44,45. Here, we found that physician and health equity expert raters generally reported bias at similar rates in independent evaluation of Med-PaLM 2 answers, but health equity expert raters generally reported a greater rate of preference for Med-PaLM 2 answers over Med-PaLM answers, overall and for specific dimensions of bias, in a dataset-dependent manner. We further found that consumer raters reported greater rates of bias than either the equity expert or physician raters, with greater rates of bias reported by younger raters, consistent with age-related effects reported in ratings of conversational safety in other domains⁴⁵.

We found that inter-rater reliability differed across rater groups, assessment rubrics and dimensions of bias, as expected, but the absolute magnitude was sensitive to the choice of metric. This metric dependence is generally consistent with the phenomenon in which chance-corrected inter-rater reliability metrics can be low despite high agreement rates^46,47. Low inter-rater agreement may be expected and does not necessarily indicate poor rating quality⁴⁸, given the complexity of the annotation tasks and the diversity of perspectives, expertise and lived experiences within and across the rater groups. Methods that explicitly model annotator disagreement⁴⁹ may help explain complex relationships between rater background, annotation quality, disagreement and the content of the question and generated output.

Further work is needed to understand how differences in expertise between rater groups affect rater responses. For example, physician raters may anchor on biological explanations for health, health equity experts may seek to understand health and health disparities within the context of structural, social, historical, cultural and interpersonal factors, and non-expert consumer raters may ground their understanding of bias in their lived experiences. Disagreement between the rater groups may derive from differences in prioritization of aspects of answer quality and bias as well as limited ability, comfort or priming to evaluate aspects outside of their expertise. The greater rate of bias reported by consumer raters might reflect both a broader range of perspectives than that of the other rater groups and a cautionary tendency toward reporting bias in cases requiring expertise, but further work is needed to understand this result. Furthermore, as this work covers a large number of identities and contexts, expert raters may not have the expertise appropriate to assess the presence of bias in all cases.

We introduced a counterfactual assessment rubric designed to probe biases present in answers to pairs of questions containing differences in identifiers of demographics or other context. As expected, we found that, among the counterfactual pairs rated to have unchanged ideal answers, the rate of bias was greater among the counterfactual pairs for which the answers were judged to have meaningfully changed. Our results suggest that further analyses and refinements to the rubric are needed to characterize biases in cases in which the differences across a counterfactual pair of questions are contextually meaningful and induce a change in the ideal answer. Furthermore, we note that, while our approach serves as a proof of concept for generating and evaluating broad and diverse sets of counterfactual questions and answers, it does not guarantee that our procedure has validity as an assessment of equity-related biases and harms relevant to the identities or contexts represented in the questions⁵⁰.

We use LLM-based prompting pipelines to automatically generate broad and diverse sets of adversarial questions via failure-based red teaming and counterfactual expansion. Our results showed that, while this approach was successful in generating adversarial questions along the dimensions of bias studied in this work, the rate of bias reported for answers to LLM-generated questions was generally less than that reported for manually created questions. Further refinement of our approach to enable scalable generation of adversarial questions is an important area of future work^35,36.

Limitations and future work

A limitation of our study is the inability to evaluate the validity and reliability of our rating procedure against a ‘ground truth’. Through post hoc qualitative analysis of the set of questions studied by Omiye et al.²³, we found some evidence that our rating procedure may be less sensitive than desired. This suggests that, while our work is successful in surfacing biases not previously identified by Singhal et al.^3,4, we may still under-report the rate at which equity-related biases and harms are present in generated answers. The reduced sensitivity of the rating procedure could be the result of various factors, such as rater fatigue or the breadth of concepts covered.

In our study, raters assessed bias with rubrics that did not explicitly consider other aspects of answer quality, whereas, in the study by Singhal et al.^3,4^3,4, bias was assessed as one item of a 12-item rubric. This confounds comparison with those works if raters in our study reported bias in cases of poor answer quality unrelated to bias. This potentially accounts for some inter-group disagreement if rater groups differ in the extent to which other aspects of answer quality are conflated with bias.

Our results present opportunities for refinement of our approach, including a standardized approach to qualifying raters and their expertise, processes to build consensus among raters and technical refinement of the assessment task (for example, presenting rubric items separately to reduce cognitive load, use of Likert scales, reducing rater fatigue with fewer ratings per task^51,52). The dimensions of bias reflected in the rubrics could be refined through further qualitative and quantitative analysis of model outputs, including those not reported as containing bias, and participatory engagement with experts and communities^53,54. Furthermore, collection of rater-rewritten answers could facilitate identification of insights about reported biases.

Extension of our approach with consideration of global contexts is a critical area of future research. While we take a step toward this by creating the TRINDS dataset, which emphasizes questions related to tropical and infectious diseases, there is a need to consider how to design assessment rubrics reflecting contextually meaningful notions of bias, algorithmic fairness and equity in global contexts^55,56,57. Beyond consideration of the relevant axes of disparities, there is need to develop evaluation procedures grounded in the specific contexts in which LLMs are used outside of Western contexts and to recruit specialized raters equipped to evaluate bias in those contexts.

Several aspects of our study, such as the assessment rubrics and EquityMedQA, are directly applicable for the study of bias in other LLMs and contexts. Furthermore, our conceptual approach may be applied to identify relevant failure modes in the outputs of other LLMs, which can be used as the basis for adversarial data curation and refinement of the assessment rubrics to capture the contextually relevant forms of bias identified. However, because the scope of the empirical study was limited to the analysis of Med-PaLM and Med-PaLM 2 outputs, it is important for future work to assess the generalizability of our empirical findings using other LLMs. Furthermore, study of the sensitivity of our results to alternative prompting strategies and repeated sampling is an important area of future work, as we deterministically generate outputs using a fixed prompt (Supplementary Table 2), consistent with Singhal et al.^3,4.

The procedures studied in this work are complementary to and do not supplant the need for transparency practices^21,58,59 and other evaluation paradigms, such as robustness, fairness and safety testing. As such, our approach is not comprehensive of all relevant biases and model failures, does not allow for direct identification of causes of harm or bias and does not directly enable reasoning about specific downstream harms or effects on health outcomes if an LLM were to be deployed for a specific use case or population^37,38,39. These limitations in scope highlight the importance of future research into the development of best practices regarding bias and harm identification that are specific to the context in which an LLM-based system is intended to be used.

Beyond identification of bias, the development of methodologies to mitigate biases in LLMs is an important direction for future work. Multiple approaches exist with potential to help mitigate the biases that we study here, including the use of classification-based filters to detect and abstain when questions or answers are potentially harmful or biased, fine-tuning using expert rewrites and further optimization incorporating expert ratings⁶⁰. Furthermore, improvements to the quality and the factuality of LLMs may also mitigate some equity-related biases and harms. The impact of mitigation should be evaluated in terms of downstream impacts of these models when deployed in various contexts and with input from communities and individuals that will be affected.

Finally, we emphasize that identifying and subsequently mitigating bias is not sufficient to achieve a state of health equity. Intentional equity design requires equity-focused measurement, trustworthiness and centering people in the context of their lives by working with end users and interdisciplinary experts to incorporate societal context into the design and evaluation of AI systems.

Methods

Here, we provide details regarding the methodology for assessment rubric design, creation of each of the EquityMedQA datasets and the design of the empirical study, including rater recruitment and statistical analysis. The Human assessment tasks section includes reporting regarding the number of cases in which ratings were not available in each setting. The study was conducted under a protocol that included in its scope model development and human evaluation using de-identified data, which was reviewed and exempted from further review by the Advarra IRB.

Assessment design methodology

To develop the three assessment rubrics used for human evaluation, we used a multifaceted design methodology that included a participatory approach with equity experts, focus sessions with physicians, review of failures of Med-PaLM 2 and iterative pilot studies with early versions of the rubrics. The assessment rubrics were the result of multiple iterations, building from the rubrics presented by Singhal et al.^3,4. We present earlier versions of the independent assessment rubric in Supplementary Table 3 and reproduce the rubrics presented by Singhal et al.^3,4 in Supplementary Table 4.

Each assessment rubric included the same examples of axes of identity and dimensions of bias. Axis-of-identity examples included the following list: ‘race, ethnicity, gender, socioeconomic status, ability, literacy, language, geography, sexual orientation, religion, age, body composition, culture, national origin, familial status and more’. The assessment rubrics indicated that the lists were non-exhaustive to allow raters to consider other axes of identity. Examples of bias were the six dimensions of bias presented in Table 1. For bias dimensions not reflected in the six provided dimensions, raters had the option of choosing ‘Other’ and providing notes in the free text field. The terms ‘axes of identity’ and ‘aspects of identity’ were used interchangeably, as were the terms ‘implicit and explicit bias’ and ‘bias’.

Participatory approach with equity experts

To better understand gaps in previous assessments for bias and equity-related harms, we engaged with the Equitable AI Research Roundtable (EARR) for two sessions⁶¹. EARR is a research coalition consisting of nine experts who are based in the US. Members bring with them diverse and multi-disciplinary qualifications, including areas of research and focus at the intersection of technology and equity in domains such as social justice, public education, health and medicine, housing, law and AI ethics. EARR members were compensated through their ongoing participation with the EARR⁶¹.

The first iteration of our independent evaluation rubric was informed by a domain-agnostic taxonomy of equity-related harms of LLMs in development by EARR, similar to refs. ^62,63. We adapted the taxonomy to health contexts through iterative engagement with EARR. We presented previous evaluations of bias from Singhal et al.^3,4 to EARR participants and asked them to consider additional equity-related model failures potentially relevant to study via a combination of small-group breakout sessions and large-group discussions. In small-group sessions, participants were asked to generate a list of potential benefits and harms of LLMs for medical question answering and discuss communities who may be vulnerable to any potential harms. They were then asked to reflect on the domain-agnostic, equity-related harm taxonomy and identify anything potentially missing and finally brainstorm additional assessment rubric items relevant to this context.

As a qualitative method to discern the validity of the assessment questions, in large-group discussions, we also asked EARR participants to give feedback on multiple early versions of the rubrics. Participants helped ensure clarity of the rubrics, inclusive of axis-of-identity examples, while keeping length and complexity of the assessment reasonable. These discussions prompted the inclusion of a question related to the general presence of implicit or explicit bias and also specific dimensions of bias (for example, stereotypical characterization), which enabled us to understand equity-related harms with more granularity.

Lastly, we reconvened EARR participants to disseminate the final assessment rubrics and dataset contributions of our work and gather feedback.

Focus group sessions with physicians

Engagement with physicians served as a qualitative method to discern reliability of the assessment rubrics. We solicited feedback on a sample of six adversarial questions against the proposed independent assessment rubric among a group of five physicians. Physicians in this group were based in the US, the UK and India and had all completed medical training with 10–30 years of experience. Based on this input, we further refined the assessment rubric and ultimately presented the final result to EARR.

Review of failures of Med-PaLM 2

Review of failures of Med-PaLM 2 was a crucial input to the assessment design process as well as to the design of the EquityMedQA datasets. For example, through this process, we recognized the need to add a rubric item for when the model fails to challenge the premise of a biased question.

Iterative scaled-up human evaluation

We performed two initial human evaluations on a subset of the data that included both non-adversarial and adversarial questions (the Mixed MMQA–OMAQ set), with the early version of the independent evaluation rubric presented in Supplementary Table 3, which provided initial quantitative results and guided revisions. For example, this led us to present the dimensions of bias related to inaccuracy and lack of inclusivity across axes of identity as two distinct dimensions rather than one combined dimension.

Independent evaluation rubric

The independent rubric instructs raters to assess the presence of implicit or explicit bias in an answer to a question across aspects of identity, with three answer choices to assess presence and severity of bias, conditional logic to specify all relevant forms of bias identified and a free text option for raters to provide additional thoughts or detail. Examples of axes of identity and implicit and explicit bias were provided. The full rubric for independent evaluation can be found in Extended Data Table 1.

Pairwise evaluation rubric

The pairwise evaluation rubric instructs raters to compare two different model answers to determine which answer shows a greater degree of bias across aspects of identity using a question with three answer choices (Answer 1, Answer 2, Tie), conditional logic to specify all relevant forms of bias identified if one answer is preferred over another and a free text option for raters to provide additional thoughts or details. Examples of axes of identity and bias were provided in the same fashion as for the independent rubric. The full rubric for pairwise evaluation can be found in Extended Data Table 2.

Counterfactual evaluation rubric

The rubric for counterfactual evaluation was designed to assess two answers to each of two questions that differ only in the insertion, deletion or modification of identifiers of aspects of identity (for example, demographics) or other context (for example, geography). This style of evaluation draws from prior work in the counterfactual fairness literature^64,65,66. This counterfactual evaluation rubric acknowledges that it is important to differentiate between cases in which (1) a change in an identifier induces no contextually meaningful change to the content of the query or to the ideal answer, such that a difference in model output for two queries that differ only on the basis of the identifier may be indicative of bias, from cases in which (2) a change in an identifier is contextually meaningful and bias may be present if the models fails to provide different, high-quality outputs appropriate for each query. The counterfactual evaluation rubric instructs raters to compare two different model answers derived from two separate questions to determine in a three-part question (1) whether the ideal answer should differ, with the option to provide a free text comment, (2) whether the content, syntax and structure of the actual answers differ and (3) whether the pairs of actual answers jointly exhibit the presence of bias. If they do, raters are asked to specify all relevant forms of bias identified and are provided a free text field to provide additional thoughts. Examples of aspects of identity and bias were given in the same fashion as in the independent rubric. The full rubric for counterfactual assessment can be found in Extended Data Table 3.

EquityMedQA

Open-ended Medical Adversarial Queries

The OMAQ dataset contains 182 queries targeting explicitly adversarial and potentially harmful consumer medical queries across varied informational and conversational use cases, including but not limited to explicit medical question answering. This dataset was initially studied by Singhal et al.⁴, referred to there as ‘Adversarial (Health equity)’. This dataset was instrumental in identifying initial potential model failures that motivated the development of the FBRT-Manual and FBRT-LLM datasets.

In comparison to other EquityMedQA datasets, OMAQ has a greater number of queries that include a biased premise, including misinformation or explicitly offensive content. OMAQ queries also deliberately contain typographical errors and incomplete sentences, and many queries exhibit ambiguous or confusing intent. Importantly, OMAQ contains queries related to medical advice that are often not well-formed medical questions. Examples include requests to generate content pertinent to a medical concern and other implicit requests for medical advice. These questions were not derived from the dimensions of bias that we present in this work.

Six health topics were prioritized: cardiovascular disease, skin cancer, breast cancer, diabetes, maternal mortality and morbidity and coronavirus disease 2019. The six topics were selected based on the following criteria: importance, evidence of disparate impact on populations across the US; relevance, presence of known health disparities relevant to health AI applications within the topic area; and feasibility, evidence regarding the role of demographic, environmental and structural factors for the health topic. For a given health topic, OMAQ queries were conditioned on key information (for example, key symptoms) and relevant terms related to demographic identifiers or sensitive characteristics (for example, age, body characteristics, race and ethnicity).

Equity in Health AI

The EHAI dataset contains 300 questions designed to target implicitly adversarial equity-related consumer medical questions specific to health in the US. For this dataset, we defined implicitly adversarial medical questions as those that are apparently well intentioned but have potential to yield a biased or harmful response. Questions did not use explicit or overt negative language and generally did not explicitly ask about health equity.

EHAI questions were derived for coverage over the taxonomy of equity-related harms⁶¹ used to derive the dimensions of bias (Table 1) represented in the assessment rubrics. This resulted in questions in the following focus areas: access to healthcare, quality of healthcare, food and nutrition, mental health, patient experience, chronic diseases, mortality rates, insurance coverage, counseling services, maternal mortality and provider perception and labeling. Similar to OMAQ, EHAI prioritized health topics with known disparities, including cardiovascular disease, mental health, diabetes, maternal mortality and morbidity, breast cancer and kidney disease.

Failure-Based Red Teaming—Manual

The FBRT-Manual dataset contains 150 human-written medical questions designed specifically to target observed equity-related failures in Med-PaLM 2 responses to consumer medical questions.

FBRT-Manual was generated through iterative manual inspection and analysis of a series of 121 ‘seed’ Med-PaLM 2 responses that were reported as biased by at least one of three physicians during assessment of the Mixed MMQA–OMAQ dataset using the earlier iteration of the individual assessment rubric presented in Supplementary Table 3. Using these seed data, we performed three rounds of manual writing of new questions for this dataset. After each round, we generated answers to questions from the previous round using Med-PaLM 2 and qualitatively inspected them to improve our intuitions for the next round.

Multiple failure modes were identified, including (1) a failure to push back against a biased or inappropriate premise in the question, (2) a failure to consider relevant systemic and social factors in understanding a patient’s illness and (3) a failure to ignore information given about a patient’s group identity when such information was irrelevant. Identifying multiple examples of (1) resulted in the addition of the corresponding dimension of bias to the assessment rubrics.

Questions were generated to target the identified sources of bias, with some related questions assessing the impact of atomic identity or geographical changes on the model outputs. We build on this approach for the counterfactual datasets (CC-Manual and CC-LLM). Questions were included to directly target pernicious stereotypes (such as an association of patients experiencing homeless with deliberate medication non-adherence) and medically violent practices (such as forced sterilization). A subset of questions were derived directly from Omiye et al.²³ to probe racist misconceptions regarding the role of race in medicine. Reflecting a wide range of potential model deployment scenarios, the dataset included language styles ranging from blunt and simplistic to sophisticated and clinical. The overtness of the explicit bias varied, from direct statement of stereotypes to more subtle justifications of harmful practices. We included additional queries focused on LGBTQ health, indigenous health, women’s health and global health topics, all of which were relatively under-represented in the original seed set.

Failure-Based Red Teaming—LLM

The FBRT-LLM dataset contains 3,558 adversarial questions generated using Med-PaLM 2 to probe observed equity-related failures in Med-PaLM 2 responses to medical questions. For human evaluation, we filtered and sampled this dataset to a size of 661 questions, using the procedure described below.

To extend the red teaming approach used for FBRT-Manual and further scale adversarial data for evaluation, we developed an LLM-powered pipeline for data augmentation. We used the underlying assumption that, if an LLM is biased when answering a question, then it may be likely to be biased when answering a similar question. This approach required a pre-existing set of seed questions to expand. To produce FBRT-LLM, we used the same set of 121 pre-existing seed questions used for FBRT-Manual.

We performed augmentation of the seed questions using Med-PaLM 2 with the custom prompts provided in Supplementary Table 5. To mutate a given seed question, we randomly sampled one of six semantic augmentation prompts. The semantic augmentation prompts asked the model to manipulate the seed question to achieve one of the following: (1) generate a clinically similar question that may have different answers for different patient demographic groups, (2) introduce additional clinical detail and complexity to the seed question so that it may have different answers for different patient demographic groups, (3) change the clinical details to make the question harder to answer, (4) generate a related question that seems as though it were written by a person who believes in medical misinformation, (5) generate a similar question such that increased clinical expertise is required to answer it and (6) generate a structurally similar question for a different condition with different symptoms. The sixth prompt was only applied to questions involving specific conditions with corresponding symptoms. Given many potential augmentations for a seed question, subsequent filtering was also done by prompting Med-PaLM 2 to evaluate both whether a particular augmentation was noncontradictory and whether it still was a health question (prompts in Supplementary Table 6). Finally, in a limited number of cases, we performed lightweight manual filtering to remove irrelevant questions.

For each question in the seed set, we generated 120 candidate question augmentations using Med-PaLM 2, to which we applied the automated filters to de-duplicate and remove low-quality augmentations. After filtering, the resulting expanded set had 3,558 newly generated questions. The number of expansions per seed question was not uniform, depending on how many generated de-duplicated augmentations passed the filters for each seed question. We subsequently performed uniform random sampling to select ten augmentations per seed question or the full set of augmentations in the case that the number of augmentations remaining after filtering was less than ten. The final size of this subset used for human evaluation in our empirical study (Table 2) was 661 questions.

Tropical and Infectious Diseases

The TRINDS dataset contains 106 questions related to tropical diseases. For this dataset, we aimed to explore adversarial questions relevant for potentially underserved geographies outside of the US.

To begin, we compiled a list of global tropical diseases, using authoritative sources such as the World Health Organization, the Pan American Health Organization and the Centers for Disease Control. For each disease, we identified general symptoms that were commonly found across several diseases, such as coughs, headaches and fever, and also identified symptoms that were specific to a particular disease (for example, coughing up blood). Additionally, because disease presence can vary widely based on geography (for example, malaria versus Zika), we identified regions (countries and cities) with high prevalence of the disease. Finally, we considered risk factors for the disease in specific locations (for example, age, gender, ethnicity and lifestyle-related factors such as eating raw food from a vendor or not sleeping under a mosquito net).

We subsequently generated 52 unique patient personas, covering 52 unique tropical diseases, which we used to generate TRINDS questions. We outline a sample persona for a Guinea worm infection: ‘an 18-year-old female is reporting with a fever, nausea, diarrhea and dizziness. They have an intensely itchy rash, swelling with a burning feeling on the leg. Patient lives in the Sarh area in southeast Chari and regularly fetches water from a nearby stream for household use’. These personas were used to generate both clinical and consumer-oriented questions. Questions were generated with varying symptoms, contextual information and tone. A sample clinical question uses the persona: ‘an 18-year-old female is reporting with a fever, nausea, diarrhea and dizziness. They have an intensely itchy rash, swelling with a burning feeling on the leg. Patient lives in the Sarh area in southeast Chari and regularly fetches water from a nearby stream for household use. How should I manage this condition?’. The final set included 106 questions derived from the patient personas.

Counterfactual Context—Manual

The CC-Manual dataset is a manually curated set of 123 pairs of queries that differ in the insertion, deletion or modification of identifiers of demographics or other context (for example, race, gender and geographical location). The purpose of this dataset is to enable use and initial evaluation of the counterfactual assessment rubric (Extended Data Table 3) as a proof of concept, and the dataset is not intended to be comprehensive in scope. The data include counterfactual pairs defined with respect to identifiers of race, gender, sex, comorbidity and geographical location. The dataset is further intended to include both cases in which the pair of counterfactual questions have the same ideal answer (for example, calculation of eGFR for different racial groups) and cases in which the ideal answers differ across the counterfactual pair (for example, change in geographical location changes the most likely diagnosis).

The dataset is constructed from eight ‘seed’ templates primarily derived from other datasets. Of the eight seed templates, three are derived from OMAQ, two are derived from TRINDS, two are derived from Omiye et al.²³ and one is new. These eight seed templates are expanded by insertion of identifiers of demographics or other context to produce 45 unique questions, corresponding to 123 counterfactual pairs defined over pairs of questions clustered by seed template. For each seed template, we expand exhaustively using a small set of terms defined specifically for each seed template. The terms encompass identifiers of race, sex, gender, comorbidity and geographical location.

Counterfactual Context—LLM

The CC-LLM dataset includes 200 pairs of questions generated via an LLM-based pipeline. Analogous to the semi-automated approach to the creation of FBRT-LLM, we explored the use of LLMs to generate diverse counterfactual examples from seed questions. In particular, this was important because CC-Manual focused only on a small number of axes of identity (for example, race, gender) and a few categories within those axes. A wider spectrum of intersectional identities and backgrounds was missing, which motivated expanding these data to improve coverage.

CC-LLM was derived from 20 seed templates, including the eight seed templates used for CC-Manual and 12 additional seed questions selected from the seed set derived from the Mixed MMQA–OMAQ dataset used for FBRT-Manual and FBRT-LLM. We prompted Med-PaLM 2 to generate 815 counterfactual question augmentations from the set of seed templates (prompts provided in Supplementary Tables 7 and 8). These questions were conditioned on demographics and other contexts sampled from Med-PaLM 2 using a separate prompt. This was implemented in a highly compositional and configurable way. We provided explicit lists of options to the model across the following dimensions: race, ethnicity, sex, gender, age, sexual orientation, socioeconomic status, disability status and location. The model sampled an intersectional demographic identity across several of these dimensions and then augmented the original question to correspond with the automatically generated context.

Finally, we applied binary prompt-based quality filters (Supplementary Table 9), filtering out question pairs that contained implausible demographics or differed too much from each other. We then sampled five augmentations per seed, yielding ten possible pairs per seed, for a total of 100 unique questions and 200 counterfactual pairs.

Empirical study methods

Human raters

To capture a diversity of perspectives on bias and harm, we used 282 total raters with varied professional backgrounds and lived experiences: physicians, equity experts and consumers. All raters were compensated for their annotation work.

Physician raters

We used 11 physician raters drawn from the same set of raters as that used by Singhal et al.^3,4. Raters were based in the US, the UK and India, had been in practice for a range of 6–20 years after residency and had expertise in family medicine, general practice medicine, internal medicine and emergency medicine. Additional information regarding axes of identity and professional training were unavailable for reporting due to the nature of recruitment. Although in the empirical study we evaluated answers written by physicians in prior work^3,4, no physician raters rated their own answers. Physician raters rated a median of 344 items each (minimum, 96; maximum 3,006) and spent a median time per item of 1.09 min (minimum, 0.05 min; maximum, 370.46 min) across all rating tasks.

Health equity expert raters

We recruited nine health equity expert raters who met the qualifications provided in Extended Data Table 8. Raters were based in the US, the UK and India, had been in practice for a range of 4–16 years and had expertise in social work, epidemiology, behavior science, health communication, community and international public health, podiatry, family medicine and emergency medicine. Five health equity expert raters had both medical training and health equity expertise. Additional information regarding axes of identity and professional training was unavailable for reporting due to the nature of recruitment. Health equity experts rated a median of 710 items (minimum, 39; maximum, 2,783) and spent a median time per item of 1.16 min (minimum, 0.04 min; maximum, 129.52 min) across all rating tasks.

Consumer raters

We also performed a study with consumer raters, with two motivations: (1) LLMs may potentially be used in both clinician-as-user and consumer-as-user contexts and at times may be used to facilitate interaction between clinicians and consumers and (2) recognition of the importance of directly assessing users of technology in the context of their lived experiences.

Data were sampled from 262 consumer raters from US-based survey panels. Consumer raters did not have medical or health equity professional training. Participants were recruited by Qualtrics and partners as part of a set of panels⁶⁷ sampled based on target age and race–ethnicity distributions representative of the US population. Gender was not a target stratum used for recruitment. Participants self-reported their age, gender, race and ethnicity from a set of categories defined by the survey vendor. Participants were excluded if needed to achieve the balance of demographics specified for the sample, if they did not pass quality-control checks or if they dropped out partway through the rating task (50 of 312 recruited participants dropped out). The distribution of self-reported participant demographics is provided in Supplementary Table 10. Participants spent a median of 13.1 min in total time rating items (minimum, 3.6 min; maximum, 308.4 min).

Other datasets studied

HealthSearchQA

We use ‘HealthSearchQA’ to refer to a set of 1,061 of the consumer medical questions sampled from the HealthSearchQA dataset³. This dataset is referred to as ‘MultiMedQA 1066’ by Singhal et al.⁴. We used HealthSearchQA to better understand how the adversarial datasets in EquityMedQA compare to more common consumer questions. Note that the number of questions evaluated here is 1,061 instead of 1,066 as in the study by Singhal et al.⁴ as a result of removing a few near-duplicate questions that differ only in the presence of punctuation.

Mixed MMQA–OMAQ

We use ‘Mixed MMQA–OMAQ’ to refer to a set of 240 questions that reflect a mix of data sources, including the 140 MultiMedQA questions evaluated by Singhal et al.³ and 100 questions randomly sampled from OMAQ. The 140 MultiMedQA questions used are a structured sample consisting of 100 questions from HealthSearchQA³, 20 questions from LiveQA⁶⁸ and 20 questions from MedicationQA⁶⁹. We used this set for analyses in which we were interested in a mix of adversarial and non-adversarial data, including iterative, participatory development of the assessment rubrics, failure-based red teaming and study of inter-rater reliability.

Omiye et al

We use the nine questions introduced by Omiye et al.²³ in our study. These questions reflect prior work on persistent race-based medical misconceptions and test whether models reproduce them. As described by Omiye et al.²³, the questions were written by four physicians who reviewed historically used race-based clinical calculators and prior work on common falsehoods believed by medical students and residents. We use ‘Omiye et al.’ to refer to these questions.

Human assessment tasks

We used the three assessment rubrics described previously (independent, pairwise and counterfactual) on answers to questions from each of the datasets. Differing combinations of the rubrics, datasets and rater groups led to the different assessment tasks we studied.

Answer generation

We collected answers from Med-PaLM 2 and Med-PaLM, dependent on the dataset. For every dataset, we generated Med-PaLM 2 answers with temperature 0 and greedy decoding, using the same prompt as that used for adversarial data by Singhal et al.⁴ (Supplementary Table 2). For OMAQ, EHAI, FBRT-Manual, FBRT-LLM, TRINDS, HealthSearchQA, Omiye et al. and Mixed MMQA–OMAQ, we also generated Med-PaLM³ answers using temperature 0 and the same prompt, for use as a comparator in pairwise assessment tasks.

Independent assessment tasks

We performed independent assessment of Med-PaLM 2 answers to every medical question from every dataset for both the physician and health equity expert raters. We used Mixed MMQA–OMAQ to perform triple rating per item across the physician and equity expert rater pools. We also performed quintuple rating per item for the smaller Omiye et al.²³ set across both physician and equity expert raters. We also performed one earlier round of physician triple rating on Mixed MMQA–OMAQ with the initial version of the individual assessment rubric presented in Supplementary Table 3. For other datasets, answers were singly rated, because it was not feasible to multiply rate answers across all the datasets.

In some cases, raters did not complete the rating task. We found that this affected seven total ratings for the independent evaluation rubric across the physician and health equity expert rater groups. Five of the missing ratings were for the triple-rated Mixed MMQA–OMAQ dataset (four sourced from MMQA and one from OMAQ), and two were for CC-Manual. For the primary analysis of triple-rated data, we filtered out a question for a rater group if three ratings were not present.

For the consumer pool, each participant assessed three distinct question–answer pairs, drawn at random from the Mixed MMQA–OMAQ set. As a result of the randomization process, two of the 240 questions in this dataset were not shown to participants; these were excluded from summary analyses comparing all rater groups (Supplementary Table 5).

Pairwise assessment tasks

We performed pairwise assessment between Med-PaLM 2 and Med-PaLM answers to every question from OMAQ, EHAI, FBRT-Manual, FBRT-LLM, TRINDS, HealthSearchQA, Omiye et al. and Mixed MMQA–OMAQ. Note that we did not perform pairwise evaluation for the counterfactual datasets, instead using counterfactual assessment to evaluate pairs of answers for related questions. Just as for the individual evaluation, we performed triple rating for the Mixed MMQA–OMAQ set and quintuple rating for the Omiye et al. set across both physician and equity expert raters. For HealthSearchQA, we also conducted a pairwise assessment between Med-PaLM 2 answers and physician-written answers across both physician and equity expert raters. For these data, we found four missing ratings for the singly rated datasets (one for EHAI, two for FBRT-Manual, one for FBRT-LLM) and no missing triply rated data.

Counterfactual assessment tasks

We performed counterfactual assessment for both CC-Manual and CC-LLM across physician and equity expert raters. For the smaller CC-Manual set, we performed triple rating. No counterfactual ratings were found to be missing. Due to a data processing error, we removed questions that referred to ‘Natal’ from the analysis of the counterfactual rubric on the CC-Manual dataset. This affects three questions (corresponding to 21 counterfactual pairs) derived from one seed question based on the TRINDS dataset.

Statistical analysis

All statistical analyses and visualizations were performed using Python (version 3.12) and R (version 4.1.3) programming languages. For analysis, we use the statsmodels (version 0.12.2)⁷⁰, scipy (version 0.13)⁷¹ and krippendorff (version 0.6.1)⁷² Python packages. For analyses of ratings from the independent evaluation rubric report, we primarily report on the ‘binary’ presence of bias, where major or minor bias is collapsed into a single category. We analyzed inter-rater reliability using both Randolph’s kappa⁴¹ and Krippendorff’s alpha⁴². We used a range of metrics because the different metrics make different assumptions about chance agreement, especially in imbalanced datasets in which the rate of positive observations may be low^41,73.

CIs for ratings in the empirical study were estimated using the bootstrap method with 1,000 resamples. We use the percentile bootstrap for inter-rater reliability statistics and the bias-corrected and accelerated bootstrap⁷⁴ for all other statistics. Bootstrap CIs fail for inter-rater reliability statistics in some cases due to data imbalance. We do not account for the nested structure of the datasets expanded from smaller sets of ‘seed’ queries in the computation of CIs.

For multiply rated data, we primarily report rates computed over a pooled sample where each rating is considered as an independent sample. We also report ‘majority-vote’ and ‘any-vote’ rates that aggregate over the set of ratings. ‘Majority-vote’ rates correspond to rates where the rating for each item takes on the consensus rating over the set of raters. ‘Any-vote’ rates correspond to the rate that at least one rater reported bias in an item in independent evaluation or was not indifferent in pairwise evaluation. For aggregated statistics, we performed bootstrap over the aggregated items, which can be considered a cluster bootstrap where the individual ratings for each item are not resampled⁷⁵.

Consumer study ratings were analyzed using a logistic regression model. The outcome variable was binary presence or absence of bias for a given question–answer pair. Because the assignment of rating items to participants was random, we measured effects on non-aggregated ratings. For each set of predictor variables in the regression, the regression estimated log odds of reported bias for each factor relative to a reference value (for example, the relative degree of bias reported for an age group relative to that of the oldest age group).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The seven EquityMedQA datasets are available as Supplementary Data 1 and at https://doi.org/10.6084/m9.figshare.26133973 (ref. ⁷⁶). The provided data include the datasets of questions as well as the ratings and generated Med-PaLM 2 and Med-PaLM answers necessary to reproduce the primary analyses of the empirical study. The data include limited demographic data from the raters (age categories) and do not include free text comments from the raters, demographic data for the consumer raters or the physician-written answers to HealthSearchQA questions. This work uses the long-form MultiMedQA questions previously described in the study by Singhal et al.^3,4, which contains samples of questions from HealthSearchQA³, LiveQA⁶⁸ and MedicationQA⁶⁹. This work further uses the set of nine questions studied by Omiye et al.²³.

Code availability

We release the set of prompts used for semi-synthetic generation of adversarial questions for failure-based red teaming and counterfactual expansion. As the models used in this work are not publicly available due to their proprietary nature, we do not release the code for generating model outputs. Analytic code to reproduce the primary analyses have been made available at https://github.com/google-research/google-research/tree/master/health_equity_toolbox.

References

Clusmann, J. et al. The future landscape of large language models in medicine. Commun. Med. 3, 141 (2023).
Article PubMed PubMed Central Google Scholar
Omiye, J. A., Gui, H., Rezaei, S. J., Zou, J. & Daneshjou, R. Large language models in medicine: the potentials and pitfalls: a narrative review. Ann. Intern. Med. 177, 210–220 (2024).
Article PubMed Google Scholar
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Article CAS PubMed PubMed Central Google Scholar
Singhal, K. et al. Towards expert-level medical question answering with large language models. Preprint at https://arxiv.org/abs/2305.09617 (2023).
Zakka, C. et al. Almanac — retrieval-augmented language models for clinical medicine. NEJM AI https://doi.org/10.1056/aioa2300068 (2024).
Yang, X. et al. A large language model for electronic health records. NPJ Digit. Med. 5, 194 (2022).
Article PubMed PubMed Central Google Scholar
Agrawal, M., Hegselmann, S., Lang, H., Kim, Y. & Sontag, D. Large language models are few-shot clinical information extractors. In Proc. of the 2022 Conference on Empirical Methods in Natural Language Processing (eds Goldberg, Y. et al.) 1998–2022 (ACL, 2022).
Kanjee, Z., Crowe, B. & Rodman, A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA 330, 78–80 (2023).
Article PubMed PubMed Central Google Scholar
McDuff, D. et al. Towards accurate differential diagnosis with large language models. Preprint at https://arxiv.org/abs/2312.00164 (2023).
Moor, M. et al. Med-Flamingo: a multimodal medical few-shot learner. In Proc. of the 3rd Machine Learning for Health Symposium (eds Hegelsmann, S. et al.) 353–367 (PMLR, 2023).
Tu, T. et al. Towards generalist biomedical AI. NEJM AI 1, AIoa2300138 (2024).
Article Google Scholar
Liu, X. et al. Large language models are few-shot health learners. Preprint at https://arxiv.org/abs/2305.15525 (2023).
Harrer, S. Attention is not all you need: the complicated case of ethically using large language models in healthcare and medicine. EBioMedicine 90, 104512 (2023).
Singh, N., Lawrence, K., Richardson, S. & Mann, D. M. Centering health equity in large language model deployment. PLoS Digit. Health 2, e0000367 (2023).
Article PubMed PubMed Central Google Scholar
Lee, P., Bubeck, S. & Petro, J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N. Engl. J. Med. 388, 1233–1239 (2023).
Article PubMed Google Scholar
Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. On the dangers of stochastic parrots: can language models be too big? In FAccT '21: Proc. of the 2021 ACM Conference on Fairness, Accountability, and Transparency (eds Elish, M. et al.) 610–623 (ACM, 2021).
Bailey, Z. D. et al. Structural racism and health inequities in the USA: evidence and interventions. Lancet 389, 1453–1463 (2017).
Article PubMed Google Scholar
Williams, D. R., Lawrence, J. A., Davis, B. A. & Vu, C. Understanding how discrimination can affect health. Health Serv. Res. 54, 1374–1388 (2019).
Article PubMed PubMed Central Google Scholar
World Health Organization. A Conceptual Framework for Action on the Social Determinants of Health www.who.int/publications/i/item/9789241500852 (2010).
World Health Organization. Operational Framework for Monitoring Social Determinants of Health Equity www.who.int/publications/i/item/9789240088320 (2024).
Arora, A. et al. The value of standards for health datasets in artificial intelligence-based applications. Nat. Med. 29, 2929–2938 (2023).
Article CAS PubMed PubMed Central Google Scholar
Ferryman, K., Mackintosh, M. & Ghassemi, M. Considering biased data as informative artifacts in AI-assisted health care. N. Engl. J. Med. 389, 833–838 (2023).
Article PubMed Google Scholar
Omiye, J. A., Lester, J. C., Spichak, S., Rotemberg, V. & Daneshjou, R. Large language models propagate race-based medicine. NPJ Digit. Med. 6, 195 (2023).
Article PubMed PubMed Central Google Scholar
Eneanya, N. D. et al. Health inequities and the inappropriate use of race in nephrology. Nat. Rev. Nephrol. 18, 84–94 (2022).
Article PubMed Google Scholar
Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 447–453 (2019).
Article CAS PubMed Google Scholar
Passi, S. & Barocas, S. Problem formulation and fairness. In FAT* ’19: Proc. of the Conference on Fairness, Accountability, and Transparency (eds Boyd, D. et al.) 39–48 (ACM, 2019).
Chen, I. Y. et al. Ethical machine learning in healthcare. Annu. Rev. Biomed. Data Sci. 4, 123–144 (2021).
Article PubMed PubMed Central Google Scholar
Pfohl, S. R., Foryciarz, A. & Shah, N. H. An empirical characterization of fair machine learning for clinical risk prediction. J. Biomed. Inform. 113, 103621 (2021).
Article PubMed Google Scholar
Zack, T. et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit. Health 6, e12–e22 (2024).
Article CAS PubMed Google Scholar
World Health Organization. Health Equity www.who.int/health-topics/health-equity (2021).
Barocas, S., Hardt, M. & Narayanan, A. Fairness and Machine Learning: Limitations and Opportunities (MIT, 2023).
Abràmoff, M. D. et al. Considerations for addressing bias in artificial intelligence for health equity. NPJ Digit. Med. 6, 170 (2023).
Article PubMed PubMed Central Google Scholar
Cary, M. P. et al. Mitigating racial and ethnic bias and advancing health equity in clinical algorithms: a scoping review. Health Aff. 42, 1359–1368 (2023).
Article Google Scholar
Feffer, M., Sinha, A., Lipton, Z. C. & Heidari, H. Red-teaming for generative AI: silver bullet or security theater? Preprint at https://arxiv.org/abs/2401.15897 (2024).
Ganguli, D. et al. Red teaming language models to reduce harms: methods, scaling behaviors, and lessons learned. Preprint at https://arxiv.org/abs/2209.07858 (2022).
Perez, E. et al. Red teaming language models with language models. In Proc. of the 2022 Conference on Empirical Methods in Natural Language Processing (eds Goldberg, Y. et al.) 3419–3448 (ACL, 2022).
Liu, X. et al. The medical algorithmic audit. Lancet Digit. Health 4, e384–e397 (2022).
Article CAS PubMed Google Scholar
Sperrin, M., Riley, R. D., Collins, G. S. & Martin, G. P. Targeted validation: validating clinical prediction models in their intended population and setting. Diagn. Progn. Res. 6, 24 (2022).
Article PubMed PubMed Central Google Scholar
Raji, I. D. et al. Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing. In FAT* ’20: Proc. of the 2020 Conference on Fairness, Accountability, and Transparency (eds Hildebrandt, M. et al.) 33–44 (ACM, 2020).
Kahng, M. et al. LLM comparator: visual analytics for side-by-side evaluation of large language models. In Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (eds Mueller, F. et al.) 1–7 (ACM, 2024).
Randolph, J. J. Free-marginal multirater kappa (multirater K_free): an alternative to Fleiss’ fixed-marginal multirater kappa. In Joensuu Learning and Instruction Symposium (2005).
Krippendorff, K. Estimating the reliability, systematic error and random error of interval data. Educ. Psychol. Meas. 30, 61–70 (1970).
Article Google Scholar
Inker, L. A. et al. New creatinine- and cystatin C-based equations to estimate GFR without race. N. Engl. J. Med. 385, 1737–1749 (2021).
Article CAS PubMed PubMed Central Google Scholar
Prabhakaran, V. et al. GRASP: a disagreement analysis framework to assess group associations in perspectives. In Proc. of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (eds Duh, K. et al.) 3473–3492 (ACL, 2024).
Homan, C. M. et al. Intersectionality in AI safety: using multilevel models to understand diverse perceptions of safety in conversational AI. In Proc. of the 3rd Workshop on Perspectivist Approaches to NLP (NLPerspectives) @ LREC-COLING 2024 (eds Abercrombie, G. et al.) 131–141 (2024).
Feinstein, A. R. & Cicchetti, D. V. High agreement but low kappa: I. The problems of two paradoxes. J. Clin. Epidemiol. 43, 543–549 (1990).
Article CAS PubMed Google Scholar
Quarfoot, D. & Levine, R. A. How robust are multirater interrater reliability indices to changes in frequency distribution? Am. Stat. 70, 373–384 (2016).
Article Google Scholar
Wang, D. et al. All that agrees is not gold: evaluating ground truth labels and dialogue content for safety. Preprint at Google Research https://research.google/pubs/all-that-agrees-is-not-gold-evaluating-ground-truth-labels-and-dialogue-content-for-safety (2023).
Paun, S. et al. Comparing Bayesian models of annotation. Trans. Assoc. Comput. Linguist. 6, 571–585 (2018).
Article Google Scholar
Blodgett, S. L., Lopez, G., Olteanu, A., Sim, R. & Wallach, H. Stereotyping Norwegian salmon: an inventory of pitfalls in fairness benchmark datasets. In Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (eds Zong, C. et al.) 1004–1015 (ACL, 2021).
Johnson, T. P. Handbook of Health Survey Methods (Wiley Online Library, 2015).
Harkness, J. A. et al. In Comparative Survey Methodology Ch. 1, 1–16 (John Wiley & Sons, 2010).
Miceli, M. et al. Documenting data production processes: a participatory approach for data work. In Proc. of the ACM on Human–Computer Interaction (ed Nichols, J.) 510 (ACM, 2022).
Birhane, A. et al. Power to the people? Opportunities and challenges for participatory AI. In EAAMO '22: Proc. of the 2nd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (eds Falcettoni, E. et al.) 6 (ACM, 2022).
Asiedu, M. et al. The case for globalizing fairness: a mixed methods study on colonialism, AI, and health in Africa. Preprint at https://arxiv.org/abs/2403.03357 (2024).
Sambasivan, N., Arnesen, E., Hutchinson, B., Doshi, T. & Prabhakaran, V. Re-imagining algorithmic fairness in India and beyond. In FAccT '21: Proc. of the 2021 ACM Conference on Fairness, Accountability, and Transparency (eds Elish, M. et al.) 315–328 (ACM, 2021).
Birhane, A. Algorithmic colonization of Africa. SCRIPTed 17, 389–409 (2020).
Article Google Scholar
Mitchell, M. et al. Model cards for model reporting. In FAT* '19: Proc. of the Conference on Fairness, Accountability, and Transparency (eds Boyd, D. et al.) 220–229 (ACM, 2019).
Gebru, T. et al. Datasheets for datasets. Commun. ACM 64, 86–92 (2021).
Article Google Scholar
Stiennon, N. et al. Learning to summarize with human feedback. Adv. Neural Inf. Process. Syst. 33, 3008–3021 (2020).
Google Scholar
Smith-Loud, J. et al. The Equitable AI Research Roundtable (EARR): towards community-based decision making in responsible AI development. Preprint at https://arxiv.org/abs/2303.08177 (2023).
Shelby, R. et al. Sociotechnical harms of algorithmic systems: scoping a taxonomy for harm reduction. In AIES '23: Proc. of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (eds Ross, F. et al.) 723–741 (ACM, 2023).
Weidinger, L. et al. Sociotechnical safety evaluation of generative AI systems. Preprint at https://arxiv.org/abs/2310.11986 (2023).
Kusner, M. J., Loftus, J., Russell, C. & Silva, R. Counterfactual fairness. In Advances in Neural Information Processing Systems (eds Guyon, I. et al.) (Curran Associates, 2017).
Garg, S. et al. Counterfactual fairness in text classification through robustness. In Proc. of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (eds Markham, A. et al.) 219–226 (ACM, 2019).
Prabhakaran, V., Hutchinson, B. & Mitchell, M. Perturbation sensitivity analysis to detect unintended model biases. In Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP–IJCNLP) (eds Inui, K. et al.) 5740–5745 (ACL, 2019).
Qualtrics. 28 Questions to Help Buyers of Online Samples www.iup.edu/arl/files/qualtrics/esomar.pdf (2019).
Ben Abacha, A., Agichtein, E., Pinter, Y. & Demner-Fushman, D. Overview of the medical question answering task at TREC 2017 LiveQA. In Text Retrieval Conference 2017 (2017).
Abacha, A. B. et al. Bridging the gap between consumers’ medication questions and trusted answers. Stud. Health Technol. Inform. 264, 25–29 (2019).
Seabold, S. & Perktold, J. statsmodels: econometric and statistical modeling with Python. In Proc. of the 9th Python in Science Conference (eds van der Walt, S. & Millman, J.) 92–96 (SciPy, 2010).
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Article CAS PubMed PubMed Central Google Scholar
Castro, S. Fast Krippendorff: fast computation of Krippendorff’s alpha agreement measure. GitHub github.com/pln-fing-udelar/fast-krippendorff (2017).
Wong, K., Paritosh, P. & Aroyo, L. Cross-replication reliability — an empirical approach to interpreting inter-rater reliability. In Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (eds Zong, C. et al.) 7053–7065 (ACL, 2021).
Efron, B. Better bootstrap confidence intervals. J. Am. Stat. Assoc. 82, 171–185 (1987).
Article Google Scholar
Field, C. A. & Welsh, A. H. Bootstrapping clustered data. J. R. Stat. Soc. B Stat. Methodol. 69, 369–390 (2007).
Article Google Scholar
Pfohl, S. et al. A toolbox for surfacing health equity harms and biases in large language models. figshare https://doi.org/10.6084/m9.figshare.26133973 (2024).

Download references

Acknowledgements

We thank J. Krause, L. Hollod, S. Man, S. Lachgar, L. Winer, Z. Ghahramani, B. Crosby, B. Green, E. Dominowska, V. Natarajan, T. Tu, P. Payne, M. Chery, D. Martin Jr, J. A. Seguin, M. Amin, R. Wong, S. S. Mahdavi, D. Webster, V. Sounderajah, D. Jeji, N. Hammel, M. Thompson, L. Foster, P. Clardy, M. Perroni, A. Um’rani, K. DeSalvo, M. Howell and the participants of EARR for their feedback and support for this work. This study was funded by Google LLC and/or its subsidiary (Google). L.A.C. is funded by the National Institute of Health through R01 EB017205, DS-I Africa U54 TW012043-01 and Bridge2AI OT2OD032701 and by the National Science Foundation through ITEST 2148451.

Author information

These authors contributed equally: Stephen R. Pfohl, Heather Cole-Lewis.

Authors and Affiliations

Google Research, Mountain View, CA, USA
Stephen R. Pfohl, Heather Cole-Lewis, Rory Sayres, Darlene Neal, Mercy Asiedu, Qazi Mamunur Rashid, Negar Rostamzadeh, Yun Liu, Mike Schaekermann, Chirag Nagpal, Preeti Singh, Akeiylah Dewitt, Sushant Prakash, Katherine Heller, Alan Karthikesalingam, Christopher Semturs, Greg Corrado, Yossi Matias, Jamila Smith-Loud, Ivor Horn & Karan Singhal
Google DeepMind, Mountain View, CA, USA
Awa Dieng, Nenad Tomasev, Shekoofeh Azizi, Alanna Walton, Alicia Parrish, Philip Mansfield & Joelle Barral
University of Alberta, Edmonton, Alberta, Canada
Liam G. McCoy
Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA
Leo Anthony Celi
Division of Pulmonary, Critical Care and Sleep Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA
Leo Anthony Celi
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
Leo Anthony Celi

Authors

Stephen R. Pfohl
View author publications
You can also search for this author inPubMed Google Scholar
Heather Cole-Lewis
View author publications
You can also search for this author inPubMed Google Scholar
Rory Sayres
View author publications
You can also search for this author inPubMed Google Scholar
Darlene Neal
View author publications
You can also search for this author inPubMed Google Scholar
Mercy Asiedu
View author publications
You can also search for this author inPubMed Google Scholar
Awa Dieng
View author publications
You can also search for this author inPubMed Google Scholar
Nenad Tomasev
View author publications
You can also search for this author inPubMed Google Scholar
Qazi Mamunur Rashid
View author publications
You can also search for this author inPubMed Google Scholar
Shekoofeh Azizi
View author publications
You can also search for this author inPubMed Google Scholar
Negar Rostamzadeh
View author publications
You can also search for this author inPubMed Google Scholar
Liam G. McCoy
View author publications
You can also search for this author inPubMed Google Scholar
Leo Anthony Celi
View author publications
You can also search for this author inPubMed Google Scholar
Yun Liu
View author publications
You can also search for this author inPubMed Google Scholar
Mike Schaekermann
View author publications
You can also search for this author inPubMed Google Scholar
Alanna Walton
View author publications
You can also search for this author inPubMed Google Scholar
Alicia Parrish
View author publications
You can also search for this author inPubMed Google Scholar
Chirag Nagpal
View author publications
You can also search for this author inPubMed Google Scholar
Preeti Singh
View author publications
You can also search for this author inPubMed Google Scholar
Akeiylah Dewitt
View author publications
You can also search for this author inPubMed Google Scholar
Philip Mansfield
View author publications
You can also search for this author inPubMed Google Scholar
Sushant Prakash
View author publications
You can also search for this author inPubMed Google Scholar
Katherine Heller
View author publications
You can also search for this author inPubMed Google Scholar
Alan Karthikesalingam
View author publications
You can also search for this author inPubMed Google Scholar
Christopher Semturs
View author publications
You can also search for this author inPubMed Google Scholar
Joelle Barral
View author publications
You can also search for this author inPubMed Google Scholar
Greg Corrado
View author publications
You can also search for this author inPubMed Google Scholar
Yossi Matias
View author publications
You can also search for this author inPubMed Google Scholar
Jamila Smith-Loud
View author publications
You can also search for this author inPubMed Google Scholar
Ivor Horn
View author publications
You can also search for this author inPubMed Google Scholar
Karan Singhal
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Conceptualization: S.R.P., H.C.-L., K.S. Study design and methodology: S.R.P., H.C.-L., R.S., D.N., M.A., N.T., Y.L., A.W., A.P., K.S. Data curation: S.R.P., H.C.-L., D.N., M.A., A. Dieng, N.T., Q.M.R., L.G.M., L.A.C., M.S., J.S.-L., K.S. Data analysis: S.R.P., R.S., S.A. Drafting and revising: S.R.P., H.C.-L., R.S., D.N., M.A., A. Dieng, N.T., Q.M.R., S.A., N.R., L.G.M., L.A.C., Y.L., M.S., A.W., A.P., C.N., P.S., A. Dewitt, P.M., S.P., K.H., A.K., C.S., J.B., G.C., Y.M., J.S.-L., I.H., K.S.

Corresponding authors

Correspondence to Stephen R. Pfohl or Heather Cole-Lewis.

Ethics declarations

Competing interests

This study was funded by Google LLC and/or its subsidiary (Google). S.R.P., H.C.-L., R.S., D.N., M.A., A. Dieng, N.T., Q.M.R., S.A., N.R., Y.L., M.S., A.W., A.P., C.N., P.S., A. Dewitt, P.M., S.P., K.H., A.K., C.S., J.B., G.C., Y.M., J.S.-L., I.H. and K.S. are employees of Google and may own stock as a part of a standard compensation package. The other authors declare no competing interests.

Peer review

Peer review information

Nature Medicine thanks Melissa McCradden and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Lorenzo Righetto, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Summary of pairwise comparisons of counterfactual answers.

Results are pooled over the CC-Manual (n = 102 pairs, triple replicaIon) and CC-LLM (n = 200 pairs) datasets. Top, the counts of the number of pairs of Med-PaLM 2 answers to counterfactual questions reported for each category of answer similarity, stratified by rater group and whether the ideal answers to the counterfactual quesIons were reported to be (left) the same or (right) different. Bottom, the rate that counterfactual pairs are reported as containing bias for cases where the ideal answers were reported to be (left) the same and (right) different, stratified by rater group and the reported category of answer similarity. Data are reported as proporIons with 95% confidence intervals.

Extended Data Fig. 2 Evaluation of answers to the questions from Omiye et al.23.

In (A) we show the number of raters, out of five, that reported the presence of bias and its dimensions across the nine questions for the independent rubric. In (B), we show the results of pairwise evaluation of Med-PaLM and Med-PaLM 2 answers to the questions for each question and rater.

Extended Data Table 1 Independent Assessment Rubric

Full size table

Extended Data Table 2 Pairwise Assessment Rubric

Full size table

Extended Data Table 3 Counterfactual Assessment Rubric

Full size table

Extended Data Table 4 Examples of adversarial questions from EquityMedQA datasets

Full size table

Extended Data Table 5 Comparison of reported bias in independent assessment across the physician, health equity expert, and consumer rater groups

Full size table

Extended Data Table 6 Inter-rater reliability assessed with Randolph’s kappa

Full size table

Extended Data Table 7 Inter-rater reliability assessed with Krippendorff’s alpha

Full size table

Extended Data Table 8 Health Equity Expert Qualifications

Full size table

Supplementary information

Supplementary Information

Supplementary Figs. 1–6 and Tables 1–10.

Reporting Summary

Supplementary Data 1

Workbook containing the datasets studied in this work. Three tabs contain model outputs and rating data, and 11 tabs contain the questions that compose the datasets. Sheets correspond to sets of ratings and sets of questions. A README tab is included. This file contains the same data as are made available through figshare (https://doi.org/10.6084/m9.figshare.26133973).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Pfohl, S.R., Cole-Lewis, H., Sayres, R. et al. A toolbox for surfacing health equity harms and biases in large language models. Nat Med 30, 3590–3600 (2024). https://doi.org/10.1038/s41591-024-03258-2

Download citation

Received: 26 March 2024
Accepted: 20 August 2024
Published: 23 September 2024
Issue Date: December 2024
DOI: https://doi.org/10.1038/s41591-024-03258-2

Subjects

Abstract

Similar content being viewed by others

A framework for human evaluation of large language models in healthcare derived from literature review

Toward expert-level medical question answering with large language models

An automated framework for assessing how well LLMs cite relevant medical references

Main

Results

Assessment design

EquityMedQA

Empirical study

Independent and pairwise analyses

Counterfactual analyses

Consumer study

Inter-rater reliability

Application to the Omiye et al. dataset

Discussion

Limitations and future work

Methods

Assessment design methodology

Participatory approach with equity experts

Focus group sessions with physicians

Review of failures of Med-PaLM 2

Iterative scaled-up human evaluation

Independent evaluation rubric

Pairwise evaluation rubric

Counterfactual evaluation rubric

EquityMedQA

Open-ended Medical Adversarial Queries

Equity in Health AI

Failure-Based Red Teaming—Manual

Failure-Based Red Teaming—LLM

Tropical and Infectious Diseases

Counterfactual Context—Manual

Counterfactual Context—LLM

Empirical study methods

Human raters

Physician raters

Health equity expert raters

Consumer raters

Other datasets studied

HealthSearchQA

Mixed MMQA–OMAQ

Omiye et al

Human assessment tasks

Answer generation

Independent assessment tasks

Pairwise assessment tasks

Counterfactual assessment tasks

Statistical analysis

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Extended Data Fig. 1 Summary of pairwise comparisons of counterfactual answers.

Extended Data Fig. 2 Evaluation of answers to the questions from Omiye et al.23.

Supplementary information

Supplementary Information

Reporting Summary

Supplementary Data 1

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links