Introduction

Screening mammography has been shown to decrease mortality from breast cancer in multiple long-term prospective trials [1, 2], but screening programs suffer from delays in care following abnormal mammography. These diagnostic delays can not only cause undue patient anxiety and prolong treatment delays, but also exacerbate disparities in minority racial and ethnic groups [3, 4]. For example, one study revealed that Black women were twice as likely to experience a delay in follow-up imaging exceeding 45 days compared to White women, and such delays were associated with a 1.6-fold increase in breast cancer mortality [4]. Delays have been additionally compounded by a spike in attendance following the screening slowdown during the COVID-19 pandemic.

Traditionally, most facilities in the United States lack the capacity to provide immediate screening results to their patients and instead interpret mammograms in a “batch” setting, after the patient has left. While more than 85% of cases are found to be normal and don’t need follow-up care [5], the remaining patients are asked to return for additional workup due to indeterminate mammographic findings. Previous studies [6,7,8] have demonstrated the benefits of immediate interpretation and same-visit additional diagnostic workup, including reduced patient anxiety, faster diagnosis, improved follow-up adherence, and decreased racial disparities in diagnostic delays. Despite these patient-centered benefits of same-visit workup, this practice is not widespread primarily due to the impracticality of offering same-visit interpretation to all patients to identify the small percentage who require further imaging.

To help solve challenges in breast cancer screening, healthcare systems have begun looking to Artificial intelligence (AI). However, challenges remain related to integrating AI into existing clinical workflows to optimize patient care and outcomes. Prior applications of AI in screening mammography have focused primarily on increasing accuracy in computer aided detection (CAD) or standalone AI interpretation workflows, [9,10,11,12] but prospective studies focused on implementation of AI have been scarce [13]. Two recent interventional studies [14, 15] investigated the role of AI in double-reader screening workflows, with a goal of reducing staffing needs and increasing cancer detection rates. Here, we describe a prospective implementation study of AI for triage and its impact on operational outcomes. Our goal was to determine whether real-time AI prioritization can streamline the time to follow-up imaging evaluation and biopsy diagnosis, compared to a standard-of-care workflow. Using AI in this way has the potential to achieve the benefits of immediate interpretation for the small percentage of patients who require additional diagnostic evaluation while retaining the workflow efficiency of batch screening for the remaining majority.

Materials and methods

Oversight and compliance

This prospective, randomized, unblinded, controlled implementation study was approved by our Institutional Review Board (STU00212646). The study is exempt from National Clinical Trial (NC) registration as it does not meet all 4 criteria on the clinicaltrials.gov checklist [16].

Participants

Women aged 40–89, scheduled to undergo screening mammography between March 2021 and May 2022, were invited to participate consecutively (after meeting study eligibility criteria), using a combination of email, phone and in-person informed consent. We obtained a list and demographics of patients through the electronic medical record system. We excluded pregnant women, as well as patients with a history of breast cancer, prior mastectomy or breast implants. In total, 1000 women consented to participate in the study (Fig. 1a). The demographics of the consented cohort as compared to the institutional and national screening population [17] can be found in Table 1.

Fig. 1
figure 1

Summary of study methods. a Participants. b AI prioritization. c Primary metrics TA and TB. TA is the time to additional imaging and TB is the time to biopsy diagnosis. Screening mammograms were assigned BIRADS category 1 (negative), 2 (benign) or 0 (incomplete—additional imaging needed) after radiologist review. After additional diagnostic imaging, a BIRADS category was assigned as 1 (negative), 2 (benign), 3 (probably benign), 4 (suspicious), 5 (highly suggestive of malignancy). BIRADS Breast Imaging Reporting & Data System

Table 1 Study population

AI system

The investigational AI device was based on technology described previously [9] and tuned for the prioritization use case. While the underlying model produces a score between 0 and 1, an operating point (score threshold) was selected in a retrospective population from the same institution to yield a final binary priority categorization (See Online Resource 1-Supplementary Methods for specifics). This AI operating point was held constant for the study. See the Online Resource 1-Supplementary Simulations for a post hoc simulation of alternative operating points which could accommodate the broader set of trade-offs between the fraction of prioritized patients and the algorithm’s sensitivity. The cloud-based device was developed with design controls under an ISO 13485-certified quality management system. Only de-identified mammograms were transferred when invoking the device.

Protocol

Participants were randomly assigned to the control or the experimental group with a standard study software program. In the initial phase of the study (the first 100 participants), a 9:1 assignment ratio in favor of the experimental group was used to accelerate discovery of potential technical or operational issues. A review of operations at the 100-participant mark yielded no corrections to the protocol or technical integration. For the subsequent 900 women, a 1:1 assignment ratio was used.

Participants in the control group followed the standard workflow, while those assigned to the experimental group followed the AI-modified workflow (Fig. 2). In both workflows, the 4 standard mammographic views were acquired (Selenia Dimensions, Hologic). Images were interpreted by one of 13 board-certified, fellowship-trained breast radiologists. Breast Imaging and Reporting Data System [18] (BIRADS) categories were assigned to each case. Some participants obtained supplemental screening with ultrasound and magnetic resonance (MRI), given high breast density, based on their preference.

Fig. 2
figure 2

Standard and AI-modified workflow compared. Elimination of the workflow steps depicted in the gray box for patients prioritized by the AI is the primary mechanism driving the hypothesized reduction in diagnostic delays. Screening mammograms were assigned BIRADS category 1 (negative), 2 (benign) or 0 (incomplete—additional imaging needed) after radiologist review. After additional diagnostic imaging, a BIRADS category was assigned as 1 (negative), 2 (benign), 3 (probably benign), 4 (suspicious), 5 (highly suggestive of malignancy). BIRADS Breast Imaging Reporting & Data System

For all participants, the AI yielded a binary categorization (Prioritized, Not Prioritized) to identify cases with a higher risk of malignancy (Fig. 1b, Online Resource 1-Supplementary Fig. 1). The AI prioritization only influenced the workflow of the experimental group participants—in the control group, AI results were analyzed post hoc, only after study completion.

Experimental group participants who were not prioritized by the AI, those who were prioritized but discontinued the AI-modified workflow, as well as all participants in the control group, had their mammograms interpreted according to the standard of care. To decrease the potential for bias, the interpreting radiologists were unaware that these were study participants.

Experimental group participants whose cases were prioritized by the AI were offered the opportunity to remain on site while a radiologist interpreted their mammograms within 30 minutes. The radiologist who performed immediate interpretation was assigned to diagnostic imaging on that day, not screening interpretation, and was responsible for same-visit additional imaging workup if needed. To minimize the potential to influence radiologists, the AI algorithm offered no explanation as to why the case was prioritized. Participants whose images were deemed normal by the radiologist were immediately informed of their result, while those deemed to need additional imaging were offered same-visit diagnostic imaging. If a biopsy was recommended after diagnostic imaging, it was scheduled for a later date.

Operational endpoints

Primary operational endpoints were time from screening examination to additional imaging workup completed (TA), and time from screening examination to biopsy diagnosis (TB) (Fig. 1c).

Exploratory analyses included: individual TA and TB values in the subset of participants ultimately diagnosed with breast cancer, radiologist recall rates (proportion of screening examinations with a recommendation for additional imaging), cancer detection rates (number of cancers detected per 1000 women), and performance of AI prioritization on participants requiring additional imaging or tissue diagnosis.

Statistical analysis

A sample size of 1000 was selected to compare with standard breast radiology metrics and assess the implementation of the AI-modified workflow. The one-sided Mann-Whitney U test was employed for TA and TB, respectively, to test the null hypothesis that the experiment and control samples originated from the same population, against the alternative hypothesis of TA and TB being shorter in the experiment. The t-test and other tests assuming normality were not used, because time durations were not expected to be Gaussian. A p-value of less than 0.025 was considered significant, considering the Holm-Bonferroni correction for multiple hypothesis testing.

In addition, we estimated confidence intervals for TA and TB separately on experiment and control by bootstrapping with 9999 iterations [19]. Similarly, we estimated one-sided confidence intervals for the differences in TA (and TB) between the experiment and control samples, by bootstrapping for 9999 iterations.

To study a potential effect of AI prioritization on radiologists’ recall rate, we applied a two sample proportions test [20] to the recall rates of experiment and control cases.

Results

Participant cohorts and workflow

Of the 1000 participants who consented, 15% (145/1000) were excluded (CONSORT diagram, Online Resource 1-Supplementary Fig. 2) therefore the final cohort for study analysis consisted of 855 participants. 46% (392/855) of eligible participants were randomized into the control group and followed the standard-of-care workflow, while 54% (463/855) were randomized into the experimental group and followed the AI-modified workflow. Patient characteristics were similar between the two groups (Table 1).

In the experimental group, screening mammograms from 72% (332/463) of participants were not prioritized by AI and were subsequently included in the standard worklist for blinded radiologist review. The remaining 28% (131/463) were prioritized by AI. For 6% (8/131) of AI-prioritized cases, a radiologist was not available to perform an interpretation within 30 minutes; these participants obtained their final screening results by letter as usual (all had negative or benign findings). For the remaining 94% (123/131) of AI-prioritized cases, a radiologist was available to perform interpretation within 30 minutes. This immediate radiologist interpretation yielded normal or benign results in 73% (90/123) of instances. These results were communicated to participants before they left the clinic. For the remaining 27% (33/123) of AI-prioritized participants, the immediate radiologist interpretation yielded a recommendation for additional imaging, which was communicated to participants during the same visit. In 15% (5/33) of cases, same-visit additional imaging was not offered for operational reasons (e.g., unavailable technologist); these participants obtained additional imaging at a later date. Same-visit additional imaging was offered to 85% (28/33) of the participants with indeterminate findings, 14% (4/28) of whom declined the offer and obtained additional imaging at a later date, while 86% (24/28) accepted and obtained same-visit additional imaging which was completed within 2 hours of their initial screening examination.

Diagnostic outcomes after screening

Diagnostic outcomes after mammography screening, additional imaging workup, and pathology analysis were collected until the study cut-off date of September 6, 2022, allowing for at least 3 months of follow up after each screening visit (Table 2). Among the 855 participants in the final cohort, 16% (135/855) received a radiologist recommendation for additional imaging. A subsequent 24% (33/135) of those women received a radiologist recommendation for tissue biopsy. Finally, 18% (6/33) of biopsied women were eventually diagnosed with cancer after tissue pathology analysis. For the 6 participants diagnosed with cancer as a result of mammography screening, the final pathology yielded 2 Invasive Ductal Carcinoma (IDC) and 4 Ductal Carcinoma in Situ (DCIS).

Table 2 Diagnostic outcomes of included screening examinations

Three participants (two control, one experimental) with normal mammography screening obtained tissue diagnosis after undergoing additional screening with MRI. Two of the biopsied findings were benign, while one (control group) resulted in a DCIS detected only on MRI and not on mammography (Online Resource 1-Supplementary Fig. 3).

Impact of the AI-modified workflow

The AI-modified workflow resulted in significantly shortened diagnostic delays (Fig. 3). In the control group, the mean TA was 25.6 days [95% CIs: 22.0–29.9] and the mean TB was 55.9 days [95% CIs: 45.5–69.6]. In comparison, mean TA in the experimental group was reduced by 25% to 19.1 days (ΔTA=−6.4 days, upper limit of one-sided 95% CI = −0.3, p<0.001), while the mean TB was reduced by 30% to 39.2 days (ΔTB=−16.8 days, upper limit of one-sided 95% CI = −5.1, p=0.003). Similar reductions were observed when comparing medians (Fig. 3). Times were further shortened for AI-prioritized participants in the experimental group: the mean TA was 86% shorter (3.5 days [95% CI 1.0–7.6]) and the mean TB was 46% shorter (30.2 days [95% CI 25.1–35.3]). This gain was primarily attributable to the fact that 73% (24/33) of AI-prioritized participants who needed additional imaging obtained it during the same visit as their screening exam.

Fig. 3
figure 3

AI-modified workflow results. All values are intervals relative to the time of screening. TA is the time to additional imaging and TB is the time to biopsy diagnosis. a [i] The mean TA with 95% CIs of the mean is shown in the control group, in the experimental group overall, and in the subsets of AI-prioritized and not prioritized participants in the experimental group. Bootstrapped effect size estimate is shown. [ii] The median TA with 95% CIs of the mean is shown in the control group, in the experimental group overall, and in the subsets of AI-prioritized and not prioritized participants in the experimental group. In the control group median TA was 22.0 days [95% CIs 21.0–27.9] vs 14.0 days [95% CIs 7.2–20.0] in the experimental group. p-value for the Mann-Whitney U test is shown. [iii] Individual TA data values are shown for participants ultimately diagnosed with breast cancer, in the control group (participants A, B, C) and in the experimental group (participants D, E, F, all AI-prioritized). b [i] The mean TB with 95% CIs of the mean is shown in the control group, in the experimental group overall, and in the subsets of AI-prioritized and not prioritized participants in the experimental group. Bootstrapped effect size estimate is shown. [ii] The median TB with 95% CIs of the mean is shown in the control group, in the experimental group overall, and in the subsets of AI-prioritized and not prioritized participants in the experimental group. The median TB was 49.0 days [95% CIs 39.2–58.1] in the control group vs 34.7 days [95% CIs 28.1–45.0] in the experimental group. p-value for the Mann-Whitney U test is shown. [iii] Individual TB data values are shown for participants ultimately diagnosed with breast cancer, in the control group (participants A, B, C) and in the experimental group (participants D, E, F, all AI-prioritized)

The reduction in diagnostic delays experienced by participants ultimately diagnosed with breast cancer was especially marked. The three cancer patients in the control group obtained additional imaging at 13, 26 and 29 days, and tissue diagnosis at 39, 57 and 58 days, respectively. In comparison, the three cancer patients in the experimental group obtained additional imaging at 0, 0 and 4 days, and tissue diagnosis at 12, 27 and 33 days, respectively.

Radiologists recalled 17.0% (79/463, [95% CI 13.8–20.8]) of cases for additional imaging in the experimental group, compared to 14.3% (56/392, [95% CI 11.0–18.1]) in the control group. This difference was not statistically significant (p=0.36). The radiologist cancer detection rate was 6.5 per 1000 (3 cancers detected in 463 participants) in the experimental group, and 7.7 per 1000 (3 cancers detected in 392 participants) in the control group.

The performance of AI prioritization compared to screening outcomes was analyzed for experimental and control cases separately, and as a combined measure. For the control group, AI results were computed for the purpose of post hoc data analysis only (Online Resource 1-Supplementary Table 1). In the whole study cohort, the AI prioritized 29% (245/855) of participants, 39% (53/135) of those who were subsequently recommended by the radiologist for additional imaging, 64% (21/33) of those recommended for tissue biopsy and 100% (6/6) of cancers. Similar proportions were measured when considering experimental and control groups separately (Online Resource 1-Supplementary Table 2). In addition to the 6 cancers diagnosed as a result of the mammography screening, the AI software identified one additional cancer in the control group--this AI finding was not presented to the radiologist (per the protocol for the control group) and the case was interpreted by the radiologist as negative. This cancer was later detected on supplemental MRI screening (Online Resource 1-Supplementary Fig. 3).

Discussion

In an environment burdened with staffing shortages and confounded by the post-COVID backlog, we have implemented an AI-modified workflow to triage breast cancer screening mammograms, resulting in a streamlined patient journey and significantly reduced time to diagnostic imaging and biopsy diagnosis. Importantly, all participants ultimately diagnosed with breast cancer were prioritized by AI and those in the experimental group obtained additional imaging within 4 days.

Fundamentally, this AI implementation approach leverages the fact that relatively few patients will require diagnostic workup post-screening and even fewer will be diagnosed with cancer. The AI serves to selectively identify those who may benefit most from immediate interpretation and does so more accurately than if radiologists had randomly selected a similar proportion of cases for prioritized review. To demonstrate this concept, we conducted a post hoc simulation comparing AI-based and random prioritization (Online Resource 1-Supplementary Simulations). The simulation showed that AI prioritization identified a substantially higher proportion of biopsies and cancers. Thus, the observed reduction in diagnostic delays is likely due to the AI-modified workflow, rather than to immediate reading only.

The importance of studying real-world implementations should be emphasized. For example, while retrospective reader studies showed that CAD was detecting more cancers than radiologists, data gathered by Lehman et al. [21] demonstrated no overall improvement in cancer detection once CAD was implemented. We similarly identified implementation challenges and insights that would not have been apparent in a retrospective setting.

First, the proportion of cases prioritized by AI for immediate radiologist review was 29%, much higher than the proportion anticipated based on retrospective testing (14%). The underlying reasons for this discrepancy are not clear. This implementation insight shows that the retrospective performance of AI may not translate to a real-world setting. For the triage use case, this meant that considerably more immediate screening interpretations were needed than initially anticipated. To address the diverse operational capacities of different clinical settings, the operating point of an AI triage system can be adjusted. Use of a more specific AI operating point (i.e., one with a more stringent threshold) could prioritize fewer examinations and thus mitigate excessive disruptions. We explored this concept in a post hoc simulation, finding that a more specific operating point that prioritized only 10% of the participants would still have selected 2 of 3 experimental group cancers (Online Resource 1-Supplementary Simulations), suggesting institutions could calibrate the AI’s prioritization rate to match available resources.

A second implementation insight was that despite the study design’s intention to prevent AI from influencing radiologist interpretation, qualitative feedback highlighted the discomfort radiologists faced in interpreting AI-prioritized cases as normal, especially initially. However, after multiple rounds of exposure to AI prioritization, radiologists realized that relying on their own expertise was equally as important. Implementation of AI explainability features beyond a binary prioritization, such as regions of interest or confidence scores, could improve AI-radiologist collaboration and decrease bias.

Finally, immediate radiologist review and same-visit additional imaging can be disruptive to traditional clinical workflows and may not be feasible for all clinical practices. While introducing a small portion of ad hoc AI-prioritized interpretations can be less efficient than batch-reading, other efficiency gains due to same-visit workup are likely (no re-interpretation ahead of diagnostic workup, less time spent scheduling and checking in the patient, etc.) and thus further study of this tradeoff is necessary. Consistent with previous publications, not all participants who were offered same-visit additional imaging accepted, which introduced additional workflow variation. Additionally, email notification of an AI-prioritized case is likely only feasible in the research setting. Methods to incorporate prioritized cases directly in a radiologist’s worklist and provide a real-time notification would require collaboration with information technology (IT) resources and electronic medical records and picture archiving and communication systems (PACS). Several commercially available AI systems have methods to prioritize cases based on complexity, and therefore incorporating real-time AI prioritization and notification could potentially be integrated into modern systems.

This investigation has limitations. Although our sample size of 1000 participants is larger than the median of 294 seen in AI healthcare trials [22], it is insufficient for robust subgroup analysis, particularly for cancers, given their low prevalence in the screening population. Additionally, the study population represents only a small proportion of the total screening population at our institution, limiting the generalizability of this AI implementation to full clinical practice.

The study was aimed at assessing an AI implementation strategy and its impact on operational outcomes, and as such was not powered to measure diagnostic accuracy of the modified workflow. Moreover, it did not directly evaluate the benefits of the AI-modified workflow in terms of reducing anxiety in patients or addressing racial disparities. Instead, we rely on previously published work, showing that immediate mammography interpretation and same-visit workup are associated with an improvement to patient experience and a reduction in inequities in timely follow-up. Finally, this study was only conducted at a single site using one AI model, and therefore the generalizability of this implementation strategy to other screening environments remains unknown.

In conclusion, our implementation study prospectively demonstrated an AI-modified workflow that was attainable in clinical practice and yielded a statistically significantly shorter time to additional imaging and biopsy diagnosis for patients undergoing screening mammography compared to the standard-of-care workflow. The broader benefits of reducing such diagnostic delays include improved patient adherence, decreased anxiety and addressing disparities in access to timely care. Additionally, introducing AI in lower resource settings where patients are often lost to follow-up care, could further amplify the importance of capturing patients with concerning imaging findings while they are still in the breast center. Consequently, this demonstration is an early but important step towards identifying an AI implementation strategy that can improve the efficiency and timeliness of breast cancer screening. Further implementation studies in diverse clinical settings are needed to assess generalizability and the potential impact on patient outcomes.