Introduction

Despite significant progress in classification and treatment over the past two decades, breast cancer (BC) remains the leading cause of cancer death for women worldwide1. Proposing an optimal therapeutic strategy to each patient requires systematic and accurate characterization of each disease. Specifically, estrogen receptor-positive (ER+), HER2-negative (HER2−) invasive BC, which accounts for approximately 70% of all invasive BC, is associated with a wide spectrum of outcomes and treatment requirements. For many of these women, a key question remains whether adjuvant chemotherapy with the burden of acute side effects and the potential long-term persistent quality of life (QoL) deterioration2 can be safely avoided. Furthermore, women with a predicted high risk of metastatic relapse despite current standard treatment could be offered more intensive or extended adjuvant strategies, including the addition of a CDK4/6 inhibitor3,4.

Prognosis definition has been traditionally based on clinical and histopathological factors, such as the patient’s age and the histological classification and grade. Biomarker assessment, mainly by immunohistochemistry (ER, progesterone receptor (PR), HER2, and the proliferation marker KI67), was added to this estimation and later refined with the inclusion of molecular signatures. Results from the TAILORx trial showed that Oncotype DX®, a 21-gene test that predicts 10-year metastasis-free survival (MFS), could help avoid unnecessary chemotherapy in up to 85% of women with early-stage ER+/HER2− node-negative BC, without affecting outcomes5,6,7. Currently, several gene expression signatures assessed on the primary tumor material are endorsed by international guidelines to support clinicians in refining the prognosis of patients with early BC (EBC) and taking adjuvant treatment decisions8. Besides this molecular characterization, prognostic tools using classical factors and embedded into publicly available websites may be used as an aid in clinical decision making. For instance, Predict Breast Cancer, a widely used online prognostication software, uses known prognostic factors such as tumor size, KI67 index, tumor grade, and lymph node status to predict overall survival at 5 and 10 years9,10. However, sensitive markers such as Ki67 may be subject to reproducibility and expertise biases11,12,13,14,15.

Disease prognosis has advanced with the integration of sophisticated computational methods. Artificial intelligence (AI), particularly machine learning (ML), is increasingly used to tackle biological and clinical challenges. Recent studies demonstrate the potential of deep learning (DL) models applied to histopathological whole-slide images (WSI) to predict patient outcomes and identify prognostic features, particularly in BC16,17,18,19,20,21.

In this study, we aimed to investigate whether AI applied to tumor WSI could: (i) provide additional prognostic information beyond clinico-pathological prognostic criteria, and (ii) identify patients who have a substantial risk of metastatic relapse despite receiving standard treatments. The ultimate goal was to develop an AI-based digital pathology tool to allow assessment of the risk of metastatic relapse.

Results

The primary objective of this study was to evaluate the additional 5-year MFS prognostic value of RlapsRisk BC score, an AI-derived prognostic score based exclusively on WSI, relative to that of the current clinico-pathological criteria, in patients with ER+/HER2− EBC. The secondary objective consisted of comparing the capacity of a model combining standard clinico-pathological criteria and RlapsRisk BC to dichotomize patients between high risk and low risk of developing 5-year MFS events to that of a model based on clinico-pathological factors only. This comparison was assessed on the entire population of the validation cohort and in different subgroups of clinical interest (histological grade 2, intermediate clinical risk of relapse as defined in the French guidelines displayed in Supplementary Table 4, pre- versus post-menopausal status, with or without lymph node invasion, treated with or without adjuvant chemotherapy).

Model development and datasets

To build our model, we used the GrandTMA cohort as a discovery dataset. This cohort has been collected retrospectively from patients diagnosed in the “One-stop breast clinic” program and treated at Gustave Roussy Cancer Center (Villejuif, France) between October 10, 2005 and February 7, 201322. The cohort comprised 1802 patients diagnosed with early invasive BC (1429 ER+/HER2–, 110 ER+/HER2+, 70 ER−/HER2+, and 193 ER−/HER2− tumors) who underwent surgery as first treatment and had at least one available tumor slide from the surgery specimen. For the ER+/HER2− BC patients, we had for each case one hematoxylin, eosin, and saffron (HES) slide from primary diagnosis and one additional hematoxylin and eosin (H&E) slide.

For the purpose of an external validation of our model, we used a dataset from the French multicenter observational prospective CANTO cohort (NCT01993498)23. Out of 14,000 patients accrued in CANTO so far, 1703 ER+/HER2− EBC patients had a minimum follow-up of 5 years and were eligible for the present study. Tumor slides and full clinico-pathological features were exploitable from 1229 patients: 676 patients with HES-stained slides included at Gustave Roussy (hereafter CANTO HES), and 553 patients with H&E-stained slides coming from other UNICANCER centers (hereafter CANTO H&E). After applying the calibration protocol (see Supplementary Methods and Materials), CANTO H&E and CANTO HES were merged into one larger cohort, hereafter the CANTO cohort (N = 1229). There was no overlap of patients between the GrandTMA and the CANTO cohorts. See Supplementary Methods for additional information on the two cohorts.

Our approach consisted of developing an algorithm that learned from the WSIs to predict the 5-year MFS without any local annotation provided by pathologists. To build this model, we used a method composed of four steps: (i) tissue tiling, (ii) feature extraction, (iii) creation of a risk score, (iv) binary classification (Fig. 1). All these steps, including model architectures and training methods, are detailed in the Supplementary Methods. The RlapsRisk BC risk score was optimized using the discovery cohort GrandTMA (HES WSIs) with a stratified threefold cross-validation approach, repeated five times. Due to the limited number of events, stratification was performed based on event occurrence to ensure a minimum number of events in each fold.

Fig. 1: RlapsRisk BC algorithm overview.
figure 1

The left panel illustrates the algorithm’s processing steps, while the right panel details the training procedure and the overall model architecture designed to predict the risk group.

We then developed a clinical score to predict the 5-year MFS from a multivariable Cox model trained using the discovery dataset (hereafter the Clinical Score), based on the following clinical variables: age, tumor size, histological grade, lymph node invasion, and Ki67 expression. We used this score as the baseline reference and compared it to a score derived from a Cox model adjusted for clinical factors and the RlapsRisk BC score (referred to as the RR Combined). The coefficients of this Cox model, which provides the RR Combined score, were also trained on the discovery dataset, combining the Clinical and RlapsRisk BC scores. This comparison allowed us to evaluate the added prognostic value of the RlapsRisk BC score beyond standard clinical factors. We excluded chemotherapy treatment from the clinical score because it was prescribed to higher-risk patients based on clinical best practices, introducing prescription bias. Since this risk was already captured through clinical variables, including chemotherapy would have led to redundancy and multicollinearity in the prognostic models.

RlapsRisk BC score and prognosis

On both the discovery cohort and the external validation cohort CANTO (HES and H&E WSIs), RlapsRisk BC score was an independent prognostic factor for MFS (Table 1 and Supplementary Table 5) (hazard ratio (HR) = 1.67 [1.31–2.13], p value < 0.005) in a multivariable Cox model that was fitted directly on the cohorts of interest, including histological grade, age, lymph node invasion, tumor size, and Ki67 expression. The results per staining for the CANTO cohort are also available in Supplementary Tables 7 and 8, showing notably equivalent performances on H&E and HES. Similarly, RlapsRisk showed strong performances in the intermediate clinical risk group (Supplementary Table 6) (HR = 1.81 [1.29–2.53], p value < 0.005). All variables, except histological grade, were considered continuous. Only patients with ER+/HER2− disease were included.

Table 1 Multivariable Cox proportional hazard models estimating the contribution of several prognostic variables on MFS on CANTO

The discrimination power of the scores was then compared using the Harrell’s C-index24 (Table 2) on the external validation dataset. Data from the CANTO cohort were held out from the training dataset and were used only for external validation and the assessment of the discrimination of each model. The RR Combined model, obtained by integrating RlapsRisk BC with the Clinical Score, significantly outperformed the Clinical Score alone in the external validation dataset, demonstrating that RlapsRisk BC provides additional prognostic information beyond clinical factors. Importantly, RR Combined was entirely fitted on the training cohort before validation, without any recalibration on the external validation dataset. It achieved a Harrell’s C-index of 0.81 (confidence interval (CI) = [0.76–0.86]), compared to 0.76 (CI = [0.69–0.81]) for the Clinical Score alone, reflecting an improvement of +0.05 (permutation test p value < 0.05). This added prognostic value of RlapsRisk BC was particularly notable in the intermediate clinical risk group, where RR Combined achieved an improvement of +0.08 (permutation test p value < 0.05).

Table 2 Performances in the C-index of the Clinical and RR Combined models

We further compared the RR Combined model with the PREDICT Breast (PB) model, a widely used prognostic tool that estimates survival outcomes based on clinical and treatment-related factors25. Consistent with previous observations, RR Combined achieved superior prognostic performance, yielding the highest C-index of 0.81 (CI = [0.75–0.86]) (Supplementary Table 9). In contrast, PB models showed lower discrimination, with C-index values ranging from 0.70 to 0.75 across their different predicted outcomes.

We then assessed the prognostic performance of RlapsRisk BC score and the clinical factors in patients subgroups defined by standard clinico-pathological factors, by the clinical risk groups (Supplementary Table 4 for descriptions) and by adjuvant treatment regimen (Fig. 2). The assessment of potential heterogeneities in these subgroups was conducted by Cox regression analyses. When a factor was used to build a subgroup, it was removed from the associated Cox multivariable model. A higher RlapsRisk BC score was associated with an increased risk of distant recurrence (HR > 1, corrected p value < 0.05) in the majority of subgroups depicted in Fig. 2. However, for pre-menopausal patients, RlapsRisk was not associated with an increased risk of relapse. In patients with pN ≥ 2, those with pT ≥ 3, and histological grade 3, the sample sizes were insufficient to obtain significant results.

Fig. 2: Forest plot of the adjusted RlapsRisk BC score HRs on the prediction of 5-year metastasis-free survival on the CANTO cohort.
figure 2

Each square of the forest plot represents the HR of the RlapsRisk BC score (a continuous variable) adjusted for the prognostic clinico-pathological factors in the subgroup of patients defined by the variable category in the first column of the table. The 95% HR confidence intervals, computed using the Wald method, are represented by the horizontal lines. Histological tumor grade 1 and low clinical risk group were removed as no MFS events were recorded in these subgroups. Hazard ratios (HRs) and 95% confidence intervals were estimated using a Cox proportional hazards model. p Values correspond to two-sided Wald tests evaluating the null hypothesis HR = 1 and were adjusted for multiple comparisons using the Benjamini–Hochberg procedure. *p < 0.1, **p < 0.05, ***p < 0.01.

Clinical validity of RlapsRisk BC

To compare the capacity of the models to dichotomize patients between high risk and low risk of developing 5-year MFS events, thresholds corresponding to a probability of MFS event of 5% at 5 years were set for each risk score in the training set and prespecified accordingly for validation (see “Methods” section for details). In the next section, when the scores are dichotomized, they are referred to as “classifiers” (e.g., RlapsRisk BC, Clinical Score, and RR Combined classifiers). To assess the prognostic power of the RlapsRisk BC classifier, we replicated selected multivariable analyses from the “RlapsRisk BC score and Prognosis” section, using the high/low classification as the input variable instead of the continuous score (Supplementary Table 10).

After applying the MFS risk stratification according to the previously defined classifiers, Kaplan–Meier analyses showed significant differences in distant recurrence events between low- and high-risk patients in the external validation dataset (Table 3 and Fig. 3). Kaplan–Meier analyses were also performed on the intermediate clinical risk subgroup and comparisons between low- and high-risk groups were also significant (Supplementary Table 11 and Fig. 4).

Fig. 3: Metastases-free survival of patients stratified according to RlapsRisk BC classifier (left), Clinical score classifier (middle), and RR Combined classifier (right) among patients from the CANTO validation cohort.
figure 3

p Values were computed using the two-sided log-rank test to compare survival distributions between groups. Shaded areas represent 95% confidence intervals estimated using the Greenwood formula.

Fig. 4: Metastases-free survival of patients stratified according to RlapsRisk BC classifier (left), Clinical score classifier (center), and RR Combined classifier (right) among patients from the CANTO intermediate clinical risk validation cohort.
figure 4

p Values were computed using the two-sided log-rank test to compare survival distributions between groups. Shaded areas represent 95% confidence intervals estimated using the Greenwood formula.

Table 3 Classification of patients according to each classifier (RlapsRisk BC, Clinical score, and RR Combined)

We also assessed the performance of RlapsRisk across subgroups of patients at different risk of recurrence, as defined according to known prognostic factors, such as menopausal status, presence of lymph node invasion vs. not, treatment by chemo-endocrine therapy vs. endocrine therapy alone. On the CANTO Cohort, the RR Combined classifier increased the cumulative sensitivity by six points and the dynamic specificity by four points in comparison to the Clinical score classifier alone (Table 4). When looking at the histological grade 2 subgroup and the intermediate clinical risk subgroup (defined in the Supplementary Table 4), the RR Combined classifier outperformed the clinical score classifier by 21, resp. 28 points, in sensitivity and 2, resp. 3 points, in specificity. These figures show that adding RlapsRisk BC score to the current clinical factors improves prognosis within subgroups with a difficult prognosis estimation, except for premenopausal patients. Additional Kaplan–Meier analyses highlighted significant differences between high- and low-risk in all these subgroups (see Supplementary Information Figs. 36).

Table 4 Performance of the classifiers in different subgroups of the CANTO cohort (external validation)

Model interpretability: histological feature assessment

To make sure our model leveraged reliable histological features, even though it was not based on handcrafted features, we computed the marginal contribution of clusters of similar tiles to the overall risk score to assess their positive or negative contribution to the final risk score predicted by the multilayer perceptron (MLP), on WSI from the CANTO cohort. The most impacting clusters were further reviewed and annotated by an expert pathologist blinded to the predicted outcome (see Fig. 5 for methodology and Supplementary Methods for more details).

Fig. 5: Interpretability of the RlapsRisk BC model.
figure 5

The methodology applied to generate the data used for this analysis is outlined. To streamline pathologist review, slides were sampled to express as much diversity as possible in the high, low, and intermediary predicted risks. Shapley values were computed to evaluate the positive or negative contributions of clusters of similar tiles. A selection of the most contributing clusters of tiles was made, and random clusters were also sampled from the distribution of Shapley values. An expert pathologist annotated the data, blinded to any information about contribution or risk.

Some histological patterns, as annotated by a pathologist, were associated with an increase of the predicted risk of relapse like the presence of high tumor cell content (p < 0.05), high degree of nuclear pleomorphism (p < 0.05), massive architecture (p < 0.05), low tubule formation (p < 0.05) and trabecular structures (p < 0.05) as well as mitotic activity (p < 0.05) (Fig. 6). Some features were also associated with a lower predicted risk of relapse such as fibrosis (p < 0.05), the presence of vascular structures (p < 0.05) or regions in the adjacent normal tissue (p < 0.05). The complete analysis is available in Supplementary Information Table 13. These results showed that our model was relying on well-established histological factors to predict patient outcomes. The majority of histological features significantly associated with an increase or decrease of the risk were statistically significant both on H&E and HES stainings individually (Supplementary Material Tables 14 and 15), supporting the robustness of our model across different staining platforms.

Fig. 6: A selection of the histological features annotated by an expert pathologist.
figure 6

The proportion of tiles annotated as containing a particular histological feature within clusters of tiles linked to high and low risk, as well as randomly selected ones, is presented alongside the significance of p values from the Chi-square test of independence computed between each group, adjusted for multiple testing using the Benjamini–Hochberg procedure. For each histological feature, the region of a WSI containing the annotated cluster of tiles is displayed, extracted both from the CANTO H&E dataset (upper histological image) and from the CANTO HES dataset (lower histological image). ns p > 0.05, **p < 0.01, ***p < 0.001, ****p < 0.0001.

Discussion

Digital pathology is increasingly being integrated into routine clinical practice, and workflows that include digitization of glass slides are no longer an exception in pathology laboratories. This ongoing transition is paving the way for the implementation of AI-based digital pathology medical devices in clinical practice.

In this study, we developed and validated a digital pathology score, which predicts distant recurrence at 5 years after surgery in adequately treated early-stage BC patients, bearing ER-positive and HER2-negative tumors. Our method uses just a standard-stained H&E or HES and a scanned tumor slide already available for diagnostic purposes in the pathology laboratory.

RlapsRisk BC score has a strong prognostic value that is independent of established clinico-pathological factors, validated in an independent multicentric cohort. Although 55% of the patients of the validation cohort originate from the same hospital that provided the samples for the training of the algorithm, the performance obtained on the other 45% showed similar results. The RR Combined model, integrating clinico-pathological factors and RlapsRisk BC, effectively dichotomized patients into low- and high-risk groups with strong discriminative power across the entire population and most clinical subgroups, except for the premenopausal subgroup. The utilization of RlapsRisk BC demonstrated notable efficacy in enhancing risk assessment accuracy within the intermediate clinical risk group, a segment that poses considerable complexity for treatment decision-making.

Unlike earlier digital pathology approaches that primarily predict morphological features such as histological grade or Ki67 index26,27,28 or leverage tumor microenvironment characteristics20, we trained our DL model to directly predict 5-year MFS from WSI without requiring local annotations. Recent studies have demonstrated the feasibility of DL models for outcome prediction from histopathology images29,30,31. However, these models are built using predefined scores (Nottingham Grading System), predefined histological features, or rely on the integration of molecular recurrence scores as indirect proxies for prognosis32,33,34. In contrast, our model is initially trained to estimate MFS exclusively from WSI, without incorporating manual annotations or molecular data. While we later evaluate its integration with clinical prognostic factors to enhance prognostic performance, the WSI-based score is learned in an end-to-end manner, ensuring that prognostic patterns are directly extracted from histological slides under the supervision of a survival endpoint. This fully automated pipeline allows the model to independently extract relevant morphological patterns before combination with standard clinical risk factors.

Our approach, which directly predicts a survival endpoint from H&E slides, has already been successfully applied in prostate cancer. Notably, AI models using digital histopathology analysis have improved prognostic stratification and therapy personalization in prostate cancer, outperforming standard classification tools35. These findings support the relevance of applying a similar approach in BC, where integrating AI into digital pathology could further refine risk assessment.

A different approach presented by Wang et al.30 attained a risk prediction from Nottingham histological grade through the re-stratification of the intermediate category (NHG2). NHG2, which encompassed the largest group of patients, was dichotomized into a low-grade subgroup and a high-grade subgroup, achieving a HR of 2.94 (95% CI 1.24–6.97, p = 0.015) for the stratification between the two groups according to recurrence-free survival30. Similarly, in our study, RlapsRisk BC classifier achieved on the validation cohort a high discriminative power on NHG2 patients with a HR of 5.67 (95% CI = [2.34–11.74], p value < 0.05) for MFS when stratifying patients into low- and high-risk groups (Supplementary Fig. 3).

The interpretability analysis of the pathological features associated with an increased or decreased risk of relapse, as predicted by RlapsRisk BC, supports the validity of our model. The model’s reliance on well-established histological prognostic factors—such as nuclear pleomorphism, tumor architecture, and mitotic activity—demonstrates its alignment with known pathology-driven risk assessments. Our interpretability analysis confirmed that RlapsRisk BC relies on meaningful pathological features to predict an increased or decreased risk of relapse, supporting the robustness of our approach. However, a more comprehensive study would be required to establish the causal relationship between certain pathological features and the model’s predictions. Aligned with other recent studies36,37,38,39, our analysis highlights the potential role of tumor-adjacent microenvironment features, such as vascular structures and fibrosis, in prognosis assessment. Our analysis brought out into relief the microenvironment features that could play a role in prognosis assessment, such as vascular structures or the presence of fibrosis. These findings support the approach of taking into consideration the tissue as a whole, rather than focusing on tumor cells only.

Currently, one of the challenges in ER+/HER2− EBC management is the adaptation of the treatment according to the risk of the patient. Reducing the number of unnecessary adjuvant chemotherapies or the length of endocrine therapies to improve QoL while maintaining an equivalent survival rate of the patients is the key challenge. This group is currently the target population for molecular signatures, where different genomic scores aim to guide the decision to avoid chemotherapy in certain patients5,31. However, variations in test results, restrictive indications, and limited reimbursement in many healthcare systems present challenges to their widespread use40,41.

Multimodal histopathologic models have been investigated as a means to stratify hormone receptor-positive EBC and could serve as pre-screening tools to identify patients who may or may not require more expensive genomic testing42. While our study does not provide a direct comparison with genomic assays such as Oncotype DX, our results suggest that a WSI-based approach could serve as a complementary tool in clinical decision-making. Head-to-head comparison of RlapsRisk BC and molecular signatures in a cohort similar to TRANSATAC would provide further insights into its potential clinical utility, including defining optimal thresholds for genomic test endorsement43. In addition, assessment of the impact of a decision-making strategy based on RlapsRisk BC would require a randomized clinical trial to eliminate potential biases that may arise from observational studies.

While our study introduces an innovative and valuable tool for patient stratification, it also has certain limitations.

While the training and validation cohorts include a limited number of premenopausal patients, RlapsRisk BC appears to exhibit suboptimal performance within this subgroup. Addressing this issue requires dedicated studies with larger sample sizes to thoroughly investigate its clinical implications. Of note, current prognostic tools to stratify risk such as gene expression signatures have also shown limited performance in this premenopausal setting. Future studies incorporating comparisons to molecular test scores would be beneficial to strengthen the evidence supporting our tool’s utility and to offer complementary insights on its prognostic accuracy relative to established standards.

To use RlapsRisk BC in clinical practice, the calibration protocol requires 30 BC slides, which may pose challenges, particularly in smaller pathology laboratories. While calibration is a standard procedure for many medical devices, it could impair the clinical deployment of such AI tools. Furthermore, variations in staining protocols, scanner settings, and other technical factors between healthcare centers could introduce domain shifts, potentially affecting the model’s reliability. Here, prognosis results were robust with respect to the staining protocols (H&E and HES) since concordant prognostic results were found in two centers using two different staining protocols. However, the diversity of worldwide staining protocols is larger than the two protocols tested here, stressing the importance of analytical validations.

In conclusion, RlapsRisk BC resulted to be an independent prognostic factor of MFS and added significant prognostic information to clinico-pathological variables. After patients’ dichotomization into low- and high-risk groups, RR Combined with classical clinico-pathological risk factors showed higher discrimination power compared to clinico-pathological risk factors alone. A prospective observational study comparing RlapsRisk BC to molecular prognostic signatures is currently ongoing and will determine the impact of the implementation of this AI-based tool into the practice workflow. Future extensions of our research include the development of novel algorithms, notably adapted to biopsy specimens. To deepen interpretability issues, an exhaustive analysis of tiles is contemplated with a focus on the spatiality notion that could provide novel insights into tumor biology.

Methods

This study complies with all relevant ethical regulations. The need for informed consent was waived due to the retrospective nature of the study and anonymization of all data.

Dataset description

GrandTMA

To build our models, we used a discovery dataset collected retrospectively from patients treated at Gustave Roussy in France and included in the “GrandTMA” cohort. This cohort comprises all patients newly diagnosed with a breast carcinoma as part of the “One-stop breast clinic” program at Gustave Roussy between October 10, 2005 and February 7, 201322. The inclusion criteria for the present study were (i) diagnosis of invasive breast carcinoma, with or without associated in situ carcinoma, (ii) any type of treatment but neoadjuvant chemotherapy, (iii) availability of a surgical specimen with a formalin-fixed, paraffin-embedded (FFPE) tumor sample available, (iv) complete clinical and therapeutic data, (v) follow-up over at least 4 years and updated annually. The exclusion criteria were (i) exclusive non-invasive tumors, (ii) cytology-only available cases, (iii) absence of follow-up, (iv) other non-adenocarcinomatous lesions of the breast. This led to the inclusion of 1802 patients diagnosed with early invasive BC (1429 ER+/HER2−, 110 ER+/HER2+, 70 ER−/HER2+, 193 ER−/HER2−), with at least 1 available HES-stained tumor slide from the surgical specimen at the Pathology Department (details in Supplementary Fig. 1). For the model training, we used slides from the primary diagnosis whenever possible. For each of the 1429 ER+/HER2− patients, we also recut the FFPE block to obtain a new H&E slide that was digitized for the purpose of the study. Biomarker status (ER, PR, and Ki67 immunohistochemistry (IHC) expression, and HER2 protein expression/gene amplification) was defined and determined locally according to the current recommendations of the College of American Pathologists and the American Society of Clinical Oncology44,45, and the French recommendations of the Study Group on Immunohistochemical Prognostic Factors in Breast Cancer46. ER and PR expression positivity was defined as an IHC staining of at least 10% of tumor cells, as is standard in several European countries. Lesions were considered positive for HER2 (score 3+) if the number of tumor cells with a complete and intense membrane IHC staining exceeded 10% of the whole invasive tumor cells population; equivocal (score 2+) if the number of tumor cells showing a complete and moderately intense membrane IHC staining, or an incomplete basolateral membrane IHC staining of moderate to severe intensity exceeded 10% of the total infiltrating tumor population; and negative in the remaining cases (scores 0 and 1+). When dichotomized, Ki67 cut-off was defined according to the current recommendations (i.e., cut-off set at 20%)45,47.

Image acquisition was performed using an Olympus VS120 slide scanner at ×20 magnification for the HES slides and Olympus VS200 at ×20 for the H&E slides. In order to avoid scanning issues that might affect subsequent image analysis, all slides were checked by a pathologist after digitization to discard slides with insufficient quality and rescanned when necessary (blurred images, need for slides re-mounting when the coverslip was damaged).

This work was carried out in accordance with the provisions of the Public Health Code applicable to research not involving the human person (Public Health Code - Article R1121-1 amended by Decree no. 2017-884, May 9, 2017) and therefore it does not come under the jurisdiction of a Committee for the Protection of Persons. It obtained the favorable opinion of the expert Committee for Research, Studies and Evaluations in the field of breast pathology, as well as of the Ethics Committee (Data Protection Office, Gustave Roussy). It has been submitted to the National Commission for Computing and Liberties under reference No. F20220121170839 and has been declared in accordance with the reference methodology MR-004. The patients involved were informed of the research via an information letter distributed by postal mail with the possibility of opposing the study.

CANTO

For the purpose of an external validation of our model, we used a dataset from the French observational and prospective CANTO cohort (NCT01993498)23, enrolling patients from 26 cancer centers. In this cohort, patients were included at diagnosis of their invasive BC, before any treatment, following the given criteria: (i) women only, (ii) aged over 45 years old, (iii) HER2− and ER+ (same definition as for the GrandTMA cohort), (iv) with a histologically invasive BC diagnosed, (v) with no clinical evidence of metastasis at the time of inclusion. Out of 12,000 patients accrued in CANTO so far, 1703 ER+/HER2− EBC patients had a minimum follow-up of 5 years and were eligible for the present study (707 patients from Gustave Roussy and 997 patients from other cancer centers of the UNICANCER group). None of these patients were also included in the GrandTMA cohort. Thirty-one patients from Gustave Roussy had incomplete data and were excluded from the study, resulting in 676 HES slides available from primary diagnosis that were digitized for the purpose of the study with an Olympus VS200 scanner at ×20 magnification. From the other centers of the CANTO cohort, we had access to 553 H&E slides from resection that were recut, restained, and digitized for the purpose of the study with a Hamamatsu Nanozoomer S60 at a ×20 magnification (details in Supplementary Fig. 2). In total, 1229 patients had exploitable WSI from HES or H&E slides together with full clinico-pathological features (described in Supplementary Table 1).

All data collections were performed in the framework of the CANTO clinical trial (NCT01993498), in compliance with all legal requirements. All patients included in the study were informed through the website https://mesdonnees.unicancer.fr/ on the reuse of their data for a separate objective with the possibility of opposing the study.

Endpoint

The chosen endpoint for survival data analysis was MFS at 5 years, defined as the time from initial surgery to the occurrence of a metastatic event or death before 5 years. Operable local relapse or axillary lymph node recurrence events were ignored. Patient’s follow-up was censored at the time of contralateral BC, second non-breast primary cancer, or last available date of follow-up.

Model description

To develop our risk score from histology slides, we used a method composed of three steps: (i) tissue tiling, (ii) feature extraction, (iii) creation of a risk score. The transformation of the score into a probability of occurrence of a MFS event before 5 years and the selection of a threshold are two additional steps, detailed in “Statistical analysis.”

Tissue segmentation and tiling

Each of the WSI was first divided into small squares, 76 × 76 micrometers in size (224 × 224 pixels) called “tiles.” This tiling was performed by first segmenting the tissue, using a pre-trained U-Net neural network48 that discarded the background and artifacts of scanning or preparation. This segmented tissue was then divided into N (ranging from 10,000 to 75,000) tiles.

Feature extraction

The N tiles were embedded into D-dimensional feature vectors using a pre-trained Vision Transformer (Fig. 1). We implemented iBOT ViT-B, a self-supervised learning transformer framework that improved performance for various prediction tasks in previous studies49 trained on the Cancer Genome Atlas PanCancer40M dataset, which covers 13 anatomic sites and 16 cancer subtypes for 5558 patients, representing a total of 6093 slides. Multiple data augmentations (random cropping, random flips, color jitter, random grayscale, Gaussian blur) were applied while the model was optimized for 153,000 iterations (approximately 50 h) on 32 NVIDIA Tesla V100 graphics processing units (GPU). The following hyperparameters were used to train the model: teacher temperature was set to 0.04 with an initial value of 0.04 and 30 warm-up epochs. AdamW optimizer50 was used and learning rate linearly ramped up during the first ten epochs to its base value scaled with the total batch size according to: 0.0005 × batch-size/25651. The final learning rate was set to 0.000002 through a cosine schedule. This frozen pre-trained algorithm was then used to extract features during training and inference.

Risk prediction using multiple instance learning (MIL)

The N feature vectors were then aggregated using a MIL model trained to predict MFS at 5 years with a stratified threefold cross-validation approach, repeated five times. For each split, five models were also trained with random initialization of the weights. Given the limited number of events, stratification was performed based on event occurrence to ensure a minimum number of events in each fold (Fig. 1). For inference on the external cohorts, predictions were generated by ensembling all 75 trained models through output averaging.

We reimplemented the attention-based model called DeepMIL proposed by Ilse et al.52. DeepMIL is a deep MIL framework designed for weakly supervised classification. It aggregates instance-level (the WSI’s tiles in our case) features using an attention-based pooling mechanism, which assigns learned importance weights to each instance within a bag. This allows the model to focus on the most relevant instances for prediction. The weighted instance representations are then combined into a single bag-level feature vector, which is processed by a fully connected layer to generate the final classification output. This approach enhances interpretability while effectively handling variability within each bag. A linear layer with L neurons (L = 256 here) was applied to the embedded features followed by a Gated Attention layer with L hidden neurons. A MLP with 128 neurons was then applied to the output. To speed-up training and fit all the data in memory, only a random subset of 8000 tiles per WSI was used, while all tiles of a slide are processed for inference. DeepMIL was trained utilizing a loss function calculated with a smoothed variant of the concordance index, as suggested by Mayr and Schmid in this study53. We used a smooth, differentiable loss function based on a sigmoid approximation of the standard C-index indicator function. This smoothing introduces a parameter σ that controls the transition sharpness, enabling gradient-based optimization. The loss function is integrated into a gradient boosting framework, iteratively minimizing the smoothed empirical risk to improve model discrimination. This approach ensures that the learned biomarker combinations are directly aligned with survival prediction performance. The model was trained with the following hyperparameters: batch size = 128, learning rate = 0.001, and MLP dropout = 0.4 (a comprehensive list of hyperparameters is provided in the Supplementary Methods). Given the 7.69% relapse rate in our training cohort, this resulted in an average of 9.84 events per minibatch, providing adequate supervision for model optimization. While minibatches without observed events were possible, their occurrence was rare (~0.0036% probability). Although such batches do not directly contribute to differentiating survival outcomes, their impact was mitigated by the smoothed C-index loss, which allows gradient updates even when minibatches contain few or no events.

The models were trained for 15 epochs with a batch size of 128. Each epoch required approximately 26 s to complete, resulting in a total training time of 6 min and 30 s per fold. The entire training of the models with a five-times-repeated threefold cross-validation process with multiple random weight initializations took approximately 1 h 38 min. Training was conducted on a Tesla T4 Nvidia GPU using the PyTorch framework. The CPU memory usage for loading all features of the training data was 45GB, with an additional 13GB of GPU memory utilized during training.

Threshold determination

For the RlapsRisk BC score, as well as the clinical and RR Combined risk scores, we fitted a Weibull AFT (Accelerated Failure Time) model on the training dataset to transform risk scores into probabilities of occurrence of MFS event before 5 years. This step was used to identify the threshold of each risk score corresponding to a probability of 5-year MFS event of 5% defined by the Weibull Models. This 5% MFS rate threshold corresponds to the 5-year interpolation of an exponential model from the 10-year MFS of 10%, which is the most common output of the molecular signatures currently used in clinical practice54. The identified threshold was fixed before applying RlapsRisk on the validation dataset.

Method for interpretability feature assessment

Training an AI model on digital slides from diagnosis to predict metastatic relapse is an original approach compared to recent research works in Digital Pathology that generally predict well-known pathological features (e.g., histological grading or KI67 index). However, bypassing human knowledge in the training phase requires even more explanations on the functioning of the model. We detail herein our method to identify and characterize typical areas on a well-defined set of slides that had extreme risk scores. Interpretability relies on the possibility of accessing the relevant information for our model that is learned by the model itself.

In the model, each tile was associated with an attention score that was used in the final weighted average to obtain the input vector for the risk predictor (see Fig. 4). However, this score did not provide information about the impact on prognosis of the tile. To overcome this limitation, we aggregated clusters of tiles based on features’ similarity (using SLIC algorithm55) on all slides of the dataset and computed the Shapley value56 associated with each cluster. We thus measured the marginal contribution of each cluster to the overall risk score to assess its positive or negative contribution to the final risk score predicted by the final MLP. We aggregated tile features to accelerate the computation of Shapley values, ensuring that our methodology maintained coherence between the contribution of clusters of tiles’ features and the contribution of individual tiles alone on the RlapsRisk BC output.

For each staining of the CANTO dataset (HE and HES), we sampled 120 slides (40 classified with the highest RlapsRisk BC scores, 40 classified with the lowest RlapsRisk BC scores by our model, and 40 slides with intermediary scores predicted risk by RlapsRisk BC). We computed the Shapley values of each cluster and extracted those with the 2 highest computed contributions from the 50 slides with the highest predicted risk available (10 slides are taken from the intermediary scores here) and those with the 2 lowest computed contributions from the 50 slides with the lowest predicted risk available (10 slides are taken from the intermediary scores here). We also sampled 100 clusters randomly from the entire set of clusters generated from the 120 slides. For each dataset, we thus obtained a total of 300 clusters of tiles, 100 associated with high risk, 100 associated with low risk and 100 randomly selected. This sampling method allowed us to get a broad representation of which features were associated with a higher or lower predicted risk in slides with high, intermediate or low predicted risk.

Those clusters of tiles were further reviewed and annotated by an expert pathologist blinded to the predicted outcome.

Forty-two histological features were recorded, encompassing tumor architecture patterns, stroma features, presence of different cell types and tiles’ localization. The proportions of appearance of each feature in the highest and lowest contribution groups were compared with the Chi-square test of independence applied to contingency tables derived from the annotated data. Statistical significance was calculated using the Benjamini–Hochberg procedure adjustment and was also assessed when comparing high and low contribution groups to the randomly sampled clusters group.

Calibration protocol

To guarantee the robustness of our stratification in high or low risk across diverse data acquisition protocols like in CANTO H&E or CANTO HES, we implemented Uniform Piecewise Approximation (UPA) as a calibration strategy57. Inspired by image processing’s histogram matching, UPA aligns the unseen dataset’s model prediction distribution with a predefined reference distribution.

The reference distribution was defined as the set of scores obtained from inference of RlapsRisk BC on the H&E samples of the training cohort, from which we derived the cumulative distribution.

For each new dataset or new laboratory setting, a set of 30 samples is selected, comprising 10, 15, and 5 samples, respectively, from histological grades 1, 2, and 3. This selection aims to mirror the histological grade distribution observed in the discovery dataset. Histological grade was considered as the main prognostic factor that is routinely captured by the pathologist’s analysis of the tumor slide only. Thus, we ensure a good representativity of morphologies and an ease of implementation in clinical practice. These 30 samples are used to generate a new set of RlapsRisk BC predictions, from which we derived an estimated cumulative distribution of the unseen dataset (corresponding to the data acquisition protocol of the laboratory).

Next, a linear function is fitted to map the unseen set’s cumulative distribution onto the reference distribution, essentially aligning the prediction “shapes,” without impacting the relative order of patients’ risk. This mapping function was then applied to calibrate any new prediction from the unseen set, ensuring consistency with the desired reference distribution and enhancing generalizability across acquisition protocols. By leveraging UPA, we could achieve consistent model predictions even when applied to data acquired differently from the reference set.

One-shot validation and bootstrapping

For the purposes of validation of the UPA calibration, we performed the calibration from only one subsample (randomly selecting the 10 grade 1 patients, 5 grade 2 patients and 5 grade 3 patients) to calibrate the score for each subcohort (CANTO H&E and CANTO HES) corresponding to two different data acquisition protocols. Once calibrated, the scores were merged as only one larger set forming the CANTO cohort (including both WSI from H&E and HES).

To assess the impact of the sample selection for the calibration step, we performed a repeated sensitivity analysis, reported in Supplementary Table 12 and Supplementary Fig. 3. CIs are derived from the point estimates obtained from the repetitions of the resampling (N = 1000).

Statistical analysis

The survival analyses were performed using the Cox model considering the well-known BC risk factors as a reference score. We compared the performance for the prediction of the MFS of this “clinical score,” RlapsRisk BC, and a model combining RlapsRisk BC and the clinical factors. We used Harrell’s C-index to assess the discrimination performance of the different scores. The comparison of the C-indices of the different models was based on permutation tests using 10,000 permutations. We computed the cumulative sensitivity as well as the dynamic specificity58 of each model for the 5-year MFS prediction. Those metrics are natural extensions of the so-called sensitivity/specificity to the particular setting of time-to-event outcomes that may be censored. CIs were computed using bootstrapping with nonparametric, unstratified resampling (10,000 samples). To illustrate the discrimination on the survival for the test dataset, we used the Kaplan–Meier estimate based on the binarized scores using the threshold defined previously and compared the risk groups using the log-rank test. All tests were two-tailed, and p values < 0.05 were considered statistically significant. The C-index, Kaplan–Meier, and multivariable Cox proportional hazards models were implemented in the lifelines (0.27.4) package of Python. Cumulative sensitivities and dynamic specificities were computed using the scikit-survival package (0.19.0). We followed the recent MI-Claim checklist to improve the reporting of our ML methods. The checklist is available in the Supplementary Information.