Introduction

Endometrial cancer (EC) represents a diverse landscape of histologic subtypes, making it the most common gynecologic malignancy1. Tumor morphology and grade, determined through histopathological examination, remain crucial for EC management. Under the revised FIGO staging system, non-aggressive histological subtypes include low-grade (G1 and G2) endometrioid carcinomas, accounting for approximately 65%2. They possess estrogen and progesterone receptors, show hormonal sensitivity, and are composed of low-grade cells, resulting in a favorable prognosis3. In contrast, aggressive EC mainly includes FIGO grade 3 (G3) endometrioid carcinomas, serous carcinomas (SC), and clear cell carcinomas. These malignancies are characterized by hormone independence and lack of expression of estrogen and progesterone receptors. They consist of high-grade cells, frequently presenting at advanced stages, indicative of a poorer prognosis4,5. 2023 revised FIGO staging introduced significant changes and correctly distinguishing between aggressive and non-aggressive types is crucial for accurate staging.

While the traditional diagnosis of EC relies on tumor morphology and grade, which provide a foundation for treatment decisions, the rise of personalized therapy requires a deeper understanding of tumor mutational burden (TMB), which is an established predictive biomarker for immune checkpoint inhibitor (ICI)6,7. TMB provides precise and comprehensive information for determining the efficacy of immunotherapies in EC8,9. TMB is usually defined as the number of somatic mutations per megabase (mut/Mb) of interrogated genomic sequence. High TMB (TMB-H) has been associated with improved patient response rates and survival benefits from ICIs, making it a promising predictive biomarker for immunotherapy8,10,11,12 Currently, the Food and Drug Administration (FDA) in the United States is considering approving TMB-based testing as a companion diagnostic for determining the suitability of ICI therapy. As for mismatch repair deficient (dMMR) solid tumors defined as having a high TMB, recent research has shown a high objective response rate (ORR) of 53% to anti-PD-1 therapy13,14. Clinical trials also indicate the suitability of using ICIs in TMB-H subtypes of EC8. Traditionally, TMB can be quantified through various next-generation sequencing (NGS)-based sequencing technologies. NGS provides comprehensive genomic analysis, enabling detailed TMB evaluation15. However, its sophisticated technique requires advanced equipment and infrastructure, which can increase the cost of the testing process16. Similarly, NGS requires substantial sequencing data, making it a time-consuming process17.

Recently, there has been growing interest in predicting TMB status directly from hematoxylin and eosin (H&E)-stained whole slide images (WSI) using DL. Studies have demonstrated the feasibility of DL methods in predicting TMB status from histopathological images in lung cancer18,19,20, colorectal cancer21,22, bladder cancer23, gastric cancer24, and gliomas25.

Although existing DL models like CLAM26, TOAD27, and TransMIL28 achieve promising results across various WSI tasks, their reliance on pre-trained features or unsupervised learning can limit their ability to capture the potential crucial details within patches29. Xiang et al.29 tackled this challenge with a multi-scale representation attention-based network (MRAN), which has demonstrated superior performance in the detection of various kinds of cancer, including lung squamous cell carcinoma (LUSC), breast invasive carcinoma (BRCA), stomach adenocarcinoma (STAD) and breast cancer lymph node metastasis.

Our literature review revealed a gap in research regarding the aggressive and non-aggressive EC classification and TMB prediction directly from H&E-stained WSIs. To address this gap, we propose a truncated ResNet-based multilayer attention multiple instance DL framework (TR-MAMIL) with four key components in the classification of aggressive and non-aggressive EC and TMB prediction for the aggressive and non-aggressive ECs, respectively (see Fig. 1(a)). Firstly, we built an effective and efficient truncated ResNet-based (TR) feature encoder for capturing the relevant morphological and molecular features. This encoder demonstrated an average of 10% improvement in metrics over the original ResNet (see Table 7), with 21% and 2% faster training and inference times, respectively (see Fig. 6(b)). Secondly, we developed a multilayered attention MIL (MAMIL) module to identify informative regions in the slide automatically and efficiently. The MAMIL module takes into account the neighboring patches or instances of each analyzed patch in a bag, leading to a more diverse feature representation of patches and a combined representation of individual patches and their neighbors30, which provides comprehensive information into the decision-making process of the proposed framework, eliminating the need for extensive manual annotation. Thirdly, we integrate the gender information as a stable factor covariate with the slide-level features to refine the final slide-level probability prediction score for each WSI. This allows the model to learn more complex features and forces the model to learn more generalized relationships, thereby improving the average performances of the TR-MAMIL by 7% and 9% in the MeanSS and AUROC, respectively (see Table 9). Finally, to address the potential overfitting issue and improve the model generalizability, we devised model selection strategies with early stopping mechanisms to help produce the best models.

Fig. 1: Data information of the two EC cohorts.
figure 1

a The TCGA whole-section WSI cohort from 29 tissue source sites, b Length distribution in pixels, c Age distribution, d Race distribution, e Subtype distribution, f image diversity of the data, g The TP53 distribution of the TMA cohort, h Visualizations of TMAs.

To assess the efficacy of our proposed framework, we compared the performance of TR-MAMIL with the seven above-mentioned state-of-the-art (SOTA) DL approaches, which have achieved notable success in the computational pathology, including ClassicMIL31 for the detection of prostate cancer, skin cancer, basal cell carcinoma, and breast cancer lymph node metastasis, CLAM26 and TransMIL28 both for subtyping of non-small cell lung cancer (NSCLC) and renal cell carcinoma (RCC) and detection of breast cancer lymph node metastasis, Wang et al.32 in patient response prediction for personalized ovarian cancer treatment, Improved_InceptionV3_MS33 for predicting therapeutic effect and assessing MSI status in ovarian cancer, TOAD27 for identifying the origin of 18 distinct metastasis tumor, and MRAN29 for detection of LUSC, BRCA, STAD, and breast cancer lymph node metastasis.

Through extensive experimental evaluation, our framework demonstrated remarkable performance in the prediction of both the EC subtype and TMB status from histopathological WSIs and consistently outperformed the benchmark methods. TR-MAMIL achieved outstanding performance in the classification of the aggressive and non-aggressive EC. For TMB prediction in the aggressive EC, TR-MAMIL obtains 73% and 82 ± 13% for the MeanSS and AUROC, respectively. In predicting TMB status in the non-aggressive EC, TR-MAMIL also demonstrated the best performance in comparison to the seven SOTA approaches. Importantly, according to the Kaplan–Meier survival analysis, the results show that TR-MAMIL successfully differentiates patients with longer disease-specific survival (DSS) and overall survival (OS) with significant difference (p < 0.01 for DSS, p < 0.05 for OS) between the TMB predicted classes in the aggressive EC. Additionally, the framework also distinguishes disease-free survival (DFS) outcomes, showing a significant difference (p < 0.05) when using TP53 predictions. These compelling findings highlight the potential of TR-MAMIL to guide personalized treatment decisions by accurately predicting the EC cancer subtype and the TMB status for effective immunotherapy planning for EC patients.

Results

Materials: Patient cohorts

In this study, we utilized 918 anonymized whole-section H&E-stained WSIs collected from 529 patients of the TCGA cohort. All EC images from TCGA, including 759 frozen, 144 diagnostic formalin-fixed paraffin-embedded (FFPE), and 15 error WSIs, were determined for TMB status by NGS results, providing the information on the number of somatic mutations per megabase (mut/Mb) of interrogated genomic sequence. In this study, the patients’ sequencing results are classified into TMB-low or TMB-high categories, with a score of 10 mut/Mb or higher defining high-TMB status [TMB-H: TMB ≥ 10 (186 patients); TMB-low: TMB < 10 (331 patients); NA: TMB data not available (12 patients). The TCGA cohort was collected from 29 tissue source sites available in the public repositories at the National Institutes of Health, USA (https://portal.gdc.cancer.gov/), where the tissue source site is accounted for dataset sampling (see Fig. 1(a)). The dataset comprises H&E-stained pathological WSIs with varying dimensions, ranging from 7967 to 174,281 pixels in width and 11,672 to 85,452 pixels in height (see Fig. 1(b)). The resolution of the WSIs is 0.252 microns per pixel (MPP). These images were extracted from 529 patients diagnosed with EC from patients aged 31 to 91 and from over seven different races, as illustrated in Fig. 1(c, d). The dataset consists of various morphological subtypes, including endometrioid carcinoma G1 (n = 97), endometrioid carcinoma G2 (n = 117), endometrioid carcinoma G3 (n = 185), serous carcinoma (SC, n = 109), combined endometrioid carcinoma G2 and SC (n = 1), and combined endometrioid carcinoma G3 and SC (n = 20) (see Fig. 1(e, f)). As shown in Fig. 1(e), for data exclusion, we excluded the rare type with only one single sample, i.e. the hybrid type of G2+SC, from our experimental analysis.

In addition, a separate tissue microarray (TMA) cohort of patients’ paraffin-embedded tissues was retrospectively retrieved from the Department of Pathology at Tri-Service General Hospital, Taipei, Taiwan. The resolution of the TMA WSIs is 0.503 microns per pixel (MPP). The TMA cohort contained 242 EC tissue cores, including 127 cores of TP53 wild type and 115 cores of TP53 mutation, as shown in Fig. 1(g, h). The TMA sections were incubated in 3% hydrogen peroxide for 10 minutes to suppress the activity of endogenous peroxidase. They were then incubated with anti-p53 primary antibody (ready-to-use; cat# D0-7, Roche) for 1 hour at room temperature, followed by incubation with horseradish peroxidase-labeled immunoglobulin (Dako, Carpinteria, CA, USA) for 1 hour at room temperature. Peroxidase activity was visualized using a solution of diaminobenzidine (DAB) at room temperature. The WSIs were then acquired with a digital slide scanner (Leica AT Turbo) with a 20x objective lens. Abnormal p53 (mutation-type) staining is defined as either strong nuclear expression in tumor cells (>80%), the complete absence of expression in tumor cells, or unequivocal cytoplasmic expression. Normal p53 (wild-type) expression was defined as nuclear staining of variable intensity in the tumor cells34.

The TCGA and TMA datasets were processed separately. Stratified sampling was employed to divide the individual cohorts into patient-independent training sets (2/3) and testing sets (1/3) to ensure the proportional representation of important characteristics in each group. Furthermore, for training, we have divided the whole training set into training (9/10) and validation (1/10) subsets. Ethical approvals have been obtained from the research ethics committee of the Tri-Service General Hospital (TSGHIRB No.1-107-05-171 and No.B202005070).

Overall results

In this study, the overall quantitative evaluation of the testing sets was performed in three parts, including by examining (1) the efficacy in EC subtyping in the first part, (2) the efficacy in TMB prediction for individual aggressive and non-aggressive ECs in the second part and (3) the efficacy as an indicator for patient’s prognosis using K-M survival analysis in the third part, with the comparison of the seven SOTA DL approaches, including ClassicMIL31, CLAM26, Wang et al.32, Improved_InceptionV3_MS33, TOAD27, TransMIL28, and MRAN29, using the TCGA dataset. For the statistical analysis, we applied Fisher’s exact test to examine the associations between TR-MAMIL prediction and the actual cancer subtype or TMB status in aggressive and non-aggressive EC (see Fisher’s Exact Test below). Furthermore, we investigated the potential of the models as indicators of DSS and OS using K–M survival analysis (see K–M survival analysis below).

Quantitative evaluation in the classification of aggressive and non-aggressive endometrial cancer

In the evaluation of the model performance on the testing set for the classification of aggressive (G3 endometrioid carcinoma and serous carcinoma: G3SC) and non-aggressive (G1 and G2 endometrioid carcinoma: G1G2) endometrial cancer, TR-MAMIL equipped with ResNet-50 Truncated as the backbone and F1-score metric for model selection and early stopping achieved the best performance overall and outperformed all seven SOTA benchmarked methods, obtaining the highest AUROC (88 ± 5%), sensitivity (93%), Mean of sensitivity and specificity (MeanSS) (89%) and accuracy (89%), respectively (see Table 1(a)). TR-MAMIL also demonstrated superior performances and outperformed the benchmark methods in MeanSS and AUROC. These results suggest the efficacy of TR-MAMIL for the classification of the aggressive and non-aggressive EC. Furthermore, we compared the receiver operating characteristics (ROC) curves of all top eight methods across all tasks, and Fig. 2(a) shows that TR-MAMIL achieved superior performances in the identification of aggressive and non-aggressive EC.

Table 1 Overall results with statistical analysis on the TCGA testing sets in (a) classification of aggressive and non-aggressive EC, (b) TMB prediction of the aggressive EC, and (c) TMB prediction of the non-aggressive EC
Fig. 2: Receiver operating characteristic curves (ROC).
figure 2

a classification of aggressive and non-aggressive EC, b TMB prediction in the aggressive EC, c TMB prediction in the non-aggressive EC.

Quantitative evaluation in TMB and TP53 prediction

Secondly, we further evaluated the model performance on the testing sets in predicting the TMB status of the aggressive and non-aggressive EC samples. For TMB prediction of the aggressive EC, the results demonstrated that TR-MAMIL equipped with ResNet-152 Truncated as the backbone and Cross-Entropy metric for model selection and early stopping achieved the highest AUROC (82 ± 13%) and MeanSS (73%), respectively (see Table 1(b)). In TMB prediction of the non-aggressive EC, TR-MAMIL equipped with ResNet-152 Truncated as the backbone and F1-score metric for model selection and early stopping also demonstrated the best performance in comparison to the seven SOTA approaches, achieving 76% sensitivity and 56 ± 8% of AUROC (see Table 1(c)). We also evaluated the model performance by comparing the ROC curves of all top eight methods for TMB prediction in both aggressive and non-aggressive ECs; Fig. 2(b, c) shows that TR-MAMIL achieved the highest AUROC in TMB prediction for both aggressive and non-aggressive EC, outperforming the benchmark methods. These findings demonstrate the promising ability of our methods to predict TMB status in both aggressive and non-aggressive ECs. Additionally, we also evaluated the model performance on an independent TMA testing set in predicting TP53 (mutation-type or wild type) of the EC samples (Table 2(a)). For TP53 prediction in the EC TMA testing set, the results showed that compared to seven SOTA methods, TR-MAMIL equipped with ResNet-152 Truncated as the backbone and Cross-Entropy metric for model selection and early stopping achieved the best performance with accuracy, sensitivity, specificity, MeanSS, and AUROC reaching 78%, 70%, 86%, 78%, and 78 ± 4%, respectively. For TP53 mutation prediction in the EC TCGA testing set (Table 2(b)), the TR-MAMIL framework, configured with the same ResNet-152 Truncated backbone but using the F1-score metric for model selection and early stopping, achieved an accuracy of 70%, sensitivity of 80%, specificity of 65%, MeanSS of 73%, and AUROC of 68 ± 8%.

Table 2 Overall results with statistical analysis on an independent TMA testing set and a TCGA testing set in (a) classification of abnormal p53 (mutation-type) and normal p53 (wild type), and (b) TP53 mutation prediction in EC

In addition, we have further evaluated the performance of the proposed method in prediction of four MSI biomarkers using the TMA dataset. Table 3 shows that the proposed method consistently obtains excellent performance in prediction of four MSI biomarkers.

Table 3 Evaluation in the prediction of four MSI biomarkers of EC, including (a) MLH1, (b) MSH2, (c) MSH6 and (d) PMS2

To demonstrate the model interpretability, Fig. 3(a, b) visualize the model attention maps of TR-MAMIL in the classification of aggressive and non-aggressive EC and TMB prediction on two aggressive and non-aggressive EC sample slides. Using a colormap overlaid on the images with 0.5 transparency, we highlight regions with high attention scores in red, indicating their significant contribution to the model’s prediction. In contrast, blue regions correspond to low attention scores and lower predicted influence.

Fig. 3: Attention heatmaps generated by the proposed model with K-M survival analysis results.
figure 3

a Model interpretability in an aggressive EC and b a non-aggressive EC. c Kaplan–Meier survival analysis in disease-specific survival and d overall Survival for TMB prediction in the aggressive EC.

Quantitative evaluation in FFPE and frozen tissue samples in the classification of aggressive and non-aggressive EC and TMB prediction

We have further evaluated the model performance on the testing sets for FFPE and frozen tissue samples, respectively. Table 4 presented the experimental results of the proposed TR-MAMIL in the three tasks: classification of the aggressive and non-aggressive EC, TMB prediction in the individual aggressive and non-aggressive EC, respectively. The TR-MAMIL method for cancer subtyping employs a truncated ResNet-50 backbone, using the F1-score metric for model selection and early stopping. For TMB prediction of aggressive subtypes, the TR-MAMIL uses a truncated ResNet-152 backbone with Cross-Entropy as the model selection metric and early stopping. In the TMB prediction of non-aggressive subtypes, a truncated ResNet-152 backbone is also employed, with Cross-Entropy as the metric for model selection and early stopping. In cancer subtyping, the results show that TR-MAMIL consistently performs well in both types of tissue slides. On the other hand, in TMB prediction, due to limited FFPE samples available in the TCGA cohort, which contains 144 (16%) FFPE WSIs and 759 (84%) frozen WSIs, the proposed TR-MAMIL obtains higher specificity on the frozen samples. The data insufficiency could have restricted the ability of the model to learn a balanced decision boundary, resulting in lower specificity for FFPE tissues, which could be resolved by adding more training data.

Table 4 FFPE and Frozen tissue sample results with statistical analysis on the TCGA testing sets in (a) classification of aggressive and non-aggressive EC, (b) TMB prediction of the aggressive EC, and (c) TMB prediction of the non-aggressive EC

Statistical analysis

To assess the clinical potential of TR-MAMIL and further validate the performances of TR-MAMIL, we performed a comprehensive statistical analysis with comparison to seven SOTA DL approaches in the three tasks: classification of the aggressive and non-aggressive EC, TMB prediction in the individual aggressive and non-aggressive EC. Statistical analyses were conducted using two-tailed Fisher’s exact test, K-M survival analysis of DSS and OS utilizing the SPSS software35.

Fisher’s exact test

According to the two-tailed Fisher’s exact test, the associations between TR-MAMIL predictions and the cancer subtype or TMB status in the aggressive group) are both extremely strong (p < 0.001), and the association between TR-MAMIL predictions and TMB status in the non-aggressive group is strong (p < 0.01) (see Table 1a, b, and c). These findings convincingly validate the outstanding performance of TR-MAMIL across all tasks: classification of the aggressive and non-aggressive EC subtypes and TMB prediction in the aggressive and non-aggressive EC subtypes.

Kaplan–Meier (K–M) survival analysis

As shown in Fig. 3(c, d), the results show that the proposed TR-MAMIL framework, equipped with ResNet50 backbone and F1-score as the model selection and early stopping metric, successfully differentiates patients with longer disease-specific survival (DSS) and overall survival (OS) with significant difference (p < 0.01 for DSS, p < 0.05 for OS) between the TMB predicted classes in the aggressive EC. Furthermore, we assess the proposed model’s capability in patient prognosis using K-M survival analysis of DSS, OS, and disease-free survival (DFS) for both TP53 prediction and combined TP53/TMB prediction outcomes, as presented in Tables 5 and 6, respectively. The results show that the proposed TR-MAMIL framework, equipped with ResNet152 backbone and Cross-Entropy as the model selection and early stopping metric, shows a significant difference in DFS (p < 0.05) using TP53 predictions in aggressive EC as shown in Fig. 4. As shown in Fig. 5, integrating TP53 mutations and TMB predictions in aggressive EC shows a marginal difference in DFS (p = 0.087). These compelling findings highlight the great potential of the TR-MAMIL framework in improving patient prognosis for clinical applications.

Table 5 Kaplan–Meier survival analysis of TP53 predictions in aggressive EC
Table 6 Kaplan–Meier survival analysis by integrating TMB and TP53 predictions in aggressive EC
Fig. 4: Kaplan-Meier curves for survival outcomes in aggressive EC patients stratified by the predictive TP53 mutation status (wild-type TP53 vs. mutated TP53) generated from the proposed TR-MAMIL.
figure 4

K-M survival analysis includes disease-free survival (DFS), disease-specific survival (DSS), and overall survival (OS).

Fig. 5: Kaplan-Meier survival analysis integrating TMB and TP53 predictions by the proposed TR-MAMIL in aggressive EC.
figure 5

K-M survival analysis includes disease-free survival (DFS), disease-specific survival (DSS), and overall survival (OS).

These compelling findings highlight the great potential of TR-MAMIL in TMB prediction as an indicator of a patient’s DSS and OS for the aggressive EC.

Discussion

We present the first interpretable DL models to predict TMB status and classify the EC cancer subtype directly from H&E-stained WSIs, enabling effective personalized immunotherapy planning and prognostic refinement of EC patients. Among EC, although the majority are diagnosed early with a good prognosis, there are still 16% of EC patients who experience disease metastasis, with a 5-year survival rate of only 16.8%. For those patients who develop extra-uterine lesions after recurrence, systemic chemotherapy combined with radiotherapy is necessary. However, effective treatment options remain limited for patients who experience disease progression after standard therapy36. The histopathological subtyping and molecular profiling analysis of EC has now become an important indicator for guiding treatment and assessing prognosis. In the 2023 revised staging system for EC, aggressive tumor types limited to the endometrium (previously classified as IA stage) are upgraded to stage IC. Alternatively, when aggressive tumor types with any myometrial invasion are clinically staged as stage IIC. (previously classified as IA or IB stage)2. TR-MAMIL performs remarkably well in distinguishing between aggressive and non-aggressive EC, which can help improve the accuracy of pathological diagnosis and clinical staging. The identification of TMB, i.e. an established predictive biomarker for cancer immunotherapy, has further enhanced the precision of treatment strategies6,7,37 as TMB provides precise and comprehensive information for determining the efficacy of immunotherapies in EC8,9. Major advances in the diagnosis and treatment of EC have been the ability to molecularly segregate and classify these carcinomas. Molecular features can be used to estimate the risk of recurrence and hence impact survival. Perhaps the most impactful molecular classification is proposed by TCGA, which classifies EC into four categories: (1) POLE/ultramutated, POLE-mutated tumors have an excellent prognosis; (2) microsatellite instability-high (MSI-H) /hypermutated, with an intermediate prognosis; (3) somatic copy-number alteration high (SCNA-high) /serous-like, nearly universal (95%) TP53 mutations, and a highly unfavorable prognosis; and (4) somatic copy-number alteration low (SCNA-low), with low copy-number alterations and low mutational burden2. The pooled overall prevalence of dMMR was high for EC (26.8%)38. The pooled overall prevalence of high TMB (≥10 mutations/Mb) was also high in EC (43.0%)38. We show that TR-MAMIL can make reasonably accurate predictions, particularly in aggressive and non-aggressive EC classifications, and predict TMB status directly from H&E-stained WSIs for EC samples. The TCGA molecular-based classification can be practically applied in clinical settings by using a simplified surrogate that includes three immunohistochemical markers (TP53, MSH6, and PMS2) and one molecular test (analysis for pathogenic POLE mutations). This surrogate approach classifies EC groups: POLE-mutant, MMR-deficient, TP53-abnormal, and NSMP2. In terms of systemic treatment for primary advanced and recurrent EC cancer, two randomized Phase III trials (ENGOT-en6/GOG-3031/RUBY and NRG-GY018/Keynote-868) have demonstrated a statistically significant and unprecedented progression-free survival (PFS) advantage by adding ICIs (dostarlimab or pembrolizumab, respectively) to standard carboplatin/paclitaxel chemotherapy, followed by ICIs maintenance therapy in dMMR patients, with hazard ratios (HR) of 0.28 (95% confidence interval [CI] 0.16–0.5) and 0.30 (95% CI 0.19–0.48), respectively. The success of ICI therapy in cancer immunotherapy has played a crucial milestone in treating advanced-stage cancers39,40,41,42. Immune cells play dual roles in tumor development, promoting tumor progression while also clearing tumor cells39. ICI therapy has achieved many significant positive research outcomes. It is well known that, in physiological conditions, checkpoint pathways prevent excessive T-cell activation that may result in loss of self-tolerance43; for example, PD-L1 and CTLA-4 regulate the stimulation of the immune microenvironment. Malignancies are able to adapt to these responses and to exploit checkpoint pathways to promote tumor growth, and thus, ICIs can help to reactivate T-cell function, leading to the killing of the tumor cells44. In many malignant tumors, ICI therapy can induce type I inflammation and enhance cytotoxic T cells, type 1 T helper cells, and M1 macrophages to eliminate tumor cells45,46. Since CD8+ T cells are the primary direct effectors of cytotoxic responses to cancer cells, after PD-1 blockade, CD8+ T cells expand, and the change is considered the result of effective anti-tumor immunity, also correlating with positive clinical outcomes47. TMB has emerged as a quantitative genomic biomarker capable of predicting response to ICIs beyond MSI48,49.

In the study by Goodman et al.50, the response rates to anti- PD-1/PD-L1 therapy were 58% for TMB-H patients and 20% for TMB-low patients, indicating an association between TMB and improved response to immunotherapy. Similarly, in the pivotal KEYNOTE-158 study, the ORR was 29% for TMB-H tumors compared to 6% for non-TMB-H tumors51. High TMB is prevalent across almost every type of cancer, and as a result, identifying patients with high TMB may benefit from immunotherapy in nearly all types of cancer17. Based on the findings of the KEYNOTE-158 study, Pembrolizumab has recently been approved by the U.S. FDA for use in tumors with TMB-H, regardless of histological type, also including EC10. In the molecular subtypes of EC, the TMB in the MSI-H group typically exceeds 50 mutations/Mb, suggesting that immunotherapy may be used. In the SCNA-high group, TP53 mutations are commonly present, and the prognosis is the worst. These tumors usually have a low mutation rate and respond poorly to immunotherapy. However, a minority of specimens are microsatellite stable but exhibit exceptional TMB-H. If TMB testing is not performed, these patients may miss the opportunity to receive ICI treatment. Additionally, in POLE mutant types, TMB often exceeds 300 mutations/Mb. Due to the excellent prognosis, these patients generally do not require immunotherapy, but a small number of advanced-stage EC patients with POLE mutations and TMB-H may still have the opportunity to receive ICI treatment2,52,53. Recently, pembrolizumab, a monoclonal antibody targeting PD-1, was also approved for use in any TMB-high (≥10) tumor, regardless of histology54.

Evaluated in TCGA EC data, TR-MAMIL demonstrated superior performances compared to seven state-of-the-art methods26,27,28,29,31,32,33. TR-MAMIL achieved outstanding performance in the classification of the aggressive and non-aggressive EC, with 97%, 93%, 89%, and 89% for the area under the receiver operating characteristic curve (AUROC), sensitivity, mean of sensitivity and specificity (MeanSS) and accuracy, respectively. For TMB prediction in the aggressive EC, TR-MAMIL achieves 73% and 78% for the MeanSS and AUROC, respectively. In predicting TMB status in the non-aggressive EC, TR-MAMIL also demonstrated the best performance compared to the seven SOTA approaches, achieving 76% sensitivity and 70% AUROC. Furthermore, according to Fisher’s exact test, the associations between TR-MAMIL prediction and the actual cancer subtype or TMB status in the aggressive group are both extremely strong (p < 0.001), and the association between the prediction of TR-MAMIL and the actual TMB status in the non-aggressive group is strong (p < 0.01). According to the 2020 edition of the World Health Organization Classification, in EC, TP53 mutation indicates a poor prognosis. High-grade endometrioid carcinoma (G3) exhibits diversity in prognosis, clinical presentation, and molecular characteristics, and it is also the tumor type most likely to benefit from molecular classification. In our study, analysis of the high-grade endometrioid carcinoma (G3) group can predict whether it belongs to the group with poor prognosis (TP53 mutation)2,55. Notably, our methods consistently outperformed seven state-of-the-art benchmarked approaches in computational pathology. Importantly, according to the Kaplan–Meier survival analysis, the results show that TR-MAMIL successfully differentiates patients with longer DSS and OS with significant differences (p < 0.01 for DSS, p < 0.05 for OS) between the TMB predicted classes in the aggressive EC. These compelling findings highlight the potential of TR-MAMIL to guide personalized treatment decisions by accurately predicting the EC cancer subtype and the TMB status for effective immunotherapy planning for EC patients. Moreover, a run-time analysis demonstrates that TR-MAMIL achieves high efficiency in inference time, taking only 26.21 seconds per slide on average, which makes TR-MAMIL feasible for practical clinical usage.

TR-MAMIL was evaluated using a variety of backbones for the feature encoder and a variety of metrics for the model selection with early stopping mechanisms. We found that employing a 1024-dimensional truncated ResNet outperformed the original 2048-dimensional ResNet with an average of 10% improvement in metrics over the original ResNet, with 21% and 2% faster training and inference times, respectively. This observation aligns with the tendency of deeper neural network layers to specialize in filters particular to features relevant to the source task and data. Interestingly, our study revealed a task-dependent optimal network depth for classifying endometrial cancer. For morphological classification (i.e., classification of aggressive and non-aggressive EC), a shallower network architecture achieved superior performance. This likely arises from the inherent simplicity of the features used in this task, such as cell morphology, texture, and spatial patterns. Shallower networks effectively capture these patterns without requiring complex non-linear transformations56. Conversely, molecular classification, predicting TMB within both aggressive and non-aggressive ECs, benefited from the increased representational power of deeper networks. These architectures excel at extracting subtle relationships between diverse genomic features and phenotypic expression, uncovering complex non-linear patterns and hidden relationships within the data. Additionally, integrating a stable covariate with the deep features extracted from the histological slide proved to be beneficial for enhancing the model performance, resulting in an average performance improvement of 7%, 10%, and 9% in terms of the mean of sensitivity and specificity, AUPRC and AUROC, respectively. While this study provides compelling evidence for the benefits of incorporating a stable covariate, additional research is encouraged to conduct larger-scale clinical validation to fully understand the specific mechanisms and to optimize integration for future research.

Given the relatively high cost of pembrolizumab and other potential targeted treatments, a key question to inform health system planning and budget impact evaluations is how many patients might be eligible for these treatments based on the presence of the biomarkers38. Overall, our framework demonstrates the potential to accurately identify aggressive and non-aggressive ECs and predict TMB status in the aggressive and non-aggressive EC subtypes directly from H&E slides. This capability holds promise for improving patient care and treatment outcomes. Here, we utilized DL to analyze WSIs and found the ability to distinguish between non-aggressive and aggressive ECs. Simultaneously, we leveraged TMB data provided by TCGA to train the model for predicting TMB status. Within the molecular classification of EC, the dMMR group shows a response to ICI. In clinical practice, dMMR can be assessed through relatively simple immunohistochemical staining (IHC) and polymerase chain reaction (PCR), replacing the need for complex and expensive NGS. MSI-H/dMMR can be easily assessed through IHC for the loss of expression of one of the four MMR proteins (MLH1, PMS2, MSH2, and MSH6). For MSI testing through PCR, MSI at ≥2 loci is defined as MSI-high, instability at a single locus is defined as MSI-low, and no instability at any of the tested loci is defined as MeanSS. However, TMB analysis requires the use of NGS, making the use of DL to analyze pathology WSIs a promising approach57. While TR-MAMIL offers promising results for direct TMB prediction from H&E-stained WSIs of EC, the inherent complexity of this task necessitates further refinement. Future studies exploring novel approaches and data integration strategies hold the key to unlocking improved performance. Beyond the successful application in this study, TR-MAMIL holds promise for broader utilization. We aim to explore their potential in various histopathological image analysis tasks. Further validation on a larger and more diverse database is crucial to demonstrate the robustness and generalizability of TR-MAMIL across clinical settings. Overcoming these challenges could lead to more effective treatment decisions and improved patient outcomes. Successful development of these models could revolutionize cancer care, enabling personalized and precise treatment strategies.

Methods

This study introduced three highly effective and efficient multilayer attention multiple instance DL approaches in the classification of aggressive and non-aggressive EC and TMB prediction for the aggressive and non-aggressive EC directly from H&E-stained WSIs. Ethical approvals have been obtained from the research ethics committee of the Tri-Service General Hospital (TSGHIRB No.1-107-05-171 and No.B202005070). All the slides were directly downloaded from the TCGA platform, and we did not apply any pre-processing techniques like stain normalization or data augmentation. Firstly, we established a uniform data representation by consistently segmenting and extracting tissue patch coordinates at 20 × magnification and employ a simple thresholding method for foreground segmentation and extract the patches (see Fig. 7(a)). Secondly, two effective and efficient truncated ResNet-based feature encoders, including truncated ResNet50 and truncated ResNet152, were devised for capturing the relevant morphological and molecular features (see Fig. 7(b)). These encoders demonstrated an average of 10% improvement in all metrics over the original ResNet (see Table 7), with 21% and 2% faster training and inference times, respectively (see Fig. 6(b)). Thirdly, a multilayered attention MIL module was proposed to identify informative regions in the slide automatically and efficiently, and fourthly, a stable factor was integrated with the slide-level features as a covariate not only to refine the final slide-level probability prediction score for each WSI and but also to allow the model to learn more complex features by forcing the model to learn more generalized relationships, thereby improving the average performances of TR-MAMIL by 7% and 9% in the MeanSS and AUROC, respectively (see Table 9 and Fig. 7(c)). Finally, to address the potential overfitting issue, which can affect the model generalizability, we devised model selection strategies with early stopping mechanisms to help produce the best models (see Fig. 7(d)). The workflow diagram in this study is provided in Fig. 7.

Table 7 Performance comparison among the original and truncated ResNets on the testing set in the classification of aggressive and non-aggressive EC
Fig. 6: Ablation studies on truncated backbones and the stable covariate.
figure 6

a Architecture comparison between the original and truncated ResNets, b Run-time analysis between the original and truncated ResNets in classification of aggressive and non-aggressive EC, c Performance comparison with or without truncated backbones and the covariate in classification of aggressive and non-aggressive EC, d Loss comparison on the validation sets for: d-i classification of aggressive and non-aggressive EC, d-ii TMB prediction in the aggressive EC, and d-iii TMB prediction in the non-aggressive EC.

Fig. 7: Overview of the whole process for TR-MAMIL.
figure 7

a Foreground segmentation and patching process, b feature extraction with truncated ResNet backbone: b-i the backbone architecture of the truncated ResNet50, b-ii the backbone architecture of the truncated ResNet152, c the multilayer attention-based MIL and the final classification process, d the model selection and early stop mechanism for the proposed methods.

Tissue segmentation and patching

In this study, our pipeline begins by automatically segmenting tissue regions in each digitized slide. We established a uniform data representation by consistently segmenting and extracting tissue patches at 20× magnification. This greatly helps TR-MAMIL achieve optimal model performances. The WSI is initially loaded into memory at a downsampled resolution (e.g., 32× downscale). We then convert the image from RGB to the HSV color space and generate a binary mask outlining tissue regions (foreground) by thresholding the saturation channel. To refine the mask, we apply median blurring and morphological closing operations to fill gaps and holes. Detected foreground objects with contours above a specified area threshold are stored for downstream processing. The segmentation mask is also available for visual inspection. After segmentation, our algorithm extracts 256 × 256-pixel patches from the segmented foreground contours at a user-defined magnification level. The coordinates of these patches and the slide metadata are stored in the HDF5 hierarchical data format (see Fig. 7(a)).

Feature extraction

After patching, a deep CNN is utilized to compute a concise feature representation for each image patch within every slide (see Fig. 7(b)). We build effective and efficient truncated ResNet-based feature encoders pre-trained on ImageNet for capturing the relevant morphological and molecular features. Specifically, a truncated ResNet50 serves as the backbone in the classification of aggressive and non-aggressive EC (see Fig. 7(b–i)), while truncated ResNet152 is employed for TMB prediction for both aggressive and non-aggressive EC, respectively (see Fig. 7(b–ii)). These models are truncated after the third residual block, followed by an adaptive average-spatial pooling layer, transforming each 256 × 256-pixel patch into a 1024-dimensional feature vector (see Fig. 6(a) for the detailed comparison between the proposed 1024-dimensional truncated ResNet and the 2048-dimensional original ResNet for ResNet50 and ResNet152, respectively). Utilizing these features in supervised learning provides advantages such as accelerated training times, reduced computational costs, and the ability to train on thousands of WSIs efficiently within a few hours. Moreover, using low-dimensional features allows simultaneous processing of all patches within a slide (up to 150,000) on a single consumer-grade GPU, eliminating the need for patch sampling and addressing issues related to noisy labels.

Ablation studies

We conducted ablation experiments to validate the efficacy and the efficiency of the core components in TR-MAMIL, including (1) the comparison of employing the original and the proposed truncated ResNet as the backbone in the feature extraction process for the EC subtyping, (2) the run-time analysis with the comparison of the original and the proposed truncated ResNet, and (3) the comparison of different feature extraction methods

Comparison of truncated and original ResNets

Firstly, we compared the performances of TR-MAMIL using the original or truncated ResNet with three different depths, including ResNet50, ResNet101, and ResNet152, in the classification of aggressive and non-aggressive EC to investigate the most suitable backbone architecture. The original ResNet architecture with four residual blocks generates 2048-dimensional feature vectors, leading to high computational costs in feature extraction for both model training and inference. To address this, we truncated the ResNet by removing the fourth residual block, resulting in 1024-dimensional feature vectors (see Fig. 6(a) for a comparison of the truncated and the original ResNets). The results evaluated on the testing set demonstrated that the proposed 1024-dimensional truncated ResNet models consistently outperformed the original full 2048-dimensional ResNets with 9%, 3%, 18.67%, 13.67%, and 8% improvement in accuracy, sensitivity, specificity, MeanSS, and AUROC, respectively, and an average of 10% overall improvement in all metrics (see Table 7). The outcome could be attributed to the observation that later layers of a deep neural network tend to learn patterns, which are increasingly specific to features relevant to the source task and data of the pre-trained model that is onto the natural image classification with ImageNet datasets in this study. In contrast, features from earlier layers are characterized by their generality and applicability to diverse datasets. Since histopathology images differ significantly from natural images, using features from later ResNet layers did not necessarily improve performance compared to earlier layers. Hence, the worse performance of full ResNets may be caused by features of the later layers being too specialized for natural images and not suitable for the different textures and patterns in histopathology. The refinement strategy of truncated ResNets not only enhances informative representations for downstream tasks but also reduces the feature dimension significantly and lowers computational costs (run-time analysis is provided in the next section).

Run-time analysis with comparison of truncated and original ResNets

Secondly, we performed a run-time analysis with a comparison of the original 2048-dimensional ResNet and the 1024 truncated ResNet on a single local workstation using an Intel(R) Xeon(R) CPU E5-2650 v2 operating at a clock speed of 2.60GHz and NVIDIA GeForce GTX 1080 Ti GPU. In evaluation, the training and inference time includes the time to perform tissue segmentation, patch extraction, feature extraction, and model training or inference. The run-time analysis presented in Fig. 6(b) reveals that all truncated ResNet models consistently take up to 1.46 times faster than the original ResNets, with an overall average of 52.21 seconds/slide for training and 26.21 seconds/slide for inference. In contrast, the 2048-dimensional original ResNet takes 73.57 seconds/slide for training and 28.44 seconds/slide for inference, which is 21% and 2% longer than the training and inference time of the 1024-dimensional truncated ResNet, respectively. These findings indicate that using 1024-dimensional truncated ResNet features from an earlier convolution layer of the pre-trained ResNet offers more efficient data processing and training and reduces disk storage requirements. Therefore, truncated ResNets are used as the backbone for TR-MAMIL in feature extraction.

Compare with SSL-based backbones (ResNets and ViTs)

In the third evaluation. Self-supervised learning (SSL) has demonstrated its effectiveness in utilizing unlabeled data, and its application to pathology could greatly benefit downstream tasks58. Recent studies also highlight the growing popularity of Transformer-based models in medical applications28,59,60,61,62. Therefore, we compare our proposed truncated ResNet feature encoders with five SSL-based feature encoders, including two SSL-based ResNets and three SSL-based Vision Transformer (ViT) feature encoders. For the two SSL-based Resnets, we compared two clustering-guided contrastive learning (CCL)-based feature encoders, i.e., CCL-ResNet5063 and CCL-ResNet152. For SSL-based ViTs, we evaluated three models of simple self-supervised methods, namely self-distillation with no labels (DINO), including (1) a DINO-ViT-S/16 pre-trained on ImageNet64, (2) a Lunit’s DINO-ViT-S/16 pre-trained on pathology images58 and (3) a Lunit’s DINO-ViT-S/16 with additional data normalization. All models were evaluated on the testing sets with identical model selection and early stopping setups on each task, including F1-score for both EC subtyping and TMB prediction in the non-aggressive EC, and cross-entropy for TMB prediction in the aggressive EC.

As demonstrated in Table 8, the feature encoder with truncated-ResNet as the backbone outperforms the SSL-based feature encoders in the classification of aggressive and non-aggressive ECs (see Table 8(a)), TMB prediction in the aggressive EC (see Table 8(b)) and TMB prediction in the non-aggressive EC (see Table 8(c)). The performance of the proposed framework with a CCL-based feature encoder, i.e., CCL-ResNet5063 and CCL-ResNet152 tends to have high sensitivity with very low specificity. Lunit’s DINO-ViT-S/16 with and without data normalization58 tend to have good performances in the classification of aggressive and non-aggressive EC and TMB prediction for the aggressive EC but have a worse performance in TMB prediction for the non-aggressive EC. Meanwhile, DINO-ViT-S/1664 tends to have a stable performance in all of the tasks but still cannot surpass the proposed truncated ResNet feature encoders. These results further underscore the efficacy of our proposed truncated ResNet feature encoders as part of our proposed framework.

Table 8 Performance comparison of TR-MAMIL using different feature extractor methods on the testing sets in (a) classification of aggressive and non-aggressive EC, (b) TMB prediction for the aggressive EC, and (c) TMB prediction for the non-aggressive EC

Multilayered attention module

Each slide is processed as a bag of feature embeddings from its constituent patches during training. The entire bag, containing k patches, is fed as a single input of dimension k × 1, 024 to the MIL network, where 1024 is a fixed vector representation, produced previously in the feature extraction step. The network employs two stacked fully connected layers, Fc1 and Fc2, to transform these patch embeddings into histology-specific feature representations \(\left\{{{\bf{h}}}_{k}\right\}\). These layers, characterized by weight matrices and bias vectors of \({W}_{1}\in {{\mathbb{R}}}^{512\times 1,024},{{\bf{b}}}_{1}\in {{\mathbb{R}}}^{512}\) and \({W}_{2}\in {{\mathbb{R}}}^{512\times 512},{{\bf{b}}}_{2}\in {{\mathbb{R}}}^{512}\), respectively, followed by rectifier linear unit (ReLU) activation functions. This multilayer architecture allows the model to learn deep features suitable for WSI analysis by tuning the representations extracted through transfer learning, mapping the set of patch feature embeddings \(\left\{{{\bf{z}}}_{k}\right\},{{\bf{z}}}_{k}\in {{\mathbb{R}}}^{1,024}\) in a given WSI to 512-dimensional vectors:

$${{\bf{h}}}_{k}=ReLU\left({W}_{2}\left(ReLU\left({W}_{1}{{\bf{z}}}_{k}+{{\bf{b}}}_{1}\right)\right)+{{\bf{b}}}_{2}\right)$$
(1)

Our approach utilized a multilayered attention module consisting of Attention Fully Connected 1 (Attn − Fc1) layer and Attention Fully Connected 2 (AttnFc2) layer with weight parameters \({V}_{a}\in {{\mathbb{R}}}^{384\times 512}\) and \({U}_{a}\in {{\mathbb{R}}}^{384\times 512}\) and task-specific weights, denoted as \({W}_{a,t}\in {{\mathbb{R}}}^{1\times 384}\) for each task t (see Fig. 7(c)). The module is trained to assign attention scores ak,t to each patch k. After softmax activation, a high score (near 1) indicates the importance of the region for the slide-level classification task, while a low score (near 0) suggests the region lacks of the diagnostic or prognostic value.

$${a}_{k,t}=\frac{exp\left\{{W}_{a,t}\left(tanh({V}_{a}{{\bf{h}}}_{{\bf{k}}})\odot sigm({U}_{a}{{\bf{h}}}_{{\bf{k}}})\right)\right\}}{{{\rm{\Sigma }}}_{j = 1}^{N}exp\left\{{W}_{a,t}\left(tanh({V}_{a}{{\bf{h}}}_{{\bf{j}}})\odot sigm({U}_{a}{{\bf{h}}}_{{\bf{j}}})\right)\right\}}$$
(2)

where the bias parameters are omitted for simplicity, and denotes the element-wise product. The sigmoid activation function is represented as ’sigm,’ and N represents the total number of patch embeddings in each slide.

Attention pooling aggregates the feature representations \(\left\{{{\bf{h}}}_{k}\right\}\) of all patches in the slide through averaging, with weights determined by their respective predicted attention scores \(\left\{{a}_{k,t}\right\}\). The resulting feature vector, \({{\bf{h}}}_{slide,t}\in {{\mathbb{R}}}^{512}\), serves as the histology deep features, representing the entire slide for task t.

$${{\bf{h}}}_{slide,t}=\mathop{\sum }\limits_{k=1}^{K}{a}_{k,t}{{\bf{h}}}_{k}$$
(3)

This trainable aggregation function intuitively enables the network to automatically identify informative regions in the slide, facilitating the classification of aggressive and non-aggressive EC and TMB prediction without the need for detailed annotations outlining precise tumor regions.

Stable covariate integration and classification

A stable factor s is encoded as an additional covariate with a binary value and concatenated to the slide-level deep features extracted from the histology slide (see Fig. 7(c)), generating a feature vector \(concat\left(\left[{{\bf{h}}}_{slide,t},s\right]\right)\). For the final classification cls layer, the slide-level probability prediction score pt for task t is computed with a softmax function as follows.

$${p}_{t}=softmax\left({W}_{cls,t}\left(concat\left(\left[{{\bf{h}}}_{slide,t},s\right]\right)\right)+{{\bf{b}}}_{cls,t}\right)$$
(4)

where softmax denotes the softmax activation and concat denotes concatenation. This study treats all tasks, including classification of the aggressive and non-aggressive EC and TMB prediction for the aggressive and non-aggressive ECs, as binary classification problems. Therefore, the task-specific classification layers are defined by \({W}_{cls,t}\in {{\mathbb{R}}}^{2\times 513}\).

Impact of incorporating a stable factor as a covariate

We further investigated the impact of incorporating the gender information as a stable factor covariate with the slide-level features by comparing model performance with or without the covariate in EC subtyping and TMB prediction on the testing sets. As shown in Table 9, adding the covariate improved the model performance by an average of 7% and 9% in terms of MeanSS and AUROC, respectively. The improvement in the aggressive and non-aggressive EC classification was even more significant, with increases of 24% in MeanSS and AUROC, respectively. Furthermore, as demonstrated in Fig. 6(c), the performance of TR-MAMIL with the covariate consistently performed better in all metrics when compared with the one without using the covariate on the same backbone.

Table 9 Performance comparison of models with or without the covariate on the testing sets in (a) classification of aggressive and non-aggressive EC, (b) TMB prediction for aggressive ECs, and (c) TMB prediction for non-aggressive ECs

Integrating a stable factor adds a concatenation layer, potentially allowing the model to learn more complex features by combining information from different network layers. This additional concatenation layer could enhance the expressive power of the model and improve performance in various tasks65. Further investigation focused on cases where both models (with and without the covariate) failed in prediction. Figure 6(d) demonstrates that models with the stable covariate consistently gain lower losses than those without the covariate on the validation sets for all three applications. This lower loss indicates that the additional covariate acts as a regularizer, preventing overfitting to the training data. Overfitting occurs when the model memorizes training data too well and fails to generalize to unseen examples. By incorporating the stable covariate, the model is forced to learn more generalizable relationships, resulting in lower training and testing losses.

Model selection with early stop mechanism

We devised two model selection strategies with early stop mechanisms for different tasks (see Fig. 7(d)). The performance of the model was evaluated on the validation set in each epoch d, generating a performance measurement indicator Qd. If Qd on the task t had not improved beyond δ1 epochs, the early stop mechanism was triggered at the epoch ds, and then the training continues for another δ2 epochs and stops at the epoch de where δ1 = 50, δ2 = 20 in this study. The proposed model selection selected the model Mi* with the optimal Qd between ds and de. If multiple models had the same best score, the model with the earliest/lowest training epochs was chosen for the final model prediction through per-slide inference.

For the tasks in prediction of both cancer subtyping and TMB prediction in the non-aggressive type, the F1-score (Fd) is used as the model performance measurement indicator Qd for model selection and early stop.

$$\left\{i\right\}=\arg \mathop{\max }\limits_{{d}^{s}\le j\le {d}^{e}}{F}_{j}$$
(5)
$${i}^{* }=\arg \mathop{\min }\limits_{i}\left\{i\right\}$$
(6)

Comparison of backbones and strategies for model selection and early stop mechanisms

we examined various combinations of three truncated backbones with different depths and three different metrics for both model selection and early stop of TR-MAMIL for EC subtyping and TMB prediction. In the classification of aggressive and non-aggressive EC, TR-MAMIL obtained excellent results in all setups with AUROC greater than 90% mostly, and the optimal setup with truncated ResNet50 as the backbone and F1-score as the metric for model selection and early stopping (see Table 10(a)).

Table 10 Comparison of backbones and strategies for model selection with early stop mechanisms on the testing sets in (a) the classification of aggressive and non-aggressive EC, (b) TMB prediction for the aggressive EC, and (c) TMB prediction for the non-aggressive EC

In TMB prediction of the aggressive EC, the best performance was observed when using truncated ResNet152 as the backbone and cross-entropy value as the metric (see Table 10(b)). In TMB prediction of the non-aggressive EC, we determined that the optimal strategy by employing the truncated ResNet152 backbone with F1-score as the evaluation metric (see Table 10(c)).

For the task in prediction of TMB in the aggressive type, cross-entropy (\({{\mathcal{L}}}_{d}\)) is used as the model performance measurement indicator Qd for model selection and early stop.

$$\left\{i\right\}=\arg \mathop{\min }\limits_{{d}^{s}\le j\le {d}^{e}}{{\mathcal{L}}}_{j}$$
(7)
$${i}^{* }=\arg \mathop{\min }\limits_{i}\left\{i\right\}$$
(8)

Implementation details

For all experiments, we performed a random sampling of slides with a mini-batch size of one. The sampling frequency for each slide was determined based on the inverse relative proportion of one class to another in the training set. The model parameters were updated using the Adam optimizer with weight decay of 1 × 10−5, a learning rate of 2 × 10−4, the first and second moments of the gradient were set to 0.9 and 0.999, and the epsilon was set to 1 × 10−8. We applied dropout layers with a dropout rate of 0.25 after every hidden layer to prevent the potential overfitting issue.