Introduction

Esophageal cancer, ranking as the eighth most prevalent malignancy and the sixth leading cause of cancer-related mortality, is characterized by a low five-year survival rate1,2. It is primarily classified into two histological subtypes: esophageal squamous cell carcinoma (ESCC) and esophageal adenocarcinoma, with the former constituting the majority of global cases, approximately 84% 3.

In terms of therapeutic approach, endoscopic resection is recommended for esophageal neoplasia, ranging from epithelial (EP) to minimal submucosal invasion (SM1), owing to its lower complication rate and reduced duration of hospitalization, in comparison with surgical intervention4. Consequently, an accurate preoperative evaluation of the depth of invasion is essential for the decision-making of the treatment strategy.

The Japanese Endoscopic Society (JES) recognizes magnifying endoscopy with narrow-band imaging (ME-NBI) as a highly effective technique for the preoperative assessment of invasion depth in ESCC5,6. In following of multiple previous classification systems, the JES classification system proposes the patterns of intravascular papillary cell layer (IPCL) and avascular area (AVA), which now has been widespread clinical adoption6,7. However, the accurate application of the JES classification system in routine requires a long-term training for endoscopists.

The advancements in deep learning have transformed numerous facets of clinical operations. In gastrointestinal endoscopy, artificial intelligence (AI), trained by large amount of labeled data, is progressively being incorporated into computer-aided diagnosis systems, thereby enhancing the detection and classification of lesions8,9. However, the procurement of such comprehensive and meticulously annotated datasets, necessitating laborious and time-intensive curation, often presents a significant impediment to the training process10. Self-supervised learning (SSL) represents a novel machine learning paradigm that harnesses the power of unsupervised learning to equip cutting-edge AI models with the capacity to manage the exigencies of tasks traditionally reliant on extensive datasets annotated by human experts11,12.

As AI evolves toward higher complexity, it poses a formidable challenge for humans to grasp the logic and steps that lead an algorithm to its conclusions. The computational processes often become encapsulated within an enigmatic framework known as a “blackbox,” which is inherently resistant to interpretation13,14. It is imperative to achieve a comprehensive grasp of AI’s decision-making mechanisms through rigorous model monitoring and ensuring AI accountability. Consequently, there is a burgeoning demand for the development of explainable AI methodologies aimed at bolstering confidence in AI models. Explainable AI is designed to demystify and elucidate the operations of machine learning algorithms, deep learning structures, and neural networks. In recent years, explainable AI has emerged as a prominent area of inquiry within the domain of AI research15,16.

In this study, we aimed to develop a series of semi-supervised models for predicting invasion depth of ESCC, based on the IPCL/AVA patterns. The models were pretrained in a large unlabeled data using self-supervised contrastive learning and fine-tuned in a small labeled data. In the fine-tuning, two approaches were adopted: the traditional blackbox or the explainable AI. Lastly, the models were evaluated in an external test dataset, in comparison with two endoscopists.

Methods

Study design

This retrospective multicenter study was conducted in two hospitals: the First Affiliated Hospital of Soochow University (Suzhou, the training dataset) and Jintan Affiliated Hospital of Jiangsu University (Jintan, the test dataset). Patients who underwent ME-NBI examinations for precancerous lesions or superficial ESCC confirmed by histology of endoscopically or surgically resected specimens between November 2016 and December 2023 were included. Each lesion offered three images in the training dataset, whereas one image in the test dataset. Non-magnified images, low-quality images, and white-light images were excluded. These images were captured using Olympus equipment (GIF-H260Z, Olympus Medical Systems, Tokyo, Japan) and saved in BMP format. Two endoscopists (L.K. and J.Z.) with more than 10 years of experience independently reviewed eligible images to ensure image quality and labeled IPCL/AVA patterns. If there was disagreement between the two endoscopists, it would be decided by X.S., with 20 years of experience. This study was approved by the ethics committee of the First Affiliated Hospital of Soochow University (Approval number 2022098)17. Due to the retrospective nature of the study, the need to obtain informed consent was waived by the Ethics Committee of the First Affiliated Hospital of Soochow University. All procedures involving human participants were conducted in accordance with ethical standards and the Declaration of Helsinki. Figure 1 presents the flowchart of the study. The characteristics of patients/ lesions/ images are listed in Table 1.

Fig. 1
figure 1

The flowchart of the study. Step #1: self-supervised contrastive learning on large unlabeled images from Suzhou. Step #2: fine-tuning (two methods: blackbox or explainable) on few labeled images from Suzhou. Step #3: test on labeled images from Jintan. This retrospective multicenter study was conducted in two hospitals: the First Affiliated Hospital of Soochow University (Suzhou, the training dataset) and Jintan Affiliated Hospital of Jiangsu University (Jintan, the test dataset).

Table 1 Characteristics of patients in the study.

Based on the pathological results of invasion depth, the ME-NBI images were labeled as: (1) epithelium (EP) to lamina propria (LPM); (2) muscularis mucosa (MM) to minimal submucosal invasion (less than 200 μm; SM1); or (3) deeper submucosal invasion (200 μm or more; SM2 or deeper). The detailed information was offered in the Supplementary Methods 1.

Based on the patterns of IPCL/AVA, each ME-NBI image was labeled in four dimensions: (1) severe irregularity (yes or no); (2) loop-like formation (yes or no); (3) highly dilated vessels which calibers appear to be more than 3 times (yes or no); and (4) AVA (small < 0.5 mm; middle 0.5–3 mm; large ≥ 3 mm).

The proposed semi-supervised learning framework

The proposed framework, which consists of the upstream task SSL (Fig. 2) and the downstream task fine-tuning (Fig. 3). The development of the framework was introduced in our previous study18.

Fig. 2
figure 2

The flowchart of Step #1 self-supervised contrastive learning. Self-supervised contrastive learning on large unlabeled images from Suzhou. The self-supervised contrastive learning is characterized by several integral elements: (1) a data augmentation component; (2) a neural network encoder; (3) a concise neural network projection layer; and (4) a contrastive loss function.

Fig. 3
figure 3

The flowchart of Step #2 fine-tuning. The fine-tuning had two approaches: (1) blackbox models were trained based on traditional learning labeled on the pathology; or (2) four feature models were trained based on the IPCL/AVA patterns, and then their outputs were integrated using a XGBoost classifier in the principle of explainable AI.

The upstream task: self-supervised contrastive learning

In the SSL (contrastive learning), an anchor image is employed to create a positive example by applying various data augmentation strategies. In contrast, a negative example is derived from selecting another random image within the same batch. Figure 2 illustrates the SSL process, showcasing augmentation techniques such as color modification and cropping followed by resizing18. The process offers a simplified approach to learning visual representations contrastively, streamlining complex self-supervised learning algorithms to their core principles, eschewing the need for specialized architectural configurations or a memory bank19. The framework is characterized by several integral elements: (1) a data augmentation component that randomly alters an image sample to generate a related pair, as a positive match, while it generates negative matches, augmented from different images; (2) a neural network encoder that serves to extract feature vectors from the augmented data instances; (3) a concise neural network projection layer that aids in the translation of these features into a space conducive to the application of the contrastive loss mechanism; and (4) a contrastive loss function that is precisely formulated to fulfill the contrastive prediction task, facilitating the refinement of the learning process.

The downstream task: Fine-tuning

As shown in Fig. 3, following the upstream task SSL, the pretrained backbone models were submitted to the downstream task fine-tuning, which were two approaches: (1) the traditional learning labeled on the pathology (blackbox) or (2) the explainable learning based on the IPCL/AVA patterns (explainable AI). The former was a 3-way supervised training based on the pathological results of the invasion depth, i.e., EP-LPM, MM-SM1, and ≥ SM2. The blackbox model outputs using a softmax classifier. The latter (explainable AI) was comprised of four feature models based on the IPCL/AVA patterns, i.e., irregularity, loop formation, dilation ≥ 3-time and size of AVA. The outputs of the four feature models were integrated by a XGBoost classifier, which was also supervised trained by pathology18.

Evaluation

A total of 2,643 ME-NBI images in the upstream task SSL (n = 2,175) and the downstream task fine-tuning (n = 468) were from Suzhou. The 468 images were randomly divided into the training and the validation at a ratio of 7:3, thus 140 images were used to evaluate the models’ performance during the training procedure (Supplementary Table 2). The images in the test were from Jintan (n = 60), as shown in Fig. 4. To compare with the models, images from the test data were evaluated by two independent endoscopists (junior, four years of endoscopic experience; and senior, eleven years of experience). They were blind to the collection and labelling of the images. Firstly, they labeled the test images independently, and then in the next week, they relabeled the test images in awareness of the prediction of the best AI model.

Fig. 4
figure 4

The flowchart of Step #3 test. Models were evaluated on an external test dataset and compared with endoscopists. The metrics included accuracy, Matthew correlation coefficient (MCC), and weighted Cohen’s kappa. Furthermore, for visualized explanation, Grad-CAM was conducted for computer vision models on endoscopic images; local interpretable model-agnostic explanation (LIME), SHapley Additive exPlanations (SHAP), partial dependence plots (PDP) were conducted for the XGBoost classifier; and t-SNE was conducted for visualize feature vectors (blackbox models vs. explainable models) in a two-dimensional space.

Model training

The Keras (version 3.8.0, TensorFlow version 2.8.0) was used to train the models. The training parameters are listed in Supplementary Table 1. Images were resized to 224 × 224 pixels and input into the framework in the form of RGB channels. The training code for SSL was inspired by that of Sayak Paul, which can be accessed at https://github.com/sayakpaul/SimCLR-in-TensorFlow-2. The training code is available at https://osf.io/t3g8n18.

Statistical analysis and explanation

The primary outcome was 3-way classification of ESCC invasion depth. To evaluate the performance of the models and endoscopists, three metrics were calculated: accuracy, Matthew correlation coefficient (MCC), and weighted Cohen’s kappa18. The detailed information of the metrics was offered in the Supplementary Methods 2. Furthermore,, Grad-CAM was conducted for visualized explanation of endoscopic images20; variable importance, local interpretation, and partial dependence plots (PDP) were for the XGBoost classifier21; and t-SNE was for the visualization of feature vectors (blackbox models vs. explainable models) in a two-dimensional space18,22.

Results

Performance of the models in the downstream task

As shown in Supplementary Table 2, Xception-backboned explainable model presented the best performance, with accuracy 0.850, MCC 0.768 and weighted Cohen’s kappa 0.770.

Performance of the models in the evaluation

The performances of the four explainable models and four blackbox models on the test set were shown in Table 2. Among the models, Xception-backboned explainable model achieved the highest accuracy (0.817), with highest MCC 0.701 and weighted Cohen’s kappa 0.780. The confusion matrices are plotted in Fig. 5.

Table 2 The performance of models and endoscopists in the evaluation.
Fig. 5
figure 5

The confusion matrices of the models and endoscopists in the test dataset. (A) blackbox models. (B) explainable models. (C) endoscopists and AI-assisted endoscopists.

Comparison with endoscopists

The performances of the junior and senior endoscopists were listed in Table 2. The senior endoscopist showed higher accuracy, MCC and weighted Cohen’s kappa coefficient (0.883, 0.809 and 0.890) than Xception-backboned explainable model. The junior endoscopist had accuracy 0.733, MCC 0.571 and weighted Cohen’s kappa 0.750.

Performance of AI-assisted endoscopists

In awareness of Xception-backboned explainable model’s prediction, the performance of endoscopists were improved. The performance (accuracy) of the senior arrived at 0.917, which improved 3.85%. In the meantime, the junior’s accuracy arrived at 0.833, which improved 13.64%.

Visualized interpretation of the models

As shown in Fig. 6, t-SNE visualization of feature vectors revealed distinct clustering patterns between models. For the blackbox model, classes exhibited partial overlap, particularly between EP-LPM and MM-SM1, suggesting ambiguity in distinguishing early invasive depths. In contrast, the explainable model produced markedly separable clusters. The findings confirm that integrating IPCL/AVA patterns during fine-tuning enforces pathologically relevant feature learning, reducing diagnostic uncertainty in borderline lesions.

Fig. 6
figure 6

t-SNE visualization of feature vectors in models in the test. t-SNE is an unsupervised machine learning algorithm for dimensionality reduction. In the study, it was used to map high-dimensional data to a two-dimensional space. Each point represents an image, and the distance between points reflects the similarity between images in the reduced space. (A) Xception-backboned blackbox model. (B) Xception-backboned explainable model.

Figure 7 was to visualize the association between the IPCL/AVA patterns and the prediction of the XGBoost classifier within the explainable AI.

Fig. 7
figure 7

Visualized explanation for XGBoost classifier in the explainable model. (A) feature importance plots; (B) partial dependence plots (PDP); (C) local Interpretation plots. The feature importance plotting indicated the general association between the IPCL/AVA features and the prediction, as well as the PDPs. The local Interpretation plots reflected the local association between the IPCL/AVA features and the prediction within individual cases.

In Fig. 7A, the feature importance indicated the general association between the IPCL/AVA patterns and the prediction, as well as the PDPs in Fig. 7B. They showed that loop-formation significantly contributed to invasion depth, as well as irregularity and dilation. However, the contribution of AVA was non-significant. The break down plotting (local interpretation, Fig. 7C) reflected the contribution of features to the prediction, within individual cases. In the case #1 with invasion depth EP, XGBoost XAI model’s prediction was 0.003 (category 0 = EP-LPM). The most important variable was irregularity (prediction value = 0.278), which decreased the general prediction of XGBoost by 0.536. The second and third most important variables were loop-formation (0.710) and dilation (0.438), which decreased the prediction by 0.453 and 0.002. In the case #2 (invasion depth MM), XGBoost prediction was 0.997 (category 1 = MM-SM1). Dilation (0.367), irregularity (0.490) and loop-formation (0.345) were the key features for the prediction. Their contribution to the prediction were − 0.333, + 0.284 and + 0.051. Moreover, in the case #3 (invasion depth SM2), XGBoost prediction was 1.990 (category 2 = ≥ SM2). Loop-formation (0.099), Dilation (0.907) and irregularity (0.902) increased the prediction by 0.578, 0.399 and 0.018.

Lastly, based on the outputs of the four feature models within Xception-backboned explainable AI, Grad-CAM was conducted for inferential explanation as shown in Fig. 8. The highlighted areas of the four feature models were their inferential evidence.

Fig. 8
figure 8

Visualized inference of four feature models within the explainable fine-tuning via Grad-CAM. Left column: the original endoscopic images; Middle column: heatmaps based on the outputs of the feature models’ last layer; Right column: the Grad-CAM heatmap covering the original images, highlighting inferential evidence of the models.

Discussion

In this study, we presented explainable semi-supervised models developed for predicting invasion depth of ESCC based on the IPCL/AVA patterns. The novel framework empowers AI models to achieve improved transparentness and performance, facing the opacity of traditional supervised learning and limited amounts of labelled endoscopic images.

Deep supervised learning algorithms are often contingent upon a substantial corpus of labeled data to attain optimal performance levels23. Nonetheless, the assembly and annotation of such datasets can entail considerable financial and temporal expenditures. SSL emerges as a niche within the unsupervised learning spectrum, dedicated to the extraction of informative features from data that lacks human-provided labels24. As a prevalent approach within SSL, contrastive learning facilitates the training of encoders on expansive datasets that are devoid of labels. It operates by enhancing the congruence between varied, augmented perspectives of identical data instances, within the latent space, through the optimization of a contrastive loss function19.

The imperative for explainable AI is driven by the inherent complexity and obscurity of traditional AI models, which frequently operate as impenetrable black boxes25. They generate predictions grounded in input data yet fail to elucidate the rationale underpinning these forecasts. The elucidation algorithm serves as the pivotal element within explainable AI, tasked with furnishing clarity and revealing the salient and impactful factors that inform the model’s predictive outcomes. This mechanism can draw upon diverse methodologies within explainable AI, encompassing techniques such as feature significance, contribution assessment, and data visualization, thereby imparting profound insights into the inner workings of machine learning models26.

The subtle endoscopic manifestations of early ESCC lesions are frequently overlooked, with literature indicating a considerable miss rate for upper gastrointestinal tract cancers during endoscopic examinations. The precise identification of ESCC lesions is essential for the prediction histology and invasion depth and consequently guides therapeutic interventions6. Mucosal lesions exhibit a low propensity for local lymph node metastasis (less than 2%) when compared to those that have invaded the submucosa (ranging from 8 to 45.9%), making them suitable candidates for endoscopic treatment2,27. In accordance with guidelines from Japan and Europe, lesions confined to the EP-LPM are clear indications for ER, while those invading MM-SM1 are considered relative indications. For lesions with ≥ SM2, esophagectomy or chemoradiotherapy is the recommended course of treatment5.

A variety of endoscopic classification criteria have been proposed for the diagnosis of ESCC, including mucosal surface characteristics, the JES’s classification based on IPCL and AVA5,6. Within this classification system, Type A vessels are indicative of normal mucosa or low-grade intraepithelial neoplasia, while Type B1, B2, and B3 vessels suggest progression to high-grade intraepithelial neoplasia or invasion into EP-LPM, MM-SM1 and ≥ SM2, respectively. However, the patterns of IPCL and AVA are highly dependent on the expertise of endoscopists and is subject to interobserver variability. Thus, there is a need for computer-aided diagnostic (CADx) approaches that can reduce the complexity and variability inherent in IPCL/AVA classification.

The past five years witnessed a series of AI studies concerned deep learning in endoscopic diagnosis and detection of ESCC. In 2019, Everson et al.28 collected a dataset comprising 7046 ME-NBI images from 17 subjects, including 10 with ESCC and 7 controls, to train a convolutional neural networks (CNN) model to classify 2-way IPCL patterns. It achieved a high level of accuracy, correctly distinguishing abnormal from normal IPCL patterns in 93.7% of cases. In 2020, Fukuda et al.29 developed a CADx system for differentiating cancerous from non-cancerous SCC on NBI/BLI images and reported an accuracy rate of 88%. Similarly, a CADx system by Guo et al.30 reported remarkable sensitivity (98.04%) and specificity (95.03%) in endoscopic NBI images. Tokai et al. developed an AI-diagnostic system to determine the invasion depth of ESCC. This system analyzed 279 white-light images, accurately estimating the invasion depth of ESCC with a sensitivity of 84.1% and an overall accuracy of 80.9% within 6 s. Uema et al.31 constructed a ResNeXt-101 backboned model to classify microvessels in ESCC. With a dataset of 2,524 ME-NBI images encompassing 393 lesions, the system achieved a microvessel classification accuracy of 84.2%, surpassing the average accuracy of eight endoscopists (77.8%). Wang and colleagues32 proposed an AI-assisted endoscopic diagnostic approach for the detection and localization of IPCLs in early-stage ESCC using ME-BLI and ME-NBI images. They employed an enhanced Faster region-based CNN with a polarized self-attention-HRNetV2p backbone for automatic IPCL detection. The methods showed promising results with a recall of 79.25%, precision of 75.54%, F1-score of 0.764, and a mean average precision of 74.95%. In the meantime, Zhang et al.33 collected 5,119 ME-NBI images from 581 ESCC patients and developed a multi-model diagnostic system for feature extraction and integration. This diagnostic system, grounded in a variety of endoscopic diagnostic methods, outperformed traditional DL models and endoscopists, achieving sensitivity, specificity, and accuracy rates of 85.7%, 86.3%, and 86.2% in image validation, and 87.5%, 84%, and 84.9% in consecutive video analysis, respectively, for distinguishing SM2-3 lesions.

The study has some limitation. To begin with, the dataset employed for training and testing was of insufficient size, which may undermine the robustness and generalizability of the findings. Second, methodological diversity and dataset heterogeneity hinder comparative analysis with the previous reports. Future efforts should prioritize standardized evaluation frameworks and multicenter collaborative datasets to enable robust benchmarking and clinical translation.

The study introduces a novel framework that combines semi-supervised learning with explainable AI to address the challenges of data scarcity and model interpretability in endoscopic assessment of ESCC. By leveraging semi-supervised learning, the model reduces its reliance on large labeled datasets, effectively utilizing abundant unlabeled ME-NBI images to enhance performance. This approach maintains competitive accuracy while providing interpretable predictions, addressing the traditional “blackbox” critique of deep learning models. Clinically, the model demonstrates real-world utility by significantly enhancing endoscopists’ diagnostic accuracy, aligning with treatment guidelines for ESCC stratification. Technically, this study pioneers the integration of self-supervised contrastive learning with multi-feature explainable AI for ESCC invasion prediction, introducing innovative visualization methods such as t-SNE for feature clustering and Grad-CAM for region-of-interest localization, tailored specifically to endoscopic IPCL/AVA patterns. These advancements collectively position the model as a powerful tool for improving diagnostic accuracy and trust in AI-driven endoscopic practices.