Introduction

Acute ischemic stroke (AIS) is a major global public health issue. The non-contrast computed tomography (NCCT) is the first-line imaging technique for determining necessary treatment options, offering valuable insight for severity assessment and prognosis of AIS [1, 2]. In the early stage of ischemic stroke, loss of the normal gray/white matter interface and effacement of the cortical sulci may be observed in NCCT; however, some image changes may be too subtle to be detected. This creates a challenge for radiologists in providing a sensitive and accurate diagnosis of AIS and annotating the definite AIS area on NCCT images.

Compared to NCCT, diffusion-weighted magnetic resonance imaging (DW-MRI or DWI) is a more sensitive technique to depict infarcts and the optimal ischemic core predictor. Diffusion abnormalities with reduced apparent diffusion coefficient (ADC) values represent the best image marker for the ischemia core [3, 4]. The DWI signal demonstrates the presence of intracellular water accumulation, or cytotoxic edema, with an overall decreased rate of water molecular diffusion within the affected tissue. The DWI signal of the infarct area increases within a few minutes after arterial occlusion, persists for 10–14 days, and then fades out [5, 6]. Nonetheless, DWI is more time-consuming for both scheduling and scanning. Routine use of DWI in all patients with suspected AIS symptoms is not cost-effective [7] and not immediately available in all hospitals. In clinical practice, NCCT is the first-line image study for AIS patients and DWI may be followed within hours to days. Despite the differing sensitivities of NCCT and DWI, the agreement between the estimated Alberta Stroke Program Early CT Score (ASPECTS) [8] from the two imaging modalities is reasonable [9]. Thus, seeking to combine the benefits of the immediacy of NCCT and the sensitivity of DWI based on a deep learning model was the purpose of this investigation.

In recent years, newly emerged deep learning techniques have been widely applied in medical image segmentation. U-Net [10] is one of the most widely used deep learning models for biomedical image segmentation, and most current methods for biomedical image segmentation have been improvements upon the original U-Net structures. One notable method is R2U-Net [11], in which recurrent residual convolutions are used instead of regular convolutions, as in the original U-Net. In various experiments, R2U-Net outperformed U-Net in three benchmark datasets, including retina blood vessel segmentation, skin cancer segmentation, and lung lesion segmentation [11]. Since its emergence, deep learning-based segmentation techniques have been widely applied in AIS lesion segmentation. Most of these studies incorporate diffusion-weighted images (DWIs) [12,13,14,15] or CT perfusion images [16,17,18,19,20], while some are multimodal approaches [21,22,23]. However, AIS lesion segmentation using NCCT from early onset remains a challenge, even given the latest developments in deep learning techniques. To the best of our knowledge, only a handful of studies has achieved automated segmentation of AIS lesions using NCCT [24,25,26], and there is still a substantial gap in the reported performance between methods using NCCT and those using modalities such as DWI and CT perfusion. Although numerous AIS segmentation models have been proposed for follow-up NCCTs [27,28,29], follow-up NCCTs generally exhibit more observable differences between healthy tissue and AIS lesion compared to NCCTs acquired at early onset, and the performance of these models on AIS in early NCCT requires further validation.

Another major challenge in training segmentation models for AIS is the potential uncertainty and inaccuracy of the manual annotations, especially in equivocal areas of lesions which may be due to the subjectivity of radiologists, or errors in the image registration process. For an AIS segmentation model to achieve good generalizability, the ability to account for label uncertainty is essential. However, the design of segmentation models such as U-Net and R2U-Net does not incorporate mechanisms to handle uncertain labels. Previous studies have proposed several methods to address such issues. For example, reweighting the loss of each pixel based on its estimated reliability [30], or modeling label uncertainty implicitly by adding random flip noise to labels during training [31] (see [32] for a detailed review). Although this method is simple and effective, it was originally designed for image-level labels, such as image classification. Adapting such methods to image segmentation tasks requires resolving issues such as estimating the uncertainty of each pixel.

In this paper, we propose R2U-RNet, a novel model for AIS lesion segmentation in NCCT. We incorporated NCCT scans from patients with AIS, and manual annotations from follow-up DWI scans. The proposed model is based on an R2U-Net backbone architecture with a novel residual refinement unit (RRU) for further refining the prediction results. To incorporate the advantages of traditional image processing, each NCCT image underwent two different preprocessing procedures and was used as a multichannel input image. The model was trained using multiscale focal loss to mitigate the class imbalance problem and to leverage the importance pertaining to different levels of details. In addition, to account for the uncertainty of manual ROIs, we proposed a noisy-label training scheme, in which label noise was applied to the pixels near the boundary of ROIs. By further improving the segmentation accuracy over existing methods, the proposed stroke lesion segmentation model can provide essential information for radiologists to make faster and more accurate AIS diagnosis, which in turn increase the treatment quality for AIS patients.

Materials and methods

Data description

We used a retrospective dataset from a single medical center located in Taiwan. The data were acquired from January 2015 to December 2017. This dataset contains a total of 261 subjects, with 1780 NCCT slices of AIS lesions. Each AIS patient underwent an initial brain NCCT scan and a follow-up brain MRI scan within one week after presenting stroke symptoms. The DWIs acquired during MRI scans were used as the gold standard for the diagnosis of AIS. The NCCTs were acquired with matrix size = 512 × 512, pixel size of 0.43 × 0.43 mm2, and slice thickness of 5 mm ; DWIs were acquired with slice thickness = 5 mm; gap = 2 mm; matrix size = 256 × 256. Manual segmentation of restricted diffusion (defined as AIS) areas on DWI was annotated by one neuroradiologist with two years of experience. The ground-truth region of interest (ROI) for the NCCT segmentation was obtained by registering the annotated lesion ROIs to the NCCT image. (see Supplementary Materials S1 for exclusion criteria and acquisition parameters).

Image preprocessing

In general, convolutional neural networks (CNNs) are less data-efficient in learning the intensity adjustment compared to traditional digital image processing. Therefore, we expected that incorporating traditional digital image processing with deep learning would enhance the prediction performance of the model. We incorporated NCCT images from two different intensity adjustment procedures (intensity normalization using Z-transform followed by 1st-order Smoothstep function [33]; and histogram equalization [34]), resulting in a model of two input channels (see Supplementary Materials S2 for details).

NCCT segmentation model with residual refine units

Figure 1 illustrates the overall architecture of the proposed R2U-RNet, which is based on the design of R2U-Net [11]. Essentially, R2U-Net was chosen as the backbone architecture due to its superior accuracy compared to U-Net by introducing recurrent residual convolutions to the original U-Net architecture [11]. Inspired by the concept of image reconstruction branch from LapSRN [35], the proposed R2U-RNet further improves upon the original architecture of R2U-Net by adding the RRU after the R2U-Net. The R2U-RNet generates an intermediate output image at each level of resolution in the expansion path, yielding a set of multiscale output images. Each intermediate output image was obtained by refining its lower-level counterpart, which was achieved by estimating a residual image and adding it to the up-sampled intermediate output image from its lower level. By incorporating this residual learning strategy, we aimed to greatly reduce the training difficulty of the model.

Fig. 1
figure 1

The architecture of the proposed segmentation model. Each input image consists of two image channels derived from two separate preprocessing procedures. The proposed R2U-RNet is based on an R2U-Net backbone architecture with an additional RRU and generates a set of predicted images at multiple scales. A noisy-label training scheme was proposed to alleviate label uncertainties by applying random label flipping during training. A multiscale focal loss was employed to leverage the importance at different levels of detail

Multiscale focal loss

Infarction usually occupies only a small portion of overall MRI images; there would thus be an extreme imbalance between the volumes of the normal and infarction brain areas. For this reason, α-balanced focal loss (FL), proposed by Lin et al. [36], was used to address such class imbalance by numerous studies on ischemic stroke segmentation [18, 37,38,39]. FL addresses the class imbalance problem via a modulating factor in addition to a class weighting factor:

$$FL\left(p,y\right)=\left\{\begin{array}{c}\alpha {\left(1-p\right)}^{\gamma } log \left(p\right),\qquad y=1\\ \left(1-\alpha \right){p}^{\gamma }log \left(1-p\right), \qquad otherwise\end{array}\right.$$

where \(\alpha \), y, and \(p\) denote the weight, class label, and predicted probability, respectively, and \(\gamma \) is a non-negative focusing parameter. In this study, we set α to the ratio in volume of normal and infarction brain area in all training data and set \(\gamma \) = 2 based on the empirical findings by Lin et al. [36].

In this study, focal loss was applied in a multiscale manner to leverage the importance from different levels of detail. With the multiscale output images generated by the RRUs, we proposed to use multiscale focal loss (MFL), which is the weighted sum of the α-balanced focal loss calculated at all levels of resolutions:

$$\mathrm{MFL}\left(\mathrm{P},\mathrm{Y}\right)=\sum_{d\in \mathrm{D}}\left[\sum_{i,{\mathrm{Y}}_{d,i}\in {\mathrm{Y}}_{d}}{\lambda }_{d}\mathrm{FL}({\mathrm{P}}_{d,i}, {\mathrm{Y}}_{d,i})\right],$$

where \(\mathrm{P}=\left\{{\mathrm{P}}_{d}|d\in \mathrm{D}\right\}\) and \(Y=\left\{{\mathrm{Y}}_{d}|d\in \mathrm{D}\right\}\), in which \({\mathrm{P}}_{d}\) and \({\mathrm{Y}}_{d}\) denote the model prediction and label at subsampling rate \(d\), respectively. Subsampling of labels was performed using nearest-neighbor interpolation. \(D\) denotes the set of subsampling rates, which is the subsampling rates of the R2U-RNet outputs in this study \((D=\left\{\mathrm{1,2},\mathrm{4,8}\right\}\)). To enhance segmentation in low-resolution, we specified a higher weight to the loss in lower-resolution by setting the weights linearly proportional to the subsampling rate of each level of outputs:

$$\left({\lambda }_{1}=\frac{1}{15}, {\lambda }_{2}=\frac{2}{15}, {\lambda }_{4}=\frac{4}{15}, {\lambda }_{8}=\frac{8}{15}\right).$$

Model training with noisy label

We proposed a training method inspired by the concept of noisy label [31] to deal with potential label uncertainty and improve the reliability of the model. We designed a function to model the flip probability based on the distance to the ROI border so that voxels near the ROI border have higher flip probability:

$${P}_{\mathrm{flip}}=0.5-{\left[{G}_{\sigma }\left(L\right)-0.5\right]}^{\left|\bullet \right|}.$$

Herein, \({P}_{\mathrm{flip}}\in {R}^{N\times N}\) represents the probability map of label flipping, \(L\in {R}^{N\times N}\) denotes the label image, \({G}_{\sigma }\) is the Gaussian function with standard deviation \(\sigma \), and the superscripted \(\left|\bullet \right|\) denotes elementwise absolute value. An example of the label flipping process is illustrated in Fig. 2. During each iteration of the training process, a noisy label was randomly generated from the original label based on the calculated flip probability, and the model was trained using the model prediction and the generated noisy label (see Fig. 1).

Fig. 2
figure 2

An illustration of the proposed noisy-label training scheme. a The DWI of an AIS patient. b The lesion area annotated by radiologists using DWI. c The flip probability map. d A noisy label image was generated by applying random label flip according to the flip probability map

Training and evaluation

The model was trained using the parameters described in Supplementary Materials S3. Evaluation metrics include Intersection Over Union (loU) [40], Dice Similarity Coefficient (DSC) [41], mean Hausdorff Distance (mHD) [42], Average Symmetric Surface Distance (ASSD) [42], True Positive Rate (TPR), True Negative Rate (TNR), and Positive Predictive Value (PPV). Several analyses were performed in our experiment to evaluate the performance of various aspects of the proposed model:

  1. 1.

    First, we conducted a comparison of performance with other models. The models evaluated in this study included U-Net and R2U-Net. We also presented selected examples of automated segmentation of the proposed model and R2U-Net to demonstrate the effectiveness of the proposed model.

  2. 2.

    In addition, we performed an ablation study to validate the efficacy of components in the proposed R2U-RNet. The components evaluated in the ablation study included: a) recurrent residual convolution units; b) multichannel input image; c) noisy-label training scheme; d) residual refinement unit; e) multiscale loss design; and f) focal loss.

  3. 3.

    We also investigated the potential factors contributing to the differences in segmentation performance in individual brain regions defined in Alberta Stroke Program Early CT Score (ASPECTS) [8]. ASPECTS is a crucial quantitative scoring system widely used in clinical practice for AIS diagnosis and the evaluation of treatment options, which involves the stroke assessment of ten important regions covered by the middle cerebral artery. To investigate segmentation performance, ASPECTS regions were delineated by an experienced radiologist using the ICBM152 template and spatially registered to the image of each individual using the Segment function from SPM12 [43]. Only ASPECTS regions with significant lesion volume were included for analysis. In our analysis, we used the involvement criteria of > 30% and > 50%, where the lesion covers greater than 30% or greater than 50% of a given ASPECTS region. General linear model (GLM) analysis was performed by modeling regional IoU according to factors including 1) number of lesion occurrence in a given ASPECT region in the training data, 2) squared number of lesion occurrence in a given ASPECT region in the training data, and 3) the side (left or right hemisphere) of the lesion:

    $$\mathrm{IoU} \sim {b}_{0}+{b}_{1}N+{b}_{2}{N}^{2}+ {{b}_{3}\delta }_{R}$$

    where \(N\) denotes the number of lesion occurrences, and \({\delta }_{R}\) is a binomial variable whose value is 1 if the given region is in the right hemisphere and 0 if otherwise.

Results

Performance comparison with other models

We compared the performance of the proposed model in AIS segmentation using NCCT with U-Net and R2U-Net, with the results shown in Table 1. Compared to the other two methods, the proposed model achieved significantly higher overlap to the ground-truth ROIs (IoU = 42.34%, DSC = 54.25%), significantly lower surface distance (ASSD = 9.02 mm, mHD = 9.69 mm), and significantly higher TPR (62.04%). In contrast, it yields slightly lower TNR (98.72%) and PPV (52.80%) than that of R2U-Net. Overall, the proposed model demonstrates significant improvement over the other two models in NCCT AIS segmentation. We also observed a large variance of segmentation performance across all models, which is especially prominent for smaller lesions (see Section S4 in Supplementary Material for detailed discussion). This may be due to the intrinsic difficulty in detecting smaller lesions, or disease progression during the period between the NCCT acquisition during early onset and the follow-up DWI. Figure 3 shows examples of the segmentation results of selected participants from which the proposed model and R2U-Net demonstrated notable differences. The results from the proposed method were generally less detailed in terms of shape, yet effectively outlined the DWI annotations and yielded higher overlaps. By contrast, although segmentation results from R2U-Net were more finely shaped, they generally showed poor overlap with the DWI ROIs, either falsely highlighting normal regions or missing large portions of infarction areas.

Table 1 The segmentation performance of U-Net, R2U-Net, and the proposed R2U-RNet
Fig. 3
figure 3

Selected examples of segmentation results. The red and green regions are the ground truths and the prediction, respectively, and their intersections are colored in yellow. a DWI images. b The input NCCT images. c The predicted infarcted regions by R2U-Net. d The predicted infarcted regions by the proposed R2U-RNet

Ablation study

Table 2 shows the resulting segmentation performance after each part of R2U-RNet was removed. The most prominent difference in performance can be observed between model g (baseline model) and f (focal loss replaced by alpha-balanced cross-entropy), where the IoU and DSC dropped by more than ten percentage points. This indicates the necessity for handling class imbalance for learning segmentation tasks, and leveraging the importance of different levels of details. Adding RRUs or multiscale loss also increased the IoU and DSC by approximately five percentage points (by comparing models d and e with g, respectively), while including histogram-equalized input image or using noisy-label training scheme improved the IoU and DSC by about three percentage points (by comparing models b and c with g, respectively). Overall, we can verify through this ablation study that removing any component in the proposed network architecture resulted in a significant decrease in IoU, DSC, mHD and ASSD. This further demonstrates the efficacy of the proposed model design.

Table 2 Results of ablation study using different R2U-RNet variants

Segmentation accuracy in ASPECTS regions

Figure 4a, b shows the bar plots of IoU in different ASPECTS regions, given the involvement criteria of > 30% and > 50%. In addition, Fig. 5a, b shows the scatter plot of the IoU values in hemispheric ASPECTS regions against their respective number of samples, given the involvement criteria of 30% and 50%. Our results showed that ASPECTS regions in the left hemisphere consistently demonstrated higher IoU values compared to their contralateral counterparts. In addition, the number of samples in some ASPECTS regions showed left/right imbalance, which might also contribute to the variations in segmentation performance. The potential cause of the observed performance variations was further investigated through the GLM analysis, which is shown in Table 3. The results revealed that for involvement criteria, the number of samples, squared number of samples, and the side of the region, all exhibited significant effects on the resulting IoU (p = 0.0014, 0.0051, 0.0002, respectively, for the > 30% involvement criterion; p = 0.0188, 0.0367, 0.0021, respectively, for the > 50% involvement criterion).

Fig. 4
figure 4

IoU of NCCT segmentation in ASPECTS regions. The error bars signify the confidence interval of 95%. a Results using involvement > 30%. b Results using involvement > 50%

Fig. 5
figure 5

Scatter plot of IoU of NCCT segmentation in ASPECTS regions against the number of samples. Crosses and circles denote the results from left and right hemispheres, respectively, and the dotted and solid lines are the regression line from ASPECTS regions in the left and right hemispheres, respectively. a Involvement > 30%. b Involvement > 50%

Table 3 The statistical results of the GLM analysis

Discussion

Comparison of segmentation results

Our results showed that the proposed model outperformed U-Net and R2U-Net in the NCCT AIS segmentation task. As shown in Fig. 3, the shapes of the manual annotations from follow-up DWIs were finely detailed. However, such details could not be easily observed in NCCT images, possibly due to poorer image contrast or the time difference between the early NCCT scans and follow-up MRI scans. The results from the proposed method were generally less detailed in terms of shape, yet effectively outlined the DWI annotations and yielded higher overlaps. By contrast, although segmentation results from R2U-Net were more finely shaped, they generally showed poor overlap with the DWI ROIs, either falsely highlighting normal regions or missing large portions of infarction areas. This poor performance may be attributed to the difficulty of detecting infarctions in NCCT images at the same level of detail as using DWIs. It is possible that imposing finely detailed labels with no tolerance in labeling error during training resulted in overfitting with R2U-Net. This demonstrates the need for leveraging the importance of different levels of details for segmentation. In our method, the multiscale design of the loss function, in addition to the noisy-label training scheme, enabled us to implicitly specify the importance in different levels of details, thereby achieving improved segmentation performance.

In addition, Table 4 shows the performance reported by two recent studies on AIS segmentation using early NCCTs [24, 25]. Notably, the ground-truth images in our study were acquired much later after the NCCT scans (within one week) compared to those used in the listed literature (within one [24] and three hours [25] after the NCCT scan, respectively). The DSC of the proposed method is comparable or superior to those reported in these recent studies, while the averaged surface distance by our model is much larger than that reported by Kuang et al. [25]. The larger surface distance by our model may reflect the inherent challenges in accurately predicting the contours of lesion outcomes from early onset NCCTs. The accuracy of the ground-truth annotations used in our studies is potentially affected by factors such as disease progression and errors in registration between thick-sliced NCCTs and DWIs. Nonetheless, by prioritizing overlap of general areas over alignments of detailed contours, our model is able to achieve comparable DSC to studies that incorporate early onset ground-truth images.

Table 4 Reported model performance of AIS lesion segmentation using early onset NCCTs in some recent studies

Factors behind segmentation performance

Our GLM analysis showed that ASPECTS regions with fewer lesion occurrences in the training data generally exhibited lower segmentation performance. This suggests that there still exists some crucial stroke-relevant information that cannot be effectively generalized across brain regions. We expect that incorporation of anatomical information or other region-specific prior knowledge may improve the generalizability across brain regions and further enhance the segmentation performance. For example, incorporation of hemispheric difference or stroke frequency map has been shown effective in previous studies on automatic segmentation [24, 25, 27].

Our GLM analysis also revealed significantly lower IoU in the right hemisphere compared to the left hemisphere. The observed hemispheric difference in the segmentation accuracy may link to differences between right and left hemispheric stroke reported in previous clinical studies [44,45,46,47], which includes the number of cases, age, stroke severity, time from symptom onset to admission, functional outcomes, and recognizability. Such hemispheric differences may be attributable to the more complex right-sided symptoms and the lack of self-awareness in right-hemisphere stroke patients, leading to selection effects in patient admission or data acquisition [46]. Understanding how such selection bias affects the segmentation accuracy, and how to mitigate the effect of the inherent hemispheric differences in the clinical data on the model performance should be crucial topics for future research.

Limitations

Several limitations indeed exist in our current work. First and foremost, predicting the outcome of AIS in real-world clinical scenarios is complicated by many factors. For example, the aforementioned difference in the noticeability of symptoms between left and right hemispheric stroke may lead to selection effects in the inclusion of AIS cases [45]. In addition, clinical diagnosis is subject to the decisions of different clinicians, which in turn affects treatment options and subsequent disease outcomes. The second limitation is that the time interval between the initial NCCT scan and the follow-up DWI scan is not perfectly controlled, which may result in slight variations in the observed disease outcomes. These factors further complicate the already challenging issue of NCCT AIS segmentation. Thirdly, although we used a proposed noisy label to mitigate the effect of registration error, the amount of the registration error in this study and how it affects the performance of the proposed segmentation model still requires further validation. Lastly, the comparison between reported performance measures should be interpreted conservatively because they are not directly comparable. Previous studies listed in our comparisons require additional imaging data, such as the stroke frequency map used in Kaung et al. [7], and thus cannot be directly evaluated using our in-house dataset. We expect that an openly available dataset for this emergent research topic will serve as a better benchmark for performance comparison.

Conclusions

We herein propose a novel model for AIS lesion segmentation in NCCT. We incorporated NCCT scans from patients with AIS, and manual annotations from follow-up DWI scans. The proposed model demonstrates superior accuracy when compared to other deep neural network models, and our ablation study further reveals the efficacy of the model design. We anticipate that the future development of a model with the ability to account for variables such as undertaken treatment and aforementioned time intervals would be valuable. By including such additional clinical data, we expect that the model would be able to predict the outcome of different treatment options and would serve as an effective tool for assisting clinicians in the diagnosis of AIS. Other future work should involve NCCT AIS segmentation using ground-truth DWIs acquired from early onset, and an automated ASPECTS scoring system based on the proposed segmentation model.