A benchmark of deep learning approaches to predict lung cancer risk using national lung screening trial cohort

Jiang, Yifan; Ebrahimpour, Leyla; Després, Philippe; Manem, Venkata SK.

doi:10.1038/s41598-024-84193-7

Download PDF

Article
Open access
Published: 11 January 2025

A benchmark of deep learning approaches to predict lung cancer risk using national lung screening trial cohort

Scientific Reports volume 15, Article number: 1736 (2025) Cite this article

2817 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Deep learning (DL) methods have demonstrated remarkable effectiveness in assisting with lung cancer risk prediction tasks using computed tomography (CT) scans. However, the lack of comprehensive comparison and validation of state-of-the-art (SOTA) models in practical settings limits their clinical application. This study aims to review and analyze current SOTA deep learning models for lung cancer risk prediction (malignant-benign classification). To evaluate our model’s general performance, we selected 253 out of 467 patients from a subset of the National Lung Screening Trial (NLST) who had CT scans without contrast, which are the most commonly used, and divided them into training and test cohorts. The CT scans were preprocessed into 2D-image and 3D-volume formats according to their nodule annotations. We evaluated ten 3D and eleven 2D SOTA deep learning models, which were pretrained on large-scale general-purpose datasets (Kinetics and ImageNet) and radiological datasets (3DSeg-8, nnUnet and RadImageNet), for their lung cancer risk prediction performance. Our results showed that 3D-based deep learning models generally perform better than 2D models. On the test cohort, the best-performing 3D model achieved an AUROC of 0.86, while the best 2D model reached 0.79. The lowest AUROCs for the 3D and 2D models were 0.70 and 0.62, respectively. Furthermore, pretraining on large-scale radiological image datasets did not show the expected performance advantage over pretraining on general-purpose datasets. Both 2D and 3D deep learning models can handle lung cancer risk prediction tasks effectively, although 3D models generally have superior performance than their 2D competitors. Our findings highlight the importance of carefully selecting pretrained datasets and model architectures for lung cancer risk prediction. Overall, these results have important implications for the development and clinical integration of DL-based tools in lung cancer screening.

Radiomics-guided deep neural networks stratify lung adenocarcinoma prognosis from CT scans

Article Open access 12 November 2021

An ensemble deep learning model for risk stratification of invasive lung adenocarcinoma using thin-slice CT

Article Open access 05 July 2023

Deep learning classification of lung cancer histology using CT images

Article Open access 09 March 2021

Introduction

Lung cancer is a significant global health challenge, accounting for 2.48 million novel cases and 1.82 million deaths every year, which makes it the deadliest and most prevalent form of cancer type worldwide, according to GLOBOCAN2022 report¹. By far, it is the most lethal tumor in industrialized countries, killing more individuals than breast, prostate and colon cancers combined. In Canada only, lung cancer is responsible for 21,000 deaths yearly and represents 25% of all cancer mortalities in 2019, with a dismal 5-year survival of 19%². A majority of these patients are diagnosed with advanced stage disease, with limited interventions, emphasizing the need to improve early detection of the disease. Thus, this can be accomplished through cancer screening, which can improve the overall outcome of this population of patients.

Early detection is vital for improving lung cancer diagnosis because it enables timely intervention and treatment³. In recent years, low-dose computed tomography (LDCT) screening has emerged as an early detection method with promising performance in reducing the death rate of lung cancer. The National Lung Screening Trial (NLST)⁴showed convincing evidence on the efficiency of LDCT in the early detection of lung cancer. The NLST included 53,454 high-risk participants and demonstrated approximately 15–20% reduction in lung cancer deaths among individuals screened with LDCT compared to those screened with chest X-ray⁵. LDCT diagnosed 649 lung cancer cases, 63% of which were stage I. Compared to the 279 cases detected by chest X-ray, 47.6% were in stage I^6,7. According to the findings from the NLST dataset, LDCT screening programs have the potential to prevent 12,000 lung cancer deaths annually in the United States alone⁸. That reveals the essential role of early detection in reducing cancer mortality worldwide.

Recently, deep learning (DL) approaches have played a pivotal role in improving lung cancer diagnosis through the analysis of CT scan data, especially in malignancy classification. Deep learning approaches in medical imaging can be roughly categorized into two groups: 2D and 3D models. A 2D deep learning model accepts two-dimensional images in a (channel, width, height) size, with the channel representing the color channel number (usually three for natural images in RGB scale and one for radiological images in grayscale). 2D models are widely used in image classification⁹, segmentation¹⁰, object detection¹¹, and medical image analysis¹². In the context of lung cancer detection, 2D models can work in ensemble approaches for precise malignancy classification¹³. Additionally, parameter optimization is another research focus. Lin et al.¹⁴proposed using Taguchi parametric optimization to automatically determine optimal parameters for a 2D deep learning model in lung cancer recognition. Unlike 2D models, a 3D deep learning model requires a three-dimensional volume in a (channel, width, height, depth) size as input. These 3D models are frequently employed in areas like video surveillance¹⁵, human action recognition¹⁶, and computer-aided diagnosis (CAD)¹⁷. Compared to 2D models, 3D models are more popular in lung cancer detection. More recently, Saravanaprasad et al.¹⁸ proposed SAACNet, a 3D segmentation attention network with asymmetric convolution, combined with a gradient boosting machine to improve pulmonary nodule classification. Additionally, Mohandass et al.¹⁹ proposed AtCNN-DenseNet-201 TL-NBOA-CT, which is a novel lung cancer classification method using 3D CT volumns, featuring modified pre-processing (MSHKF), improved feature extraction (IEWT), an attention-based CNN with transfer learning from DenseNet-201, optimized by the Namib Beetle Optimization Algorithm. While both 2D and 3D models for lung cancer detection achieve notable results, the scarcity of publicly available code and pretrained models hinders reproducibility and verification by other researchers.

One challenge in building up a deep learning-based lung cancer risk prediction system is the wide range of model choices, each of 2D or 3D models have their own strengths and weaknesses. InceptionV3²⁰and ViTs²¹excel in 2D image classification but are computationally intensive, while SqueezeNet²²and MobileNets^23,24 perform well in 3D video classification but lack readily available pretrained models due to training complexity and niche applications. As how their relative merits impact the lung cancer risk prediction performance remains unclear, a comprehensive evaluation of these 2D and 3D models is necessary for clinical translation.

Another aspect that may hinder deep learning methods being applied in lung cancer risk prediction is that most of these models were originally designed for general computer vision tasks. Even though these models can achieve remarkable success in their respective domains, the different data distributions between lung cancer CT scans and their general computer vision datasets can limit the transferability of the learned features and representations²⁵. Although past studies²⁶ have showed that finetuning these models in a radiological dataset can achieve acceptable results, their effectiveness in cancer risk prediction is still uncertain. Therefore, it is necessary to evaluate these deep learning models, which are designed for general computer vision task and finetuned on various datasets in the specific context of lung cancer risk prediction. While the studies in the literature highlight the potential role for deep learning models as a promising tool for various tasks, however, there is no evidence yet of successfully building clinically translatable diagnostic deep learning tools for lung cancer screening.

With above premises, in this study, we performed a systematic comparison of ten 3D and eleven 2D SOTA deep learning models from different computer vision areas and evaluated them in the context of lung cancer risk prediction (malignant-benign classification). Each model was trained and tested on a well-labeled chest CT dataset from a subset of NLST dataset. We assessed model performance in the lung cancer risk prediction task, compared each model using different metrics, and evaluated their efficiency. Finally, we also analyzed the prediction performance when fine-tuning from different large-scale pretrained models. This would enable us to identify suitable deep learning methods that can be specifically used to develop clinically relevant, robust CT image-based models for future studies.

Materials and methods

Description of cohorts

The National Lung Screening Trial (NLST)⁴is a comprehensive, multi-site study that evaluated the effectiveness of low-dose computed tomography (LDCT) versus chest X-ray (CXR) in screening individuals at high risk for lung cancer. Participants included current or former smokers aged 55–74 who had smoked at least 30 pack-years. Former smokers were eligible if they had quit within the preceding 15 years. Previous studies have detailed the study design and main findings for the NLST dataset. We accessed to radiological and clinical data from the LDCT arm through the National Cancer Institute (NCI) Cancer Image Archive²⁷. Following the case-control analysis of Cherezov et al.²⁸, we selected total 253 participants who had CT scans without contrast. 150 of 255 patients are from the first follow-up screen (T1), and 103 of 212 patients are from the second follow-up screen (T2). Participants from T1 were assigned to the training cohort (N = 150), and those from T2 were assigned to the test cohort (N = 103). And the nodule annotations were obtained from Cherezov et al.²⁸ study, which were manually delineated by a radiologist. The dataset construction consort diagram of lung cancers and nodule-positive controls is shown in Fig. 1.

Deep learning model candidates

We implemented both 2D and 3D deep learning models to leverage the slice-wise features and volumetric spatial information present in chest CT scans and volumes, respectively. Table 1 shows the details of the implemented structures. We implemented the following model architectures which can be further categorized into convolution neural networks and vision transformers.

Convolution neural networks

3D deep learning model candidates from the CNNs category selected from the popular ResNet family, consisting of ResNet18^29,30,31, ResNet50^29,30,31, ResNet101^29,30,31, ResNeXt101^29,30,31, and R2Plus1D³⁰, which represent a significant advancement in 3D deep learning architectures. These models leverage residual connections, which enable the training of deeper networks by alleviating the vanishing gradient problem²⁹. The diverse depth configurations (18, 50 and 101 layers deep) of the ResNet models provide a better trade-off between accuracy and computational complexity, making them superior for various 3D computer vision tasks. ResNeXt is a ResNet’s extension, which introduced a new hyperparameter to represent the number of parallel paths in each residual block. architecture We also investigated the R2Plus1D model, which extends the residual learning concept to video data by decomposing 3D convolutions into separate spatio-temporal components to achieve better video representations for tasks such as action recognition and video understanding³⁰. The evaluation on ResNet family models enable us to grasp a deeper understanding of how their depths and network design can impact their performance in lung cancer risk prediction.

In addition, 3D deep learning model candidates from the CNNs category also include efficient architectures: ShuffleNetv1^31,32, ShuffleNetv2^31,33, SqueezeNet^22,31, MobileNetv1^23,31and MobileNetv2^24,31. ShuffleNetv1 reduces the demands on computational resources by mixing channels and using grouped convolutions while maintaining high performance. ShuffleNetv2 enhances efficiency by introducing new methods for splitting and mixing channels. SqueezeNet takes a different approach, compressing and expanding features to achieve top accuracy with a lighter network structure. MobileNetV1 leverages depth wise separable convolutions to significantly reduce the number of model parameters. As its successor, MobileNetV2 introduces inverted residual blocks and linear bottlenecks, which allow the network to learn more expressive features while maintaining a low number of parameters. Our evaluation on efficient architectures allows us to better understand how to balance the efficiency and performance when performing lung cancer risk prediction tasks.

2D deep learning model candidates in this category include ResNet50²⁹, DenseNet121³⁴, and InceptionV3²⁰, have been widely applied in 2D image classification and demonstrated strong performance in other computer vision tasks. CNNs are designed effectively to learn hierarchical features from input images, which makes them suitable for medical imaging related tasks. ResNet50 introduces residual connections to achieve high-efficiency in the training of deep networks. DenseNet121 employs the dense connectivity to improve the feature reuse inside models. InceptionV3 utilizes inception modules with different kernel sizes to capture multi-scale features. These CNNs based models serve as strong baselines and provide a comprehensive comparison with the other competitors in lung cancer risk prediction.

Vision transformers (ViT): As an emerging deep learning architectures with promising performance in many computer vision tasks, vision transformers leverage the self-attention mechanism to capture the global dependency of input images, thus allowing it to be an alternative to traditional CNNs. In this study, we investigated several 2D deep learning model candidates from the ViT category: ViT-Large^21,35, BEiTv2^21,36, BEiTv1³⁷, CAFormer-B36³⁸, ConvFormer-B36³⁸, DeiT3-Large³⁹, Swin-Large⁴⁰and VOLO-D4⁴¹. ViT-Large is the original Vision Transformer design that uses self-attention to capture global patterns in images. BEiTv2 improves with self-supervised pre-training using masked semantic modeling to learn better visual features, while BEiTv1 only uses masked image modeling for pre-training. CAFormer-B36 combines convolution and self-attention together to enhance both local and global feature learning. ConvFormer-B36 integrates convolution into the transformer to capture local spatial details. DeiT3-Large builds on Data-efficient Image Transformers (DeiT) with design enhancements and better training approaches. Swin-Large utilizes a hierarchical transformer with shifted window attention for efficient local and global context modeling. VOLO-D4 presents a vision outlooker that learns detailed visual representations by focusing on both token and patch levels. Evaluation on these emerging ViT structures allows us to enhance our understanding in modern deep learning technologies’ performance in the context of lung cancer risk prediction.

Pre-trained models

All of the described models lead the top performance in various benchmarks, such as the image classification task in ImageNet⁴²and the video classification task in Kinetics⁴³. Each candidate model is pretrained on a large-scale general-purpose dataset: Kinetics for 3D models and ImageNet for 2D models. To investigate the impact of pretraining on a radiological dataset, we also leveraged pretrained models from 3DSeg-8⁴⁴and nnUnet^10,44for 3D models and the RadImageNet⁴⁵ for 2D models. Note that each pretrained model was pretrained by only one dataset. We briefly introduce above four datasets as follow:

ImageNet is a large-scale visual database designed for use in visual object recognition task. It contains over 14 million images organized into more than 20,000 categories, making it a crucial resource for training and benchmarking 2D computer vision models.
RadImageNet is a medical imaging dataset specifically curated for radiology applications, consisting of over 1.3 million radiographs across various modalities and anatomical regions. It aims to improve the performance and generalizability of AI models in medical imaging tasks.
Kinetics is a large-scale, high-quality dataset of YouTube video for human action recognition. It contains over 650,000 video clips with 700 action classes, each lasting around 10 s, making it a valuable resource for training and evaluating deep learning models in video understanding tasks. Note that the model weights were pretrained by Hara et al.³⁰.
3DSeg-8 dataset is a comprehensive collection of existing 3D medical image datasets specifically curated for segmentation tasks. It includes eight different anatomical structures or regions, including brain, hippo, prostate, liver, heart, pancreas, vessel and spleen.
nnU-Net’s training set includes a variety of medical imaging datasets such as the Medical Segmentation Decathlon, which comprises ten different medical image segmentation tasks across various organs and modalities. Note that the model weights were pretrained by Med3D³¹.

Table 1 The summarization of the implemented deep learning models in this research study. Params is the abbreviation for the number of parameters.

Full size table

Lung cancer risk prediction workflow

The workflow for the lung cancer risk prediction using 2D and 3D deep learning models is presented in Fig. 2. The entire workflow can be divided into three individual parts. During data preprocessing, the nodule region-of-interest (ROI) slices were extracted from chest CT scans according to the annotations’ maximum bounding rectangle. After obtaining the nodule ROI slices, if the prediction process using a 3D model is opted for, the slices are stacked into a 3D volume, which serves as the input for the prediction model. The 3D models predict the probability of malignancy for every nodule ROI volume. On the other hand, we preprocess a 3D volume into 2D slices, and the 2D models predict the scores for each input slice and generate the final score by taking an average of these scores. To implement and validate the deep learning models effectively, all models were developed from the training cohort using a nested three-fold cross-validation design. After development, all models were trained on the training cohort, and model performance was reported on the test cohort. All experimental results at the 3-fold cross-validation set (training set) were from conducting training models on the two training folds in the cross-validation set and test them on the test fold in the cross validation. All experimental results at test set were from conducting training models on the training set and test them on the test set.

Evaluation metrics

In the experiments, we report model performance in malignancy classification using three evaluation metrics: accuracy, F1 score, and area under the receiver operating characteristic (AUROC) curve. For the three-fold cross-validation, we report the average performance across the three folds, using the results from the epoch with the best AUROC score in each fold. When reporting test performance, we use the results from the epoch with the highest AUROC score.

Implementation details

All 3D models were trained for 120 epochs with a learning rate of 1e-4. We use an Adam optimizer with parameters \(\:{\beta\:}_{1}=0.9\) and \(\:{\beta\:}_{2}=0.99\). The batch size used for training the 3D models is 4. All 3D volumes were resized into \(\:64\times\:64\times\:64\). For 2D models, ResNet50, InceptionV3, and DenseNet121 follow the same hyperparameters as the 3D models, while the remaining 2D models follow the fine-tuning parameters provided by PyTorch Image Models⁴⁶. All 2D slices were resized into \(\:224\times\:224\). Since 2D models only accept 2D slices, we set batch size to the slice number of the input 3D volume, and accept a single 3D volume then preprocess it to a batch of 2D slices before feeding into the models. We applied a simple data augmentation strategy, which includes random 90-degree rotation and random flipping. All experiments are run on the HPC of Université Laval using either an Nvidia Tesla V100 16GB or an Nvidia Tesla A100 80GB GPU (V100 GPUs are for 3D model training and A100 GPUs are for 2D model training).

Results

Patient characteristics

The study population characteristics of training and test cohorts are summarized in Table 2. We found none of the study population characteristics were significantly different between the training and test cohort. Training and test cohorts share similar average ages, with the training cohort having a slightly higher average age of 63.80 compared to 63.24 in the test cohort. Regarding sex distribution, the majority of patient are male, with slightly higher percentages in the training cohort (57.33%) compared to the test cohort (54.37%). In both cohorts, the majority are current smokers, but the percentage is higher in the test cohort (58.25%) compared to the training cohort (52.00%). As for the distribution of nodule location, the training cohort has slightly higher percentages for right lobe (58.00% vs. 53.40%) but lower percentages for left lobe (42.00% vs. 45.63%). The distribution of nodule sizes (< 6 mm, 6–16 mm, ≥ 16 mm) is similar between the two cohorts, with comparable percentages in each size range. The distribution of cancer stages is also similar between two cohorts, with small differences across the stages.

Table 2 Study population characteristics of training and test cohorts. Continuous data were reported as mean ± standard deviation (SD), and categorical data as counts and percentages.

Full size table

Prediction performance of lung cancer risk prediction.

In this sub-section, we present the comparative study of 3D and 2D deep learning models. Supplementary Tables 1 and 2 summarizes the performance of these models in the cross-validation set and the test set, respectively. In supplementary Table 1, the results showed that ShuffleNetv2 edged out the other models in all three metrics (Accuracy = 0.83, F1 score = 0.72, and AUROC = 0.86) in the cross-validation experiments. For the test set results, ResNet50 achieved the best accuracy (0.81), while the MobileNet family demonstrated superior performance in both F1 score (0.70) and AUROC (0.86) metrics. From the above, the best 3D models were pretrained on the Kinetics dataset. From supplementary Table 2, we observed that Swin Large obtained the best accuracy (0.81), and VOLO-D4 performed the best in terms of F1 score (0.74) and AUROC (0.85) metrics. In the case of the test set, DeiT3 marginally exceeded other models across all the evaluation metrics (Accuracy = 0.77, F1 score = 0.64, and AUROC = 0.79). Then the best 2D models were pretrained on the ImageNet dataset.

3D vs. 2D models

To provide a clear demonstration of the overall performance of 3D and 2D models, we summarized the best models and the overall performance statistics in Fig. 3. The best models were selected according to their AUROC metrics and were pretrained on the Kinetics dataset for 3D models and the ImageNet dataset for 2D models. The best performance on the cross-validation set (Fig. 3A) were shown by 3D ShuffleNetv2 (Accuracy = 0.83, F1 score = 0.72, and AUROC = 0.86) and 2D VOLO-D4 (Accuracy = 0.80, F1 score = 0.74, and AUROC = 0.85), respectively. The best performance on the test set (Fig. 3B) were shown by 3D MobileNetv1 (Accuracy = 0.79, F1 score = 0.67, and AUROC = 0.86) and 2D DeiT3-Large (Accuracy = 0.77, F1 score = 0.64, and AUROC = 0.79), respectively. The performance statistics in the cross-validation set and the test set (Fig. 3C and D) present that 3D models perform generally better than 2D models with higher metrics and stability.

Pretrained on general-purpose vs. radiological datasets

In Fig. 4, we present the AUROC curves for both 3D and 2D models that were pretrained on the Kinetics and ImageNet datasets and evaluated on the test set. Overall, the 3D models performed modestly better than 2D models. The highest and lowest performances of 3D models were obtained by the MobileNetv1 (AUROC = 0.86) and R2Plus1D (AUROC = 0.78), respectively. And the highest and lowest performances of 2D models are exhibited by the DeiT3-Large (AUROC = 0.79) and InceptionV3 (AUROC = 0.62), respectively.

Table 3 summarizes the prediction performance for three 3D deep learning models pretrained by various datasets. For ResNet18, the model pretrained on the nnUnet dataset achieved best accuracy (0.79) and F1 score (0.67) on the training set, as well as the best F1 score (0.65) and AUROC (0.84) on the test set. The Kinetics pretrained ResNet18 obtains the best AUROC (0.83) and accuracy (0.77) on the training and test sets, respectively. Regarding ResNet50, the best performance is obtained by the model pretrained by the nnUnet dataset (Accuracy = 0.78, F1 score = 0.66, AUROC = 0.82) on the training set, while the best performance was obtained by Kinetics pretrained model (Accuracy = 0.81, F1 score = 0.69, AUROC = 0.82) on the test set. In terms of ResNet101, the Kinetics pretrained model showed better performance in 5 out of 6 metrics on both training and test set (Training set: accuracy = 0.73, AUROC = 0.80; Test set, accuracy = 0.79, F1 score = 0.67, AUROC = 0.82). The model pretrained by the 3Dseg-8 dataset achieves the best F1 score on the training set (0.64).

Table 3 Performance comparison for different pretrained 3D deep learning models. The best evaluation score for each model is marked in bold.

Full size table

In Table 4, we demonstrate the prediction performances for three 2D deep learning models pretrained by the ImageNet and RadImageNet datasets. For ResNet50, the ImageNet pretrained model have the best performances on both the training and test set (Training set: accuracy = 0.59, F1 score = 0.59, AUROC = 0.78; Test set: accuracy = 0.74, F1 score = 0.56, AUROC = 0.77). Regarding DenseNet121, the ImageNet pretrained model exhibits the best performances on the training set (accuracy = 0.57, F1 score = 0.53, AUROC = 0.66) and the best accuracy (0.62) on the test set. However, the best F1 score (0.47) and AUROC (0.72) are achieved by the RadImageNet pretrained model. In terms of InceptionV3, the best metrics are demonstrated by the model pretrained by the RadImageNet dataset on both the training and test sets (Training set: accuracy = 0.74, F1 score = 0.57, AUROC = 0.77; Test set: accuracy = 0.61, F1 score = 0.56, AUROC = 0.76).

Table 4 Performance comparison for different pretrained 2D deep learning models. The best evaluation score for each model is marked in bold.

Full size table

Comparison among models in different scales

Figure 5 illustrates the model efficiency by considering the trade-off between their performances (F1 score and AUROC) and model scale (the number of parameters). Models plotted in the upper right corner indicate better performances, while those in the lower left corner suggest lower performances. We can learn that 2D models with the ViT structure generally have larger scales than 3D models but slightly outperformed by their 3D competitors in terms of performances. BeiTv1 and BeiTv2 are the largest models (303 million parameters) but only achieved midstream performances. In contrast, SqueezeNet is the lightest model (1 million parameters) but with top-level F1 score and AUROC.

Discussion

Deep learning approaches have shown their superior capability in various medical image analysis tasks, such as lesion segmentation⁴⁷, cancer detection⁴⁸and survival analysis⁴⁹. Although previous works⁵⁰ in deep learning-based lung cancer risk prediction suggest that this emerging technology has a promising potential in both diagnostic accuracy and efficiency, they primarily focus on the model design for certain datasets, while rarely consider taking advantage of currently successful deep learning models from the computer vision domain. With this premise, we conducted a comprehensive evaluation and analysis of state-of-the-art deep learning models originally designed for general-purpose in the context of lung cancer risk prediction. We analyzed both 2D and 3D model designs, the impact of pretraining on general-purpose datasets versus radiological datasets, and model efficiency. Our results highlight the significance of model selection and pretrained dataset choice in achieving optimal performance. Therefore, by addressing a gap in the current literature, our findings offer guidance in selecting appropriate model architectures and pretrained datasets and facilitating the development of more accurate and efficient lung cancer risk prediction tools.

A prior study⁵¹ showed a closed preference for 2D and 3D model architectures in deep learning-based lung cancer diagnosis. Interestingly, our results suggest that 3D models generally show better performance compared to their 2D competitors from various aspects. According to the performance on the cross-validation set (Fig. 3A and C), 3D models demonstrate stronger stability and higher resistance against potential bias and overfitting, despite the small performance gap in the F1 score between the best 3D and 2D models. Furthermore, Fig. 3B and D indicate an overall-leading performance of 3D models on the test set, which suggests 3D models are better in model generalization and comprehensive capabilities. Since our cohorts are imbalanced (81 malignant and 172 benign cases), the AUROC metric is more significant due to its sensitivity to the true positive rate (correctly identifying malignant cases) and false positive rate (incorrectly identifying benign cases as malignant), as well as its evaluation of different thresholds. 3D models generally show higher AUROC scores, which indicates that 3D models perform better in handling imbalanced datasets and are less likely to produce biased results. The superior performance of 3D models highlights the importance of capturing the spatial context and continuity present in volumetric data, which lead to a better understanding of the relationship between adjacent structures and the ability to differentiate between tissues that may appear similar in 2D slices but have different 3D configurations. Nevertheless, we have to point out that 3D models may not have higher performance than 2D models do in the other tasks, for instance, Kakigi et al.⁵² compared thin-slice 2D fat-saturated proton density-weighted images with deep learning-based reconstruction (dDLR) to 3D fat-saturated proton density multi-planar voxel images for shoulder joint MRI, finding that the 2D approach with dDLR provides superior image quality and anatomical visualization compared to the 3D technique. Therefore, we only verified that 3D models have higher performance in lung cancer risk prediction using CT scans, but their performance in other medical related tasks remain to be discussed.

The choice of pretrained dataset is a vital factor that can heavily impact the prediction performance. Models pretrained on datasets from specific medical domains can improve performance in corresponding downstream diagnostic tasks^44,45,50. However, in our study, we found that general-purpose video/image datasets (Kinetics, ImageNet) can provide a good foundation for transfer learning across different models and tasks. Radiological image datasets (3DSeg-8, nnUnet, and RadImageNet) as specialized datasets, show mixed results: improving performance in some cases (e.g., 3D ResNet18 and 2D InceptionV3) while decreasing it in others. The varying impact of pretrained datasets on the prediction performance for different architectures indicate that certain models are more sensitive to the domain-specific features of the pretrained dataset. The radiological datasets with scan regions that differ from lung negatively impact transfer learning capability and generalization. Although the general-purpose datasets were not originally designed for lung cancer prediction, they are still capable in introducing sufficient generalization by helping models learn a broad range of features applicable to various medical-related tasks.

In addition to model design, model scale (i.e., the number of parameters in a model) is another potential factor that can pose impact on deep learning-based lung cancer risk prediction models. Nevertheless, our results align with previous findings^31,46, which demonstrate that model scale is not strongly correlated with model performance. Figure 5 illustrates that 3D MobileNetV1 (12 million params), 3D ShuffleNetV2 (3 million params), and 3D SqueezeNet (1 million params) demonstrate higher model efficiency by achieving higher AUROC with fewer parameters. In contrast, the largest models, such as 2D BeiTv1 and BeiTv2 (> 300 M params), do not necessarily yield the best overall performance. This indicates potential diminishing returns at high parameter numbers and highlight that model performance does not solely depend on its scale. And our results also suggest that effective model design enable light-weight models to edge out their heavier competitors. However, Fig. 5 also reveals that extremely light models (< 10 million params) may struggle to match the performance of medium to large-sized models, which indicates a potential lower bound on model size for lung cancer risk prediction task.

The strengths of this research study include a comprehensive analysis and evaluation of SOTA deep learning models, which are representative and have already widely used in various vision tasks. The evaluation covers different aspects, which are from prediction performance, pretrained datasets to model efficiency. Our findings can guide a more effective model selection process in lung cancer risk prediction. However, our study has `two important limitations. First, the size of cohorts is relatively small, making it difficult to conduct a reliable evaluation of model generalization capacity in a small patient population. Second, the data distribution of existing cohorts is imbalanced, which means potential bias and overfitting are unavoidable during the model training procedure. To mitigate the impact of above limitations, further investigation in a larger, more diverse cross-institutional cohort is necessary. Third, as this study was conducted on patients with a smoking history, the lung cancer risk prediction performance may vary when patient characteristics change. In conclusion, deep learning models may serve as an effective method in lung cancer risk prediction using CT scans. 3D deep learning architectures are more suitable for this task due to their outstanding prediction performance and better trade-ff between model efficiency and performance. Further investigation with a larger and cross-institutional cohort is necessary for confirming these findings.

Data availability

NLST data presented in this study are publicly available through the TCIA archive: : https://www.cancerimagingarchive.net/collection/nlst/.

References

Bray, F. et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: a cancer journal for clinicians 74.3 : 229–263. (2024).
Canadian Cancer Statistics. Canadian Cancer Society, Nov. 2023, cancer.ca/Canadian-Cancer-Statistics-2023-EN. Accessed 3 May 2024. (2023).
Blandin Knight et al. Progress and prospects of early detection in lung cancer. Open. Biology. 7 (9), 170070 (2017).
Article PubMed PubMed Central Google Scholar
National Lung Screening Trial Research Team. The national lung screening trial: overview and study design. Radiology 258.1 : 243–253. (2011).
National Lung Screening Trial Research Team. Reduced lung-cancer mortality with low-dose computed tomographic screening. N. Engl. J. Med. 365 (5), 395–409 (2011).
Article Google Scholar
Aberle, D. R. et al. Results of the two incidence screenings in the National Lung Screening Trial. N. Engl. J. Med. 369 (10), 920–931 (2013).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Amin, M. B. et al. The eighth edition AJCC cancer staging manual: continuing to build a bridge from a population-based to a more personalized approach to cancer staging. CA: a cancer journal for clinicians 67.2 : 93–99. (2017).
Ma, J. et al. Annual number of lung cancer deaths potentially avertable by screening in the United States. Cancer 119 (7), 1381–1385 (2013).
Article PubMed MATH Google Scholar
Krizhevsky, A. & Sutskever, I. Hinton. ImageNet classification with deep convolutional neural networks. Commun. ACM. 60 (6), 84–90 (2017).
Article MATH Google Scholar
Isensee, F. et al. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods. 18 (2), 203–211 (2021).
Article CAS PubMed MATH Google Scholar
Redmon, J. et al. You only look once: Unified, real-time object detection. Proceedings of the IEEE conference on computer vision and pattern recognition. (2016).
Jiang, Y. et al. COVID-19 CT image synthesis with a conditional generative adversarial network. IEEE J. Biomedical Health Inf. 25 (2), 441–452 (2020).
Article MATH Google Scholar
Shah, A. et al. Deep learning ensemble 2D CNN approach towards the detection of lung cancer. Sci. Rep. 13 (1), 2987 (2023).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Lin, C. J., Jeng, S. Y. & Mei-Kuei, C. Using 2D CNN with Taguchi parametric optimization for lung cancer recognition from CT images. Appl. Sci. 10 (7), 2591 (2020).
Article CAS MATH Google Scholar
Zhao, Y. et al. Temporal action detection with structured segment networks. Proceedings of the IEEE international conference on computer vision. (2017).
Wang, M., Xing, J. & Liu, Y. Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472 (2021).
Moon, W. et al. Computer-aided tumor detection in automated breast ultrasound using a 3-D convolutional neural network. Comput. Methods Programs Biomed. 190, 105360 (2020).
Article PubMed MATH Google Scholar
Saravanaprasad, P. & Anbu Karuppusamy, S. Advanced lung tumor diagnosis using a 3D deep neural network based CAD system. Biomed. Signal Process. Control. 89, 105650 (2024).
Article Google Scholar
Chaunzwa, T. L. et al. Deep learning classification of lung cancer histology using CT images. Sci. Rep. 11 (1), 1–12 (2021).
Article Google Scholar
Szegedy, C. et al. Rethinking the inception architecture for computer vision. Proceedings of the IEEE conference on computer vision and pattern recognition. (2016).
Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
Mohandass, G. et al. Lung Cancer classification using optimized attention-based Convolutional Neural Network with DenseNet-201 transfer learning model on CT image. Biomed. Signal Process. Control. 95, 106330 (2024).
Article MATH Google Scholar
Howard, A. G. et al. Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv Preprint arXiv :170404861 (2017).
Sandler, M. et al. Mobilenetv2: Inverted residuals and linear bottlenecks. Proceedings of the IEEE conference on computer vision and pattern recognition. (2018).
Raghu, M. et al. Transfusion: understanding transfer learning for medical imaging. Adv. Neural. Inf. Process. Syst. 32 (2019).
Tajbakhsh, N. et al. Convolutional neural networks for medical image analysis: full training or fine tuning? IEEE Trans. Med. Imaging. 35 (5), 1299–1312 (2016).
Article PubMed MATH Google Scholar
National Lung Screening Trial Research Team. Data from the National Lung Screening Trial (NLST) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.HMQ8-J677. (2013).
Cherezov, D. et al. Delta radiomic features improve prediction for lung cancer incidence: a nested case–control analysis of the National Lung Screening Trial. Cancer medicine 7.12 : 6340–6356. (2018).
He, K. et al. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition. (2016).
Hara, K., Kataoka, H. & Satoh, Y. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. (2018).
Kopuklu, O. et al. Resource efficient 3d convolutional neural networks. Proceedings of the IEEE/CVF international conference on computer vision workshops. (2019).
Zhang, X. et al. Shufflenet: An extremely efficient convolutional neural network for mobile devices. Proceedings of the IEEE conference on computer vision and pattern recognition. (2018).
Ma, N. et al. Shufflenet v2: Practical guidelines for efficient cnn architecture design. Proceedings of the European conference on computer vision (ECCV). (2018).
Huang, G. et al. Densely connected convolutional networks. Proceedings of the IEEE conference on computer vision and pattern recognition. (2017).
Radford, A. et al. Learning transferable visual models from natural language supervision. International conference on machine learning. PMLR, (2021).
Peng, Z. et al. Beit v2: masked image modeling with vector-quantized visual tokenizers. arXiv 2022. arXiv preprint arXiv:2208.06366.
Bao, H. et al. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021).
Yu, W. et al. Metaformer baselines for vision. IEEE Trans. Pattern Anal. Mach. Intell. (2023).
Touvron, H. Matthieu Cord, and Hervé Jégou. Deit iii: Revenge of the vit. European conference on computer vision. Cham: Springer Nature Switzerland, (2022).
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF international conference on computer vision. (2021).
Yuan et al. Volo: Vision outlooker for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 45 (5), 6575–6586 (2022).
MATH Google Scholar
Deng, J. et al. Imagenet: A large-scale hierarchical image database. 2009 IEEE conference on computer vision and pattern recognition. Ieee, (2009).
Kay, W. et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).
Chen, S., Ma, K. & Zheng, Y. Med3d: Transfer learning for 3d medical image analysis. arXiv preprint arXiv:1904.00625 (2019).
Mei, X. et al. RadImageNet: an open radiologic deep learning research dataset for effective transfer learning. Radiology: Artif. Intell. 4, e210315 (2022).
Google Scholar
Wightman, R. PyTorch Image Models. GitHub, https://github.com/rwightman/pytorch-image-models. (2019). Accessed 19 03 2024.
Ma, J. et al. Segment anything in medical images. Nat. Commun. 15 (1), 654 (2024).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Nasser, I. M., Samy, S. & Abu-Naser Lung cancer detection using artificial neural network. Int. J. Eng. Inform. Syst. (IJEAIS). 3 (3), 17–23 (2019).
Google Scholar
Tong et al. Deep learning based feature-level integration of multi-omics data for breast cancer patients survival analysis. BMC Med. Inf. Decis. Mak. 20, 1–12 (2020).
MATH Google Scholar
Trajanovski, S. et al. Towards radiologist-level cancer risk assessment in CT lung screening using deep learning. Comput. Med. Imaging Graph. 90, 101883 (2021).
Article PubMed MATH Google Scholar
Hosseini, S., Hesamoddin, R., Monsefi & Shabnam Shadroo. Deep learning applications for lung cancer diagnosis: a systematic review. Multimedia Tools Appl. 83 (5), 14305–14335 (2024).
Article Google Scholar
Kakigi, T. et al. Thin-slice 2D MR Imaging of the Shoulder Joint Using Denoising Deep Learning Reconstruction Provides Higher Image Quality Than 3D MR Imaging. Magnetic Resonance in Medical Sciences : mp-2023. (2024).

Download references

Acknowledgements

We acknowledge the National Cancer Institute (NCI) for providing access to the radiological and clinical data from the LDCT arm through The Cancer Imaging Archive (TCIA). Additionally, we extend our thanks to Cherezov et al. [9] for providing us with the labels and nodule annotations used in this study, which were delineated by a radiologist.

Funding

Venkata Manem holds a salary support award from the IVADO and Fonds de recherche du Québec—Santé (FRQS: Quebec Foundation for Health Research). Venkata Manem was supported by the New Frontier Research - Rapid Response Fund. Yifan Jiang and Leyla Ebrahimpour were also supported by the New Frontier Research - Rapid Response Fund.

Author information

Authors and Affiliations

Centre de recherche du CHU de Québec-Université Laval, Quebec City, Canada
Yifan Jiang, Leyla Ebrahimpour & Venkata SK. Manem
Département de biologie moléculaire, de biochimie médicale et de pathologie, Université Laval, Quebec City, Canada
Yifan Jiang, Leyla Ebrahimpour & Venkata SK. Manem
Département de physique, de génie physique et d’optique, Université Laval, Quebec City, Canada
Leyla Ebrahimpour & Philippe Després
Centre de recherche de l’Institut universitaire de cardiologie et de pneumologie de Québec, Quebec City, Canada
Leyla Ebrahimpour & Philippe Després
Cancer Research Center, Université Laval, Quebec City, Canada
Venkata SK. Manem
Big Data Research Center, Université Laval, Quebec City, Canada
Philippe Després & Venkata SK. Manem
Institute Intelligence and Data, Université Laval, Quebec City, Canada
Yifan Jiang, Leyla Ebrahimpour, Philippe Després & Venkata SK. Manem

Authors

Yifan Jiang
View author publications
You can also search for this author inPubMed Google Scholar
Leyla Ebrahimpour
View author publications
You can also search for this author inPubMed Google Scholar
Philippe Després
View author publications
You can also search for this author inPubMed Google Scholar
Venkata SK. Manem
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Conceptualization: YJ, VM; Method development and analysis: YJ; Data curation: LE, PD; Writing—original draft preparation: YJ, VM; Writing—review and editing: YJ, LE, PD, VM.

Corresponding author

Correspondence to Venkata SK. Manem.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Jiang, Y., Ebrahimpour, L., Després, P. et al. A benchmark of deep learning approaches to predict lung cancer risk using national lung screening trial cohort. Sci Rep 15, 1736 (2025). https://doi.org/10.1038/s41598-024-84193-7

Download citation

Received: 29 May 2024
Accepted: 20 December 2024
Published: 11 January 2025
DOI: https://doi.org/10.1038/s41598-024-84193-7

Subjects

Abstract

Similar content being viewed by others

Radiomics-guided deep neural networks stratify lung adenocarcinoma prognosis from CT scans

An ensemble deep learning model for risk stratification of invasive lung adenocarcinoma using thin-slice CT

Deep learning classification of lung cancer histology using CT images

Introduction

Materials and methods

Description of cohorts

Deep learning model candidates

Convolution neural networks

Pre-trained models

Lung cancer risk prediction workflow

Evaluation metrics

Implementation details

Results

Patient characteristics

Prediction performance of lung cancer risk prediction.

3D vs. 2D models

Pretrained on general-purpose vs. radiological datasets

Comparison among models in different scales

Discussion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Electronic supplementary material

Supplementary Material 1

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links