Introduction

Lung cancer is a significant global health challenge, accounting for 2.48 million novel cases and 1.82 million deaths every year, which makes it the deadliest and most prevalent form of cancer type worldwide, according to GLOBOCAN2022 report1. By far, it is the most lethal tumor in industrialized countries, killing more individuals than breast, prostate and colon cancers combined. In Canada only, lung cancer is responsible for 21,000 deaths yearly and represents 25% of all cancer mortalities in 2019, with a dismal 5-year survival of 19%2. A majority of these patients are diagnosed with advanced stage disease, with limited interventions, emphasizing the need to improve early detection of the disease. Thus, this can be accomplished through cancer screening, which can improve the overall outcome of this population of patients.

Early detection is vital for improving lung cancer diagnosis because it enables timely intervention and treatment3. In recent years, low-dose computed tomography (LDCT) screening has emerged as an early detection method with promising performance in reducing the death rate of lung cancer. The National Lung Screening Trial (NLST)4showed convincing evidence on the efficiency of LDCT in the early detection of lung cancer. The NLST included 53,454 high-risk participants and demonstrated approximately 15–20% reduction in lung cancer deaths among individuals screened with LDCT compared to those screened with chest X-ray5. LDCT diagnosed 649 lung cancer cases, 63% of which were stage I. Compared to the 279 cases detected by chest X-ray, 47.6% were in stage I6,7. According to the findings from the NLST dataset, LDCT screening programs have the potential to prevent 12,000 lung cancer deaths annually in the United States alone8. That reveals the essential role of early detection in reducing cancer mortality worldwide.

Recently, deep learning (DL) approaches have played a pivotal role in improving lung cancer diagnosis through the analysis of CT scan data, especially in malignancy classification. Deep learning approaches in medical imaging can be roughly categorized into two groups: 2D and 3D models. A 2D deep learning model accepts two-dimensional images in a (channel, width, height) size, with the channel representing the color channel number (usually three for natural images in RGB scale and one for radiological images in grayscale). 2D models are widely used in image classification9, segmentation10, object detection11, and medical image analysis12. In the context of lung cancer detection, 2D models can work in ensemble approaches for precise malignancy classification13. Additionally, parameter optimization is another research focus. Lin et al.14proposed using Taguchi parametric optimization to automatically determine optimal parameters for a 2D deep learning model in lung cancer recognition. Unlike 2D models, a 3D deep learning model requires a three-dimensional volume in a (channel, width, height, depth) size as input. These 3D models are frequently employed in areas like video surveillance15, human action recognition16, and computer-aided diagnosis (CAD)17. Compared to 2D models, 3D models are more popular in lung cancer detection. More recently, Saravanaprasad et al.18 proposed SAACNet, a 3D segmentation attention network with asymmetric convolution, combined with a gradient boosting machine to improve pulmonary nodule classification. Additionally, Mohandass et al.19 proposed AtCNN-DenseNet-201 TL-NBOA-CT, which is a novel lung cancer classification method using 3D CT volumns, featuring modified pre-processing (MSHKF), improved feature extraction (IEWT), an attention-based CNN with transfer learning from DenseNet-201, optimized by the Namib Beetle Optimization Algorithm. While both 2D and 3D models for lung cancer detection achieve notable results, the scarcity of publicly available code and pretrained models hinders reproducibility and verification by other researchers.

One challenge in building up a deep learning-based lung cancer risk prediction system is the wide range of model choices, each of 2D or 3D models have their own strengths and weaknesses. InceptionV320and ViTs21excel in 2D image classification but are computationally intensive, while SqueezeNet22and MobileNets23,24 perform well in 3D video classification but lack readily available pretrained models due to training complexity and niche applications. As how their relative merits impact the lung cancer risk prediction performance remains unclear, a comprehensive evaluation of these 2D and 3D models is necessary for clinical translation.

Another aspect that may hinder deep learning methods being applied in lung cancer risk prediction is that most of these models were originally designed for general computer vision tasks. Even though these models can achieve remarkable success in their respective domains, the different data distributions between lung cancer CT scans and their general computer vision datasets can limit the transferability of the learned features and representations25. Although past studies26 have showed that finetuning these models in a radiological dataset can achieve acceptable results, their effectiveness in cancer risk prediction is still uncertain. Therefore, it is necessary to evaluate these deep learning models, which are designed for general computer vision task and finetuned on various datasets in the specific context of lung cancer risk prediction. While the studies in the literature highlight the potential role for deep learning models as a promising tool for various tasks, however, there is no evidence yet of successfully building clinically translatable diagnostic deep learning tools for lung cancer screening.

With above premises, in this study, we performed a systematic comparison of ten 3D and eleven 2D SOTA deep learning models from different computer vision areas and evaluated them in the context of lung cancer risk prediction (malignant-benign classification). Each model was trained and tested on a well-labeled chest CT dataset from a subset of NLST dataset. We assessed model performance in the lung cancer risk prediction task, compared each model using different metrics, and evaluated their efficiency. Finally, we also analyzed the prediction performance when fine-tuning from different large-scale pretrained models. This would enable us to identify suitable deep learning methods that can be specifically used to develop clinically relevant, robust CT image-based models for future studies.

Materials and methods

Description of cohorts

The National Lung Screening Trial (NLST)4is a comprehensive, multi-site study that evaluated the effectiveness of low-dose computed tomography (LDCT) versus chest X-ray (CXR) in screening individuals at high risk for lung cancer. Participants included current or former smokers aged 55–74 who had smoked at least 30 pack-years. Former smokers were eligible if they had quit within the preceding 15 years. Previous studies have detailed the study design and main findings for the NLST dataset. We accessed to radiological and clinical data from the LDCT arm through the National Cancer Institute (NCI) Cancer Image Archive27. Following the case-control analysis of Cherezov et al.28, we selected total 253 participants who had CT scans without contrast. 150 of 255 patients are from the first follow-up screen (T1), and 103 of 212 patients are from the second follow-up screen (T2). Participants from T1 were assigned to the training cohort (N = 150), and those from T2 were assigned to the test cohort (N = 103). And the nodule annotations were obtained from Cherezov et al.28 study, which were manually delineated by a radiologist. The dataset construction consort diagram of lung cancers and nodule-positive controls is shown in Fig. 1.

Fig. 1
figure 1

Dataset construction consort diagram.

Deep learning model candidates

We implemented both 2D and 3D deep learning models to leverage the slice-wise features and volumetric spatial information present in chest CT scans and volumes, respectively. Table 1 shows the details of the implemented structures. We implemented the following model architectures which can be further categorized into convolution neural networks and vision transformers.

Convolution neural networks

3D deep learning model candidates from the CNNs category selected from the popular ResNet family, consisting of ResNet1829,30,31, ResNet5029,30,31, ResNet10129,30,31, ResNeXt10129,30,31, and R2Plus1D30, which represent a significant advancement in 3D deep learning architectures. These models leverage residual connections, which enable the training of deeper networks by alleviating the vanishing gradient problem29. The diverse depth configurations (18, 50 and 101 layers deep) of the ResNet models provide a better trade-off between accuracy and computational complexity, making them superior for various 3D computer vision tasks. ResNeXt is a ResNet’s extension, which introduced a new hyperparameter to represent the number of parallel paths in each residual block. architecture We also investigated the R2Plus1D model, which extends the residual learning concept to video data by decomposing 3D convolutions into separate spatio-temporal components to achieve better video representations for tasks such as action recognition and video understanding30. The evaluation on ResNet family models enable us to grasp a deeper understanding of how their depths and network design can impact their performance in lung cancer risk prediction.

In addition, 3D deep learning model candidates from the CNNs category also include efficient architectures: ShuffleNetv131,32, ShuffleNetv231,33, SqueezeNet22,31, MobileNetv123,31and MobileNetv224,31. ShuffleNetv1 reduces the demands on computational resources by mixing channels and using grouped convolutions while maintaining high performance. ShuffleNetv2 enhances efficiency by introducing new methods for splitting and mixing channels. SqueezeNet takes a different approach, compressing and expanding features to achieve top accuracy with a lighter network structure. MobileNetV1 leverages depth wise separable convolutions to significantly reduce the number of model parameters. As its successor, MobileNetV2 introduces inverted residual blocks and linear bottlenecks, which allow the network to learn more expressive features while maintaining a low number of parameters. Our evaluation on efficient architectures allows us to better understand how to balance the efficiency and performance when performing lung cancer risk prediction tasks.

2D deep learning model candidates in this category include ResNet5029, DenseNet12134, and InceptionV320, have been widely applied in 2D image classification and demonstrated strong performance in other computer vision tasks. CNNs are designed effectively to learn hierarchical features from input images, which makes them suitable for medical imaging related tasks. ResNet50 introduces residual connections to achieve high-efficiency in the training of deep networks. DenseNet121 employs the dense connectivity to improve the feature reuse inside models. InceptionV3 utilizes inception modules with different kernel sizes to capture multi-scale features. These CNNs based models serve as strong baselines and provide a comprehensive comparison with the other competitors in lung cancer risk prediction.

Vision transformers (ViT): As an emerging deep learning architectures with promising performance in many computer vision tasks, vision transformers leverage the self-attention mechanism to capture the global dependency of input images, thus allowing it to be an alternative to traditional CNNs. In this study, we investigated several 2D deep learning model candidates from the ViT category: ViT-Large21,35, BEiTv221,36, BEiTv137, CAFormer-B3638, ConvFormer-B3638, DeiT3-Large39, Swin-Large40and VOLO-D441. ViT-Large is the original Vision Transformer design that uses self-attention to capture global patterns in images. BEiTv2 improves with self-supervised pre-training using masked semantic modeling to learn better visual features, while BEiTv1 only uses masked image modeling for pre-training. CAFormer-B36 combines convolution and self-attention together to enhance both local and global feature learning. ConvFormer-B36 integrates convolution into the transformer to capture local spatial details. DeiT3-Large builds on Data-efficient Image Transformers (DeiT) with design enhancements and better training approaches. Swin-Large utilizes a hierarchical transformer with shifted window attention for efficient local and global context modeling. VOLO-D4 presents a vision outlooker that learns detailed visual representations by focusing on both token and patch levels. Evaluation on these emerging ViT structures allows us to enhance our understanding in modern deep learning technologies’ performance in the context of lung cancer risk prediction.

Pre-trained models

All of the described models lead the top performance in various benchmarks, such as the image classification task in ImageNet42and the video classification task in Kinetics43. Each candidate model is pretrained on a large-scale general-purpose dataset: Kinetics for 3D models and ImageNet for 2D models. To investigate the impact of pretraining on a radiological dataset, we also leveraged pretrained models from 3DSeg-844and nnUnet10,44for 3D models and the RadImageNet45 for 2D models. Note that each pretrained model was pretrained by only one dataset. We briefly introduce above four datasets as follow:

  • ImageNet is a large-scale visual database designed for use in visual object recognition task. It contains over 14 million images organized into more than 20,000 categories, making it a crucial resource for training and benchmarking 2D computer vision models.

  • RadImageNet is a medical imaging dataset specifically curated for radiology applications, consisting of over 1.3 million radiographs across various modalities and anatomical regions. It aims to improve the performance and generalizability of AI models in medical imaging tasks.

  • Kinetics is a large-scale, high-quality dataset of YouTube video for human action recognition. It contains over 650,000 video clips with 700 action classes, each lasting around 10 s, making it a valuable resource for training and evaluating deep learning models in video understanding tasks. Note that the model weights were pretrained by Hara et al.30.

  • 3DSeg-8 dataset is a comprehensive collection of existing 3D medical image datasets specifically curated for segmentation tasks. It includes eight different anatomical structures or regions, including brain, hippo, prostate, liver, heart, pancreas, vessel and spleen.

  • nnU-Net’s training set includes a variety of medical imaging datasets such as the Medical Segmentation Decathlon, which comprises ten different medical image segmentation tasks across various organs and modalities. Note that the model weights were pretrained by Med3D31.

Table 1 The summarization of the implemented deep learning models in this research study. Params is the abbreviation for the number of parameters.

Lung cancer risk prediction workflow

The workflow for the lung cancer risk prediction using 2D and 3D deep learning models is presented in Fig. 2. The entire workflow can be divided into three individual parts. During data preprocessing, the nodule region-of-interest (ROI) slices were extracted from chest CT scans according to the annotations’ maximum bounding rectangle. After obtaining the nodule ROI slices, if the prediction process using a 3D model is opted for, the slices are stacked into a 3D volume, which serves as the input for the prediction model. The 3D models predict the probability of malignancy for every nodule ROI volume. On the other hand, we preprocess a 3D volume into 2D slices, and the 2D models predict the scores for each input slice and generate the final score by taking an average of these scores. To implement and validate the deep learning models effectively, all models were developed from the training cohort using a nested three-fold cross-validation design. After development, all models were trained on the training cohort, and model performance was reported on the test cohort. All experimental results at the 3-fold cross-validation set (training set) were from conducting training models on the two training folds in the cross-validation set and test them on the test fold in the cross validation. All experimental results at test set were from conducting training models on the training set and test them on the test set.

Evaluation metrics

Fig. 2
figure 2

Workflow of 2D and 3D prediction models for lung cancer risk assessment using chest CT scans. The workflow is divided into three main parts: data preprocessing (orange dashed rectangle), 3D model workflow (blue dashed rectangle), and 2D model workflow (green dashed rectangle).

In the experiments, we report model performance in malignancy classification using three evaluation metrics: accuracy, F1 score, and area under the receiver operating characteristic (AUROC) curve. For the three-fold cross-validation, we report the average performance across the three folds, using the results from the epoch with the best AUROC score in each fold. When reporting test performance, we use the results from the epoch with the highest AUROC score.

Implementation details

All 3D models were trained for 120 epochs with a learning rate of 1e-4. We use an Adam optimizer with parameters \(\:{\beta\:}_{1}=0.9\) and \(\:{\beta\:}_{2}=0.99\). The batch size used for training the 3D models is 4. All 3D volumes were resized into \(\:64\times\:64\times\:64\). For 2D models, ResNet50, InceptionV3, and DenseNet121 follow the same hyperparameters as the 3D models, while the remaining 2D models follow the fine-tuning parameters provided by PyTorch Image Models46. All 2D slices were resized into \(\:224\times\:224\). Since 2D models only accept 2D slices, we set batch size to the slice number of the input 3D volume, and accept a single 3D volume then preprocess it to a batch of 2D slices before feeding into the models. We applied a simple data augmentation strategy, which includes random 90-degree rotation and random flipping. All experiments are run on the HPC of Université Laval using either an Nvidia Tesla V100 16GB or an Nvidia Tesla A100 80GB GPU (V100 GPUs are for 3D model training and A100 GPUs are for 2D model training).

Results

Patient characteristics

The study population characteristics of training and test cohorts are summarized in Table 2. We found none of the study population characteristics were significantly different between the training and test cohort. Training and test cohorts share similar average ages, with the training cohort having a slightly higher average age of 63.80 compared to 63.24 in the test cohort. Regarding sex distribution, the majority of patient are male, with slightly higher percentages in the training cohort (57.33%) compared to the test cohort (54.37%). In both cohorts, the majority are current smokers, but the percentage is higher in the test cohort (58.25%) compared to the training cohort (52.00%). As for the distribution of nodule location, the training cohort has slightly higher percentages for right lobe (58.00% vs. 53.40%) but lower percentages for left lobe (42.00% vs. 45.63%). The distribution of nodule sizes (< 6 mm, 6–16 mm, ≥ 16 mm) is similar between the two cohorts, with comparable percentages in each size range. The distribution of cancer stages is also similar between two cohorts, with small differences across the stages.

Table 2 Study population characteristics of training and test cohorts. Continuous data were reported as mean ± standard deviation (SD), and categorical data as counts and percentages.

Prediction performance of lung cancer risk prediction.

In this sub-section, we present the comparative study of 3D and 2D deep learning models. Supplementary Tables 1 and 2 summarizes the performance of these models in the cross-validation set and the test set, respectively. In supplementary Table 1, the results showed that ShuffleNetv2 edged out the other models in all three metrics (Accuracy = 0.83, F1 score = 0.72, and AUROC = 0.86) in the cross-validation experiments. For the test set results, ResNet50 achieved the best accuracy (0.81), while the MobileNet family demonstrated superior performance in both F1 score (0.70) and AUROC (0.86) metrics. From the above, the best 3D models were pretrained on the Kinetics dataset. From supplementary Table 2, we observed that Swin Large obtained the best accuracy (0.81), and VOLO-D4 performed the best in terms of F1 score (0.74) and AUROC (0.85) metrics. In the case of the test set, DeiT3 marginally exceeded other models across all the evaluation metrics (Accuracy = 0.77, F1 score = 0.64, and AUROC = 0.79). Then the best 2D models were pretrained on the ImageNet dataset.

3D vs. 2D models

Fig. 3
figure 3

The best models and performance statistics for 3D and 2D models. The best models’ performance is summarized in Fig. 3A and B for the cross-validation set and the test set, respectively. And the performance statistics is summarized in Fig. 3C and D for the cross-validation set and the test set, respectively.

To provide a clear demonstration of the overall performance of 3D and 2D models, we summarized the best models and the overall performance statistics in Fig. 3. The best models were selected according to their AUROC metrics and were pretrained on the Kinetics dataset for 3D models and the ImageNet dataset for 2D models. The best performance on the cross-validation set (Fig. 3A) were shown by 3D ShuffleNetv2 (Accuracy = 0.83, F1 score = 0.72, and AUROC = 0.86) and 2D VOLO-D4 (Accuracy = 0.80, F1 score = 0.74, and AUROC = 0.85), respectively. The best performance on the test set (Fig. 3B) were shown by 3D MobileNetv1 (Accuracy = 0.79, F1 score = 0.67, and AUROC = 0.86) and 2D DeiT3-Large (Accuracy = 0.77, F1 score = 0.64, and AUROC = 0.79), respectively. The performance statistics in the cross-validation set and the test set (Fig. 3C and D) present that 3D models perform generally better than 2D models with higher metrics and stability.

Fig. 4
figure 4

AUROC curves of 3D and 2D models in the test set. The curves in cold tunes represent the AUROC performance of 3D models, while the curves in warm tunes depict the AUROC performance of 2D models. The models in the legend are listed in descending order according to their AUROC scores.

Pretrained on general-purpose vs. radiological datasets

In Fig. 4, we present the AUROC curves for both 3D and 2D models that were pretrained on the Kinetics and ImageNet datasets and evaluated on the test set. Overall, the 3D models performed modestly better than 2D models. The highest and lowest performances of 3D models were obtained by the MobileNetv1 (AUROC = 0.86) and R2Plus1D (AUROC = 0.78), respectively. And the highest and lowest performances of 2D models are exhibited by the DeiT3-Large (AUROC = 0.79) and InceptionV3 (AUROC = 0.62), respectively.

Table 3 summarizes the prediction performance for three 3D deep learning models pretrained by various datasets. For ResNet18, the model pretrained on the nnUnet dataset achieved best accuracy (0.79) and F1 score (0.67) on the training set, as well as the best F1 score (0.65) and AUROC (0.84) on the test set. The Kinetics pretrained ResNet18 obtains the best AUROC (0.83) and accuracy (0.77) on the training and test sets, respectively. Regarding ResNet50, the best performance is obtained by the model pretrained by the nnUnet dataset (Accuracy = 0.78, F1 score = 0.66, AUROC = 0.82) on the training set, while the best performance was obtained by Kinetics pretrained model (Accuracy = 0.81, F1 score = 0.69, AUROC = 0.82) on the test set. In terms of ResNet101, the Kinetics pretrained model showed better performance in 5 out of 6 metrics on both training and test set (Training set: accuracy = 0.73, AUROC = 0.80; Test set, accuracy = 0.79, F1 score = 0.67, AUROC = 0.82). The model pretrained by the 3Dseg-8 dataset achieves the best F1 score on the training set (0.64).

Table 3 Performance comparison for different pretrained 3D deep learning models. The best evaluation score for each model is marked in bold.

In Table 4, we demonstrate the prediction performances for three 2D deep learning models pretrained by the ImageNet and RadImageNet datasets. For ResNet50, the ImageNet pretrained model have the best performances on both the training and test set (Training set: accuracy = 0.59, F1 score = 0.59, AUROC = 0.78; Test set: accuracy = 0.74, F1 score = 0.56, AUROC = 0.77). Regarding DenseNet121, the ImageNet pretrained model exhibits the best performances on the training set (accuracy = 0.57, F1 score = 0.53, AUROC = 0.66) and the best accuracy (0.62) on the test set. However, the best F1 score (0.47) and AUROC (0.72) are achieved by the RadImageNet pretrained model. In terms of InceptionV3, the best metrics are demonstrated by the model pretrained by the RadImageNet dataset on both the training and test sets (Training set: accuracy = 0.74, F1 score = 0.57, AUROC = 0.77; Test set: accuracy = 0.61, F1 score = 0.56, AUROC = 0.76).

Table 4 Performance comparison for different pretrained 2D deep learning models. The best evaluation score for each model is marked in bold.

Comparison among models in different scales

Fig. 5
figure 5

Efficiency demonstration of 3D/2D prediction models. The models are evaluated on two metrics: F1 score on the x-axis and AUROC on the y-axis. The size of each dot represents the approximate number of parameters in millions (M). Larger dots indicate models with more parameters. The models in the legend are listed in descending order according to their parameter numbers. Spots in cold tunes represents 3D models, spots in warm tunes represents 2D models. The index of each model is depicted in the center of spots.

Figure 5 illustrates the model efficiency by considering the trade-off between their performances (F1 score and AUROC) and model scale (the number of parameters). Models plotted in the upper right corner indicate better performances, while those in the lower left corner suggest lower performances. We can learn that 2D models with the ViT structure generally have larger scales than 3D models but slightly outperformed by their 3D competitors in terms of performances. BeiTv1 and BeiTv2 are the largest models (303 million parameters) but only achieved midstream performances. In contrast, SqueezeNet is the lightest model (1 million parameters) but with top-level F1 score and AUROC.

Discussion

Deep learning approaches have shown their superior capability in various medical image analysis tasks, such as lesion segmentation47, cancer detection48and survival analysis49. Although previous works50 in deep learning-based lung cancer risk prediction suggest that this emerging technology has a promising potential in both diagnostic accuracy and efficiency, they primarily focus on the model design for certain datasets, while rarely consider taking advantage of currently successful deep learning models from the computer vision domain. With this premise, we conducted a comprehensive evaluation and analysis of state-of-the-art deep learning models originally designed for general-purpose in the context of lung cancer risk prediction. We analyzed both 2D and 3D model designs, the impact of pretraining on general-purpose datasets versus radiological datasets, and model efficiency. Our results highlight the significance of model selection and pretrained dataset choice in achieving optimal performance. Therefore, by addressing a gap in the current literature, our findings offer guidance in selecting appropriate model architectures and pretrained datasets and facilitating the development of more accurate and efficient lung cancer risk prediction tools.

A prior study51 showed a closed preference for 2D and 3D model architectures in deep learning-based lung cancer diagnosis. Interestingly, our results suggest that 3D models generally show better performance compared to their 2D competitors from various aspects. According to the performance on the cross-validation set (Fig. 3A and C), 3D models demonstrate stronger stability and higher resistance against potential bias and overfitting, despite the small performance gap in the F1 score between the best 3D and 2D models. Furthermore, Fig. 3B and D indicate an overall-leading performance of 3D models on the test set, which suggests 3D models are better in model generalization and comprehensive capabilities. Since our cohorts are imbalanced (81 malignant and 172 benign cases), the AUROC metric is more significant due to its sensitivity to the true positive rate (correctly identifying malignant cases) and false positive rate (incorrectly identifying benign cases as malignant), as well as its evaluation of different thresholds. 3D models generally show higher AUROC scores, which indicates that 3D models perform better in handling imbalanced datasets and are less likely to produce biased results. The superior performance of 3D models highlights the importance of capturing the spatial context and continuity present in volumetric data, which lead to a better understanding of the relationship between adjacent structures and the ability to differentiate between tissues that may appear similar in 2D slices but have different 3D configurations. Nevertheless, we have to point out that 3D models may not have higher performance than 2D models do in the other tasks, for instance, Kakigi et al.52 compared thin-slice 2D fat-saturated proton density-weighted images with deep learning-based reconstruction (dDLR) to 3D fat-saturated proton density multi-planar voxel images for shoulder joint MRI, finding that the 2D approach with dDLR provides superior image quality and anatomical visualization compared to the 3D technique. Therefore, we only verified that 3D models have higher performance in lung cancer risk prediction using CT scans, but their performance in other medical related tasks remain to be discussed.

The choice of pretrained dataset is a vital factor that can heavily impact the prediction performance. Models pretrained on datasets from specific medical domains can improve performance in corresponding downstream diagnostic tasks44,45,50. However, in our study, we found that general-purpose video/image datasets (Kinetics, ImageNet) can provide a good foundation for transfer learning across different models and tasks. Radiological image datasets (3DSeg-8, nnUnet, and RadImageNet) as specialized datasets, show mixed results: improving performance in some cases (e.g., 3D ResNet18 and 2D InceptionV3) while decreasing it in others. The varying impact of pretrained datasets on the prediction performance for different architectures indicate that certain models are more sensitive to the domain-specific features of the pretrained dataset. The radiological datasets with scan regions that differ from lung negatively impact transfer learning capability and generalization. Although the general-purpose datasets were not originally designed for lung cancer prediction, they are still capable in introducing sufficient generalization by helping models learn a broad range of features applicable to various medical-related tasks.

In addition to model design, model scale (i.e., the number of parameters in a model) is another potential factor that can pose impact on deep learning-based lung cancer risk prediction models. Nevertheless, our results align with previous findings31,46, which demonstrate that model scale is not strongly correlated with model performance. Figure 5 illustrates that 3D MobileNetV1 (12 million params), 3D ShuffleNetV2 (3 million params), and 3D SqueezeNet (1 million params) demonstrate higher model efficiency by achieving higher AUROC with fewer parameters. In contrast, the largest models, such as 2D BeiTv1 and BeiTv2 (> 300 M params), do not necessarily yield the best overall performance. This indicates potential diminishing returns at high parameter numbers and highlight that model performance does not solely depend on its scale. And our results also suggest that effective model design enable light-weight models to edge out their heavier competitors. However, Fig. 5 also reveals that extremely light models (< 10 million params) may struggle to match the performance of medium to large-sized models, which indicates a potential lower bound on model size for lung cancer risk prediction task.

The strengths of this research study include a comprehensive analysis and evaluation of SOTA deep learning models, which are representative and have already widely used in various vision tasks. The evaluation covers different aspects, which are from prediction performance, pretrained datasets to model efficiency. Our findings can guide a more effective model selection process in lung cancer risk prediction. However, our study has `two important limitations. First, the size of cohorts is relatively small, making it difficult to conduct a reliable evaluation of model generalization capacity in a small patient population. Second, the data distribution of existing cohorts is imbalanced, which means potential bias and overfitting are unavoidable during the model training procedure. To mitigate the impact of above limitations, further investigation in a larger, more diverse cross-institutional cohort is necessary. Third, as this study was conducted on patients with a smoking history, the lung cancer risk prediction performance may vary when patient characteristics change. In conclusion, deep learning models may serve as an effective method in lung cancer risk prediction using CT scans. 3D deep learning architectures are more suitable for this task due to their outstanding prediction performance and better trade-ff between model efficiency and performance. Further investigation with a larger and cross-institutional cohort is necessary for confirming these findings.