Abstract
Clinical adoption of digital pathology-based artificial intelligence models for diagnosing lung cancer has been limited, partly due to lack of robust external validation. This review provides an overview of such tools, their performance and external validation. We systematically searched for external validation studies in medical, engineering and grey literature databases from 1st January 2010 to 31st October 2024. 22 studies were included. Models performed various tasks, including classification of malignant versus non-malignant tissue, tumour growth pattern classification and subtyping of adeno- versus squamous cell carcinomas. Subtyping models were most common and performed highly, with average AUC values ranging from 0.746 to 0.999. Although most studies used restricted datasets, methodological issues relevant to the applicability of models in real-world settings included small and/or non-representative datasets, retrospective studies and case-control studies without further real-world validation. Ultimately, more rigorous external validation of models is warranted for increased clinical adoption.
Similar content being viewed by others
Introduction
Digital pathology refers to the analysis, management and sharing of pathology-related data within a digital environment1. The advent of digital pathology has driven the development of numerous artificial intelligence (AI) models for application on digital pathology images to aid cancer diagnosis. Such AI tools are being developed at an increasing rate every year. Lung cancer is the leading cause of cancer-related death in the UK, accounting for approximately 35, 000 deaths annually2. This high mortality rate is largely a result of late-stage diagnosis. Notably, the five-year survival rate for individuals diagnosed with lung cancer at stage 1 is 65%, however this decreases considerably to 5% for individuals diagnosed at stage 43. The implementation of national targeted lung cancer screening programmes in the UK and other high-income countries worldwide may improve patient outcomes4. Nevertheless, increased screening is likely to result in increased referrals to pathology services, and place substantial strain on an already-burdened workforce5. AI could potentially address these workforce bottlenecks6.
The application of AI models to digitised whole slide images (WSIs) is revolutionising cancer diagnosis. Pathologists are facing considerable pressures with rising workloads and a need to analyse increasingly complex and vast datasets. By automating certain tasks, AI models could complement pathologists in their clinical workflows and offer scalable diagnostic support4. Importantly, AI is capable of rapidly analysing vast datasets and may recognise patterns that may not be easily discernible to the human eye7. This ability is especially pertinent in lung cancer, where early diagnosis can lead to a substantial improvement in patient outcomes8. An emerging trend in the field of AI is the development of foundation models. These are large-scale models that are trained on vast datasets and act as a foundation for a diverse range of downstream tasks9. Notably, several histopathology-based AI models have already been approved by the FDA, including Paige Prostate for facilitating the diagnosis of prostate cancer10.
Despite their potential, the clinical adoption of cancer diagnostic AI pathology tools has been extremely limited to date. This is largely attributable to lack of robust external validation of models prior to deployment, and concerns regarding the generalisability of models to real-world clinical settings11. External validation refers to the evaluation of model performance using data taken from a separate source to the data used for training and testing the model11. A major challenge to widespread clinical adoption of these models is the problem of validating these tools using diverse, real-world datasets11. While AI models may perform well on internal datasets, their performance may drop considerably on external datasets that reflect the variability encountered in clinical practice. Robust external validation is important for assessing the generalisability of a model to different patient populations and is a critical step before AI models can be trusted and integrated into clinical workflows12.
Current literature on external validation is sparse and existing systematic reviews evaluating validation studies for pathology-based lung cancer diagnostic algorithms focus primarily on AI techniques and validation on internal datasets13,14,15. A review of external validation studies for these AI tools, with a focus on methodological robustness, is yet to be conducted. Our review provides an overview of models used to facilitate lung cancer diagnosis from digital pathology images and explores the current state of external validation of these models. The primary objective of this review was to assess the methodological robustness of validation studies and report model performance where possible, focusing exclusively on external, independent datasets. We chose to use a systematic scoping review approach to map the available evidence, identify gaps in the evidence, and critically appraise included studies.
Results
Database searches resulted in 4423 studies, with 440 additional studies identified through other sources, including Google Scholar and a snowballing approach (Fig.1). After duplicates were removed, we screened 3851 titles and abstracts, and reviewed 414 full-text articles. Overall, 22 studies met the inclusion criteria, including 20 publications and two preprints (Table 1). During the screening process, we identified 239 papers describing the development and validation of pathology lung cancer detection models. It is noteworthy that approximately only 10% of these papers described the external validation of models.
AI models and tasks
Figure 2 presents the characteristics of the included studies. 18 models facilitated the diagnosis of non-small cell lung cancer (NSCLC), focusing primarily on lung adenocarcinoma (LUAD) and/or lung squamous cell carcinoma (LUSC). Three models detected small cell lung cancer (SCLC) in addition to NSCLC. We identified one foundation model, named Virchow, which was trained for pan-cancer detection using approximately 1.5 million WSIs covering 17 tissue types16. Virchow was trained and validated on lung cancer tissue, however the lung cancer subtypes intended to be detected by the model were not specified16.
Models performed various tasks along the diagnostic pathway, most commonly subtyping (n = 16)17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32. 13 subtyping models distinguished LUAD from LUSC, whereas three models distinguished LUAD, LUSC and SCLC. Other tasks performed by AI models included classification of malignant versus non-malignant tissue (n = 14)17,19,20,21,22,26,29,30,31,32,33,34,35,36, tumour growth pattern classification (n = 2)33,36, biomarker identification (n = 2)16,19, prediction of tumour cellularity (n = 1)35, and classification of cell types (n = 1)37. We identified 14 multi-tasking models16,17,19,20,21,22,26,29,30,31,32,33,34,35,36, the majority of which (n = 10) combined subtyping and classification of malignant versus non-malignant tissue17,19,20,21,22,26,29,30,31,32.
Information regarding the intended role of the model within the diagnostic pathway, the intended clinical setting and the intended country of deployment was limited. Three authors provided details on the intended clinical setting of their models, such as country of deployment and whether the target population would be asymptomatic or symptomatic18,30,35. One author reported that their model could act as a triage tool in clinical practice16, whereas authors of 12 studies reported that their AI model was developed to aid the clinician without providing any further details17,18,19,20,21,22,23,30,32,33,35,36,37. One author reported that their study was for research purposes only and that the model was not developed specifically as a clinical tool26.
Study type
16 out of 22 studies were retrospective16,20,21,22,23,24,25,26,28,29,30,31,32,33,34,37, with retrospective case-control studies being the most used study design (n = 10)16,20,21,22,26,29,30,31,33,34. We identified one prospective case-control study36, however we could not identify any completed prospective cohort studies or randomised controlled trials. For five studies, it was unclear whether data was collected retrospectively or prospectively17,18,19,27,35.
Datasets
Histopathological datasets used for external validation were heterogeneous in size, with studies using as few as 20 samples to as many as 2115 samples (see Table 1). Around half of the studies (n = 9) used datasets consisting of between 100 and 500 images17,19,20,24,27,28,30,35,37. Similarly to dataset size, the number of datasets used, and the source of datasets varied considerably. Seven studies were single-centre studies17,19,20,23,24,27,37, whereas six studies used images from multiple centres, ranging from two centres to four centres18,21,22,29,34,35. While most studies used images from restricted datasets from secondary care hospitals and tertiary centres16,17,18,19,20,23,24,27,28,35,36,37, three studies used a combination of both public and restricted datasets22,29,34.
Technical diversity within datasets
Over half of the studies (n = 12) used techniques to address potential variations in images that may arise due to differences in equipment or tissue processing protocols across centres. Nine studies reported using either one or a combination of the following to increase technical diversity in datasets: WSIs created with different whole slide scanners, various magnifications, slides preserved using different methods (e.g. FFPE or frozen), different tissue samples (e.g. biopsies or resections), slides prepared with various stains, and slides containing artefacts (e.g. bubbles and scratches)17,18,20,22,24,25,33,34,35. Among the 13 studies where the use of these methods was unclear, two studies simulated technical diversity through data augmentation techniques such as rotation, flipping, and varying brightness, saturation, contrast, and hue21,37. Conversely, three studies used stain normalisation to minimise variability between images18,23,27.
Quality assessment
High or unclear risk of bias was observed for all studies in at least one of the five assessed QUADAS-AI-P domains. As depicted in Fig. 3, high risk of bias was noted for 14% of studies in the ‘Reference standard’ domain, 50% of studies in the ‘Image selection’ domain, and 86% of studies in the ‘Participant selection/study design’ domain. On the other hand, low risk of bias was noted for 18% of studies in the ‘Image selection’ domain, 23% of studies in the ’Reference standard’ domain, and 32% of studies in both the ‘Flow and timing’ and ‘Index test’ domains. Due to inadequate reporting, the risk of bias was unclear for most studies in the ‘Flow and timing’, ‘Index test’, and ‘Reference standard’ domains. Additionally, concerns regarding applicability were high for one study in the ‘Target condition’ domain, low for 95% of studies in the ‘Index test’ domain and unclear for 82% of studies in the ‘Participant selection’ domain due to insufficient reporting of participant characteristics (Fig. 3). The risk of bias ratings and concerns regarding applicability are shown for each individual study in Supplementary material 3. See supplementary material 4 for a full list of methodological concerns.
a Results of the risk of bias assessment. For each QUADAS-AI-P domain, the blue, orange and grey sections of the bar indicate the percentage of studies judged to be at low, high or unclear risk of bias, respectively. b Results of the concerns regarding applicability assessment. For each QUADAS-AI-P domain, the blue, orange and grey sections of the bar indicate the percentage of studies considered to have low, high or unclear concerns regarding applicability. QUADAS-AI-P: QUality Assessment tool of Diagnostic Accuracy Studies tailored to Artificial Intelligence and digital Pathology.
Diagnostic performance and evaluation metrics used
The area under the receiver operating characteristic curve (AUC) was reported in 17 out of 22 studies17,18,19,20,22,23,24,25,26,27,28,29,30,31,32,34, making it the most commonly reported evaluation metric overall. Notably, only four studies reported sensitivity and/or specificity16,19,20,29. Other metrics used to evaluate models included accuracy, F1 score, precision, recall, and area under the precision-recall curve (AUPRC). Performance metrics were reported according to dataset, tissue type and/or preservation method, lung cancer subtype or unit of analysis (patch-level or slide-level). Importantly, we could not conduct a meta-analysis due to considerable heterogeneity in AI task, evaluation metrics used, unit of analysis and reporting. It was only possible to compare performance metrics for models that subtyped lung cancers, as this was the most common, most clearly defined task with the greatest consistency in reporting (Table 2). Models for subtyping lung cancers performed highly, with average AUC values ranging from 0.746 (Mukashyaka et al. 2024) to 0.999 (Kanavati et al. 2021)22,25. Notably, out of the 16 studies evaluating models for subtyping lung cancers, eight provided ROC curves20,22,23,24,27,28,29,30, and eight provided measures of variability20,21,22,25,27,29,30,32.
Discussion
To our knowledge, this is the most up-to-date systematic scoping review evaluating the methodological robustness of studies externally validating AI models for diagnosing lung cancer from digital pathology images. We identified 20 publications and two preprints. Models performed various tasks to facilitate lung cancer diagnosis, with subtyping being the most common task. Models for subtyping lung cancers performed well, with average AUC values ranging from 0.746 (Mukashyaka et al. 2024) to 0.999 (Kanavati et al. 2021)22,25.
Promisingly, over 60% of studies used at least one restricted external validation dataset. Restricted datasets are advantageous over public datasets as it is easier to assess the reliability of ground truth labels and to ensure that validation is truly external38. Images sometimes overlap between online repositories, so even if models were trained and validated on separate public datasets, these datasets may not be completely independent38.
Nevertheless, we identified several methodological issues regarding the external validation process. These include failure to account for technical variation across centres and poor reporting of clinically meaningful evaluation metrics. Furthermore, only one prospective real-world validation study has been conducted to date. Other studies mostly used retrospective, non-representative datasets and case-control study designs, reflecting early-stage validation.
Notably, over 80% of studies were at high risk of bias in the ‘Participant selection/study design’ domain, primarily due to the use of non-diverse, retrospective datasets and case-control studies which are highly susceptible to spectrum bias39. Separate recruitment of cases and controls may result in those with less extreme phenotypes (e.g. early-stage, asymptomatic individuals) being missed from datasets. Subsequently, algorithms may perform inadequately on these individuals in clinic. Spectrum bias is a particular concern for algorithms designed to be used in screening settings, where a larger proportion of early-stage cancers will be identified compared to a symptomatic population. This is of particular importance with the introduction of lung cancer screening programmes in high-income countries worldwide4.
While retrospective validation is time- and cost-effective, models may underperform in real-world settings. Prospective studies and ongoing monitoring would be beneficial for understanding whether a model works with existing infrastructure and scanners, and on a population reflecting the target population as closely as possible. Prospective studies could range from small-scale implementation studies to larger RCTs. Encouragingly, we identified a clinical trial at the recruitment stage aiming to validate a lung cancer detection model using a prospective cohort study design (NCT05925764).
Another methodological concern was lack of diversity within the study population. Among the three studies that reported participant ethnicity, only a minority of participants were non-White. Research indicates racial disparities in lung cancer subtype and stage40. For example, Black individuals have a higher NSCLC incidence rate and are more likely to present with advanced lung cancer compared to White individuals40,41. Notably, none of the studies performed sub-group analysis by ethnicity.
Importantly, model performance is affected by sample size42. It is notable that 12 studies used fewer than 500 samples17,19,20,23,24,27,30,33,35,36,37,43, and that sample size was not reported for three studies16,21,25. Although it is encouraging that five studies used over 1000 samples17,21,25,31,34, dataset size may not necessarily reflect diversity within a dataset. The number of participants in these studies were not reported, and multiple samples may have originated from the same participant. Furthermore, around 90% of studies failed to report the proportion of subtypes and stages represented in datasets. Although two authors used data augmentation techniques to increase image variability, the level of technical diversity within datasets was concerning for half of the included studies16,19,23,26,27,28,29,30,31,32,36. Out of six single-centre studies, two studies applied stain normalisation to standardise stain colour23,27, and one study failed to report the use of any methods to increase technical variation19. This poses the risk of suboptimal performance on data from external centres using different equipment and/or tissue processing protocols. See recommendations in Box 1 below for methods to increase technical diversity within histopathological datasets.
In contrast with previous studies, clinically meaningful evaluation metrics such as sensitivity and specificity were poorly reported. Many authors failed to consider the clinical application of their models, and therefore did not determine suitable levels of sensitivity or specificity required to select optimal cutoffs from ROC curves. Moreover, sensitivity and specificity values were difficult to calculate as confusion matrices and ROC curves were rarely provided. With the increasing development of pathology AI models for cancer detection, more standardised reporting of metrics would enable future meta-analyses to determine the most effective model for a particular task.
Our findings that lung cancer subtyping was the most common task, and that AUC was the most frequently reported metric, echo the results of Prabhu et al. (2022) and Davri et al. (2023), respectively13,14. Importantly, previous reviews evaluating both internal and external validation studies together highlighted issues with small datasets, lack of multicentre studies and heterogeneity in metrics used13,14,15. Our results indicate that these issues persist even when considering external validation studies alone. With regards to the quality assessment, similarly to McGenity et al.15, we found that the greatest number of studies were at high or unclear risk of bias in the ‘Participant selection/study design’ domain. While McGenity et al.15 attributed this primarily to lack of non-random, non-consecutive and unclear participant enrolment, we found that this was predominantly due to lack of diversity within the participant population. In contrast with McGenity et al.15, the quality assessment tool we used contained an additional domain titled ‘Image selection’. We identified eleven studies at high risk of bias in this domain due to lack of technical diversity within datasets17,19,20,21,23,25,26,27,30,31,33.
Our review has several strengths. Firstly, given the fast-paced nature of the field, this is the most comprehensive and up-to-date review of its kind. While previous research has been limited to models for LUAD and LUSC14, we additionally included models that facilitated the diagnosis of any lung cancer subtype. Moreover, we used a broad, comprehensive search strategy and explored both engineering and medical databases. Secondly, we adhered to PRISMA-ScR guidance and published our protocol a priori, outlining any changes in an updated version44. Thirdly, screening, data extraction and the quality assessment were independently conducted by at least two reviewers, and conflicts at the screening stage were resolved independently by a third reviewer. Nevertheless, a key limitation of this review was the inability to conduct a meta-analysis due to heterogeneity in evaluation metrics used across studies. Additionally, our quality assessment was hindered by poor reporting and lack of response to missing information requests. Out of the 22 authors contacted for missing information, only seven authors responded. Finally, it is noteworthy that preprints were included in the review and that relevant literature may have been missed as we were unable to translate two studies.
Lung cancer subtyping models may bring considerable benefits to clinical practice. Firstly, by automating certain tasks using AI, stains and other materials can be conserved for downstream tasks such as genomic analysis and treatment planning. Secondly, LUAD and LUSC require different treatment modalities45. While, targeted therapies such as EGFR inhibitors and BRAF inhibitors are effective for treating LUAD, these therapies have shown limited benefit for LUSC45,46. Treatment for LUSC typically involves surgery, radiotherapy and platinum-based chemotherapy, which are most effective for early-stage disease45. Hence, timely and accurate lung cancer subtyping is critical. A delay in treatment could result in rapid disease progression, making treatment more complex and increase the risk of side effects47.
In conclusion, this review provides an overview of the landscape of AI models for the digital pathology-based diagnosis of lung cancer and their external validation. Such tools have great potential to support clinical workflows and may be best positioned as triage tools or as add-on tools to augment pathologists. The field is evolving rapidly, however, robust clinical validation to date is lacking, which raises concerns about whether model performance would be maintained in real-world clinical settings. The methodological issues identified in this review highlight the need for more rigorous external validation of pathology lung cancer detection models for increased clinical adoption. Based on these issues, we propose a set of recommendations for more robust external validation to aid the translation of AI cancer detection models that are safe, ethical and effective in clinical practice (Box 1). Further work is warranted to clarify exact requirements for a robust external validation dataset. For example, expert consensus may help to determine the required sample size, number of contributing centres, and the optimal level of geographic and demographic diversity. This review focuses on lung cancer as a use case; however, our findings are likely to apply to other cancer types as well.
Methods
Search strategy and selection criteria
This systematic scoping review followed the Preferred Items for Systematic Reviews and Meta-Analysis guidelines extension for Scoping Reviews44. The protocol was published on Open Science Framework (https://osf.io/yacju) prior to conducting the review.
We screened 50 titles and abstracts for piloting purposes and to inform our search strategy. We subsequently searched for primary research articles published between 1st January 2010 to 31st October 2024, with no language restrictions. We chose 2010 as the starting date due to advancements in deep learning methods and increased availability of big data48. Using a combination of keywords related to ‘lung cancer’, ‘AI’, ‘validation’, ‘diagnosis’ and ‘pathology’, we systematically searched MEDLINE, Embase, Web of Science, IEEE, Engineering Village, and Association for Computing Machinery. The full search strategy for MEDLINE can be found in Supplementary material 1. Additionally, we searched ClinicalTrials.gov for study protocols, and BioRxiv and MedRxiv for preprint articles. We extended our systematic search to IBM and the first 200 studies in Google Scholar, ranked by relevance for studies that met our inclusion criteria. Snowballing was used to identify studies that may have been missed during the search process.
Studies were considered for inclusion if they provided evidence on the accuracy and utility of machine learning models for analysing histopathology or cytology images to aid in early lung cancer diagnosis. We did not impose restrictions in relation to study type as we anticipated that models would be at various stages of development and study type was one of our outcomes. Studies were excluded if they described the development of AI algorithms without any validation of their effectiveness; evaluated models unrelated to histopathology or cytology, were not designed for lung cancer diagnosis; were not machine learning models; only validated models internally; were review articles, or conference abstracts with insufficient information (e.g. lack of clarity on whether validation was internal or external).
Data extraction, analysis and synthesis
After removing duplicates, two authors (SA and MK or MG) independently performed title and abstract screening as well as full-text screening to identify studies that met the inclusion criteria. Any conflicts were resolved independently by a third reviewer (JO). All stages of screening were conducted using Covidence systematic review software, Veritas Health Innovation, Melbourne, Australia. Data extraction was performed independently by at least two reviewers (SA and MK or MG) using a pre-designed data extraction template, which was included in the published protocol.
The extracted data included details related to the type of study used to validate models, AI task, model performance where available, evaluation metrics used, validation dataset size, number of datasets, technical diversity within datasets and dataset source (public or restricted). Public datasets are openly accessible to the public via online repositories, whereas restricted datasets are not openly accessible to the public and may require authorised access49. We additionally aimed to collect information on intended clinical setting, validation setting, and reporting. All authors of the included studies were contacted for missing information. Due to considerable heterogeneity in AI task, evaluation metrics used, unit of analysis and reporting, we used a narrative synthesis approach rather than a meta-analysis.
Quality assessment
We conducted a quality assessment to further investigate the methodological robustness of validation studies. We modified the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) tool to better reflect specifics of AI and digital pathology50. Signalling questions were adapted with input from experts in AI and digital pathology (QUADAS-AI-P, see supplementary material 2). All five domains of QUADAS-AI-P are concerned with the external validation phase. Two authors (SA and MK or MG) independently conducted a quality assessment using the QUADAS-AI-P tool.
Data availability
The data that support the findings of this study are available from corresponding publications and are available from authors upon reasonable request.
References
Kim, Y. J., Roh, E. H. & Park, S. A literature review of quality, costs, process-associated with digital pathology. J. Exerc. Rehabil. 17, 11–14 (2021).
Gourd, E. Lung cancer control in the UK hit badly by COVID-19 pandemic. Lancet Oncol. 21, 1559 (2020).
Cancer Research UK. Survival for Lung Cancer. https://www.cancerresearchuk.org/about-cancer/lung-cancer/survival (2022).
Lam, S. et al. Current and future perspectives on computed tomography screening for lung cancer: a roadmap from 2023 to 2027 from the International Association for the study of lung cancer. J. Thorac. Oncol. 19, 36–51 (2024).
The Royal College of Pathologists. Meeting Pathology Demand: Histopathology Workforce Census. https://www.rcpath.org/static/952a934d-2ec3-48c9-a8e6e00fcdca700f/Meeting-Pathology-Demand-Histopathology-Workforce-Census-2018.pdf (2018).
Vos, S. et al. Making pathologists ready for the new artificial intelligence era: changes in required competencies. Mod. Pathol. 38, 100657 (2025).
Serag, A. et al. Translational AI and deep learning in diagnostic pathology. Front. Med. 6, 185 (2019).
Ning, J. et al. Early diagnosis of lung cancer: which is the optimal choice? Aging.13, 6214–6227 (2021).
Scott, I. A. & Zuccon, G. The new paradigm in machine learning—foundation models, large language models and beyond: a primer for physicians. Intern. Med. J. 54, 705–715 (2024).
Eloy, C. et al. Artificial intelligence-assisted cancer diagnosis improves the efficiency of pathologists in prostatic biopsies. Virchows Arch. 482, 595–604 (2023).
Kelly, C. J., Karthikesalingam, A., Suleyman, M., Corrado, G. & King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 17, 195 (2019).
Tsopra, R. et al. A framework for validating AI in precision medicine: considerations from the European ITFoC consortium. BMC Med. Inform. Decis. Mak. 21, 274 (2021).
Davri, A. et al. Deep Learning for lung cancer diagnosis, prognosis and prediction using histological and cytological images: a systematic review. Cancers. 15, 3981 (2023).
Prabhu, S., Prasad, K., Robels-Kelly, A. & Lu, X. AI-based carcinoma detection and classification using histopathological images: a systematic review. Comput Biol. Med. 142, 105209 (2022).
McGenity, C. et al. Artificial intelligence in digital pathology: a systematic review and meta-analysis of diagnostic test accuracy. npj Digit. Med. 7, 114 (2024).
Vorontsov, E. et al. A foundation model for clinical-grade computational pathology and rare cancers detection. Nat. Med. 30, 2924–2935 (2024).
Bilaloglu, S. et al. Efficient pan-cancer whole-slide image classification and outlier detection using convolutional neural networks. bioRxiv, 633123. https://doi.org/10.1101/633123 (2019).
Cao, L. et al. E2EFP-MIL: End-to-end and high-generalizability weakly supervised deep convolutional network for lung cancer classification from whole slide image. Med. Image Anal. 88, 102837 (2023).
Chen, Y. et al. A whole-slide image (WSI)-based immunohistochemical feature prediction system improves the subtyping of lung cancer. Lung Cancer. 165, 18–27 (2022).
Coudray, N. et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med. 24, 1559–1567 (2018).
Hari, S. N. et al. Examining batch effect in histopathology as a distributionally robust optimization problem. bioRxiv. https://doi.org/10.1101/2021.09.14.460365 (2021).
Kanavati, F. et al. A deep learning model for the classification of indeterminate lung carcinoma in biopsy whole slide images. Sci. Rep. 11, 8110 (2021).
Le Page, A. L. et al. Using a convolutional neural network for classification of squamous and non-squamous non-small cell lung cancer based on diagnostic histopathology HES images. Sci. Rep. 11, 23912 (2021).
Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 5, 555–570 (2021).
Mukashyaka, P., Sheridan, T. B., Foroughi, P. A. & Chuang, J. H. SAMPLER: unsupervised representations for rapid analysis of whole slide tissue images. eBioMedicine 99, 104908 (2024).
Noorbakhsh, J. et al. Deep learning-based cross-classifications reveal conserved spatial behaviors within tumor histological images. Nat. Commun. 11, 6367 (2020).
Quiros, A. et al. Mapping the landscape of histomorphological cancer phenotypes using self-supervised learning on unannotated pathology slides. Nat. Commun. 15, 4596 (2024).
Wang, S. et al. Deep learning of cell spatial organizations identifies clinically relevant insights in tissue images. Nat. Commun. 14, 7872 (2023).
Yang, H. et al. Deep learning-based six-type classifier for lung cancer and mimics from histopathological whole slide images: a retrospective study. BMC Med. 19, 80 (2021).
Yu, K. H. et al. Classifying non-small cell lung cancer types and transcriptomic subtypes using convolutional neural networks. J. Am. Med Inf. Assoc. 27, 757–769 (2020).
Sharma, R., Kumar, S., Shrivastava, A. & Bhatt, T. Optimizing knowledge transfer in sequential models: leveraging residual connections in flow transfer learning for lung cancer classification. 14th Indian Conference on Computer Vision, Graphics and Image Processing. https://doi.org/10.1145/3627631.3627663 (2024).
Borras Ferris, L. et al. A full pipeline to analyze lung histopathology images. SPIE—Progress in Biomedical Optics and Imaging. 12933. https://doi.org/10.1117/12.3006708 (2024).
Gertych, A. et al. Convolutional neural networks can accurately distinguish four histologic growth patterns of lung adenocarcinoma in digital slides. Sci. Rep. 9, 1483 (2019).
Kanavati, F. et al. Weakly-supervised learning for lung carcinoma classification using deep learning. Sci. Rep. 10, 9297 (2020).
Sakamoto, T. et al. A collaborative workflow between pathologists and deep learning for the evaluation of tumour cellularity in lung adenocarcinoma. Histopathology. 81, 758–769 (2022).
Swiderska-Chadaj, Z. et al. A deep learning approach to assess the predominant tumor growth pattern in whole-slide images of lung adenocarcinoma. Medical Imaging: Digital Pathology. https://doi.org/10.1117/12.2549742 (2020).
Wang, S. et al. ConvPath: A software tool for lung adenocarcinoma digital pathological image analysis aided by a convolutional neural network. EBioMedicine. 50, 103–110 (2019).
Sounderajah, V. et al. A quality assessment tool for artificial intelligence-centered diagnostic test accuracy studies: QUADAS-AI. Nat. Med. 27, 1663–1665 (2021).
Usher-Smith, J. A., Sharp, S. J. & Griffin, S. J. The spectrum effect in tests for risk prediction, screening, and diagnosis. BMJ. 353, i3139 (2016).
Zeng, H. et al. Racial disparities in histological subtype, stage, tumor grade and cancer-specific survival in lung cancer. Transl. Lung Cancer Res. 11, 1348–1358 (2022).
Duncan, F. C. et al. Racial disparities in staging, treatment, and mortality in non-small cell lung cancer. Transl. Lung Cancer Res. 13, 76–94 (2024).
Rajput, D., Wang, W.-J. & Chen, C.-C. Evaluation of a decided sample size in machine learning applications. BMC Bioinform. 24, 48 (2023).
Wang, S. et al. Deep learning of cell spatial organizations identifies clinically relevant insights in tissue images. Nat. Commun. 14. https://doi.org/10.1038/s41467-023-43172-8 (2023).
Tricco, A. C. et al. PRISMA extension for scoping reviews (PRISMA-ScR): checklist and explanation. Ann. Intern. Med. 169, 467–473 (2018).
Wang, X., Zheng, K. & Hao, Z. In-depth analysis of immune cell landscapes reveals differences between lung adenocarcinoma and lung squamous cell carcinoma. Front. Oncol. 14, 1338634 (2024).
Lewis, W. E. et al. Efficacy of targeted inhibitors in metastatic lung squamous cell carcinoma with EGFR or ALK alterations. JTO Clin. Res. Rep. 2, 100237 (2021).
Cancer Research UK. Treatment Options for Non Small Cell Lung Cancer (NSCLC). https://www.cancerresearchuk.org/about-cancer/lung-cancer/treatment/non-small-cell-lung-cancer (2023).
Vo, V. et al. Multi-stakeholder preferences for the use of artificial intelligence in healthcare: a systematic review and thematic analysis. Soc. Sci. Med. 338, 116357 (2023).
Marée, R. Open practices and resources for collaborative digital pathology. Front. Med. 6, 255 (2019).
Whiting, P. F. et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann. Intern. Med. 155, 529–536 (2011).
Flahault, A., Cadilhac, M. & Thomas, G. Sample size calculation should be performed for design accuracy in diagnostic test studies. J. Clin. Epidemiol. 58, 859–862 (2005).
Snell, K. I. E. et al. External validation of clinical prediction models: simulation-based sample size calculations were more reliable than rules-of-thumb. J. Clin. Epidemiol. 135, 79–89 (2021).
Cancer Research UK. Lung Cancer Incidence Statistics. https://www.cancerresearchuk.org/health-professional/cancer-statistics/statistics-by-cancer-type/lung-cancer/incidence#heading-Zero (2024).
Acknowledgements
The authors would like to thank Dr Fayyaz Minhas (University of Warwick) for his help in adapting the quality assessment tool. J.O. and S.A. disclose support for the research of this work from Barts Charity (grant number: G-001522, MGU0461). O.B. declares funding from Barts Charity (grant number: G-001522). The funding source had no role in the study design, data collection, data analysis, data interpretation, or in manuscript writing. The views expressed in this manuscript are those of the authors and may not necessarily reflect those of the funding source.
Author information
Authors and Affiliations
Contributions
This study was initially conceived by J.O. and J.L.R., and further developed by R.G., O.B., D.M. and S.A.. J.O. acquired funding for this project. S.A. developed the protocol, screened articles for inclusion, extracted and synthesised data, conducted the quality assessment, interpreted the results and drafted the manuscript. M.K. and M.G. screened articles for inclusion, extracted data and conducted the quality assessment. J.L.R. assisted in the development of the protocol and assisted in the tailoring of the quality assessment tool. O.B. assisted in the development of the protocol. D.M. provided methodological support, assisted in the development of the protocol and supervised the screening process. J.O. supervised the project, assisted in the development of the protocol and resolved conflicts at the screening stage. The raw data were collected and verified by S.A., M.K. and M.G. All authors reviewed the manuscript, had access to the data presented in the manuscript, and approved the final version.
Corresponding author
Ethics declarations
Competing interests
Author J.O. has previously acted as a paid consultant for Hardian Health but declares no non-financial competing interests. All other authors declare no financial or non-financial competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Arun, S., Grosheva, M., Kosenko, M. et al. Systematic scoping review of external validation studies of AI pathology models for lung cancer diagnosis. npj Precis. Onc. 9, 166 (2025). https://doi.org/10.1038/s41698-025-00940-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41698-025-00940-7