Systematic scoping review of external validation studies of AI pathology models for lung cancer diagnosis

Arun, Soumya; Grosheva, Mariia; Kosenko, Mark; Robertus, Jan Lukas; Blyuss, Oleg; Gabe, Rhian; Munblit, Daniel; Offman, Judith

doi:10.1038/s41698-025-00940-7

Download PDF

Article
Open access
Published: 07 June 2025

Systematic scoping review of external validation studies of AI pathology models for lung cancer diagnosis

Soumya Arun ORCID: orcid.org/0000-0001-8850-725X¹,
Mariia Grosheva²,
Mark Kosenko²,
Jan Lukas Robertus^3,4,
Oleg Blyuss^1,2,
Rhian Gabe⁵,
Daniel Munblit^2,6 &
…
Judith Offman¹

npj Precision Oncology volume 9, Article number: 166 (2025) Cite this article

2350 Accesses
4 Altmetric
Metrics details

Subjects

Abstract

Clinical adoption of digital pathology-based artificial intelligence models for diagnosing lung cancer has been limited, partly due to lack of robust external validation. This review provides an overview of such tools, their performance and external validation. We systematically searched for external validation studies in medical, engineering and grey literature databases from 1st January 2010 to 31st October 2024. 22 studies were included. Models performed various tasks, including classification of malignant versus non-malignant tissue, tumour growth pattern classification and subtyping of adeno- versus squamous cell carcinomas. Subtyping models were most common and performed highly, with average AUC values ranging from 0.746 to 0.999. Although most studies used restricted datasets, methodological issues relevant to the applicability of models in real-world settings included small and/or non-representative datasets, retrospective studies and case-control studies without further real-world validation. Ultimately, more rigorous external validation of models is warranted for increased clinical adoption.

Explainable AI for lung cancer detection via a custom CNN on CT images

Article Open access 13 April 2025

Intricacies of human–AI interaction in dynamic decision-making for precision oncology

Article Open access 29 January 2025

Accurate prediction of disease-free and overall survival in non-small cell lung cancer using patient-level multimodal weakly supervised learning

Article Open access 19 June 2025

Introduction

Digital pathology refers to the analysis, management and sharing of pathology-related data within a digital environment¹. The advent of digital pathology has driven the development of numerous artificial intelligence (AI) models for application on digital pathology images to aid cancer diagnosis. Such AI tools are being developed at an increasing rate every year. Lung cancer is the leading cause of cancer-related death in the UK, accounting for approximately 35, 000 deaths annually². This high mortality rate is largely a result of late-stage diagnosis. Notably, the five-year survival rate for individuals diagnosed with lung cancer at stage 1 is 65%, however this decreases considerably to 5% for individuals diagnosed at stage 4³. The implementation of national targeted lung cancer screening programmes in the UK and other high-income countries worldwide may improve patient outcomes⁴. Nevertheless, increased screening is likely to result in increased referrals to pathology services, and place substantial strain on an already-burdened workforce⁵. AI could potentially address these workforce bottlenecks⁶.

The application of AI models to digitised whole slide images (WSIs) is revolutionising cancer diagnosis. Pathologists are facing considerable pressures with rising workloads and a need to analyse increasingly complex and vast datasets. By automating certain tasks, AI models could complement pathologists in their clinical workflows and offer scalable diagnostic support⁴. Importantly, AI is capable of rapidly analysing vast datasets and may recognise patterns that may not be easily discernible to the human eye⁷. This ability is especially pertinent in lung cancer, where early diagnosis can lead to a substantial improvement in patient outcomes⁸. An emerging trend in the field of AI is the development of foundation models. These are large-scale models that are trained on vast datasets and act as a foundation for a diverse range of downstream tasks⁹. Notably, several histopathology-based AI models have already been approved by the FDA, including Paige Prostate for facilitating the diagnosis of prostate cancer¹⁰.

Despite their potential, the clinical adoption of cancer diagnostic AI pathology tools has been extremely limited to date. This is largely attributable to lack of robust external validation of models prior to deployment, and concerns regarding the generalisability of models to real-world clinical settings¹¹. External validation refers to the evaluation of model performance using data taken from a separate source to the data used for training and testing the model¹¹. A major challenge to widespread clinical adoption of these models is the problem of validating these tools using diverse, real-world datasets¹¹. While AI models may perform well on internal datasets, their performance may drop considerably on external datasets that reflect the variability encountered in clinical practice. Robust external validation is important for assessing the generalisability of a model to different patient populations and is a critical step before AI models can be trusted and integrated into clinical workflows¹².

Current literature on external validation is sparse and existing systematic reviews evaluating validation studies for pathology-based lung cancer diagnostic algorithms focus primarily on AI techniques and validation on internal datasets^13,14,15. A review of external validation studies for these AI tools, with a focus on methodological robustness, is yet to be conducted. Our review provides an overview of models used to facilitate lung cancer diagnosis from digital pathology images and explores the current state of external validation of these models. The primary objective of this review was to assess the methodological robustness of validation studies and report model performance where possible, focusing exclusively on external, independent datasets. We chose to use a systematic scoping review approach to map the available evidence, identify gaps in the evidence, and critically appraise included studies.

Results

Database searches resulted in 4423 studies, with 440 additional studies identified through other sources, including Google Scholar and a snowballing approach (Fig.1). After duplicates were removed, we screened 3851 titles and abstracts, and reviewed 414 full-text articles. Overall, 22 studies met the inclusion criteria, including 20 publications and two preprints (Table 1). During the screening process, we identified 239 papers describing the development and validation of pathology lung cancer detection models. It is noteworthy that approximately only 10% of these papers described the external validation of models.

Table 1 Publications of studies that met the inclusion criteria, task(s) performed by each model, external validation dataset details and algorithm details for each study

Full size table

AI models and tasks

Figure 2 presents the characteristics of the included studies. 18 models facilitated the diagnosis of non-small cell lung cancer (NSCLC), focusing primarily on lung adenocarcinoma (LUAD) and/or lung squamous cell carcinoma (LUSC). Three models detected small cell lung cancer (SCLC) in addition to NSCLC. We identified one foundation model, named Virchow, which was trained for pan-cancer detection using approximately 1.5 million WSIs covering 17 tissue types¹⁶. Virchow was trained and validated on lung cancer tissue, however the lung cancer subtypes intended to be detected by the model were not specified¹⁶.

Models performed various tasks along the diagnostic pathway, most commonly subtyping (n = 16)^{17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32}. 13 subtyping models distinguished LUAD from LUSC, whereas three models distinguished LUAD, LUSC and SCLC. Other tasks performed by AI models included classification of malignant versus non-malignant tissue (n = 14)^{17,19,20,21,22,26,29,30,31,32,33,34,35,36}, tumour growth pattern classification (n = 2)^33,36, biomarker identification (n = 2)^16,19, prediction of tumour cellularity (n = 1)³⁵, and classification of cell types (n = 1)³⁷. We identified 14 multi-tasking models^{16,17,19,20,21,22,26,29,30,31,32,33,34,35,36}, the majority of which (n = 10) combined subtyping and classification of malignant versus non-malignant tissue^{17,19,20,21,22,26,29,30,31,32}.

Information regarding the intended role of the model within the diagnostic pathway, the intended clinical setting and the intended country of deployment was limited. Three authors provided details on the intended clinical setting of their models, such as country of deployment and whether the target population would be asymptomatic or symptomatic^18,30,35. One author reported that their model could act as a triage tool in clinical practice¹⁶, whereas authors of 12 studies reported that their AI model was developed to aid the clinician without providing any further details^{17,18,19,20,21,22,23,30,32,33,35,36,37}. One author reported that their study was for research purposes only and that the model was not developed specifically as a clinical tool²⁶.

Study type

16 out of 22 studies were retrospective^{16,20,21,22,23,24,25,26,28,29,30,31,32,33,34,37}, with retrospective case-control studies being the most used study design (n = 10)^{16,20,21,22,26,29,30,31,33,34}. We identified one prospective case-control study³⁶, however we could not identify any completed prospective cohort studies or randomised controlled trials. For five studies, it was unclear whether data was collected retrospectively or prospectively^{17,18,19,27,35}.

Datasets

Histopathological datasets used for external validation were heterogeneous in size, with studies using as few as 20 samples to as many as 2115 samples (see Table 1). Around half of the studies (n = 9) used datasets consisting of between 100 and 500 images^{17,19,20,24,27,28,30,35,37}. Similarly to dataset size, the number of datasets used, and the source of datasets varied considerably. Seven studies were single-centre studies^{17,19,20,23,24,27,37}, whereas six studies used images from multiple centres, ranging from two centres to four centres^{18,21,22,29,34,35}. While most studies used images from restricted datasets from secondary care hospitals and tertiary centres^{16,17,18,19,20,23,24,27,28,35,36,37}, three studies used a combination of both public and restricted datasets^22,29,34.

Technical diversity within datasets

Over half of the studies (n = 12) used techniques to address potential variations in images that may arise due to differences in equipment or tissue processing protocols across centres. Nine studies reported using either one or a combination of the following to increase technical diversity in datasets: WSIs created with different whole slide scanners, various magnifications, slides preserved using different methods (e.g. FFPE or frozen), different tissue samples (e.g. biopsies or resections), slides prepared with various stains, and slides containing artefacts (e.g. bubbles and scratches)^{17,18,20,22,24,25,33,34,35}. Among the 13 studies where the use of these methods was unclear, two studies simulated technical diversity through data augmentation techniques such as rotation, flipping, and varying brightness, saturation, contrast, and hue^21,37. Conversely, three studies used stain normalisation to minimise variability between images^18,23,27.

Quality assessment

High or unclear risk of bias was observed for all studies in at least one of the five assessed QUADAS-AI-P domains. As depicted in Fig. 3, high risk of bias was noted for 14% of studies in the ‘Reference standard’ domain, 50% of studies in the ‘Image selection’ domain, and 86% of studies in the ‘Participant selection/study design’ domain. On the other hand, low risk of bias was noted for 18% of studies in the ‘Image selection’ domain, 23% of studies in the ’Reference standard’ domain, and 32% of studies in both the ‘Flow and timing’ and ‘Index test’ domains. Due to inadequate reporting, the risk of bias was unclear for most studies in the ‘Flow and timing’, ‘Index test’, and ‘Reference standard’ domains. Additionally, concerns regarding applicability were high for one study in the ‘Target condition’ domain, low for 95% of studies in the ‘Index test’ domain and unclear for 82% of studies in the ‘Participant selection’ domain due to insufficient reporting of participant characteristics (Fig. 3). The risk of bias ratings and concerns regarding applicability are shown for each individual study in Supplementary material 3. See supplementary material 4 for a full list of methodological concerns.

Diagnostic performance and evaluation metrics used

The area under the receiver operating characteristic curve (AUC) was reported in 17 out of 22 studies^{17,18,19,20,22,23,24,25,26,27,28,29,30,31,32,34}, making it the most commonly reported evaluation metric overall. Notably, only four studies reported sensitivity and/or specificity^16,19,20,29. Other metrics used to evaluate models included accuracy, F1 score, precision, recall, and area under the precision-recall curve (AUPRC). Performance metrics were reported according to dataset, tissue type and/or preservation method, lung cancer subtype or unit of analysis (patch-level or slide-level). Importantly, we could not conduct a meta-analysis due to considerable heterogeneity in AI task, evaluation metrics used, unit of analysis and reporting. It was only possible to compare performance metrics for models that subtyped lung cancers, as this was the most common, most clearly defined task with the greatest consistency in reporting (Table 2). Models for subtyping lung cancers performed highly, with average AUC values ranging from 0.746 (Mukashyaka et al. 2024) to 0.999 (Kanavati et al. 2021)^22,25. Notably, out of the 16 studies evaluating models for subtyping lung cancers, eight provided ROC curves^{20,22,23,24,27,28,29,30}, and eight provided measures of variability^{20,21,22,25,27,29,30,32}.

Table 2 Performance of lung cancer subtyping models

Full size table

Discussion

To our knowledge, this is the most up-to-date systematic scoping review evaluating the methodological robustness of studies externally validating AI models for diagnosing lung cancer from digital pathology images. We identified 20 publications and two preprints. Models performed various tasks to facilitate lung cancer diagnosis, with subtyping being the most common task. Models for subtyping lung cancers performed well, with average AUC values ranging from 0.746 (Mukashyaka et al. 2024) to 0.999 (Kanavati et al. 2021)^22,25.

Promisingly, over 60% of studies used at least one restricted external validation dataset. Restricted datasets are advantageous over public datasets as it is easier to assess the reliability of ground truth labels and to ensure that validation is truly external³⁸. Images sometimes overlap between online repositories, so even if models were trained and validated on separate public datasets, these datasets may not be completely independent³⁸.

Nevertheless, we identified several methodological issues regarding the external validation process. These include failure to account for technical variation across centres and poor reporting of clinically meaningful evaluation metrics. Furthermore, only one prospective real-world validation study has been conducted to date. Other studies mostly used retrospective, non-representative datasets and case-control study designs, reflecting early-stage validation.

Notably, over 80% of studies were at high risk of bias in the ‘Participant selection/study design’ domain, primarily due to the use of non-diverse, retrospective datasets and case-control studies which are highly susceptible to spectrum bias³⁹. Separate recruitment of cases and controls may result in those with less extreme phenotypes (e.g. early-stage, asymptomatic individuals) being missed from datasets. Subsequently, algorithms may perform inadequately on these individuals in clinic. Spectrum bias is a particular concern for algorithms designed to be used in screening settings, where a larger proportion of early-stage cancers will be identified compared to a symptomatic population. This is of particular importance with the introduction of lung cancer screening programmes in high-income countries worldwide⁴.

While retrospective validation is time- and cost-effective, models may underperform in real-world settings. Prospective studies and ongoing monitoring would be beneficial for understanding whether a model works with existing infrastructure and scanners, and on a population reflecting the target population as closely as possible. Prospective studies could range from small-scale implementation studies to larger RCTs. Encouragingly, we identified a clinical trial at the recruitment stage aiming to validate a lung cancer detection model using a prospective cohort study design (NCT05925764).

Another methodological concern was lack of diversity within the study population. Among the three studies that reported participant ethnicity, only a minority of participants were non-White. Research indicates racial disparities in lung cancer subtype and stage⁴⁰. For example, Black individuals have a higher NSCLC incidence rate and are more likely to present with advanced lung cancer compared to White individuals^40,41. Notably, none of the studies performed sub-group analysis by ethnicity.

Importantly, model performance is affected by sample size⁴². It is notable that 12 studies used fewer than 500 samples^{17,19,20,23,24,27,30,33,35,36,37,43}, and that sample size was not reported for three studies^16,21,25. Although it is encouraging that five studies used over 1000 samples^{17,21,25,31,34}, dataset size may not necessarily reflect diversity within a dataset. The number of participants in these studies were not reported, and multiple samples may have originated from the same participant. Furthermore, around 90% of studies failed to report the proportion of subtypes and stages represented in datasets. Although two authors used data augmentation techniques to increase image variability, the level of technical diversity within datasets was concerning for half of the included studies^{16,19,23,26,27,28,29,30,31,32,36}. Out of six single-centre studies, two studies applied stain normalisation to standardise stain colour^23,27, and one study failed to report the use of any methods to increase technical variation¹⁹. This poses the risk of suboptimal performance on data from external centres using different equipment and/or tissue processing protocols. See recommendations in Box 1 below for methods to increase technical diversity within histopathological datasets.

In contrast with previous studies, clinically meaningful evaluation metrics such as sensitivity and specificity were poorly reported. Many authors failed to consider the clinical application of their models, and therefore did not determine suitable levels of sensitivity or specificity required to select optimal cutoffs from ROC curves. Moreover, sensitivity and specificity values were difficult to calculate as confusion matrices and ROC curves were rarely provided. With the increasing development of pathology AI models for cancer detection, more standardised reporting of metrics would enable future meta-analyses to determine the most effective model for a particular task.

Our findings that lung cancer subtyping was the most common task, and that AUC was the most frequently reported metric, echo the results of Prabhu et al. (2022) and Davri et al. (2023), respectively^13,14. Importantly, previous reviews evaluating both internal and external validation studies together highlighted issues with small datasets, lack of multicentre studies and heterogeneity in metrics used^13,14,15. Our results indicate that these issues persist even when considering external validation studies alone. With regards to the quality assessment, similarly to McGenity et al.¹⁵, we found that the greatest number of studies were at high or unclear risk of bias in the ‘Participant selection/study design’ domain. While McGenity et al.¹⁵ attributed this primarily to lack of non-random, non-consecutive and unclear participant enrolment, we found that this was predominantly due to lack of diversity within the participant population. In contrast with McGenity et al.¹⁵, the quality assessment tool we used contained an additional domain titled ‘Image selection’. We identified eleven studies at high risk of bias in this domain due to lack of technical diversity within datasets^{17,19,20,21,23,25,26,27,30,31,33}.

Our review has several strengths. Firstly, given the fast-paced nature of the field, this is the most comprehensive and up-to-date review of its kind. While previous research has been limited to models for LUAD and LUSC¹⁴, we additionally included models that facilitated the diagnosis of any lung cancer subtype. Moreover, we used a broad, comprehensive search strategy and explored both engineering and medical databases. Secondly, we adhered to PRISMA-ScR guidance and published our protocol a priori, outlining any changes in an updated version⁴⁴. Thirdly, screening, data extraction and the quality assessment were independently conducted by at least two reviewers, and conflicts at the screening stage were resolved independently by a third reviewer. Nevertheless, a key limitation of this review was the inability to conduct a meta-analysis due to heterogeneity in evaluation metrics used across studies. Additionally, our quality assessment was hindered by poor reporting and lack of response to missing information requests. Out of the 22 authors contacted for missing information, only seven authors responded. Finally, it is noteworthy that preprints were included in the review and that relevant literature may have been missed as we were unable to translate two studies.

Lung cancer subtyping models may bring considerable benefits to clinical practice. Firstly, by automating certain tasks using AI, stains and other materials can be conserved for downstream tasks such as genomic analysis and treatment planning. Secondly, LUAD and LUSC require different treatment modalities⁴⁵. While, targeted therapies such as EGFR inhibitors and BRAF inhibitors are effective for treating LUAD, these therapies have shown limited benefit for LUSC^45,46. Treatment for LUSC typically involves surgery, radiotherapy and platinum-based chemotherapy, which are most effective for early-stage disease⁴⁵. Hence, timely and accurate lung cancer subtyping is critical. A delay in treatment could result in rapid disease progression, making treatment more complex and increase the risk of side effects⁴⁷.

In conclusion, this review provides an overview of the landscape of AI models for the digital pathology-based diagnosis of lung cancer and their external validation. Such tools have great potential to support clinical workflows and may be best positioned as triage tools or as add-on tools to augment pathologists. The field is evolving rapidly, however, robust clinical validation to date is lacking, which raises concerns about whether model performance would be maintained in real-world clinical settings. The methodological issues identified in this review highlight the need for more rigorous external validation of pathology lung cancer detection models for increased clinical adoption. Based on these issues, we propose a set of recommendations for more robust external validation to aid the translation of AI cancer detection models that are safe, ethical and effective in clinical practice (Box 1). Further work is warranted to clarify exact requirements for a robust external validation dataset. For example, expert consensus may help to determine the required sample size, number of contributing centres, and the optimal level of geographic and demographic diversity. This review focuses on lung cancer as a use case; however, our findings are likely to apply to other cancer types as well.

Box 1 Proposed recommendations for the external validation of pathology AI models to facilitate cancer diagnosis (based on methodological concerns in Supplementary Material 4)

Algorithm

Design recommendations:

Define:

Intended clinical setting (e.g. country, population demographics, cancer stage and cancer subtype).
Role of the model within the diagnostic pathway (e.g. aid for the clinician, replacement or triage tool).

Study design

Design recommendations:

Data should be taken from a separate source to the data used to train the model.
Case-control studies should be followed with prospective cohort or implementation studies and/or RCTs to avoid potential spectrum bias.
Retrospective studies should use restricted datasets if possible.
Data should be taken from multiple centres and geographic locations, if possible, as these are likely to differ in their population, protocols and technical equipment.
1. ○
  As a minimum, models should be validated in their intended clinical setting prior to deployment. Further local validation will allow assessment of whether a model’s performance translates to a specific centre.
The minimum sample size needed for a maximum acceptable sampling error should be calculated a priori. Several methods for sample size determination have been proposed^51,52.
1. ○
  The sample size should be determined with regard to the clinical outcome to be measured.
2. ○
  The dataset should be large enough to include variations in disease encountered in clinical practice, including rare pathologies.
3. ○
  If possible, the dataset should be powered for subgroup analysis (e.g. by age, sex, ethnicity, cancer stage, cancer subtype).
4. ○
  Dataset size may be artificially increased through data augmentation techniques. Such methods may be of particular value for rare cancers with small patient cohorts.

Use an appropriate ground truth. Several pathologists with a high level of expertise would be considered the gold-standard.

Reporting recommendations:

Study design (prospective or retrospective, case-control, cohort study or RCT).
Number of samples used for external validation.
Number of centres that data was taken from.
Ground truth, including whether set by another algorithm.

Population/participant selection

Design recommendations:

The distribution of subgroups (e.g. age, sex, cancer stage and cancer subtype) within a dataset should ideally reflect the distribution of subgroups in the target population.
1. ○
  For example, in the UK, approximately half of lung cancer cases occur in males and half occur in females⁵³. These proportions should be reflected in the external validation dataset.

Reporting recommendations:

Target population (e.g. symptomatic or asymptomatic individuals).
Number of participants included in the final analysis.
Demographic characteristics of the population (e.g. age, sex, ethnicity, sociodemographic status).
Sample collection method (e.g. random, consecutive).
Number of samples of each cancer stage and subtype.
If certain sub-groups were missed out from external validation datasets (e.g. certain stages, subtypes, ethnicities), include reasons for their missingness.

Image selection

Design recommendations:

The dataset should represent technical variations encountered in real-world settings. Technical diversity could be introduced through the use of:
1. ○
  samples prepared with different protocols (e.g. staining techniques, preservation methods)
2. ○
  samples scanned with different whole slide scanners.
3. ○
  samples with artefacts (e.g. bubbles, folds, pen markings)
4. ○
  samples scanned at different magnifications
5. ○
  different tissue samples (e.g. biopsies, resections)
6. ○
  data augmentation techniques
7. ○
  generative adversarial networks

Contemporary datasets should be used, if possible, as these are most likely to reflect current tissue processing protocols and scanning procedures.
Where there is limited access to contemporary data, historical data may be used. In such cases, the potential for dataset drift and bias should be considered.

Reporting recommendations:

Number of samples taken from each centre.
Number of samples taken from each participant.
Scanners used to create digital whole slide images.
Data augmentation techniques and/or generative adversarial networks used.
For open-source datasets, the subset of samples used for validation and the year that samples were taken from participants.

Diagnostic performance and metrics

Design recommendations:

The threshold at which outcomes are reported should be decided with consideration of the clinical task performed by the model. Consider whether sensitivity or specificity should be prioritised for this task.

Reporting recommendations:

As a minimum, a confusion matrix should be provided, including the number of true positives, true negatives, false positives and false negatives.
Metrics with clinical utility including sensitivity, specificity, AUROC, negative predictive value and positive predictive value.
The threshold at which outcomes are reported.
Model performance at whole-slide image-level.
Outcomes according to prespecified subgroups (e.g. age, sex, ethnicity, cancer stage, cancer subtype).
Clearly defined measures of variability (e.g. confidence intervals).

Methods

Search strategy and selection criteria

This systematic scoping review followed the Preferred Items for Systematic Reviews and Meta-Analysis guidelines extension for Scoping Reviews⁴⁴. The protocol was published on Open Science Framework (https://osf.io/yacju) prior to conducting the review.

We screened 50 titles and abstracts for piloting purposes and to inform our search strategy. We subsequently searched for primary research articles published between 1^st January 2010 to 31^st October 2024, with no language restrictions. We chose 2010 as the starting date due to advancements in deep learning methods and increased availability of big data⁴⁸. Using a combination of keywords related to ‘lung cancer’, ‘AI’, ‘validation’, ‘diagnosis’ and ‘pathology’, we systematically searched MEDLINE, Embase, Web of Science, IEEE, Engineering Village, and Association for Computing Machinery. The full search strategy for MEDLINE can be found in Supplementary material 1. Additionally, we searched ClinicalTrials.gov for study protocols, and BioRxiv and MedRxiv for preprint articles. We extended our systematic search to IBM and the first 200 studies in Google Scholar, ranked by relevance for studies that met our inclusion criteria. Snowballing was used to identify studies that may have been missed during the search process.

Studies were considered for inclusion if they provided evidence on the accuracy and utility of machine learning models for analysing histopathology or cytology images to aid in early lung cancer diagnosis. We did not impose restrictions in relation to study type as we anticipated that models would be at various stages of development and study type was one of our outcomes. Studies were excluded if they described the development of AI algorithms without any validation of their effectiveness; evaluated models unrelated to histopathology or cytology, were not designed for lung cancer diagnosis; were not machine learning models; only validated models internally; were review articles, or conference abstracts with insufficient information (e.g. lack of clarity on whether validation was internal or external).

Data extraction, analysis and synthesis

After removing duplicates, two authors (SA and MK or MG) independently performed title and abstract screening as well as full-text screening to identify studies that met the inclusion criteria. Any conflicts were resolved independently by a third reviewer (JO). All stages of screening were conducted using Covidence systematic review software, Veritas Health Innovation, Melbourne, Australia. Data extraction was performed independently by at least two reviewers (SA and MK or MG) using a pre-designed data extraction template, which was included in the published protocol.

The extracted data included details related to the type of study used to validate models, AI task, model performance where available, evaluation metrics used, validation dataset size, number of datasets, technical diversity within datasets and dataset source (public or restricted). Public datasets are openly accessible to the public via online repositories, whereas restricted datasets are not openly accessible to the public and may require authorised access⁴⁹. We additionally aimed to collect information on intended clinical setting, validation setting, and reporting. All authors of the included studies were contacted for missing information. Due to considerable heterogeneity in AI task, evaluation metrics used, unit of analysis and reporting, we used a narrative synthesis approach rather than a meta-analysis.

Quality assessment

We conducted a quality assessment to further investigate the methodological robustness of validation studies. We modified the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) tool to better reflect specifics of AI and digital pathology⁵⁰. Signalling questions were adapted with input from experts in AI and digital pathology (QUADAS-AI-P, see supplementary material 2). All five domains of QUADAS-AI-P are concerned with the external validation phase. Two authors (SA and MK or MG) independently conducted a quality assessment using the QUADAS-AI-P tool.

Data availability

The data that support the findings of this study are available from corresponding publications and are available from authors upon reasonable request.

References

Kim, Y. J., Roh, E. H. & Park, S. A literature review of quality, costs, process-associated with digital pathology. J. Exerc. Rehabil. 17, 11–14 (2021).
Article PubMed PubMed Central Google Scholar
Gourd, E. Lung cancer control in the UK hit badly by COVID-19 pandemic. Lancet Oncol. 21, 1559 (2020).
Article CAS PubMed PubMed Central Google Scholar
Cancer Research UK. Survival for Lung Cancer. https://www.cancerresearchuk.org/about-cancer/lung-cancer/survival (2022).
Lam, S. et al. Current and future perspectives on computed tomography screening for lung cancer: a roadmap from 2023 to 2027 from the International Association for the study of lung cancer. J. Thorac. Oncol. 19, 36–51 (2024).
Article PubMed Google Scholar
The Royal College of Pathologists. Meeting Pathology Demand: Histopathology Workforce Census. https://www.rcpath.org/static/952a934d-2ec3-48c9-a8e6e00fcdca700f/Meeting-Pathology-Demand-Histopathology-Workforce-Census-2018.pdf (2018).
Vos, S. et al. Making pathologists ready for the new artificial intelligence era: changes in required competencies. Mod. Pathol. 38, 100657 (2025).
Article PubMed Google Scholar
Serag, A. et al. Translational AI and deep learning in diagnostic pathology. Front. Med. 6, 185 (2019).
Article Google Scholar
Ning, J. et al. Early diagnosis of lung cancer: which is the optimal choice? Aging.13, 6214–6227 (2021).
Article CAS PubMed PubMed Central Google Scholar
Scott, I. A. & Zuccon, G. The new paradigm in machine learning—foundation models, large language models and beyond: a primer for physicians. Intern. Med. J. 54, 705–715 (2024).
Article PubMed Google Scholar
Eloy, C. et al. Artificial intelligence-assisted cancer diagnosis improves the efficiency of pathologists in prostatic biopsies. Virchows Arch. 482, 595–604 (2023).
Article CAS PubMed PubMed Central Google Scholar
Kelly, C. J., Karthikesalingam, A., Suleyman, M., Corrado, G. & King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 17, 195 (2019).
Article PubMed PubMed Central Google Scholar
Tsopra, R. et al. A framework for validating AI in precision medicine: considerations from the European ITFoC consortium. BMC Med. Inform. Decis. Mak. 21, 274 (2021).
Article PubMed PubMed Central Google Scholar
Davri, A. et al. Deep Learning for lung cancer diagnosis, prognosis and prediction using histological and cytological images: a systematic review. Cancers. 15, 3981 (2023).
Prabhu, S., Prasad, K., Robels-Kelly, A. & Lu, X. AI-based carcinoma detection and classification using histopathological images: a systematic review. Comput Biol. Med. 142, 105209 (2022).
Article PubMed Google Scholar
McGenity, C. et al. Artificial intelligence in digital pathology: a systematic review and meta-analysis of diagnostic test accuracy. npj Digit. Med. 7, 114 (2024).
Article PubMed PubMed Central Google Scholar
Vorontsov, E. et al. A foundation model for clinical-grade computational pathology and rare cancers detection. Nat. Med. 30, 2924–2935 (2024).
Article CAS PubMed PubMed Central Google Scholar
Bilaloglu, S. et al. Efficient pan-cancer whole-slide image classification and outlier detection using convolutional neural networks. bioRxiv, 633123. https://doi.org/10.1101/633123 (2019).
Cao, L. et al. E2EFP-MIL: End-to-end and high-generalizability weakly supervised deep convolutional network for lung cancer classification from whole slide image. Med. Image Anal. 88, 102837 (2023).
Article PubMed Google Scholar
Chen, Y. et al. A whole-slide image (WSI)-based immunohistochemical feature prediction system improves the subtyping of lung cancer. Lung Cancer. 165, 18–27 (2022).
Article CAS PubMed Google Scholar
Coudray, N. et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med. 24, 1559–1567 (2018).
Article CAS PubMed PubMed Central Google Scholar
Hari, S. N. et al. Examining batch effect in histopathology as a distributionally robust optimization problem. bioRxiv. https://doi.org/10.1101/2021.09.14.460365 (2021).
Kanavati, F. et al. A deep learning model for the classification of indeterminate lung carcinoma in biopsy whole slide images. Sci. Rep. 11, 8110 (2021).
Article CAS PubMed PubMed Central Google Scholar
Le Page, A. L. et al. Using a convolutional neural network for classification of squamous and non-squamous non-small cell lung cancer based on diagnostic histopathology HES images. Sci. Rep. 11, 23912 (2021).
Article PubMed PubMed Central Google Scholar
Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 5, 555–570 (2021).
Article PubMed PubMed Central Google Scholar
Mukashyaka, P., Sheridan, T. B., Foroughi, P. A. & Chuang, J. H. SAMPLER: unsupervised representations for rapid analysis of whole slide tissue images. eBioMedicine 99, 104908 (2024).
Article CAS PubMed Google Scholar
Noorbakhsh, J. et al. Deep learning-based cross-classifications reveal conserved spatial behaviors within tumor histological images. Nat. Commun. 11, 6367 (2020).
Article CAS PubMed PubMed Central Google Scholar
Quiros, A. et al. Mapping the landscape of histomorphological cancer phenotypes using self-supervised learning on unannotated pathology slides. Nat. Commun. 15, 4596 (2024).
Article Google Scholar
Wang, S. et al. Deep learning of cell spatial organizations identifies clinically relevant insights in tissue images. Nat. Commun. 14, 7872 (2023).
Article CAS PubMed PubMed Central Google Scholar
Yang, H. et al. Deep learning-based six-type classifier for lung cancer and mimics from histopathological whole slide images: a retrospective study. BMC Med. 19, 80 (2021).
Article PubMed PubMed Central Google Scholar
Yu, K. H. et al. Classifying non-small cell lung cancer types and transcriptomic subtypes using convolutional neural networks. J. Am. Med Inf. Assoc. 27, 757–769 (2020).
Article Google Scholar
Sharma, R., Kumar, S., Shrivastava, A. & Bhatt, T. Optimizing knowledge transfer in sequential models: leveraging residual connections in flow transfer learning for lung cancer classification. 14th Indian Conference on Computer Vision, Graphics and Image Processing. https://doi.org/10.1145/3627631.3627663 (2024).
Borras Ferris, L. et al. A full pipeline to analyze lung histopathology images. SPIE—Progress in Biomedical Optics and Imaging. 12933. https://doi.org/10.1117/12.3006708 (2024).
Gertych, A. et al. Convolutional neural networks can accurately distinguish four histologic growth patterns of lung adenocarcinoma in digital slides. Sci. Rep. 9, 1483 (2019).
Article PubMed PubMed Central Google Scholar
Kanavati, F. et al. Weakly-supervised learning for lung carcinoma classification using deep learning. Sci. Rep. 10, 9297 (2020).
Article CAS PubMed PubMed Central Google Scholar
Sakamoto, T. et al. A collaborative workflow between pathologists and deep learning for the evaluation of tumour cellularity in lung adenocarcinoma. Histopathology. 81, 758–769 (2022).
Article PubMed PubMed Central Google Scholar
Swiderska-Chadaj, Z. et al. A deep learning approach to assess the predominant tumor growth pattern in whole-slide images of lung adenocarcinoma. Medical Imaging: Digital Pathology. https://doi.org/10.1117/12.2549742 (2020).
Wang, S. et al. ConvPath: A software tool for lung adenocarcinoma digital pathological image analysis aided by a convolutional neural network. EBioMedicine. 50, 103–110 (2019).
Article PubMed PubMed Central Google Scholar
Sounderajah, V. et al. A quality assessment tool for artificial intelligence-centered diagnostic test accuracy studies: QUADAS-AI. Nat. Med. 27, 1663–1665 (2021).
Article CAS PubMed Google Scholar
Usher-Smith, J. A., Sharp, S. J. & Griffin, S. J. The spectrum effect in tests for risk prediction, screening, and diagnosis. BMJ. 353, i3139 (2016).
Article PubMed PubMed Central Google Scholar
Zeng, H. et al. Racial disparities in histological subtype, stage, tumor grade and cancer-specific survival in lung cancer. Transl. Lung Cancer Res. 11, 1348–1358 (2022).
Article PubMed PubMed Central Google Scholar
Duncan, F. C. et al. Racial disparities in staging, treatment, and mortality in non-small cell lung cancer. Transl. Lung Cancer Res. 13, 76–94 (2024).
Article PubMed PubMed Central Google Scholar
Rajput, D., Wang, W.-J. & Chen, C.-C. Evaluation of a decided sample size in machine learning applications. BMC Bioinform. 24, 48 (2023).
Article Google Scholar
Wang, S. et al. Deep learning of cell spatial organizations identifies clinically relevant insights in tissue images. Nat. Commun. 14. https://doi.org/10.1038/s41467-023-43172-8 (2023).
Tricco, A. C. et al. PRISMA extension for scoping reviews (PRISMA-ScR): checklist and explanation. Ann. Intern. Med. 169, 467–473 (2018).
Article PubMed Google Scholar
Wang, X., Zheng, K. & Hao, Z. In-depth analysis of immune cell landscapes reveals differences between lung adenocarcinoma and lung squamous cell carcinoma. Front. Oncol. 14, 1338634 (2024).
Article CAS PubMed PubMed Central Google Scholar
Lewis, W. E. et al. Efficacy of targeted inhibitors in metastatic lung squamous cell carcinoma with EGFR or ALK alterations. JTO Clin. Res. Rep. 2, 100237 (2021).
PubMed PubMed Central Google Scholar
Cancer Research UK. Treatment Options for Non Small Cell Lung Cancer (NSCLC). https://www.cancerresearchuk.org/about-cancer/lung-cancer/treatment/non-small-cell-lung-cancer (2023).
Vo, V. et al. Multi-stakeholder preferences for the use of artificial intelligence in healthcare: a systematic review and thematic analysis. Soc. Sci. Med. 338, 116357 (2023).
Article PubMed Google Scholar
Marée, R. Open practices and resources for collaborative digital pathology. Front. Med. 6, 255 (2019).
Article Google Scholar
Whiting, P. F. et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann. Intern. Med. 155, 529–536 (2011).
Article PubMed Google Scholar
Flahault, A., Cadilhac, M. & Thomas, G. Sample size calculation should be performed for design accuracy in diagnostic test studies. J. Clin. Epidemiol. 58, 859–862 (2005).
Article PubMed Google Scholar
Snell, K. I. E. et al. External validation of clinical prediction models: simulation-based sample size calculations were more reliable than rules-of-thumb. J. Clin. Epidemiol. 135, 79–89 (2021).
Article PubMed PubMed Central Google Scholar
Cancer Research UK. Lung Cancer Incidence Statistics. https://www.cancerresearchuk.org/health-professional/cancer-statistics/statistics-by-cancer-type/lung-cancer/incidence#heading-Zero (2024).

Download references

Acknowledgements

The authors would like to thank Dr Fayyaz Minhas (University of Warwick) for his help in adapting the quality assessment tool. J.O. and S.A. disclose support for the research of this work from Barts Charity (grant number: G-001522, MGU0461). O.B. declares funding from Barts Charity (grant number: G-001522). The funding source had no role in the study design, data collection, data analysis, data interpretation, or in manuscript writing. The views expressed in this manuscript are those of the authors and may not necessarily reflect those of the funding source.

Author information

Authors and Affiliations

Centre for Cancer Screening, Prevention and Early Diagnosis, Wolfson Institute of Population Health, Queen Mary University of London, London, UK
Soumya Arun, Oleg Blyuss & Judith Offman
Department of Paediatrics and Paediatric Infectious Diseases, Institute of Child’s Health, I.M. Sechenov First Moscow State Medical University, Sechenov University, Moscow, Russia
Mariia Grosheva, Mark Kosenko, Oleg Blyuss & Daniel Munblit
Department of Histopathology, Royal Brompton and Harefield, Guy’s and St Thomas’ NHS Foundation Trust, London, UK
Jan Lukas Robertus
Faculty of Medicine, National Heart and Lung Institute, Imperial College London, London, UK
Jan Lukas Robertus
Centre for Evaluation and Methods, Wolfson Institute of Population Health, Queen Mary University of London, London, UK
Rhian Gabe
Division of Care in Long Term Conditions, Florence Nightingale Faculty of Nursing, Midwifery and Palliative Care, King’s College London, London, UK
Daniel Munblit

Authors

Soumya Arun
View author publications
Search author on:PubMed Google Scholar
Mariia Grosheva
View author publications
Search author on:PubMed Google Scholar
Mark Kosenko
View author publications
Search author on:PubMed Google Scholar
Jan Lukas Robertus
View author publications
Search author on:PubMed Google Scholar
Oleg Blyuss
View author publications
Search author on:PubMed Google Scholar
Rhian Gabe
View author publications
Search author on:PubMed Google Scholar
Daniel Munblit
View author publications
Search author on:PubMed Google Scholar
Judith Offman
View author publications
Search author on:PubMed Google Scholar

Contributions

This study was initially conceived by J.O. and J.L.R., and further developed by R.G., O.B., D.M. and S.A.. J.O. acquired funding for this project. S.A. developed the protocol, screened articles for inclusion, extracted and synthesised data, conducted the quality assessment, interpreted the results and drafted the manuscript. M.K. and M.G. screened articles for inclusion, extracted data and conducted the quality assessment. J.L.R. assisted in the development of the protocol and assisted in the tailoring of the quality assessment tool. O.B. assisted in the development of the protocol. D.M. provided methodological support, assisted in the development of the protocol and supervised the screening process. J.O. supervised the project, assisted in the development of the protocol and resolved conflicts at the screening stage. The raw data were collected and verified by S.A., M.K. and M.G. All authors reviewed the manuscript, had access to the data presented in the manuscript, and approved the final version.

Corresponding author

Correspondence to Soumya Arun.

Ethics declarations

Competing interests

Author J.O. has previously acted as a paid consultant for Hardian Health but declares no non-financial competing interests. All other authors declare no financial or non-financial competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Arun, S., Grosheva, M., Kosenko, M. et al. Systematic scoping review of external validation studies of AI pathology models for lung cancer diagnosis. npj Precis. Onc. 9, 166 (2025). https://doi.org/10.1038/s41698-025-00940-7

Download citation

Received: 24 October 2024
Accepted: 08 May 2025
Published: 07 June 2025
DOI: https://doi.org/10.1038/s41698-025-00940-7