Abstract
Colorectal cancer (CRC) is the 2nd leading cause of cancer death in the United States (US). Rural Appalachia suffers the highest CRC incidence and mortality rates. There are several non-clinical health-related social determinant factors (SDOH) associated with cancer mortality. This study describes novel predictive modeling that uses demographic, clinical, and SDOH features from health records data from Appalachian community cancer centers to predict 5-year CRC survival. We trained, validated, and tested four gradient-boosted tree ensemble (XGBoost) machine learning models which were developed using selected combinations of available features. The area under the receiver operating characteristic curve was greatest in the model that included SDOH features with demographic and clinical features (0.79; Pā<ā0.0001). Feature stratification showed rurality as the top SDOH feature. It is demonstrated that the ML model performs better when SDOH features are included, and that rurality significantly impacts CRC survival in Appalachia. The study provides preliminary indications that further data collection and evaluation of SDOH factors would strengthen our understanding of their impact on cancer survival in Appalachia and other underserved populations and improve development of strategies for care delivery.
Similar content being viewed by others
Introduction
Cancer remains a leading cause of death in the United States (US), with colorectal cancer contributing significantly to cancer mortality. Colorectal cancer (CRC) is the second leading cause of cancer death in the US, accounting for 7.6% of all new cancer cases and 10% of cancer deaths1,2,3. Although overall CRC mortality rates have decreased by 8.2% in recent decades, CRC deaths continue to increase substantially within certain demographic groups, including African Americans, persons aged 40 to 54, and rural populations4. CRC is more prevalent in Appalachia than in any other US geographic region, with rural Appalachians suffering the highest burden of CRC deaths in the nation5,6. Rural Appalachians experience higher incidence of early onset CRC and more frequent occurrence of late-stage diagnosis and delayed treatment. Further, overall CRC mortality is 32% higher in rural Appalachia than in any other US population by region7,8,9. This worsening disparity highlights a critical need for targeted measures to identify and address factors contributing to high CRC mortality in Appalachia and similarly affected rural populations.
It is well known that non-clinical social, environmental, and lifestyle factors, referred to as the social determinants of health (SDOH), affect CRC risk, morbidity, and survival. Previous studies identify poverty, lack of education, lack of social support, along with social and geographic isolation, as SDOH factors that play an important role in CRC stage at diagnosis and survival10. Comparably, CRC disparity in rural Appalachia has been attributed to higher poverty and unemployment, geographic rurality, and high uninsured rates, as well as lower educational attainment and limited healthcare access in this region11. Despite this knowledge, little has been done to adequately quantify and address the impact of non-clinical, health related factors on cancer disparities.
The prediction of CRC mortality risk is a critical measure within cancer treatment planning and resource allocation. Risk scores can be used to identify individuals who are likely to live long enough to benefit from adjuvant treatment. Machine learning (ML) has been validated for its utility in prediction of clinical disease risk and outcomes, yet several challenges continue to limit these modelsā ability to predict CRC survival12,13,14. With ML methods, feature selection plays a particularly important role in improving prediction model performance15,16. In this context, cancer prediction models primarily use clinical predictors (e.g., cancer stage, tumor grade, and number of positive lymph nodes) to classify patients and assist in treatment decision making. Review of previous CRC risk prediction models demonstrated suboptimal model performance when using only clinical features17,18,19.
Many ML-based prediction models also rely on data from national public datasets, and feature availability and demographic diversity are often limited in these datasets20,21. In a pilot study, we used National Cancer Instituteās (NCI)Ā Surveillance, Epidemiology, and End Results (SEER) Program data to explore bias when using common demographic and clinical features to predict 5-year CRC survival between Appalachian and non-Appalachian patients. Although bias was not detected, overall ML model performance was significantly lower in the Appalachian vs. non-Appalachian test population and SHapely Additive exPlanations (SHAP) analysis revealed that marital status was a top feature contributing to the performance of the model22. This suggested that other factors, beyond common clinical predictors, could improve ML model predictability and demonstrate the effect of non-clinical SDOH factors on CRC survival10. Several other studies have evaluated the performance and challenges associated with ML derived colorectal cancer prediction models using numerous clinical data features, however evaluation of the effects of SDOH features on CRC survival prediction is largely absent in the current literature13,23,24,25,26,27.
The inclusion of SDOH data variables as ML model input features could help identify and quantify the impact of non-clinical factors contributing to high CRC mortality in Appalachia and potentially improve CRC survival prediction in rural and other disparately affected populations. This knowledge would strengthen our understanding of the impact of SDOH on cancer survival and improve the development of strategies for care delivery in Appalachia and other underserved populations. Accordingly, the purpose of the present study was to determine whether the addition of SDOH features to ML modeling improves CRC survival prediction in the Appalachian population. We hypothesized that the addition of SDOH features would improve ML-based prediction of CRC mortality and help to evaluate the social and environmental factors in Appalachia that impact CRC outcomes.
Materials and methods
Figure 1 shows the complete workflow diagram for this study. First, de-identified data was pre-processed for missing value handling and data encoding to convert variables from categorical to numerical values. Feature selection approaches were used to extract the most important features from the dataset. The aggregate dataset was divided into 70% for training, 20% for testing, and the remaining 10% of the data for model validation purposes. The gradient-boosted tree ensemble machine-learning technique was used to train, test, and validate the four models, after which the voting classifier was applied. The confusion matrix was used to assess model effectiveness. Finally, models were evaluated comparatively to determine the best model.
Dataset description
This retrospective study utilized de-identified electronic health records (EHR) data from community cancer care centers located in Appalachia: (1) St. Elizabeth Healthcare in Edgewood, KY; (2) Pikeville Medical Center in Pikeville, KY; and (3) Thompson Cancer Survival Center (TCSC) based in Knoxville, TN. Combined, datasets included 25 attributes from 7,718 adults aged 18 years and older across three cancer care centers. Only patients that are 18ā+āand were diagnosed with malignant colon and/or rectosigmoid cancer between the years 2000 to 2017 were included in this study. Patients under 18 years of age were removed from the dataset. All data sets included demographic, clinical, and SDOH patient features. St. Elizabeth Healthcare and Pikeville Medical Center data was extracted from the local EHR system. Data from TCSC collected between 2000 and 2009 was captured by research staff from paper records and transcribed into .csv files. TCSC data from 2010āāā1017 was extracted from the local EHR system. All data was de-identified by individual healthcare sites prior to delivery, shared by secure data transfer, and stored on password protected cloud-based storage. The use of de-identified retrospective data is classified as a non-human subject study and IRB approval is not required. The need for informed consent was deemed unnecessary according to the HIPAA Privacy Rule. To ensure ethical conduct, this study was conducted under approval by Western-Copernicus Group Institutional Review Board (WCGĀ® IRB) protocol no. 20,223,670, consistent with 45 Code of Federal Regulations 46.102. All methods were performed in accordance with the relevant guidelines and regulations.
CRC 5-year survival vs. death metrics
The EHR data included a variable of āsurvival monthsā which was used to report mortality based upon the length of time a patient survived in months relative to their initial diagnosis of cancer of the colon and/or rectum. The gold standard 5-year cancer survival metric was applied in this study as the ML model was developed to classify individuals as survivingā<ā60 months or ā„ā60 months (5 years) post-CRC diagnosis28. The ML prediction model is a binary classifier, where the ānegativeā class represents patients who have <ā60 survival months post CRC diagnosis, and the āpositiveā class represents patients who have ā„ā60 survival months reported in the EHR.
Data preprocessing and feature selection
Datasets underwent preprocessing before being fitted into ML models to address missingness, data entry errors, outliers, variable labeling, and encoding. Data preprocessing involved converting all categorical data variables into numeric values, merging the three datasets from individual healthcare sites into one aggregate dataset, and data cleaning to address typographical errors and missing data. Listwise deletion was used to handle missing data.
Feature selection was employed to eliminate redundancy, identify the most critical EHR data variables, and increase the ML modelās accuracy in its prediction ability. The Extreme Gradient Boosting (XGBoost) ML algorithm complemented by SHAP plots was utilized to determine which features were most important for this investigation. There were 25 features available in the dataset. The raw data collected from EHRs included a variety of information about the patients including their demographic and clinical information related to their CRC diagnosis and select SDOH variables. The main target variable for the study was Survival months. Demographic features were Age at diagnosis, Sex, Race, Ethnicity, State of residence, County, and Zip (3-digit). The SDOH features that were available for analysis included Marital status, Geographic classification (rural/non-rural), Employment status, and Insurance status. All available features and corresponding model inclusion are shown in TableĀ 1.
Marital status data was missing in TCSC data and was therefore handled by imputation. Studies suggest that two-stage methods (imputation method and classifier combination)29,30,31 offer optimal configuration for a particular dataset. IterativeImputer32 is a python-based multivariate statistical method for handling missing data by means of accurate imputations, specifically when there are reasonably strong correlations between features. We used linear regression with 300 estimators and a scaled tolerance of ~ā2.5%. It took under 20 iterations for convergence. We did not find any significant differences in model accuracies between iterative robust imputation and arbitrary (fixed value for missing data) imputation. Weak correlation between imputed features or data imbalance between known and unknown data (e.g., TCSC didnāt collect Marital Status for their patients) may be potential reasons for this observation. A comparative table showing model accuracies by imputation method is included as supplemental material.
Training and validation
The dataset was randomly divided into training, testing, and validation parts. About 70% for training and 20% for test data. The remaining 10% of the data was reserved for model validation purposes. The ML prediction model was trained using the input features described above. For this study, we used XGBoost with hyperparameter optimization technique. XGBoost is a gradient-boosted tree ensemble method of ML which combines the estimates of simpler, weaker models to make predictions for a chosen target. The main principle of gradient boosting is to minimize the errors of the prior models to enhance the predictive performance as it continues repeatedly. Specifically, the Gradient Boosting classifier is employed for classification tasks33,34. Recent studies have demonstrated that gradient-boosted tree algorithms yield high accuracy for both acute and chronic prediction tasks, and that XGBoost performs better than other ML models on tabular data35. One of the benefits of using XGBoost is that it can implicitly handle a certain level of missingness in the data by accounting for missingness during the training process36. Before training the ML prediction model, hyperparameter optimization was performed using the training dataset, applied through a 5-fold cross-validation grid search and evaluation of the area under the receiver operating characteristic (AUROC) curve for different combinations of hyperparameters included in the grid search. The parameters (optimized values) were the maximum tree depth, pseudo-regularization parameter, minimum sum of weights of observations, fraction of observations to subsample at each step, fraction of features to use for building each tree, node, and level of depth in tree.
ML model performance metrics
The AUROC was used as the primary metric to evaluate and compare the overall performance of the ML prediction models37. Other metrics used to evaluate the performance of the ML prediction model for 5-year survival post-diagnosis were calculated as follows based on true positive (TP), false positive (FP), true negative (TN), and false negative (FN) values38,39. Sensitivity indicates the modelās ability to accurately identify patients who with CRC survival after 5-years.
Specificity is used to assess how well the model can accurately identify patients who will not survive.
Precision, or Positive Predictive Value, measures how many positive cases (CRC survivors) were correctly predicted by the model.
Conversely, the Negative Predictive Value measures how many negative cases (CRC deaths) were correctly predicted by the model.
In addition, 95% confidence intervals (CIs) were computed for each metric as described in the Statistics section below. SHapely Additive exPlanations (SHAP) analysis was performed to evaluate the importance of each feature for generating model output40. The SHAP analysis ranks features by their importance to model predictions from top to bottom.
Impact of SDOH features on ML model performance
The merged data set consisting of EHR data from St. Elizabeth Healthcare, Pikeville Medical Center, and TCSC (TableĀ 2) was used determine the additive effects of SDOH features on performance of the ML model in predicting mortality in Appalachian colorectal cancer patients. Due to the unavailability of cause of death data in one of the hospitalās EHRs, the ML prediction model classified patient as survivingā<ā5 or ā„ā5 years irrespective of whether the cause of death was CRC-related or due to other factors. To test our hypothesis, we developed 4 ML models with the following feature inputs: Model 1: demographic and clinical features; Model 2: demographic and SDOH features; Model 3: demographic, clinical, and SDOH features; Model 4: clinical and SDOH features (TableĀ 1). The dataset was randomly divided into training, hold-out test, and validation datasets utilizing a 70:20:10 split, and the hold-out test data set was never exposed to the model during training. To evaluate algorithmic fairness, we compared XGBoost with other regression methods, Logistic regression and K-Nearest Neighbor. XGBoost delivered superior accuracies compared to both the logistic and K-nearest Neighbor regression (with number of neighborsā=ā5) methods. A comparative table showing predictive model accuracies by regression technique for each of the four data models is included as supplemental material.
Statistical analysis
The confidence intervals (CIs) for AUROC were calculated using a bootstrapping method where a subset of patients from the hold-out test dataset were randomly sampled and the AUROC was calculated using the data from those patients, repeated 1000 times with replacement41. From these bootstrapped AUROC values, the middle 95% range was selected as the 95% CI for the AUROC. As the sample size of the hold-out test dataset was sufficiently large, the CIs for other metrics were calculated using normal approximation42. Because demographic and clinical features are traditionally used for mortality/survival prediction, Model 1 (demographicā+āclinical features) was considered our control ML model. The difference between the metrics of Models 2, 3, and 4 as compared to Model 1 was determined using a two-sided t-test with a 95% significance level (Fig.Ā 2).
Impact of SDOH features on AUROC when predicting CRC in Appalachians. The machine learning algorithm performance was greatest in Model 3 (0.790). All models performed better than the baseline. Model 1 ā demographicsā+āclinical features Model 2 ā demographicsā+āSDOH features. Model 3 ā demographicsā+āclinicalā+āSDOH features. Model 4 ā clinicalā+āSDOH features.
Results
Patient population and characteristics
Demographics, clinical characteristics, and SDOH variables for colorectal cancer patients from each site are shown in TableĀ 1. Of the 7,718 patients included in analysis, mean age at diagnosis was ~ā67 years and ~ā50% of the patients were female. Nearly all patients were white and non-Hispanic, and approximately 55% were married. Most patients in the St. Elizabeth dataset were non-rural, whereas the majority of patients in Pikeville and TCSC datasets were rural. Additional information regarding cancer primary site, summary stage, and grade as well as employment and health insurance are included in TableĀ 1. Due to missing information on survival status ā the target feature, 14 patient records were removed from the dataset before analysis.
ML model performance
The AUROC curves for the hold-out test datasets for 4 ML prediction models along with the baseline are shown in Fig.Ā 2. Overall, the ML prediction models achieved a range of classification performance for patients with an AUROC of 0.780 (95% CIā=ā0.757ā0.801) for Model 1 (demographic and clinical features), 0.627 (95% CIā=ā0.598ā0.655) for Model 2 (demographic and SDOH features), 0.790 (95% CIā=ā0.767ā0.812) for Model 3 (demographic, clinical, and SDOH features) and 0.771 (95% CIā=ā0.748ā0.794) for Model 4 (clinical and SDOH features). The strongest performance was observed in Model 3, which was significantly greater than Model 1 (Pā<ā0.001).
The complete list of performance metrics for the ML prediction models are shown in TableĀ 3, and a confusion matrix is shown in Fig.Ā 3 to provide a visual representation of the ML prediction modelās output for the hold-out test groups. For all metrics, including TPR, Model 3 demonstrated the greatest performance of the 4 ML-based prediction models (Fig.Ā 4).
Confusion matrix provides a visual representation of the machine learning prediction modelās output for the hold-out test datasets in each model. Top left: true negative (TN); Top right: false positive (FP); Bottom left: False negative (FN); Bottom right: true positive (TP); Y-axis: true value; X-axis: predicted value. 0ā=ādied; 1ā=āsurvived; SDOHā=āsocial determinants of health. Feature Importance: A SHAP analysis was used to evaluate the importance of individual input features in generating each modelās predictions top to bottom in decreasing order of importance. The top features for Model 1 (demographic and clinical features) included age at diagnosis, cancer summary stage, year of diagnosis, and cancer grade. The importance of SDOH features for Models 2ā4 varied depending on which other variables (demographic or clinical) were included in the model. In Model 2 which included demographic and SDOH variables, the most important SDOH features were rurality, employment status, and marital status. More importantly, in Model 3 which included all demographic, clinical and SDOH variables, rurality was in the top 3 features and had a greater impact on the modelās performance than traditional demographic and clinical features. Rurality was also in the top 3 features in Model 4 (Fig.Ā 4). Among demographic variables, race and ethnicity were the lowest impact features across Models 1, 2, and 3. Primary tumor site and radiotherapy were consistently the lowest impact clinical features across all models as well.
Discussion
This study is one of the first to determine the impact of SDOH variables in predictive modeling of CRC mortality/survivorship in an underserved population with CRC disparity. We developed a gradient-boosted tree ensemble (XGBoost) ML model that demonstrated good classification performance for survival prediction in Appalachian CRC patients using local demographic and clinical EHR data as demonstrated by an AUROC of 0.780 and TPR of 0.769 (Model 1). The addition of SDOH variables to the model resulted in a modest, but significant improvement in AUROC and TPR (Model 3). Although building the most performative model was not the primary goal of this study, we were able to develop a strong ML-based model to classify CRC patient mortality using simple demographic, clinical, and SDOH data as feature inputs. These findings align with other ML based studies designed to assess CRC disparity43,44. High CRC incidence and mortality in rural Appalachians has been attributed to several SDOH factors, including higher poverty, uninsured rates, and unemployment, and healthcare access limited by geographic location in this region6,8,11. In the present study, we evaluated the impact of SDOH on CRC survival in Appalachia by measuring the impact of SDOH features on ML-based CRC survival prediction. We utilized a representative, locally sourced EHR dataset from Appalachian cancer care centers to demonstrate the impact of SDOH features in this population by building several ML models which included SDOH features in various combinations, with and without traditional demographic and clinical predictors. Our collective findings provide insight into the key features that contribute to ML-based prediction of CRC survival in this unique population. First, using easily accessible and traditional demographic and clinical features (see TableĀ 1), the ML model achieved a strong performance (Model 1; Fig.Ā 2; TableĀ 3). Second, the combination of demographic and SDOH variables without clinical variables, had the weakest performance (Model 2); however, the combination of clinical and SDOH features without demographic variables still resulted in a robust performance (Model 4). These data, along with SHAP values (Fig.Ā 4), clearly indicate that clinical features are highly important in CRC mortality, as expected. Most importantly, the addition of SDOH features to both demographic and clinical features resulted in the strongest performing ML model and significantly improved CRC survival prediction (Model 3). Additionally, SHAP analysis for this model indicates that rurality was in the top 3 features contributing to model performance. Further, health insurance status (6th ) and employment status (9th ) also were within the top features.
These latter findings indicate that non-clinical SDOH factors such as poverty, high uninsured rates, and unemployment may impact high cancer-related mortality seen in rural populations, and these SDOH factors should be considered when using ML or other methods to address disparities. These findings show that SDOH variables impact CRC patient survivorship and that addressing SDOH factors may improve outcomes and quality of life of CRC patients, especially those in medically underserved populations unequally burdened by SDOH factors like those seen in rural Appalachia. Allocation of healthcare system resources towards modifiable SDOH factors would be a viable strategy to reduce cancer deaths. For example, expanding Medicaid, providing transportation or mobile health clinics, and providing supplemental food or housing resources to unemployed cancer patients could further improve CRC survival within these and other underserved populations. Our findings align with non-ML based studies designed to assess the impact of SDOH factors on CRC survival45,46,47. Further quantification of SDOH on CRC survival may contribute to the development of adjuvant SDOH-based policies and treatment interventions that improve cancer survivorship outcomes.
Several other studies have evaluated the performance and challenges associated with ML derived CRC prediction models. Burnett et al. reviewed the strengths and weaknesses of CRC risk-prediction models that used routinely collected health registry data13. They found that tree-based models outperformed other models and were more interpretable. Rahman et al. evaluated prediction of CRC using ML algorithms in large datasets of global dietary data48. Their findings demonstrate the importance of using non-clinical data features for early detection of CRC in a large dataset that includes younger and older adults. Ting et al. developed a ML-based model to predict CRC tumor recurrence in survivors27. They identified drinking behavior as an important predictive SDOH feature which should be monitored to improve CRC treatment strategies. Our findings fill in gaps in the current literature by demonstrating the impact of non-clinical SDOH features on CRC survival in an underserved, disparately affected population.
It is also of note that this study utilized EHR datasets directly acquired through community partnerships established within the target population, while many ML-based studies commonly use public datasets. It is known that selection, representation, and evaluation biases may result when using public datasets to generate real-world evidence, therefore we utilized EHR data provided directly from Appalachian community cancer care centers to minimize these risks49,50.
Study limitations
There are limitations that should be considered. First, the dataset was missing cause of death information from one hospital. This information may have been used to better relate patient mortality to the CRC diagnosis. As a result, our analysis evaluated all-cause mortality which may not adequately capture the direct relationship between SDOH and CRC mortality. However, this can be reevaluated in future studies. Secondly, our data processing methods did not address the effects of patient age as a continuous variable within the model, which contributes to its high feature importance and wide distribution on SHAP analysis. Regression algorithms are commonly applied to examine relationships amongst continuous variables, however this study identified XGBoost as the most performative algorithm as all other variables, aside from age at diagnosis, were categorical. This can be addressed in future studies through preprocessing recategorization of continuous age variables into age groups. Thirdly, in this study we were unable to evaluate the effects of SDOH factors related to low healthcare access and transportation, which are prevalent in rural Appalachia. Patientās mode of transportation is not commonly accounted for in EHRs. Available zip code and county data can be used to evaluate this in the future. Further, we did not obtain treatment and diagnosis dates in this dataset to measure the length of time in days between primary CRC diagnosis and initiation of treatment. This information would allow for evaluation of cancer treatment delays, which can be related to SDOH factors in disparately affected populations like rural Appalachia. Fourth, EHRs often have quality issues (e.g., missingness, misclassification, measurement error) and the data is largely unstructured (e.g., diagnostic notes, patient experience, etc.). Our dataset is small with about 7700 records after combining three hospital datasets with the same features. The achieved accuracy of ~ā0.79 is superior compared to linear regression models for binary classification we explored as a part of this study. We believe that the accuracy can be further improved with collecting additional patient records and addressing the missing data (e.g., marital status in TCSC, transportation in St. Elizabeth, etc.). Fifth, our sample was primarily White and non-Hispanic that mirrors the rural US population which is 76% non-Hispanic White. The lack of racial-ethnic diversity within the sample could affect the reliability of this specific predictive model if used in more diverse, non-rural populations. As currently designed, this ML model would best be utilized only in rural populations or those with similar cancer mortality rates, similar poverty, uninsured, and unemployment rates, as well as similar racial-ethnic distributions. However, it is important to note that the ML methods applied in this study are broadly applicable. Lastly, this study was limited by the unavailability of SDOH data in specialty care EHRs. The collection of SDOH data is a more recently adopted practice which generally occurs at the primary care level. Conducting studies such as this one which demonstrate the effects of SDOH on clinical outcomes could lead to increased collection of SDOH data and the development of supportive SDOH policies and procedures that improve clinical outcomes related to cancer and other chronic diseases.
Conclusions
Colorectal cancer is a recognized cause of cancer-related death that unequally effects the rural Appalachian population. Although the impact of non-clinical social and environmental factors on cancer outcomes has been well described, SDOH factors are not commonly included in machine learning approaches to address CRC disparity. We used a gradient-boosting machine learning algorithm to test four different models to predict CRC survival based on EHR data which included SDOH features. Our best performing model combined demographic, clinical, and SDOH features. Feature stratification showed that rurality, insurance status, and employment status were important to model performance, with rurality being a top feature overall. Compared to previous methods, our model successfully demonstrated the impact of SDOH features in ML based prediction of CRC survival in a dataset that well-represented the rural Appalachian population. In future research, we will gather a larger EHR dataset which covers a broader population of rural patients and includes additional SDOH variables to improve model performance and more accurately quantifies the effects of other SDOH features on CRC survival. Further ML-based evaluation of the impact of SDOH on cancer disparity can facilitate the establishment of best practices in community-based AI/ML research to address cancer disparities in underserved communities.
Data availability
The datasets generated and/or analyzed during the current study are not publicly available due to their derivation from electronic health records, but data are available from the corresponding author on reasonable request.The code used to support the findings of the current study are not publicly available, but code is available from the corresponding author on reasonable request.
References
Center for Disease Control and Prevention. (n.d.). Leading Causes of Death. National Center for Health Statistics. Retrieved December 30. from (2024). https://www.cdc.gov/nchs/fastats/leading-causes-of-death.htm
American Cancer Society. Colorectal Cancer Facts & Figs. 2023ā2025. (2023). https://www.cancer.org/content/dam/cancer-org/research/cancer-facts-and-statistics/colorectal-cancer-facts-and-figures/colorectal-cancer-facts-and-figures-2023.pdf
Siegel, R. L., Giaquinto, A. N. & Jemal, A. Cancer Statistics, 2024. CA; A Cancer Journal for Clinicians, 74(1), 12ā49. (2024). https://doi.org/10.3322/caac.21820
Tan, J. Y., Yeo, Y. H., Ng, W. L., Fong, Z. V. & Brady, J. T. How have US colorectal cancer mortality trends changed in the past 20 years? Int. J. Cancer. 155 (3), 493ā500. https://doi.org/10.1002/ijc.34926 (n.d.).
National Cancer Institute. (n.d.). Common Cancer Sites & SEER Cancer Stat Facts. (). Retrieved August 26, from (2024). https://seer.cancer.gov/statfacts/html/common.html
Attarabeen, O. F., Sambamoorthi, U., Larkin, K. T. & Kelly, K. M. Colon cancer worry in Appalachia. J. Community Health. 43 (1), 79ā88. https://doi.org/10.1007/s10900-017-0390-z (2018).
Oelsner, W. K. et al. An emerging health disparity: early onset colorectal Cancer in appalachia S344. Official J. Am. Coll. Gastroenterol. | ACG (Vol. 116, S148āS149 (2021).
Yao, N., Alcala, H. E., Anderson, R. & Balkrishnan, R. Cancer disparities in rural appalachia: incidence, early detection, and survivorship. J.Rural Health. 33 (4), 375ā381. https://doi.org/10.1111/jrh.12213 (2017).
Sepassi, A. et al. Rural-Urban disparities in colorectal Cancer screening, diagnosis, treatment, and survivorship care: A systematic review and Meta-Analysis. Oncologist 29 (4), e431āe446. https://doi.org/10.1093/oncolo/oyad347 (2024).
Coughlin, S. S. Social determinants of colorectal cancer risk, stage, and survival: A systematic review. Int. J. Colorectal Dis. 35 (6), 985ā995. https://doi.org/10.1007/s00384-020-03585-z (2020).
Driscoll, D. L., OāDonnell, H., Patel, M. & Cattell-Gordon, D. C. Assessing and addressing the determinants of Appalachian population health: A scoping review. J. Appalach. Health. 5 (3), 85ā102. https://doi.org/10.13023/jah.0503.07 (2023).
Wulczyn, E., Steiner, D. F., Moran, M., Plass, M., Reihs, R., Tan, F., Flament-Auvigne,I., Brown, T., Regitnig, P., Chen, P.-H. C., Hegde, N., Sadhwani, A., MacDonald, R.,Ayalew, B., Corrado, G. S., Peng, L. H., Tse, D., Müller, H., Xu, Z., ⦠Mermel, C.H. (2021). Interpretable survival prediction for colorectal cancer using deep learning.Npj Digital Medicine, 4(1), 71. https://doi.org/10.1038/s41746-021-00427-2.
Burnett, B. et al. Machine learning in colorectal Cancer risk prediction from routinely collected data: A review. Diagnostics (Basel Switzerland). 13 (2). https://doi.org/10.3390/diagnostics13020301 (2023).
Alboaneen, D. et al. Predicting colorectal Cancer using machine and deep learning algorithms: challenges and opportunities. Big Data Cogn. Comput. 7 (2). https://doi.org/10.3390/bdcc7020074 (2023).
Afrash, M. R., Mirbagheri, E., Mashoufi, M. & Kazemi-Arpanahi, H. Optimizing prognostic factors of five-year survival in gastric cancer patients using feature selection techniques with machine learning algorithms: A comparative study. BMC Med. Inf. Decis. Mak. 23 (1), 54. https://doi.org/10.1186/s12911-023-02154-y (2023).
Cueto-López, N. et al. A comparative study on feature selection for a risk prediction model for colorectal cancer. Comput. Methods Programs Biomed. 177, 219ā229. https://doi.org/10.1016/j.cmpb.2019.06.001 (2019).
Win, A. K., Macinnis, R. J., Hopper, J. L. & Jenkins, M. A. Risk prediction models for colorectal cancer: A review. Cancer Epidemiol. Biomarkers Prevention: Publication Am. Association Cancer Res. Cosponsored Am. Soc. Prev. Oncol. 21 (3), 398ā410. https://doi.org/10.1158/1055-9965.EPI-11-0771 (2012).
Ester, M., Kriegel, H. P., Sander, J. & Xu, X. A density-based algorithm for discovering clusters in large Spatial databases with noise. KDD 96 (34), 226ā231 (1996).
Aleksandrova, K. et al. Development and validation of a lifestyle-based model for colorectal cancer risk prediction: the lifecrc score. BMC Med. 19 (1), 1. https://doi.org/10.1186/s12916-020-01826-0 (2021).
Robertson, N. M. et al. Lung and colorectal Cancer disparities in Appalachian kentucky: Spatial analysis on the influence of education and literacy. Int. J. Environ. Res. Public Health. 20 (14). https://doi.org/10.3390/ijerph20146363 (2023).
Kennion, O., Maitland, S. & Brady, R. Machine learning as a new horizon for colorectal cancer risk prediction? A systematic review. Health Sci. Rev. 4, 100041. https://doi.org/10.1016/j.hsr.2022.100041 (2022).
Montgomery, A. et al. Evaluating Bias using gradient-boosted tree ensemble machine-learning based prediction of colorectal cancer survival in Appalachian and Non-Appalachian Patients in a national registry dataset. [Unpublished Manuscript]. (2024).
WoÅŗniacki, A., KsiÄ Å¼ek, W. & Mrowczyk, P. A novel approach for predicting the survival of colorectal Cancer patients using machine learning techniques and advanced parameter optimization methods. Cancers 16 (18). https://doi.org/10.3390/cancers16183205 (2024).
Mohammadi, G. et al. Classification and diagnostic prediction of colorectal Cancer mortality based on machine learning algorithms: A multicenter National study. Asian Pac. J. Cancer Prevention: APJCP. 25 (1), 333ā342. https://doi.org/10.31557/APJCP.2024.25.1.333 (2024).
Zhao, D. et al. A reliable method for colorectal cancer prediction based on feature selection and support vector machine. Med. Biol. Eng. Comput. 57 (4), 901ā912. https://doi.org/10.1007/s11517-018-1930-0 (2019).
Buk Cardoso, L. et al. Machine learning for predicting survival of colorectal cancer patients. Sci. Rep. 13 (1), 8874. https://doi.org/10.1038/s41598-023-35649-9 (2023).
Ting, W. C., Chang, H. R., Chang, C. C. & Lu, C. J. Developing a novel machine Learning-Based classification scheme for predicting SPCs in colorectal Cancer survivors. Appl. Sci. 10 (4). https://doi.org/10.3390/app10041355 (2020).
Welch, H. G., Schwartz, L. M. & Woloshin, S. Are increasing 5-Year survival rates evidence of success against cancer?? JAMA 283 (22), 2975ā2978. https://doi.org/10.1001/jama.283.22.2975 (2000).
Li, J. et al. Predicting breast cancer 5-year survival using machine learning: A systematic review. PloS One. 16 (4), e0250370. https://doi.org/10.1371/journal.pone.0250370 (2021).
Shadbahr, T. et al. The impact of imputation quality on machine learning classifiers for datasets with missing values. Commun. Med. 3 (1), 139. https://doi.org/10.1038/s43856-023-00356-z (2023).
SCORE2 risk prediction algorithms. New models to estimate 10-year risk of cardiovascular disease in Europe. Eur. Heart J. 42 (25), 2439ā2454. https://doi.org/10.1093/eurheartj/ehab309 (2021).
Ćznacar, T. & Güler, T. Prediction of early diagnosis in ovarian Cancer patients using machine learning approaches with Boruta and advanced feature selection. Life (Basel Switzerland). 15 (4). https://doi.org/10.3390/life15040594 (2025).
Friedman, J. H. Greedy function approximation: A gradient boosting machine. Annals Statistics, 1189ā1232. (2001).
Uddin, K. M. M., Mamun, A., Chakrabarti, A., Mostafiz, A., Dey, S. K. & R., & An ensemble machine learning-based approach to predict cervical cancer using hybrid feature selection. Neurosci. Inf. 4 (3), 100169. https://doi.org/10.1016/j.neuri.2024.100169 (2024).
BentĆ©jac, C., CsƶrgÅ, A. & MartĆnez-MuƱoz, G. A comparative analysis of gradient boosting algorithms. Artif. Intell. Rev. 54 (3), 1937ā1967. https://doi.org/10.1007/s10462-020-09896-5 (2021).
Aydin, Z. E. & Ozturk, Z. K. Performance analysis of XGBoost classifier with missing data. 1st Int Conf. Comput. Mach. Intell, No. (2021).
Bradley, A. P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 30 (7), 1145ā1159. https://doi.org/10.1016/S0031-3203(96)00142-2 (1997).
Thapa, R. et al. Machine learning differentiation of autism spectrum Sub-Classifications. J. Autism Dev. Disord. 54 (11), 4216ā4231. https://doi.org/10.1007/s10803-023-06121-4 (2024).
Maharjan, J. et al. Machine learning determination of applied behavioral analysis treatment plan type. Brain Inf. 10 (1), 7. https://doi.org/10.1186/s40708-023-00186-8 (2023).
RodrĆguez-PĆ©rez, R. & Bajorath, J. Interpretation of machine learning models using Shapley values: application to compound potency and multi-target activity predictions. J. Comput. Aided Mol. Des. 34 (10), 1013ā1026. https://doi.org/10.1007/s10822-020-00314-0 (2020).
Liu, H., Li, G., Cumberland, W. G. & Wu, T. Testing statistical significance of the area under a receiving operating characteristics curve for repeated measures design with bootstrapping. J. Data Sci. 3 (3), 257ā278. https://doi.org/10.6339/JDS.2005.03(3).206 (2022).
Altman, D., Machin, D., Bryant, T. & Gardner, M. Statistics with Confidence: Confidence Intervals and Statistical Guidelines (Wiley, 2013).
Waljee, A. K. et al. Artificial intelligence and machine learning for early detection and diagnosis of colorectal cancer in sub-Saharan Africa. Gut 71 (7), 1259ā1265. https://doi.org/10.1136/gutjnl-2022-327211 (2022).
Galadima, H. et al. Machine learning as a tool for early detection: A focus on Late-Stage colorectal Cancer across socioeconomic spectrums. Cancers 16 (3). https://doi.org/10.3390/cancers16030540 (2024).
Bradley, C. J. et al. Ethnicity, socioeconomic status, income inequality, and colorectal cancer outcomes: evidence from the 4C2 collaboration. Cancer Causes Control. 33 (4), 533ā546. https://doi.org/10.1007/s10552-021-01547-6 (2022).
Lorentsen, M. K. & Sanoff, H. K. Social determinants of health and the link to colorectal Cancer outcomes. Curr. Treat. Options Oncol. 25 (4), 453ā464. https://doi.org/10.1007/s11864-024-01191-7 (2024).
Heidarnia, M. A. et al. Social determinants of health and 5-year survival of colorectal cancer. Asian Pac. J. Cancer Prevention: APJCP. 14 (9), 5111ā5116. https://doi.org/10.7314/apjcp.2013.14.9.5111 (2013).
Abdul Rahman, H., Ottom, M. A. & Dinov, I. D. Machine learning-based colorectal cancer prediction using global dietary data. BMC Cancer. 23 (1), 144. https://doi.org/10.1186/s12885-023-10587-x (2023).
Shimron, E., Tamir, J. I., Wang, K. & Lustig, M. Implicit data crimes: Machine learning bias arising from misuse of public data. Proceedings of the National Academy of Sciences, 119(13), e2117203119. (2022).
van Giffen, B., Herhausen, D. & Fahse, T. Overcoming the pitfalls and perils of algorithms: A classification of machine learning biases and mitigation methods. J. Bus. Res. 144, 93ā106. https://doi.org/10.1016/j.jbusres.2022.01.076 (2022).
Acknowledgements
All authors thank St. Elizabeth Healthcare, Pikeville Medical Center, and Thompson Cancer Survival Center for providing data to this study.
Funding
This research was, in part, funded by the National Institutes of Health (NIH) Agreement No. 1OT2OD032581. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the NIH.
Author information
Authors and Affiliations
Contributions
AM contributed study conceptualization and led study design. AM completed data acquisition. RV and AM contributed data analysis and interpretation. AM and FD completed manuscript design and all content writing, figures, and tables. JS, PJ, AJ, DC and AS provided theoretical support. Manuscript revision was led by AM and completed by all co-authors.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisherās note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the articleās Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the articleās Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Montgomery, A., Vadapalli, R., Dinenno, F.A. et al. Machine learning to evaluate the effects of non-clinical social determinant features in predicting colorectal Cancer mortality in a medically underserved Appalachian population. Sci Rep 15, 25781 (2025). https://doi.org/10.1038/s41598-025-11074-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-11074-y