Machine learning to evaluate the effects of non-clinical social determinant features in predicting colorectal Cancer mortality in a medically underserved Appalachian population

Montgomery, Aisha; Vadapalli, Ravi; Dinenno, Frank A.; Schilling, Josh; Jain, Praduman; Jacob, Aasems; Chism, David; Shanker, Anil

doi:10.1038/s41598-025-11074-y

Download PDF

Article
Open access
Published: 16 July 2025

Machine learning to evaluate the effects of non-clinical social determinant features in predicting colorectal Cancer mortality in a medically underserved Appalachian population

Scientific Reports volume 15, Article number: 25781 (2025) Cite this article

379 Accesses
Metrics details

Subjects

Abstract

Colorectal cancer (CRC) is the 2nd leading cause of cancer death in the United States (US). Rural Appalachia suffers the highest CRC incidence and mortality rates. There are several non-clinical health-related social determinant factors (SDOH) associated with cancer mortality. This study describes novel predictive modeling that uses demographic, clinical, and SDOH features from health records data from Appalachian community cancer centers to predict 5-year CRC survival. We trained, validated, and tested four gradient-boosted tree ensemble (XGBoost) machine learning models which were developed using selected combinations of available features. The area under the receiver operating characteristic curve was greatest in the model that included SDOH features with demographic and clinical features (0.79; P < 0.0001). Feature stratification showed rurality as the top SDOH feature. It is demonstrated that the ML model performs better when SDOH features are included, and that rurality significantly impacts CRC survival in Appalachia. The study provides preliminary indications that further data collection and evaluation of SDOH factors would strengthen our understanding of their impact on cancer survival in Appalachia and other underserved populations and improve development of strategies for care delivery.

Construction of a prognostic prediction model for colorectal cancer based on 5-year clinical follow-up data

Article Open access 21 January 2025

Multimetric feature selection for analyzing multicategory outcomes of colorectal cancer: random forest and multinomial logistic regression models

Article 18 September 2021

Multilevel analysis of social determinants of advanced stage colorectal cancer diagnosis

Article Open access 26 April 2024

Introduction

Cancer remains a leading cause of death in the United States (US), with colorectal cancer contributing significantly to cancer mortality. Colorectal cancer (CRC) is the second leading cause of cancer death in the US, accounting for 7.6% of all new cancer cases and 10% of cancer deaths^1,2,3. Although overall CRC mortality rates have decreased by 8.2% in recent decades, CRC deaths continue to increase substantially within certain demographic groups, including African Americans, persons aged 40 to 54, and rural populations⁴. CRC is more prevalent in Appalachia than in any other US geographic region, with rural Appalachians suffering the highest burden of CRC deaths in the nation^5,6. Rural Appalachians experience higher incidence of early onset CRC and more frequent occurrence of late-stage diagnosis and delayed treatment. Further, overall CRC mortality is 32% higher in rural Appalachia than in any other US population by region^7,8,9. This worsening disparity highlights a critical need for targeted measures to identify and address factors contributing to high CRC mortality in Appalachia and similarly affected rural populations.

It is well known that non-clinical social, environmental, and lifestyle factors, referred to as the social determinants of health (SDOH), affect CRC risk, morbidity, and survival. Previous studies identify poverty, lack of education, lack of social support, along with social and geographic isolation, as SDOH factors that play an important role in CRC stage at diagnosis and survival¹⁰. Comparably, CRC disparity in rural Appalachia has been attributed to higher poverty and unemployment, geographic rurality, and high uninsured rates, as well as lower educational attainment and limited healthcare access in this region¹¹. Despite this knowledge, little has been done to adequately quantify and address the impact of non-clinical, health related factors on cancer disparities.

The prediction of CRC mortality risk is a critical measure within cancer treatment planning and resource allocation. Risk scores can be used to identify individuals who are likely to live long enough to benefit from adjuvant treatment. Machine learning (ML) has been validated for its utility in prediction of clinical disease risk and outcomes, yet several challenges continue to limit these models’ ability to predict CRC survival^12,13,14. With ML methods, feature selection plays a particularly important role in improving prediction model performance^15,16. In this context, cancer prediction models primarily use clinical predictors (e.g., cancer stage, tumor grade, and number of positive lymph nodes) to classify patients and assist in treatment decision making. Review of previous CRC risk prediction models demonstrated suboptimal model performance when using only clinical features^17,18,19.

Many ML-based prediction models also rely on data from national public datasets, and feature availability and demographic diversity are often limited in these datasets^20,21. In a pilot study, we used National Cancer Institute’s (NCI) Surveillance, Epidemiology, and End Results (SEER) Program data to explore bias when using common demographic and clinical features to predict 5-year CRC survival between Appalachian and non-Appalachian patients. Although bias was not detected, overall ML model performance was significantly lower in the Appalachian vs. non-Appalachian test population and SHapely Additive exPlanations (SHAP) analysis revealed that marital status was a top feature contributing to the performance of the model²². This suggested that other factors, beyond common clinical predictors, could improve ML model predictability and demonstrate the effect of non-clinical SDOH factors on CRC survival¹⁰. Several other studies have evaluated the performance and challenges associated with ML derived colorectal cancer prediction models using numerous clinical data features, however evaluation of the effects of SDOH features on CRC survival prediction is largely absent in the current literature^{13,23,24,25,26,27}.

The inclusion of SDOH data variables as ML model input features could help identify and quantify the impact of non-clinical factors contributing to high CRC mortality in Appalachia and potentially improve CRC survival prediction in rural and other disparately affected populations. This knowledge would strengthen our understanding of the impact of SDOH on cancer survival and improve the development of strategies for care delivery in Appalachia and other underserved populations. Accordingly, the purpose of the present study was to determine whether the addition of SDOH features to ML modeling improves CRC survival prediction in the Appalachian population. We hypothesized that the addition of SDOH features would improve ML-based prediction of CRC mortality and help to evaluate the social and environmental factors in Appalachia that impact CRC outcomes.

Materials and methods

Figure 1 shows the complete workflow diagram for this study. First, de-identified data was pre-processed for missing value handling and data encoding to convert variables from categorical to numerical values. Feature selection approaches were used to extract the most important features from the dataset. The aggregate dataset was divided into 70% for training, 20% for testing, and the remaining 10% of the data for model validation purposes. The gradient-boosted tree ensemble machine-learning technique was used to train, test, and validate the four models, after which the voting classifier was applied. The confusion matrix was used to assess model effectiveness. Finally, models were evaluated comparatively to determine the best model.

Dataset description

This retrospective study utilized de-identified electronic health records (EHR) data from community cancer care centers located in Appalachia: (1) St. Elizabeth Healthcare in Edgewood, KY; (2) Pikeville Medical Center in Pikeville, KY; and (3) Thompson Cancer Survival Center (TCSC) based in Knoxville, TN. Combined, datasets included 25 attributes from 7,718 adults aged 18 years and older across three cancer care centers. Only patients that are 18 + and were diagnosed with malignant colon and/or rectosigmoid cancer between the years 2000 to 2017 were included in this study. Patients under 18 years of age were removed from the dataset. All data sets included demographic, clinical, and SDOH patient features. St. Elizabeth Healthcare and Pikeville Medical Center data was extracted from the local EHR system. Data from TCSC collected between 2000 and 2009 was captured by research staff from paper records and transcribed into .csv files. TCSC data from 2010 − 1017 was extracted from the local EHR system. All data was de-identified by individual healthcare sites prior to delivery, shared by secure data transfer, and stored on password protected cloud-based storage. The use of de-identified retrospective data is classified as a non-human subject study and IRB approval is not required. The need for informed consent was deemed unnecessary according to the HIPAA Privacy Rule. To ensure ethical conduct, this study was conducted under approval by Western-Copernicus Group Institutional Review Board (WCG^® IRB) protocol no. 20,223,670, consistent with 45 Code of Federal Regulations 46.102. All methods were performed in accordance with the relevant guidelines and regulations.

CRC 5-year survival vs. death metrics

The EHR data included a variable of “survival months” which was used to report mortality based upon the length of time a patient survived in months relative to their initial diagnosis of cancer of the colon and/or rectum. The gold standard 5-year cancer survival metric was applied in this study as the ML model was developed to classify individuals as surviving < 60 months or ≥ 60 months (5 years) post-CRC diagnosis²⁸. The ML prediction model is a binary classifier, where the “negative” class represents patients who have < 60 survival months post CRC diagnosis, and the “positive” class represents patients who have ≥ 60 survival months reported in the EHR.

Data preprocessing and feature selection

Datasets underwent preprocessing before being fitted into ML models to address missingness, data entry errors, outliers, variable labeling, and encoding. Data preprocessing involved converting all categorical data variables into numeric values, merging the three datasets from individual healthcare sites into one aggregate dataset, and data cleaning to address typographical errors and missing data. Listwise deletion was used to handle missing data.

Feature selection was employed to eliminate redundancy, identify the most critical EHR data variables, and increase the ML model’s accuracy in its prediction ability. The Extreme Gradient Boosting (XGBoost) ML algorithm complemented by SHAP plots was utilized to determine which features were most important for this investigation. There were 25 features available in the dataset. The raw data collected from EHRs included a variety of information about the patients including their demographic and clinical information related to their CRC diagnosis and select SDOH variables. The main target variable for the study was Survival months. Demographic features were Age at diagnosis, Sex, Race, Ethnicity, State of residence, County, and Zip (3-digit). The SDOH features that were available for analysis included Marital status, Geographic classification (rural/non-rural), Employment status, and Insurance status. All available features and corresponding model inclusion are shown in Table 1.

Marital status data was missing in TCSC data and was therefore handled by imputation. Studies suggest that two-stage methods (imputation method and classifier combination)^29,30,31 offer optimal configuration for a particular dataset. IterativeImputer³² is a python-based multivariate statistical method for handling missing data by means of accurate imputations, specifically when there are reasonably strong correlations between features. We used linear regression with 300 estimators and a scaled tolerance of ~ 2.5%. It took under 20 iterations for convergence. We did not find any significant differences in model accuracies between iterative robust imputation and arbitrary (fixed value for missing data) imputation. Weak correlation between imputed features or data imbalance between known and unknown data (e.g., TCSC didn’t collect Marital Status for their patients) may be potential reasons for this observation. A comparative table showing model accuracies by imputation method is included as supplemental material.

Table 1 Feature inputs available in EHR datasets for machine learning prediction models.

Full size table

Training and validation

The dataset was randomly divided into training, testing, and validation parts. About 70% for training and 20% for test data. The remaining 10% of the data was reserved for model validation purposes. The ML prediction model was trained using the input features described above. For this study, we used XGBoost with hyperparameter optimization technique. XGBoost is a gradient-boosted tree ensemble method of ML which combines the estimates of simpler, weaker models to make predictions for a chosen target. The main principle of gradient boosting is to minimize the errors of the prior models to enhance the predictive performance as it continues repeatedly. Specifically, the Gradient Boosting classifier is employed for classification tasks^33,34. Recent studies have demonstrated that gradient-boosted tree algorithms yield high accuracy for both acute and chronic prediction tasks, and that XGBoost performs better than other ML models on tabular data³⁵. One of the benefits of using XGBoost is that it can implicitly handle a certain level of missingness in the data by accounting for missingness during the training process³⁶. Before training the ML prediction model, hyperparameter optimization was performed using the training dataset, applied through a 5-fold cross-validation grid search and evaluation of the area under the receiver operating characteristic (AUROC) curve for different combinations of hyperparameters included in the grid search. The parameters (optimized values) were the maximum tree depth, pseudo-regularization parameter, minimum sum of weights of observations, fraction of observations to subsample at each step, fraction of features to use for building each tree, node, and level of depth in tree.

ML model performance metrics

The AUROC was used as the primary metric to evaluate and compare the overall performance of the ML prediction models³⁷. Other metrics used to evaluate the performance of the ML prediction model for 5-year survival post-diagnosis were calculated as follows based on true positive (TP), false positive (FP), true negative (TN), and false negative (FN) values^38,39. Sensitivity indicates the model’s ability to accurately identify patients who with CRC survival after 5-years.

$$Sensitivity\; (True\; Positive\; Rate, TPR) =\:\frac{TP}{TP+FN}*100$$

Specificity is used to assess how well the model can accurately identify patients who will not survive.

$$Specificity\; (True\; Negative\; Rate, TNR) = \:\frac{TN}{TN+FP}*100$$

Precision, or Positive Predictive Value, measures how many positive cases (CRC survivors) were correctly predicted by the model.

$$Positive\; Predictive\; Value\; (precision, PPV) = \:\frac{TP}{TP+FP}*100$$

Conversely, the Negative Predictive Value measures how many negative cases (CRC deaths) were correctly predicted by the model.

$$Negative\; Predictive\; Value\; (NPV) = \:\frac{TN}{TN+FN}*100$$

In addition, 95% confidence intervals (CIs) were computed for each metric as described in the Statistics section below. SHapely Additive exPlanations (SHAP) analysis was performed to evaluate the importance of each feature for generating model output⁴⁰. The SHAP analysis ranks features by their importance to model predictions from top to bottom.

Impact of SDOH features on ML model performance

The merged data set consisting of EHR data from St. Elizabeth Healthcare, Pikeville Medical Center, and TCSC (Table 2) was used determine the additive effects of SDOH features on performance of the ML model in predicting mortality in Appalachian colorectal cancer patients. Due to the unavailability of cause of death data in one of the hospital’s EHRs, the ML prediction model classified patient as surviving < 5 or ≥ 5 years irrespective of whether the cause of death was CRC-related or due to other factors. To test our hypothesis, we developed 4 ML models with the following feature inputs: Model 1: demographic and clinical features; Model 2: demographic and SDOH features; Model 3: demographic, clinical, and SDOH features; Model 4: clinical and SDOH features (Table 1). The dataset was randomly divided into training, hold-out test, and validation datasets utilizing a 70:20:10 split, and the hold-out test data set was never exposed to the model during training. To evaluate algorithmic fairness, we compared XGBoost with other regression methods, Logistic regression and K-Nearest Neighbor. XGBoost delivered superior accuracies compared to both the logistic and K-nearest Neighbor regression (with number of neighbors = 5) methods. A comparative table showing predictive model accuracies by regression technique for each of the four data models is included as supplemental material.

Table 2 Patient demographics, clinical, and social determinants of health (SDOH) characteristics.

Full size table

Statistical analysis

The confidence intervals (CIs) for AUROC were calculated using a bootstrapping method where a subset of patients from the hold-out test dataset were randomly sampled and the AUROC was calculated using the data from those patients, repeated 1000 times with replacement⁴¹. From these bootstrapped AUROC values, the middle 95% range was selected as the 95% CI for the AUROC. As the sample size of the hold-out test dataset was sufficiently large, the CIs for other metrics were calculated using normal approximation⁴². Because demographic and clinical features are traditionally used for mortality/survival prediction, Model 1 (demographic + clinical features) was considered our control ML model. The difference between the metrics of Models 2, 3, and 4 as compared to Model 1 was determined using a two-sided t-test with a 95% significance level (Fig. 2).

Results

Patient population and characteristics

Demographics, clinical characteristics, and SDOH variables for colorectal cancer patients from each site are shown in Table 1. Of the 7,718 patients included in analysis, mean age at diagnosis was ~ 67 years and ~ 50% of the patients were female. Nearly all patients were white and non-Hispanic, and approximately 55% were married. Most patients in the St. Elizabeth dataset were non-rural, whereas the majority of patients in Pikeville and TCSC datasets were rural. Additional information regarding cancer primary site, summary stage, and grade as well as employment and health insurance are included in Table 1. Due to missing information on survival status – the target feature, 14 patient records were removed from the dataset before analysis.

ML model performance

The AUROC curves for the hold-out test datasets for 4 ML prediction models along with the baseline are shown in Fig. 2. Overall, the ML prediction models achieved a range of classification performance for patients with an AUROC of 0.780 (95% CI = 0.757–0.801) for Model 1 (demographic and clinical features), 0.627 (95% CI = 0.598–0.655) for Model 2 (demographic and SDOH features), 0.790 (95% CI = 0.767–0.812) for Model 3 (demographic, clinical, and SDOH features) and 0.771 (95% CI = 0.748–0.794) for Model 4 (clinical and SDOH features). The strongest performance was observed in Model 3, which was significantly greater than Model 1 (P < 0.001).

The complete list of performance metrics for the ML prediction models are shown in Table 3, and a confusion matrix is shown in Fig. 3 to provide a visual representation of the ML prediction model’s output for the hold-out test groups. For all metrics, including TPR, Model 3 demonstrated the greatest performance of the 4 ML-based prediction models (Fig. 4).

Table 3 Performance metrics for the ML prediction model: impact of SDOH features

Full size table

Discussion

This study is one of the first to determine the impact of SDOH variables in predictive modeling of CRC mortality/survivorship in an underserved population with CRC disparity. We developed a gradient-boosted tree ensemble (XGBoost) ML model that demonstrated good classification performance for survival prediction in Appalachian CRC patients using local demographic and clinical EHR data as demonstrated by an AUROC of 0.780 and TPR of 0.769 (Model 1). The addition of SDOH variables to the model resulted in a modest, but significant improvement in AUROC and TPR (Model 3). Although building the most performative model was not the primary goal of this study, we were able to develop a strong ML-based model to classify CRC patient mortality using simple demographic, clinical, and SDOH data as feature inputs. These findings align with other ML based studies designed to assess CRC disparity^43,44. High CRC incidence and mortality in rural Appalachians has been attributed to several SDOH factors, including higher poverty, uninsured rates, and unemployment, and healthcare access limited by geographic location in this region^6,8,11. In the present study, we evaluated the impact of SDOH on CRC survival in Appalachia by measuring the impact of SDOH features on ML-based CRC survival prediction. We utilized a representative, locally sourced EHR dataset from Appalachian cancer care centers to demonstrate the impact of SDOH features in this population by building several ML models which included SDOH features in various combinations, with and without traditional demographic and clinical predictors. Our collective findings provide insight into the key features that contribute to ML-based prediction of CRC survival in this unique population. First, using easily accessible and traditional demographic and clinical features (see Table 1), the ML model achieved a strong performance (Model 1; Fig. 2; Table 3). Second, the combination of demographic and SDOH variables without clinical variables, had the weakest performance (Model 2); however, the combination of clinical and SDOH features without demographic variables still resulted in a robust performance (Model 4). These data, along with SHAP values (Fig. 4), clearly indicate that clinical features are highly important in CRC mortality, as expected. Most importantly, the addition of SDOH features to both demographic and clinical features resulted in the strongest performing ML model and significantly improved CRC survival prediction (Model 3). Additionally, SHAP analysis for this model indicates that rurality was in the top 3 features contributing to model performance. Further, health insurance status (6th ) and employment status (9th ) also were within the top features.

These latter findings indicate that non-clinical SDOH factors such as poverty, high uninsured rates, and unemployment may impact high cancer-related mortality seen in rural populations, and these SDOH factors should be considered when using ML or other methods to address disparities. These findings show that SDOH variables impact CRC patient survivorship and that addressing SDOH factors may improve outcomes and quality of life of CRC patients, especially those in medically underserved populations unequally burdened by SDOH factors like those seen in rural Appalachia. Allocation of healthcare system resources towards modifiable SDOH factors would be a viable strategy to reduce cancer deaths. For example, expanding Medicaid, providing transportation or mobile health clinics, and providing supplemental food or housing resources to unemployed cancer patients could further improve CRC survival within these and other underserved populations. Our findings align with non-ML based studies designed to assess the impact of SDOH factors on CRC survival^45,46,47. Further quantification of SDOH on CRC survival may contribute to the development of adjuvant SDOH-based policies and treatment interventions that improve cancer survivorship outcomes.

Several other studies have evaluated the performance and challenges associated with ML derived CRC prediction models. Burnett et al. reviewed the strengths and weaknesses of CRC risk-prediction models that used routinely collected health registry data¹³. They found that tree-based models outperformed other models and were more interpretable. Rahman et al. evaluated prediction of CRC using ML algorithms in large datasets of global dietary data⁴⁸. Their findings demonstrate the importance of using non-clinical data features for early detection of CRC in a large dataset that includes younger and older adults. Ting et al. developed a ML-based model to predict CRC tumor recurrence in survivors²⁷. They identified drinking behavior as an important predictive SDOH feature which should be monitored to improve CRC treatment strategies. Our findings fill in gaps in the current literature by demonstrating the impact of non-clinical SDOH features on CRC survival in an underserved, disparately affected population.

It is also of note that this study utilized EHR datasets directly acquired through community partnerships established within the target population, while many ML-based studies commonly use public datasets. It is known that selection, representation, and evaluation biases may result when using public datasets to generate real-world evidence, therefore we utilized EHR data provided directly from Appalachian community cancer care centers to minimize these risks^49,50.

Study limitations

There are limitations that should be considered. First, the dataset was missing cause of death information from one hospital. This information may have been used to better relate patient mortality to the CRC diagnosis. As a result, our analysis evaluated all-cause mortality which may not adequately capture the direct relationship between SDOH and CRC mortality. However, this can be reevaluated in future studies. Secondly, our data processing methods did not address the effects of patient age as a continuous variable within the model, which contributes to its high feature importance and wide distribution on SHAP analysis. Regression algorithms are commonly applied to examine relationships amongst continuous variables, however this study identified XGBoost as the most performative algorithm as all other variables, aside from age at diagnosis, were categorical. This can be addressed in future studies through preprocessing recategorization of continuous age variables into age groups. Thirdly, in this study we were unable to evaluate the effects of SDOH factors related to low healthcare access and transportation, which are prevalent in rural Appalachia. Patient’s mode of transportation is not commonly accounted for in EHRs. Available zip code and county data can be used to evaluate this in the future. Further, we did not obtain treatment and diagnosis dates in this dataset to measure the length of time in days between primary CRC diagnosis and initiation of treatment. This information would allow for evaluation of cancer treatment delays, which can be related to SDOH factors in disparately affected populations like rural Appalachia. Fourth, EHRs often have quality issues (e.g., missingness, misclassification, measurement error) and the data is largely unstructured (e.g., diagnostic notes, patient experience, etc.). Our dataset is small with about 7700 records after combining three hospital datasets with the same features. The achieved accuracy of ~ 0.79 is superior compared to linear regression models for binary classification we explored as a part of this study. We believe that the accuracy can be further improved with collecting additional patient records and addressing the missing data (e.g., marital status in TCSC, transportation in St. Elizabeth, etc.). Fifth, our sample was primarily White and non-Hispanic that mirrors the rural US population which is 76% non-Hispanic White. The lack of racial-ethnic diversity within the sample could affect the reliability of this specific predictive model if used in more diverse, non-rural populations. As currently designed, this ML model would best be utilized only in rural populations or those with similar cancer mortality rates, similar poverty, uninsured, and unemployment rates, as well as similar racial-ethnic distributions. However, it is important to note that the ML methods applied in this study are broadly applicable. Lastly, this study was limited by the unavailability of SDOH data in specialty care EHRs. The collection of SDOH data is a more recently adopted practice which generally occurs at the primary care level. Conducting studies such as this one which demonstrate the effects of SDOH on clinical outcomes could lead to increased collection of SDOH data and the development of supportive SDOH policies and procedures that improve clinical outcomes related to cancer and other chronic diseases.

Conclusions

Colorectal cancer is a recognized cause of cancer-related death that unequally effects the rural Appalachian population. Although the impact of non-clinical social and environmental factors on cancer outcomes has been well described, SDOH factors are not commonly included in machine learning approaches to address CRC disparity. We used a gradient-boosting machine learning algorithm to test four different models to predict CRC survival based on EHR data which included SDOH features. Our best performing model combined demographic, clinical, and SDOH features. Feature stratification showed that rurality, insurance status, and employment status were important to model performance, with rurality being a top feature overall. Compared to previous methods, our model successfully demonstrated the impact of SDOH features in ML based prediction of CRC survival in a dataset that well-represented the rural Appalachian population. In future research, we will gather a larger EHR dataset which covers a broader population of rural patients and includes additional SDOH variables to improve model performance and more accurately quantifies the effects of other SDOH features on CRC survival. Further ML-based evaluation of the impact of SDOH on cancer disparity can facilitate the establishment of best practices in community-based AI/ML research to address cancer disparities in underserved communities.

Data availability

The datasets generated and/or analyzed during the current study are not publicly available due to their derivation from electronic health records, but data are available from the corresponding author on reasonable request.The code used to support the findings of the current study are not publicly available, but code is available from the corresponding author on reasonable request.

References

Center for Disease Control and Prevention. (n.d.). Leading Causes of Death. National Center for Health Statistics. Retrieved December 30. from (2024). https://www.cdc.gov/nchs/fastats/leading-causes-of-death.htm
American Cancer Society. Colorectal Cancer Facts & Figs. 2023–2025. (2023). https://www.cancer.org/content/dam/cancer-org/research/cancer-facts-and-statistics/colorectal-cancer-facts-and-figures/colorectal-cancer-facts-and-figures-2023.pdf
Siegel, R. L., Giaquinto, A. N. & Jemal, A. Cancer Statistics, 2024. CA; A Cancer Journal for Clinicians, 74(1), 12–49. (2024). https://doi.org/10.3322/caac.21820
Tan, J. Y., Yeo, Y. H., Ng, W. L., Fong, Z. V. & Brady, J. T. How have US colorectal cancer mortality trends changed in the past 20 years? Int. J. Cancer. 155 (3), 493–500. https://doi.org/10.1002/ijc.34926 (n.d.).
National Cancer Institute. (n.d.). Common Cancer Sites & SEER Cancer Stat Facts. (). Retrieved August 26, from (2024). https://seer.cancer.gov/statfacts/html/common.html
Attarabeen, O. F., Sambamoorthi, U., Larkin, K. T. & Kelly, K. M. Colon cancer worry in Appalachia. J. Community Health. 43 (1), 79–88. https://doi.org/10.1007/s10900-017-0390-z (2018).
Article PubMed PubMed Central Google Scholar
Oelsner, W. K. et al. An emerging health disparity: early onset colorectal Cancer in appalachia S344. Official J. Am. Coll. Gastroenterol. | ACG (Vol. 116, S148–S149 (2021).
Google Scholar
Yao, N., Alcala, H. E., Anderson, R. & Balkrishnan, R. Cancer disparities in rural appalachia: incidence, early detection, and survivorship. J.Rural Health. 33 (4), 375–381. https://doi.org/10.1111/jrh.12213 (2017).
Article PubMed Google Scholar
Sepassi, A. et al. Rural-Urban disparities in colorectal Cancer screening, diagnosis, treatment, and survivorship care: A systematic review and Meta-Analysis. Oncologist 29 (4), e431–e446. https://doi.org/10.1093/oncolo/oyad347 (2024).
Article PubMed PubMed Central Google Scholar
Coughlin, S. S. Social determinants of colorectal cancer risk, stage, and survival: A systematic review. Int. J. Colorectal Dis. 35 (6), 985–995. https://doi.org/10.1007/s00384-020-03585-z (2020).
Article PubMed Google Scholar
Driscoll, D. L., O’Donnell, H., Patel, M. & Cattell-Gordon, D. C. Assessing and addressing the determinants of Appalachian population health: A scoping review. J. Appalach. Health. 5 (3), 85–102. https://doi.org/10.13023/jah.0503.07 (2023).
Article PubMed PubMed Central Google Scholar
Wulczyn, E., Steiner, D. F., Moran, M., Plass, M., Reihs, R., Tan, F., Flament-Auvigne,I., Brown, T., Regitnig, P., Chen, P.-H. C., Hegde, N., Sadhwani, A., MacDonald, R.,Ayalew, B., Corrado, G. S., Peng, L. H., Tse, D., Müller, H., Xu, Z., … Mermel, C.H. (2021). Interpretable survival prediction for colorectal cancer using deep learning.Npj Digital Medicine, 4(1), 71. https://doi.org/10.1038/s41746-021-00427-2.
Burnett, B. et al. Machine learning in colorectal Cancer risk prediction from routinely collected data: A review. Diagnostics (Basel Switzerland). 13 (2). https://doi.org/10.3390/diagnostics13020301 (2023).
Alboaneen, D. et al. Predicting colorectal Cancer using machine and deep learning algorithms: challenges and opportunities. Big Data Cogn. Comput. 7 (2). https://doi.org/10.3390/bdcc7020074 (2023).
Afrash, M. R., Mirbagheri, E., Mashoufi, M. & Kazemi-Arpanahi, H. Optimizing prognostic factors of five-year survival in gastric cancer patients using feature selection techniques with machine learning algorithms: A comparative study. BMC Med. Inf. Decis. Mak. 23 (1), 54. https://doi.org/10.1186/s12911-023-02154-y (2023).
Article Google Scholar
Cueto-López, N. et al. A comparative study on feature selection for a risk prediction model for colorectal cancer. Comput. Methods Programs Biomed. 177, 219–229. https://doi.org/10.1016/j.cmpb.2019.06.001 (2019).
Article PubMed Google Scholar
Win, A. K., Macinnis, R. J., Hopper, J. L. & Jenkins, M. A. Risk prediction models for colorectal cancer: A review. Cancer Epidemiol. Biomarkers Prevention: Publication Am. Association Cancer Res. Cosponsored Am. Soc. Prev. Oncol. 21 (3), 398–410. https://doi.org/10.1158/1055-9965.EPI-11-0771 (2012).
Article Google Scholar
Ester, M., Kriegel, H. P., Sander, J. & Xu, X. A density-based algorithm for discovering clusters in large Spatial databases with noise. KDD 96 (34), 226–231 (1996).
Google Scholar
Aleksandrova, K. et al. Development and validation of a lifestyle-based model for colorectal cancer risk prediction: the lifecrc score. BMC Med. 19 (1), 1. https://doi.org/10.1186/s12916-020-01826-0 (2021).
Article PubMed PubMed Central Google Scholar
Robertson, N. M. et al. Lung and colorectal Cancer disparities in Appalachian kentucky: Spatial analysis on the influence of education and literacy. Int. J. Environ. Res. Public Health. 20 (14). https://doi.org/10.3390/ijerph20146363 (2023).
Kennion, O., Maitland, S. & Brady, R. Machine learning as a new horizon for colorectal cancer risk prediction? A systematic review. Health Sci. Rev. 4, 100041. https://doi.org/10.1016/j.hsr.2022.100041 (2022).
Article Google Scholar
Montgomery, A. et al. Evaluating Bias using gradient-boosted tree ensemble machine-learning based prediction of colorectal cancer survival in Appalachian and Non-Appalachian Patients in a national registry dataset. [Unpublished Manuscript]. (2024).
Woźniacki, A., Książek, W. & Mrowczyk, P. A novel approach for predicting the survival of colorectal Cancer patients using machine learning techniques and advanced parameter optimization methods. Cancers 16 (18). https://doi.org/10.3390/cancers16183205 (2024).
Mohammadi, G. et al. Classification and diagnostic prediction of colorectal Cancer mortality based on machine learning algorithms: A multicenter National study. Asian Pac. J. Cancer Prevention: APJCP. 25 (1), 333–342. https://doi.org/10.31557/APJCP.2024.25.1.333 (2024).
Article Google Scholar
Zhao, D. et al. A reliable method for colorectal cancer prediction based on feature selection and support vector machine. Med. Biol. Eng. Comput. 57 (4), 901–912. https://doi.org/10.1007/s11517-018-1930-0 (2019).
Article PubMed Google Scholar
Buk Cardoso, L. et al. Machine learning for predicting survival of colorectal cancer patients. Sci. Rep. 13 (1), 8874. https://doi.org/10.1038/s41598-023-35649-9 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Ting, W. C., Chang, H. R., Chang, C. C. & Lu, C. J. Developing a novel machine Learning-Based classification scheme for predicting SPCs in colorectal Cancer survivors. Appl. Sci. 10 (4). https://doi.org/10.3390/app10041355 (2020).
Welch, H. G., Schwartz, L. M. & Woloshin, S. Are increasing 5-Year survival rates evidence of success against cancer?? JAMA 283 (22), 2975–2978. https://doi.org/10.1001/jama.283.22.2975 (2000).
Article CAS PubMed Google Scholar
Li, J. et al. Predicting breast cancer 5-year survival using machine learning: A systematic review. PloS One. 16 (4), e0250370. https://doi.org/10.1371/journal.pone.0250370 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Shadbahr, T. et al. The impact of imputation quality on machine learning classifiers for datasets with missing values. Commun. Med. 3 (1), 139. https://doi.org/10.1038/s43856-023-00356-z (2023).
Article PubMed PubMed Central Google Scholar
SCORE2 risk prediction algorithms. New models to estimate 10-year risk of cardiovascular disease in Europe. Eur. Heart J. 42 (25), 2439–2454. https://doi.org/10.1093/eurheartj/ehab309 (2021).
Article CAS Google Scholar
Öznacar, T. & Güler, T. Prediction of early diagnosis in ovarian Cancer patients using machine learning approaches with Boruta and advanced feature selection. Life (Basel Switzerland). 15 (4). https://doi.org/10.3390/life15040594 (2025).
Friedman, J. H. Greedy function approximation: A gradient boosting machine. Annals Statistics, 1189–1232. (2001).
Uddin, K. M. M., Mamun, A., Chakrabarti, A., Mostafiz, A., Dey, S. K. & R., & An ensemble machine learning-based approach to predict cervical cancer using hybrid feature selection. Neurosci. Inf. 4 (3), 100169. https://doi.org/10.1016/j.neuri.2024.100169 (2024).
Article Google Scholar
Bentéjac, C., Csörgő, A. & Martínez-Muñoz, G. A comparative analysis of gradient boosting algorithms. Artif. Intell. Rev. 54 (3), 1937–1967. https://doi.org/10.1007/s10462-020-09896-5 (2021).
Article Google Scholar
Aydin, Z. E. & Ozturk, Z. K. Performance analysis of XGBoost classifier with missing data. 1st Int Conf. Comput. Mach. Intell, No. (2021).
Bradley, A. P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 30 (7), 1145–1159. https://doi.org/10.1016/S0031-3203(96)00142-2 (1997).
Article ADS Google Scholar
Thapa, R. et al. Machine learning differentiation of autism spectrum Sub-Classifications. J. Autism Dev. Disord. 54 (11), 4216–4231. https://doi.org/10.1007/s10803-023-06121-4 (2024).
Article CAS PubMed Google Scholar
Maharjan, J. et al. Machine learning determination of applied behavioral analysis treatment plan type. Brain Inf. 10 (1), 7. https://doi.org/10.1186/s40708-023-00186-8 (2023).
Article Google Scholar
Rodríguez-Pérez, R. & Bajorath, J. Interpretation of machine learning models using Shapley values: application to compound potency and multi-target activity predictions. J. Comput. Aided Mol. Des. 34 (10), 1013–1026. https://doi.org/10.1007/s10822-020-00314-0 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Liu, H., Li, G., Cumberland, W. G. & Wu, T. Testing statistical significance of the area under a receiving operating characteristics curve for repeated measures design with bootstrapping. J. Data Sci. 3 (3), 257–278. https://doi.org/10.6339/JDS.2005.03(3).206 (2022).
Article Google Scholar
Altman, D., Machin, D., Bryant, T. & Gardner, M. Statistics with Confidence: Confidence Intervals and Statistical Guidelines (Wiley, 2013).
Waljee, A. K. et al. Artificial intelligence and machine learning for early detection and diagnosis of colorectal cancer in sub-Saharan Africa. Gut 71 (7), 1259–1265. https://doi.org/10.1136/gutjnl-2022-327211 (2022).
Article CAS PubMed Google Scholar
Galadima, H. et al. Machine learning as a tool for early detection: A focus on Late-Stage colorectal Cancer across socioeconomic spectrums. Cancers 16 (3). https://doi.org/10.3390/cancers16030540 (2024).
Bradley, C. J. et al. Ethnicity, socioeconomic status, income inequality, and colorectal cancer outcomes: evidence from the 4C2 collaboration. Cancer Causes Control. 33 (4), 533–546. https://doi.org/10.1007/s10552-021-01547-6 (2022).
Article PubMed Google Scholar
Lorentsen, M. K. & Sanoff, H. K. Social determinants of health and the link to colorectal Cancer outcomes. Curr. Treat. Options Oncol. 25 (4), 453–464. https://doi.org/10.1007/s11864-024-01191-7 (2024).
Article PubMed Google Scholar
Heidarnia, M. A. et al. Social determinants of health and 5-year survival of colorectal cancer. Asian Pac. J. Cancer Prevention: APJCP. 14 (9), 5111–5116. https://doi.org/10.7314/apjcp.2013.14.9.5111 (2013).
Article Google Scholar
Abdul Rahman, H., Ottom, M. A. & Dinov, I. D. Machine learning-based colorectal cancer prediction using global dietary data. BMC Cancer. 23 (1), 144. https://doi.org/10.1186/s12885-023-10587-x (2023).
Article PubMed PubMed Central Google Scholar
Shimron, E., Tamir, J. I., Wang, K. & Lustig, M. Implicit data crimes: Machine learning bias arising from misuse of public data. Proceedings of the National Academy of Sciences, 119(13), e2117203119. (2022).
van Giffen, B., Herhausen, D. & Fahse, T. Overcoming the pitfalls and perils of algorithms: A classification of machine learning biases and mitigation methods. J. Bus. Res. 144, 93–106. https://doi.org/10.1016/j.jbusres.2022.01.076 (2022).
Article Google Scholar

Download references

Acknowledgements

All authors thank St. Elizabeth Healthcare, Pikeville Medical Center, and Thompson Cancer Survival Center for providing data to this study.

Funding

This research was, in part, funded by the National Institutes of Health (NIH) Agreement No. 1OT2OD032581. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the NIH.

Author information

Authors and Affiliations

Vibrent Health, 4114 Legato Rd #900, Fairfax, VA, 22033, USA
Aisha Montgomery, Frank A. Dinenno, Josh Schilling & Praduman Jain
Frost Institute for Data Science and Computing and Electrical and Computer Engineering, University of Miami, Coral Gables, FL, USA
Ravi Vadapalli
Pikeville Medical Center, Pikeville, KY, USA
Aasems Jacob
Thompson Cancer Survival Center, Knoxville, TN, USA
David Chism
Meharry Medical College, Nashville, TN, USA
Anil Shanker
Applied Sciences, Premier, Inc., Charlotte, NC, United States
Aisha Montgomery

Authors

Aisha Montgomery
View author publications
Search author on:PubMed Google Scholar
Ravi Vadapalli
View author publications
Search author on:PubMed Google Scholar
Frank A. Dinenno
View author publications
Search author on:PubMed Google Scholar
Josh Schilling
View author publications
Search author on:PubMed Google Scholar
Praduman Jain
View author publications
Search author on:PubMed Google Scholar
Aasems Jacob
View author publications
Search author on:PubMed Google Scholar
David Chism
View author publications
Search author on:PubMed Google Scholar
Anil Shanker
View author publications
Search author on:PubMed Google Scholar

Contributions

AM contributed study conceptualization and led study design. AM completed data acquisition. RV and AM contributed data analysis and interpretation. AM and FD completed manuscript design and all content writing, figures, and tables. JS, PJ, AJ, DC and AS provided theoretical support. Manuscript revision was led by AM and completed by all co-authors.

Corresponding author

Correspondence to Aisha Montgomery.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Montgomery, A., Vadapalli, R., Dinenno, F.A. et al. Machine learning to evaluate the effects of non-clinical social determinant features in predicting colorectal Cancer mortality in a medically underserved Appalachian population. Sci Rep 15, 25781 (2025). https://doi.org/10.1038/s41598-025-11074-y

Download citation

Received: 31 January 2025
Accepted: 08 July 2025
Published: 16 July 2025
DOI: https://doi.org/10.1038/s41598-025-11074-y

Subjects

Abstract

Similar content being viewed by others

Construction of a prognostic prediction model for colorectal cancer based on 5-year clinical follow-up data

Multimetric feature selection for analyzing multicategory outcomes of colorectal cancer: random forest and multinomial logistic regression models

Multilevel analysis of social determinants of advanced stage colorectal cancer diagnosis

Introduction

Materials and methods

Dataset description

CRC 5-year survival vs. death metrics

Data preprocessing and feature selection

Training and validation

ML model performance metrics

Impact of SDOH features on ML model performance

Statistical analysis

Results

Patient population and characteristics

ML model performance

Discussion

Study limitations

Conclusions

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Electronic supplementary material

Supplementary Material 1

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links