Introduction

Cancer remains a leading cause of death in the United States (US), with colorectal cancer contributing significantly to cancer mortality. Colorectal cancer (CRC) is the second leading cause of cancer death in the US, accounting for 7.6% of all new cancer cases and 10% of cancer deaths1,2,3. Although overall CRC mortality rates have decreased by 8.2% in recent decades, CRC deaths continue to increase substantially within certain demographic groups, including African Americans, persons aged 40 to 54, and rural populations4. CRC is more prevalent in Appalachia than in any other US geographic region, with rural Appalachians suffering the highest burden of CRC deaths in the nation5,6. Rural Appalachians experience higher incidence of early onset CRC and more frequent occurrence of late-stage diagnosis and delayed treatment. Further, overall CRC mortality is 32% higher in rural Appalachia than in any other US population by region7,8,9. This worsening disparity highlights a critical need for targeted measures to identify and address factors contributing to high CRC mortality in Appalachia and similarly affected rural populations.

It is well known that non-clinical social, environmental, and lifestyle factors, referred to as the social determinants of health (SDOH), affect CRC risk, morbidity, and survival. Previous studies identify poverty, lack of education, lack of social support, along with social and geographic isolation, as SDOH factors that play an important role in CRC stage at diagnosis and survival10. Comparably, CRC disparity in rural Appalachia has been attributed to higher poverty and unemployment, geographic rurality, and high uninsured rates, as well as lower educational attainment and limited healthcare access in this region11. Despite this knowledge, little has been done to adequately quantify and address the impact of non-clinical, health related factors on cancer disparities.

The prediction of CRC mortality risk is a critical measure within cancer treatment planning and resource allocation. Risk scores can be used to identify individuals who are likely to live long enough to benefit from adjuvant treatment. Machine learning (ML) has been validated for its utility in prediction of clinical disease risk and outcomes, yet several challenges continue to limit these models’ ability to predict CRC survival12,13,14. With ML methods, feature selection plays a particularly important role in improving prediction model performance15,16. In this context, cancer prediction models primarily use clinical predictors (e.g., cancer stage, tumor grade, and number of positive lymph nodes) to classify patients and assist in treatment decision making. Review of previous CRC risk prediction models demonstrated suboptimal model performance when using only clinical features17,18,19.

Many ML-based prediction models also rely on data from national public datasets, and feature availability and demographic diversity are often limited in these datasets20,21. In a pilot study, we used National Cancer Institute’s (NCI)Ā Surveillance, Epidemiology, and End Results (SEER) Program data to explore bias when using common demographic and clinical features to predict 5-year CRC survival between Appalachian and non-Appalachian patients. Although bias was not detected, overall ML model performance was significantly lower in the Appalachian vs. non-Appalachian test population and SHapely Additive exPlanations (SHAP) analysis revealed that marital status was a top feature contributing to the performance of the model22. This suggested that other factors, beyond common clinical predictors, could improve ML model predictability and demonstrate the effect of non-clinical SDOH factors on CRC survival10. Several other studies have evaluated the performance and challenges associated with ML derived colorectal cancer prediction models using numerous clinical data features, however evaluation of the effects of SDOH features on CRC survival prediction is largely absent in the current literature13,23,24,25,26,27.

The inclusion of SDOH data variables as ML model input features could help identify and quantify the impact of non-clinical factors contributing to high CRC mortality in Appalachia and potentially improve CRC survival prediction in rural and other disparately affected populations. This knowledge would strengthen our understanding of the impact of SDOH on cancer survival and improve the development of strategies for care delivery in Appalachia and other underserved populations. Accordingly, the purpose of the present study was to determine whether the addition of SDOH features to ML modeling improves CRC survival prediction in the Appalachian population. We hypothesized that the addition of SDOH features would improve ML-based prediction of CRC mortality and help to evaluate the social and environmental factors in Appalachia that impact CRC outcomes.

Materials and methods

Figure 1 shows the complete workflow diagram for this study. First, de-identified data was pre-processed for missing value handling and data encoding to convert variables from categorical to numerical values. Feature selection approaches were used to extract the most important features from the dataset. The aggregate dataset was divided into 70% for training, 20% for testing, and the remaining 10% of the data for model validation purposes. The gradient-boosted tree ensemble machine-learning technique was used to train, test, and validate the four models, after which the voting classifier was applied. The confusion matrix was used to assess model effectiveness. Finally, models were evaluated comparatively to determine the best model.

Fig. 1
figure 1

Flow diagram of supervised machine-learning modeling methods as described.

Dataset description

This retrospective study utilized de-identified electronic health records (EHR) data from community cancer care centers located in Appalachia: (1) St. Elizabeth Healthcare in Edgewood, KY; (2) Pikeville Medical Center in Pikeville, KY; and (3) Thompson Cancer Survival Center (TCSC) based in Knoxville, TN. Combined, datasets included 25 attributes from 7,718 adults aged 18 years and older across three cancer care centers. Only patients that are 18 + and were diagnosed with malignant colon and/or rectosigmoid cancer between the years 2000 to 2017 were included in this study. Patients under 18 years of age were removed from the dataset. All data sets included demographic, clinical, and SDOH patient features. St. Elizabeth Healthcare and Pikeville Medical Center data was extracted from the local EHR system. Data from TCSC collected between 2000 and 2009 was captured by research staff from paper records and transcribed into .csv files. TCSC data from 2010ā€‰āˆ’ā€‰1017 was extracted from the local EHR system. All data was de-identified by individual healthcare sites prior to delivery, shared by secure data transfer, and stored on password protected cloud-based storage. The use of de-identified retrospective data is classified as a non-human subject study and IRB approval is not required. The need for informed consent was deemed unnecessary according to the HIPAA Privacy Rule. To ensure ethical conduct, this study was conducted under approval by Western-Copernicus Group Institutional Review Board (WCGĀ® IRB) protocol no. 20,223,670, consistent with 45 Code of Federal Regulations 46.102. All methods were performed in accordance with the relevant guidelines and regulations.

CRC 5-year survival vs. death metrics

The EHR data included a variable of ā€œsurvival monthsā€ which was used to report mortality based upon the length of time a patient survived in months relative to their initial diagnosis of cancer of the colon and/or rectum. The gold standard 5-year cancer survival metric was applied in this study as the ML model was developed to classify individuals as surviving < 60 months or ≄ 60 months (5 years) post-CRC diagnosis28. The ML prediction model is a binary classifier, where the ā€œnegativeā€ class represents patients who have < 60 survival months post CRC diagnosis, and the ā€œpositiveā€ class represents patients who have ≄ 60 survival months reported in the EHR.

Data preprocessing and feature selection

Datasets underwent preprocessing before being fitted into ML models to address missingness, data entry errors, outliers, variable labeling, and encoding. Data preprocessing involved converting all categorical data variables into numeric values, merging the three datasets from individual healthcare sites into one aggregate dataset, and data cleaning to address typographical errors and missing data. Listwise deletion was used to handle missing data.

Feature selection was employed to eliminate redundancy, identify the most critical EHR data variables, and increase the ML model’s accuracy in its prediction ability. The Extreme Gradient Boosting (XGBoost) ML algorithm complemented by SHAP plots was utilized to determine which features were most important for this investigation. There were 25 features available in the dataset. The raw data collected from EHRs included a variety of information about the patients including their demographic and clinical information related to their CRC diagnosis and select SDOH variables. The main target variable for the study was Survival months. Demographic features were Age at diagnosis, Sex, Race, Ethnicity, State of residence, County, and Zip (3-digit). The SDOH features that were available for analysis included Marital status, Geographic classification (rural/non-rural), Employment status, and Insurance status. All available features and corresponding model inclusion are shown in TableĀ 1.

Marital status data was missing in TCSC data and was therefore handled by imputation. Studies suggest that two-stage methods (imputation method and classifier combination)29,30,31 offer optimal configuration for a particular dataset. IterativeImputer32 is a python-based multivariate statistical method for handling missing data by means of accurate imputations, specifically when there are reasonably strong correlations between features. We used linear regression with 300 estimators and a scaled tolerance of ~ 2.5%. It took under 20 iterations for convergence. We did not find any significant differences in model accuracies between iterative robust imputation and arbitrary (fixed value for missing data) imputation. Weak correlation between imputed features or data imbalance between known and unknown data (e.g., TCSC didn’t collect Marital Status for their patients) may be potential reasons for this observation. A comparative table showing model accuracies by imputation method is included as supplemental material.

Table 1 Feature inputs available in EHR datasets for machine learning prediction models.

Training and validation

The dataset was randomly divided into training, testing, and validation parts. About 70% for training and 20% for test data. The remaining 10% of the data was reserved for model validation purposes. The ML prediction model was trained using the input features described above. For this study, we used XGBoost with hyperparameter optimization technique. XGBoost is a gradient-boosted tree ensemble method of ML which combines the estimates of simpler, weaker models to make predictions for a chosen target. The main principle of gradient boosting is to minimize the errors of the prior models to enhance the predictive performance as it continues repeatedly. Specifically, the Gradient Boosting classifier is employed for classification tasks33,34. Recent studies have demonstrated that gradient-boosted tree algorithms yield high accuracy for both acute and chronic prediction tasks, and that XGBoost performs better than other ML models on tabular data35. One of the benefits of using XGBoost is that it can implicitly handle a certain level of missingness in the data by accounting for missingness during the training process36. Before training the ML prediction model, hyperparameter optimization was performed using the training dataset, applied through a 5-fold cross-validation grid search and evaluation of the area under the receiver operating characteristic (AUROC) curve for different combinations of hyperparameters included in the grid search. The parameters (optimized values) were the maximum tree depth, pseudo-regularization parameter, minimum sum of weights of observations, fraction of observations to subsample at each step, fraction of features to use for building each tree, node, and level of depth in tree.

ML model performance metrics

The AUROC was used as the primary metric to evaluate and compare the overall performance of the ML prediction models37. Other metrics used to evaluate the performance of the ML prediction model for 5-year survival post-diagnosis were calculated as follows based on true positive (TP), false positive (FP), true negative (TN), and false negative (FN) values38,39. Sensitivity indicates the model’s ability to accurately identify patients who with CRC survival after 5-years.

$$Sensitivity\; (True\; Positive\; Rate, TPR) =\:\frac{TP}{TP+FN}*100$$

Specificity is used to assess how well the model can accurately identify patients who will not survive.

$$Specificity\; (True\; Negative\; Rate, TNR) = \:\frac{TN}{TN+FP}*100$$

Precision, or Positive Predictive Value, measures how many positive cases (CRC survivors) were correctly predicted by the model.

$$Positive\; Predictive\; Value\; (precision, PPV) = \:\frac{TP}{TP+FP}*100$$

Conversely, the Negative Predictive Value measures how many negative cases (CRC deaths) were correctly predicted by the model.

$$Negative\; Predictive\; Value\; (NPV) = \:\frac{TN}{TN+FN}*100$$

In addition, 95% confidence intervals (CIs) were computed for each metric as described in the Statistics section below. SHapely Additive exPlanations (SHAP) analysis was performed to evaluate the importance of each feature for generating model output40. The SHAP analysis ranks features by their importance to model predictions from top to bottom.

Impact of SDOH features on ML model performance

The merged data set consisting of EHR data from St. Elizabeth Healthcare, Pikeville Medical Center, and TCSC (TableĀ 2) was used determine the additive effects of SDOH features on performance of the ML model in predicting mortality in Appalachian colorectal cancer patients. Due to the unavailability of cause of death data in one of the hospital’s EHRs, the ML prediction model classified patient as surviving < 5 or ≄ 5 years irrespective of whether the cause of death was CRC-related or due to other factors. To test our hypothesis, we developed 4 ML models with the following feature inputs: Model 1: demographic and clinical features; Model 2: demographic and SDOH features; Model 3: demographic, clinical, and SDOH features; Model 4: clinical and SDOH features (TableĀ 1). The dataset was randomly divided into training, hold-out test, and validation datasets utilizing a 70:20:10 split, and the hold-out test data set was never exposed to the model during training. To evaluate algorithmic fairness, we compared XGBoost with other regression methods, Logistic regression and K-Nearest Neighbor. XGBoost delivered superior accuracies compared to both the logistic and K-nearest Neighbor regression (with number of neighbors = 5) methods. A comparative table showing predictive model accuracies by regression technique for each of the four data models is included as supplemental material.

Table 2 Patient demographics, clinical, and social determinants of health (SDOH) characteristics.

Statistical analysis

The confidence intervals (CIs) for AUROC were calculated using a bootstrapping method where a subset of patients from the hold-out test dataset were randomly sampled and the AUROC was calculated using the data from those patients, repeated 1000 times with replacement41. From these bootstrapped AUROC values, the middle 95% range was selected as the 95% CI for the AUROC. As the sample size of the hold-out test dataset was sufficiently large, the CIs for other metrics were calculated using normal approximation42. Because demographic and clinical features are traditionally used for mortality/survival prediction, Model 1 (demographic + clinical features) was considered our control ML model. The difference between the metrics of Models 2, 3, and 4 as compared to Model 1 was determined using a two-sided t-test with a 95% significance level (Fig.Ā 2).

Fig. 2
figure 2

Impact of SDOH features on AUROC when predicting CRC in Appalachians. The machine learning algorithm performance was greatest in Model 3 (0.790). All models performed better than the baseline. Model 1 – demographics + clinical features Model 2 – demographics + SDOH features. Model 3 – demographics + clinical + SDOH features. Model 4 – clinical + SDOH features.

Results

Patient population and characteristics

Demographics, clinical characteristics, and SDOH variables for colorectal cancer patients from each site are shown in TableĀ 1. Of the 7,718 patients included in analysis, mean age at diagnosis was ~ 67 years and ~ 50% of the patients were female. Nearly all patients were white and non-Hispanic, and approximately 55% were married. Most patients in the St. Elizabeth dataset were non-rural, whereas the majority of patients in Pikeville and TCSC datasets were rural. Additional information regarding cancer primary site, summary stage, and grade as well as employment and health insurance are included in TableĀ 1. Due to missing information on survival status – the target feature, 14 patient records were removed from the dataset before analysis.

ML model performance

The AUROC curves for the hold-out test datasets for 4 ML prediction models along with the baseline are shown in Fig.Ā 2. Overall, the ML prediction models achieved a range of classification performance for patients with an AUROC of 0.780 (95% CI = 0.757–0.801) for Model 1 (demographic and clinical features), 0.627 (95% CI = 0.598–0.655) for Model 2 (demographic and SDOH features), 0.790 (95% CI = 0.767–0.812) for Model 3 (demographic, clinical, and SDOH features) and 0.771 (95% CI = 0.748–0.794) for Model 4 (clinical and SDOH features). The strongest performance was observed in Model 3, which was significantly greater than Model 1 (P < 0.001).

The complete list of performance metrics for the ML prediction models are shown in TableĀ 3, and a confusion matrix is shown in Fig.Ā 3 to provide a visual representation of the ML prediction model’s output for the hold-out test groups. For all metrics, including TPR, Model 3 demonstrated the greatest performance of the 4 ML-based prediction models (Fig.Ā 4).

Fig. 3
figure 3

Confusion matrix provides a visual representation of the machine learning prediction model’s output for the hold-out test datasets in each model. Top left: true negative (TN); Top right: false positive (FP); Bottom left: False negative (FN); Bottom right: true positive (TP); Y-axis: true value; X-axis: predicted value. 0 = died; 1 = survived; SDOH = social determinants of health. Feature Importance: A SHAP analysis was used to evaluate the importance of individual input features in generating each model’s predictions top to bottom in decreasing order of importance. The top features for Model 1 (demographic and clinical features) included age at diagnosis, cancer summary stage, year of diagnosis, and cancer grade. The importance of SDOH features for Models 2–4 varied depending on which other variables (demographic or clinical) were included in the model. In Model 2 which included demographic and SDOH variables, the most important SDOH features were rurality, employment status, and marital status. More importantly, in Model 3 which included all demographic, clinical and SDOH variables, rurality was in the top 3 features and had a greater impact on the model’s performance than traditional demographic and clinical features. Rurality was also in the top 3 features in Model 4 (Fig.Ā 4). Among demographic variables, race and ethnicity were the lowest impact features across Models 1, 2, and 3. Primary tumor site and radiotherapy were consistently the lowest impact clinical features across all models as well.

Fig. 4
figure 4

SHAP feature importance plots shows the importance of input features (from top to bottom) that contributed to the discriminative ability of each of the four machine learning prediction models. SHAP = SHapley Additive exPlanations.Ā Red: high importance, Blue: Low importance

Table 3 Performance metrics for the ML prediction model: impact of SDOH features

Discussion

This study is one of the first to determine the impact of SDOH variables in predictive modeling of CRC mortality/survivorship in an underserved population with CRC disparity. We developed a gradient-boosted tree ensemble (XGBoost) ML model that demonstrated good classification performance for survival prediction in Appalachian CRC patients using local demographic and clinical EHR data as demonstrated by an AUROC of 0.780 and TPR of 0.769 (Model 1). The addition of SDOH variables to the model resulted in a modest, but significant improvement in AUROC and TPR (Model 3). Although building the most performative model was not the primary goal of this study, we were able to develop a strong ML-based model to classify CRC patient mortality using simple demographic, clinical, and SDOH data as feature inputs. These findings align with other ML based studies designed to assess CRC disparity43,44. High CRC incidence and mortality in rural Appalachians has been attributed to several SDOH factors, including higher poverty, uninsured rates, and unemployment, and healthcare access limited by geographic location in this region6,8,11. In the present study, we evaluated the impact of SDOH on CRC survival in Appalachia by measuring the impact of SDOH features on ML-based CRC survival prediction. We utilized a representative, locally sourced EHR dataset from Appalachian cancer care centers to demonstrate the impact of SDOH features in this population by building several ML models which included SDOH features in various combinations, with and without traditional demographic and clinical predictors. Our collective findings provide insight into the key features that contribute to ML-based prediction of CRC survival in this unique population. First, using easily accessible and traditional demographic and clinical features (see TableĀ 1), the ML model achieved a strong performance (Model 1; Fig.Ā 2; TableĀ 3). Second, the combination of demographic and SDOH variables without clinical variables, had the weakest performance (Model 2); however, the combination of clinical and SDOH features without demographic variables still resulted in a robust performance (Model 4). These data, along with SHAP values (Fig.Ā 4), clearly indicate that clinical features are highly important in CRC mortality, as expected. Most importantly, the addition of SDOH features to both demographic and clinical features resulted in the strongest performing ML model and significantly improved CRC survival prediction (Model 3). Additionally, SHAP analysis for this model indicates that rurality was in the top 3 features contributing to model performance. Further, health insurance status (6th ) and employment status (9th ) also were within the top features.

These latter findings indicate that non-clinical SDOH factors such as poverty, high uninsured rates, and unemployment may impact high cancer-related mortality seen in rural populations, and these SDOH factors should be considered when using ML or other methods to address disparities. These findings show that SDOH variables impact CRC patient survivorship and that addressing SDOH factors may improve outcomes and quality of life of CRC patients, especially those in medically underserved populations unequally burdened by SDOH factors like those seen in rural Appalachia. Allocation of healthcare system resources towards modifiable SDOH factors would be a viable strategy to reduce cancer deaths. For example, expanding Medicaid, providing transportation or mobile health clinics, and providing supplemental food or housing resources to unemployed cancer patients could further improve CRC survival within these and other underserved populations. Our findings align with non-ML based studies designed to assess the impact of SDOH factors on CRC survival45,46,47. Further quantification of SDOH on CRC survival may contribute to the development of adjuvant SDOH-based policies and treatment interventions that improve cancer survivorship outcomes.

Several other studies have evaluated the performance and challenges associated with ML derived CRC prediction models. Burnett et al. reviewed the strengths and weaknesses of CRC risk-prediction models that used routinely collected health registry data13. They found that tree-based models outperformed other models and were more interpretable. Rahman et al. evaluated prediction of CRC using ML algorithms in large datasets of global dietary data48. Their findings demonstrate the importance of using non-clinical data features for early detection of CRC in a large dataset that includes younger and older adults. Ting et al. developed a ML-based model to predict CRC tumor recurrence in survivors27. They identified drinking behavior as an important predictive SDOH feature which should be monitored to improve CRC treatment strategies. Our findings fill in gaps in the current literature by demonstrating the impact of non-clinical SDOH features on CRC survival in an underserved, disparately affected population.

It is also of note that this study utilized EHR datasets directly acquired through community partnerships established within the target population, while many ML-based studies commonly use public datasets. It is known that selection, representation, and evaluation biases may result when using public datasets to generate real-world evidence, therefore we utilized EHR data provided directly from Appalachian community cancer care centers to minimize these risks49,50.

Study limitations

There are limitations that should be considered. First, the dataset was missing cause of death information from one hospital. This information may have been used to better relate patient mortality to the CRC diagnosis. As a result, our analysis evaluated all-cause mortality which may not adequately capture the direct relationship between SDOH and CRC mortality. However, this can be reevaluated in future studies. Secondly, our data processing methods did not address the effects of patient age as a continuous variable within the model, which contributes to its high feature importance and wide distribution on SHAP analysis. Regression algorithms are commonly applied to examine relationships amongst continuous variables, however this study identified XGBoost as the most performative algorithm as all other variables, aside from age at diagnosis, were categorical. This can be addressed in future studies through preprocessing recategorization of continuous age variables into age groups. Thirdly, in this study we were unable to evaluate the effects of SDOH factors related to low healthcare access and transportation, which are prevalent in rural Appalachia. Patient’s mode of transportation is not commonly accounted for in EHRs. Available zip code and county data can be used to evaluate this in the future. Further, we did not obtain treatment and diagnosis dates in this dataset to measure the length of time in days between primary CRC diagnosis and initiation of treatment. This information would allow for evaluation of cancer treatment delays, which can be related to SDOH factors in disparately affected populations like rural Appalachia. Fourth, EHRs often have quality issues (e.g., missingness, misclassification, measurement error) and the data is largely unstructured (e.g., diagnostic notes, patient experience, etc.). Our dataset is small with about 7700 records after combining three hospital datasets with the same features. The achieved accuracy of ~ 0.79 is superior compared to linear regression models for binary classification we explored as a part of this study. We believe that the accuracy can be further improved with collecting additional patient records and addressing the missing data (e.g., marital status in TCSC, transportation in St. Elizabeth, etc.). Fifth, our sample was primarily White and non-Hispanic that mirrors the rural US population which is 76% non-Hispanic White. The lack of racial-ethnic diversity within the sample could affect the reliability of this specific predictive model if used in more diverse, non-rural populations. As currently designed, this ML model would best be utilized only in rural populations or those with similar cancer mortality rates, similar poverty, uninsured, and unemployment rates, as well as similar racial-ethnic distributions. However, it is important to note that the ML methods applied in this study are broadly applicable. Lastly, this study was limited by the unavailability of SDOH data in specialty care EHRs. The collection of SDOH data is a more recently adopted practice which generally occurs at the primary care level. Conducting studies such as this one which demonstrate the effects of SDOH on clinical outcomes could lead to increased collection of SDOH data and the development of supportive SDOH policies and procedures that improve clinical outcomes related to cancer and other chronic diseases.

Conclusions

Colorectal cancer is a recognized cause of cancer-related death that unequally effects the rural Appalachian population. Although the impact of non-clinical social and environmental factors on cancer outcomes has been well described, SDOH factors are not commonly included in machine learning approaches to address CRC disparity. We used a gradient-boosting machine learning algorithm to test four different models to predict CRC survival based on EHR data which included SDOH features. Our best performing model combined demographic, clinical, and SDOH features. Feature stratification showed that rurality, insurance status, and employment status were important to model performance, with rurality being a top feature overall. Compared to previous methods, our model successfully demonstrated the impact of SDOH features in ML based prediction of CRC survival in a dataset that well-represented the rural Appalachian population. In future research, we will gather a larger EHR dataset which covers a broader population of rural patients and includes additional SDOH variables to improve model performance and more accurately quantifies the effects of other SDOH features on CRC survival. Further ML-based evaluation of the impact of SDOH on cancer disparity can facilitate the establishment of best practices in community-based AI/ML research to address cancer disparities in underserved communities.