Introduction

Clinical background

On the last day of 2019, the WHO received information about 44 cases of pneumonia-like disease in Wuhan city, China1. By 5 September 2022, more than 600 million cases of SARS-CoV-2 infection had been reported across all continents, regions, and most countries, resulting in nearly 6.5 million deaths2.

COVID-19, the disease caused by infection with SARS-CoV-2, has a high mortality rate in hospitalised patients with deaths predominantly caused by respiratory failure3. It continues to this day to be a challenging global pandemic with significant morbidity and mortality4. As Knight et al.5 indicate, prognostic models that can predict outcomes among COVID-19 patients can be used to support clinical decision-making regarding hospital treatment and prioritisation. One such score is the 4C score that includes data about patient comorbidity, abnormal physiology, and inflammation using routinely measured data, bedside observations, and biochemistry tests6. While in most cases COVID-19 is a mild illness, those at highest risk of death and severe complications usually are hospitalised some time after onset7.

Pulmonary embolism (PE) is among the most severe and preventable complications of COVID-19 characterized by increased D-dimer levels and high thrombosis risk that has been repeatedly reported across different countries8. Studies suggest PE incidence rates above 15% in the ICU for COVID-19 patients and early recognition of its risk factors can help in identifying urgent treatment with anticoagulation therapy to those most in clinical need4,9. Recent international studies additionally suggest COVID-19 as a key risk factor for pulmonary embolism both in the short- and long-term9,10. Existing PE prediction models are limited in part because they were developed for non-COVID-19 patients and traditional risk factors for PE may not be as predictive. If risk models can be developed for assessing occurrence of PE in COVID-19 patients across different countries, that can be an important step forward in preventing this serious complication of COVID-19, especially given the current epidemiological situation9.

As for risk factors that contribute the most to the occurrence of mortality and pulmonary embolism in COVID-19 patients, age has been established as the dominant predictor of mortality11. Furthermore, studies have described other risk factors of COVID mortality such as cardiovascular disease, chronic respiratory disease, diabetes, hypertension, smoking, and obesity12.

Technical background

Machine learning has been applied to different COVID-19 related questions. Large amounts of patient data are being generated during the COVID-19 pandemic which can be useful for predictive modelling. Using machine learning with large amounts of complex patient data could generate accurate and patient-specific predictions and assist clinicians.

Previous research includes13 exploring in-hospital mortality with logistic regression on just 191 patients and14 have followed with multi-center validation with 299 patients for internal training and 145 patients for external validaton.15 have looked at regression-based predictions of all-cause mortality with hospital admission time as a predictor and using hazard models yet their results have also been limited due to a smaller dataset restricting generalisability. All of these studies have used a combination of demographics, comorbidities, symptoms, laboratory tests, and self-reported onset times.

In this study, we investigated how pulmonary embolism and all-cause mortality vary across subgroups of a large and international cohort. We also show how predictive certain clinical factors gathered from patients with COVID-19 can be to the respective outcomes. In studies looking at predicting thromboembolism more broadly, a defining limitation for impactful and generalisable application of machine learning methods has been a small patient sample and a lack of systematic comparison of algorithms16. Applying a diverse set of methods to one of the largest and most diverse datasets on hospitalised patients with COVID-19 can help find the best mechanism for risk prioritisation of patients in a timely way and may help reduce mortality and risk of PE in those with COVID-19.

Results

Variable distributions can be seen in Tables 123, and 4. A detailed collection of figures for variable distribution across age groups can be found in the Supplementary.

Table 1 Baseline characteristics stratified by occurrence of PE (median and IQR used for lab measurements).
Table 2 Baseline characteristics stratified by occurrence of PE (continued).
Table 3 Baseline characteristics stratified by death (median and IQR used for lab measurements).
Table 4 Baseline characteristics stratified by death (continued).

Several variables were highly correlated with PE and death (Supplementary Figures 3 and 4, Tables II and III). Multivariable logistic regression shows high association of country, age, alpha variant, and certain symptoms with PE and death (Figs. 1 and 2). Tables with p-values are included in Tables 56, and 7.

Figure 1
figure 1

Adjusted odds ratios for PE with 95% confidence intervals.

Figure 2
figure 2

Adjusted odds ratios for death with 95% confidence intervals.

Table 5 Adjusted odds ratios of features with 95% confidence intervals (only Spain and UK patients included for PE).
Table 6 Adjusted odds ratios of features with 95% confidence intervals (only Spain and UK patients included for PE).
Table 7 Adjusted odds ratios of features with 95% confidence intervals (only Spain and UK patients included for PE).

The Cox proportional hazards model without regularisation yielded a C-index of 0.71 and the forest plot shows high hazard ratios for age, certain regions of admission, and specific symptoms (Fig. 3 and Tables 8 and 9).

Figure 3
figure 3

Adjusted hazard ratios for mortality with 95% confidence intervals.

Table 8 Adjusted hazard ratios for mortality.
Table 9 Adjusted hazard ratios for mortality (continued).

The Kaplan-Meier curves for risk stratification across age, sex, and region groups show clear difference in risk with older men and those in South Asia and the Middle East with the lowest rates of survival (Fig. 4).

Figure 4
figure 4

Kaplan–Meier survival curve for COVID-19 patients stratified by a) age, b) sex, and c) region.

Tables 1011, and 12 show superior performance of the XGBoost model across all 3 test sets. Similarly, XGBoost maintains sensitive and accurate prediction of death compared to other alternative models (Table 13). The validation scores are for the combined UK and Spain set.

Table 10 Prediction model results for PE on test set with UK and Spain (F1-w is the weighted F1 score).
Table 11 Prediction model results for PE on UK test set (F1-w is the weighted F1 score).
Table 12 Prediction model results for PE on Spain test set (F1-w is the weighted F1 score).
Table 13 Death prediction model results for test set (F1-w is the weighted F1 score).

The model also maintains high predictive performance across various subgroups of the patient population stratified across sex and age (Tables 14 and 15).

To further evaluate our model, we test it on held-out test data with specific patient population subgroups including men, women, and different age groups as can be seen in Tables 14 and 15. Our model shows reliable prediction for PE and mortality in both men and women without a significant difference in performance for each group, whereas for age groups there is greater variation in results as compared to sex differences but it remains relatively consistent in predictive performance.

Table 14 Prediction model results stratified across sex and age groups for PE (F1-w is the weighted F1 score).
Table 15 Prediction model results stratified across sex and age groups for death (F1-w is the weighted F1 score).

Taking the best performing XGBoost model and applying 2 different feature importance methods, average f1-score gain across splits and Shapley values, we obtain the results seen in Figs. 567, and 8. A feature importance stratification on a held-out test set of only men and only women separately for either PE or mortality prediction is also included in Figs. 91011, and 12. As further clarification for the SHAP plot, darker colour indicates that a higher value of that feature contributes to the prediction either positively (if on the right hand side of the vertical line) or negatively (if on the left hand side of the vertical line). Higher placement of the feature vertically in the plot means it has a higher mean Shapley value and hence contributes more to correct predictions in the model.

Figure 5
figure 5

Feature importance from XGBoost PE prediction model using F1-score gain method (average contribution of each feature to predictive performance).

Figure 6
figure 6

XGBoost feature importance with SHAP for PE. The values in the legend being higher or darker colour in the plot correspond to higher values of that feature contributing to the prediction either for stronger positive prediction (more colour points for the feature on the right side of the vertical line) or stronger negative prediction of outcome otherwise.

Figure 7
figure 7

Feature importance from XGBoost mortality prediction model using F1-score gain method (average contribution of each feature to predictive performance).

Figure 8
figure 8

XGBoost feature importance with SHAP for mortality. The values in the legend being higher or darker colour in the plot correspond to higher values of that feature contributing to the prediction either for stronger positive prediction (more colour points for the feature on the right side of the vertical line) or stronger negative prediction of outcome otherwise.

Figure 9
figure 9

XGBoost feature importance with SHAP for PE (only men). The values in the legend being higher or darker colour in the plot correspond to higher values of that feature contributing to the prediction either for stronger positive prediction (more colour points for the feature on the right side of the vertical line) or stronger negative prediction of outcome otherwise.

Figure 10
figure 10

XGBoost feature importance with SHAP for PE (only women). The values in the legend being higher or darker colour in the plot correspond to higher values of that feature contributing to the prediction either for stronger positive prediction (more colour points for the feature on the right side of the vertical line) or stronger negative prediction of outcome otherwise.

Figure 11
figure 11

XGBoost feature importance with SHAP for mortality (only men). The values in the legend being higher or darker colour in the plot correspond to higher values of that feature contributing to the prediction either for stronger positive prediction (more colour points for the feature on the right side of the vertical line) or stronger negative prediction of outcome otherwise.

Figure 12
figure 12

XGBoost feature importance with SHAP for mortality (only women). The values in the legend being higher or darker colour in the plot correspond to higher values of that feature contributing to the prediction either for stronger positive prediction (more colour points for the feature on the right side of the vertical line) or stronger negative prediction of outcome otherwise.

Discussion

To our knowledge, this multi-center dataset is the largest international cohort of hospitalised COVID-19 patients available. Our analysis showed that patients with PE are older, more often male, white, from higher income countries, and are more likely to suffer from: asthma, chronic cardiac disease, chronic kidney disease, chronic neurological disease, chronic pulmonary disease, hypertension, cancer, obesity, rheumatologic conditions, or smoke.

The occurrence of pulmonary embolism in our study population was 0.7% and our results showed a significant association between confirmed PE and mortality when compared with patients without PE as has been similarly found in patients without COVID-1917.

Accordingly, our logistic regression models for PE and death showed that different age-groups experience different risks of either outcome. The age group 40-80 was at highest odds of having PE, and those >60 of dying as can be seen in the Kaplan-Meier curve in Fig. 4a. Symptomatic COVID-19 patients were almost 3 times more likely to experience PE while also being more likely to die. Within symptoms and comorbidities, shortness of breath, chest pain, obesity, and bleeding were associated with higher odds of a PE, followed by hypertension and loss of smell. The regionality of the data must be addressed in the higher odds of death in South Asia, Middle East and North Africa (MENA), and South Africa compared to Europe and Central Asia as the hospital centers in those communities have different challenges and circumstances when it comes to fighting the pandemic. Symptoms like shortness of breath, confusion, severe dehydration, and wheezing were present in COVID-19 patients with higher odds of death, and comorbidities such as malignant neoplasm, diabetes, and chronic kidney or liver disease also lead to higher risk of death. For both correlation and odds of PE and death, men were more at risk. This is shown in the Kaplan–Meier curves for survival stratified across sex in Fig. 4b.

The hazard ratios confirmed those over the age of 60 were at highest risk of death, especially those COVID-19 patients who experienced shortness of breath, severe dehydration, confusion, and had pre-existing chronic conditions. Regionality of hospital admission was once again an important risk factor for death. Interestingly, patients with PE, chest pain, asthma, or fever seemed to have lower risk associated which could be due to earlier and easier detection of these symptoms and conditions in the progression of the disease.

Seeking to combine this clinically insightful information for outcome prediction, we developed a fast prediction model with XGBoost for both PE and death in COVID-19 hospitalised patients, and tested it in different countries separately. We also showed that appropriate class weighting can help with class imbalance and even outperform ensemble resampling methods without having to sacrifice the interpretability of the model (Tables 1013). The differences between measured performance on UK and Spain test sets as evaluated by sensitivity and accuracy are due to different class-imbalance ratios and positive case distributions between the datasets. It is important to note that the metric to focus on for our purposes are the validation and test AUC which remain consistent between the two datasets at around 75% as it is the most robust metric in the face of extreme class-imbalance. Since the class-imbalance varies between the two datasets as well, other metrics like sensitivity and accuracy will be significantly impacted despite attempts to dampen it but due to only a few percentages of positive cases, the potency of our approaches can only be limited. The best-performing model for PE prediction evaluated across separate held-out UK-only data, Spain-only data, and UK and Spain data combined is XGBoost without undersampling and without rigid thresholding using robust class weighting. As for death, the XGBoost again outperformed all other models including the ensemble with XGBoost on some metrics.

Since our XGBoost model outperformed other methods, we also showed that the best method for handling class imbalance is through robust class weighting and compared it to other methods for imbalance handling like ensembles and resampling methods. Another advantage of this method is that it avoids introducing bias like in the case of resampling. Finally, XGBoost provides feature importances which was useful for explaining clinical risk prediction of the model to healthcare professionals and policy-makers.

Exploring two different interpretability methods for XGBoost, average gain across splits and Shapley values, showed that the time of dominant presence of the alpha variant, age, fever, shortness of breath, and hypertension were the key predictors for PE, followed by region of admission, sex, and chest pain. Recent work has alluded to an association between the alpha variant and occurrence of thromboembolisms in mice but further research relevant to human samples is missing18. Age was a complex non-linear predictor with different age groups corresponding to varying risks. The clear colour separation for the Shapley values for age in Fig. 8 showed how each age group has a clearly separable predictive value for mortality with older groups having higher risk but which is not the case for PE as younger age groups can be more predictive of higher PE risk. Furthermore, Shapley values analysis identified obesity, smoking, and the presence of cough as important predictors for PE whereas the default XGBoost method does not. The most predictive features for all-cause mortality were age, region of hospital admission, sex, diabetes, and shortness of breath whereas the default method highlights hypertension and obesity in addition. For mortality, higher values of region corresponded to samples from South Asia and South Africa. Comparing all of the top identified predictors across these models for all outcomes can be seen in Tables 16 and 17 where certain symptoms and comorbidities have been identified to be universally predictive risk factors right at-admission without any additional measurements having to be taken for PE and mortality risk assessment.

Table 16 Features of Significant Importance for PE and Mortality Prediction According to Different Models (For XGBoost Top 20 SHAP Value Features Were Taken as Important, and for Logistic Regression and Cox model significance was taken as \(p<\)0.005).
Table 17 Features of Significant Importance for PE and Mortality Prediction According to Different Models (For XGBoost Top 20 SHAP Value Features Were Taken as Important, and for Logistic Regression and Cox model significance was taken as p < 0.005).

The pulmonary embolism and mortality prediction model can help with management of COVID-19 as it uses standard demographics, comorbidity, and symptom data collected at admission for identifying patients most at risk of developing PE which may enable an earlier start of targeted anticoagulation therapy. Our mortality risk prediction model can also help with patient population risk assessment and prioritisation across different regions of the world.

A strength of the current study is that a combination of machine learning and traditional statistical modeling can offer a more reliable system for predictive risk forecasting. XGBoost provides at-admission prediction of both events, while odds and hazards ratios obtained from logistic regression and the Cox proportional hazards model give us an insight into stratified risk and global feature importance. We systematically compare our XGBoost model with different risk prediction algorithms. Our model also outperforms recently published results across a variety of metrics like AUROC and sensitivity despite being developed on a much larger and more heterogeneous and diverse dataset while being robust to class imbalance19. With existing scores built on non-COVID-19 data like The National Early Warning Score 2, there is insufficient information available on their reliability in the COVID-19 setting, and some have been found to underestimate mortality20. Our model is able to deploy at admission for both PE and death risk prediction and can help supplant these needs rapidly.

The study, however, has several limitations. First, almost 60% of patients who died did so in South Africa, and over 70% of PE cases were located in the UK. This may be due to limited access to d-dimer tests or CT scans. There were no mandatory diagnostic criteria in the ISARIC CRF for PE. The absence of a control group of patients without COVID-19 in this dataset prevented estimation of specificity. The patient cohort comprised of hospitalised patients with confirmed COVID-19 who had a mortality rate of 21.7%. These models are not for use in the community and could still perform differently in populations at lower risk of death and across different regions of the world. As part of future work, dependent on sufficient data, PE and death could be modelled with a comprehensive multi-state statistical framework, which incorporates the interrelations among survival, PE, and death states.

In conclusion, the set of decisions taken must include different stakeholders like patients, clinicians, hospital administrators, researchers, and data procurers so that trade-offs can be identified and context-informed decisions can be taken to address them, especially if our models could have missed harms or benefits to different groups and communities.

Methods

Data

In this work, we use data of COVID-19 patients from The International Severe Acute Respiratory and Emerging Infection Consortium (ISARIC), a repository that standardises and secures data on COVID-19 assembled from a global cohort over 2 years of the pandemic as of January 2022. It includes so far data on over 800,000 patients from 53 countries. These data capture the global experience of the first 2 years of the pandemic21. The clinical characterisation protocol underwent ethical review by the World Health Organization Ethics Review Committee and ethics approval was obtained for each participating country and site according to local requirements. Ethics Committee approval was given by the WHO Ethics Review Committee (RPC571 and RPC572, 25 April 2013). Institutional approval was additionally obtained by participating sites including the South Central-Oxford C Research Ethics Committee in England (Ref. 13/SC/0149), the Scotland A Research Ethics Committee (Ref. 20/SS/0028) for the UK and the Human Research Ethics Committee (Medical) at the University of the Witwatersrand in South Africa as part of a national surveillance programme (M160667), which collectively represent the majority of the data. Other institutional and national approvals are in place as per local requirements. This is a secondary analysis of data collected, with appropriate local permissions, and each institution signed a Terms of Submission in which committed that they had the appropriate permissions in place. All methods were performed in accordance with the relevant guidelines and regulations.

The study population consisted of all patients with either clinically diagnosed or laboratory confirmed COVID-19 admitted to the participating hospitals. The aim of the recruiting sites was to use a consecutive sample.

The dataset contains 800,459 patients and 182 variables. The mean age of patients was 56.4 (20.9), 48.6% were male, and the majority of cases were from South Africa (54.0%) and the United Kingdom (34.1%). 65.3% of patients were discharged alive and 20.4% died. We grouped countries with fewer than 60 individuals into a single category. Out of all patients, 5450 (0.7%) experienced a pulmonary embolism, 73 experienced thromboembolism, and 143 experienced deep vein thrombosis. We define our outcome of interest as the main pulmonary embolism (PE) diagnosis for subsequent analysis. 4653 (82.1%) of the PE cases were recorded in the United Kingdom (UK) and Spain which based on our knowledge makes our study the largest study of its kind for PE to date. Due to similar data collection patterns and recording, we used data from these two countries only for PE modelling as they contain the vast majority of reported PE cases.

Data preprocessing

Since treatment information does not have reliable timestamps for most patients, the following variables were used in the analysis for PE: demographics (including age, sex, country), comorbidities (hypertension, diabetes, smoking etc.), and symptoms (coughing, fever, fatigue etc.). The presence of diagnosis during domination by the alpha variant was also included (after December 2020) due to its possible association with incidence of PE. In our modelling of PE, we used data from patients from the UK and Spain only and did not use laboratory measurements or imputation methods. 269,373 patients and 45 variables remained for PE, and 734,282 patients and 55 variables for death. Age was grouped into 5 categories (0–20, 20–40, 40–60, 60–80, 80–120) with the distributions seen in Figs. 13 and 14 below. The symptomatic variable represents any symptoms reported for a patient. For number of days from admission to event (death), we removed outliers of more than 200 days and those in the negatives, thereby removing 1342 patients.

Figure 13
figure 13

Age distribution for all patients stratified by death outcome.

Figure 14
figure 14

Age distribution for UK and Spain patients.

Prior to processing the data, for PE prediction, we held out 3 test sets of 20% of the total dataset for independent testing, one of which would only include patients from Spain, one only including patients from the UK, and another including both. For mortality prediction testing, we held out 20% of the total dataset sample. A workflow diagram for data processing and system design is shown in Fig. 15.

Figure 15
figure 15

Flowchart of framework with machine learning model to predict the risk of PE and mortality at admission.

Stratified Kaplan–Meier curves by age, sex, and region of admission were plotted using Cox proportional hazards models while machine learning methods were applied for prediction of PE or death.

Baseline and machine learning methods

The reference groups for statistical analysis for age were those under 20 years old, for region it was East Asia, and for country variable it was Norway. For Cox proportional hazards model, proportionality assumption was verified through visualisation of the survival curves and observing parallel behaviour as seen in Fig. 4. We investigated several prediction methods for PE occurrence and death, including logistic regression, Linear Discriminant Analysis (LDA), naive Bayes classifier, random forests, ADABoosting algorithms, and the high-performing extreme gradient boost machine (XGBoost)22,23. Previous studies looking at tree-based algorithms such as XGBoost have highlighted its capacity to learn the correlations between covariates well when it comes to mortality prediction in COVID-19 patients while also being somewhat interpretable24. We applied all of these methods for the purposes of a systematic comparison using 5-fold cross-validation, several hold-out test sets stratified across countries and regions, and evaluated with multiple metrics. A list of methods applied can be seen in Table 18 with details in Supplementary.

Table 18 Machine learning methods investigated.

As is often the case in disease prediction, there is class imbalance with about 1.7% of UK and Spain patients having been diagnosed with PE and around 20.4% having died in the case population. To address this, we use other metrics mentioned above in the evaluation of our models besides accuracy as it does not capture the true predictive performance of our models and we rely more on sensitivity and the F1 score. We also use a different threshold for prediction after probability estimation instead of the default 0.5 to achieve cost-sensitivity, and we apply random undersampling at a minority:majority ratio of 1:4 as has been highlighted in other work25,26. We evaluate these methods both separately and in combination to investigate the best approach for this set of prediction problems.

To address imbalance in predictions, we applied either undersampling, thresholding, or both. As for death, due to a much softer imbalance, undersampling was not necessary. We also added class-weighting to our XGBoost model using inverse proportions and compared it with the other methods to address class imbalance. All confusion matrices and parameter details for each model can be found in the Supplementary.

Furthermore, we build an ensemble that combines AdaBoosted decision trees with robust undersampling using different subsets for resampled training so as to address the imbalance and compare this cost-sensitive model with our best performing model and add further confidence in its ability to generalise in an imbalanced scenario27. We extend the ensemble learning methods by using our own XGBoost model in the ensemble structure instead of the standard AdaBoosted decision trees. The number of trees was a tunable hyperparameter listed in the Supplementary (Tables IVVI). We compare our cost-sensitive class-weighted XGBoost machine learning model with these resampling ensembles to show improved performance without the need of introducing bias like in the case of resampling while maintainting interpretability.

Model validation and evaluation

We proceed to tune our machine learning models and validate them using stratified 5-fold cross-validation with Bayesian optimisation. We repeated the optimisation procedure for 50 iterations after which we evaluated the model on the independent test set with the following metrics: AUROC, Accuracy, Weighted F1 Score, and Sensitivity. The details can be found in the Supplementary.

While existing studies referenced in the Introduction mention some approaches to feature importance estimation for COVID-19 mortality and outcome prediction as well as for other problems, rarely does one find several interpretability methods compared and contrasted in one scenario. We implemented both tree-based F-score interpretability methods as well as Shapley values analysis, logistic regression and Cox regression, and hope to draw interesting conclusions from each and their comparisons28,29. A full explanation for the Shapley values method and its details can be found in the Supplementary materials.

Role of the funding source

The funder had no role in study design, data collection, data analysis, data interpretation, writing of the report, and decision to submit the paper for publication.