Abstract
By September 2022, more than 600 million cases of SARS-CoV-2 infection have been reported globally, resulting in over 6.5 million deaths. COVID-19 mortality risk estimators are often, however, developed with small unrepresentative samples and with methodological limitations. It is highly important to develop predictive tools for pulmonary embolism (PE) in COVID-19 patients as one of the most severe preventable complications of COVID-19. Early recognition can help provide life-saving targeted anti-coagulation therapy right at admission. Using a dataset of more than 800,000 COVID-19 patients from an international cohort, we propose a cost-sensitive gradient-boosted machine learning model that predicts occurrence of PE and death at admission. Logistic regression, Cox proportional hazards models, and Shapley values were used to identify key predictors for PE and death. Our prediction model had a test AUROC of 75.9% and 74.2%, and sensitivities of 67.5% and 72.7% for PE and all-cause mortality respectively on a highly diverse and held-out test set. The PE prediction model was also evaluated on patients in UK and Spain separately with test results of 74.5% AUROC, 63.5% sensitivity and 78.9% AUROC, 95.7% sensitivity. Age, sex, region of admission, comorbidities (chronic cardiac and pulmonary disease, dementia, diabetes, hypertension, cancer, obesity, smoking), and symptoms (any, confusion, chest pain, fatigue, headache, fever, muscle or joint pain, shortness of breath) were the most important clinical predictors at admission. Age, overall presence of symptoms, shortness of breath, and hypertension were found to be key predictors for PE using our extreme gradient boosted model. This analysis based on the, until now, largest global dataset for this set of problems can inform hospital prioritisation policy and guide long term clinical research and decision-making for COVID-19 patients globally. Our machine learning model developed from an international cohort can serve to better regulate hospital risk prioritisation of at-risk patients.
Similar content being viewed by others
Introduction
Clinical background
On the last day of 2019, the WHO received information about 44 cases of pneumonia-like disease in Wuhan city, China1. By 5 September 2022, more than 600 million cases of SARS-CoV-2 infection had been reported across all continents, regions, and most countries, resulting in nearly 6.5 million deaths2.
COVID-19, the disease caused by infection with SARS-CoV-2, has a high mortality rate in hospitalised patients with deaths predominantly caused by respiratory failure3. It continues to this day to be a challenging global pandemic with significant morbidity and mortality4. As Knight et al.5 indicate, prognostic models that can predict outcomes among COVID-19 patients can be used to support clinical decision-making regarding hospital treatment and prioritisation. One such score is the 4C score that includes data about patient comorbidity, abnormal physiology, and inflammation using routinely measured data, bedside observations, and biochemistry tests6. While in most cases COVID-19 is a mild illness, those at highest risk of death and severe complications usually are hospitalised some time after onset7.
Pulmonary embolism (PE) is among the most severe and preventable complications of COVID-19 characterized by increased D-dimer levels and high thrombosis risk that has been repeatedly reported across different countries8. Studies suggest PE incidence rates above 15% in the ICU for COVID-19 patients and early recognition of its risk factors can help in identifying urgent treatment with anticoagulation therapy to those most in clinical need4,9. Recent international studies additionally suggest COVID-19 as a key risk factor for pulmonary embolism both in the short- and long-term9,10. Existing PE prediction models are limited in part because they were developed for non-COVID-19 patients and traditional risk factors for PE may not be as predictive. If risk models can be developed for assessing occurrence of PE in COVID-19 patients across different countries, that can be an important step forward in preventing this serious complication of COVID-19, especially given the current epidemiological situation9.
As for risk factors that contribute the most to the occurrence of mortality and pulmonary embolism in COVID-19 patients, age has been established as the dominant predictor of mortality11. Furthermore, studies have described other risk factors of COVID mortality such as cardiovascular disease, chronic respiratory disease, diabetes, hypertension, smoking, and obesity12.
Technical background
Machine learning has been applied to different COVID-19 related questions. Large amounts of patient data are being generated during the COVID-19 pandemic which can be useful for predictive modelling. Using machine learning with large amounts of complex patient data could generate accurate and patient-specific predictions and assist clinicians.
Previous research includes13 exploring in-hospital mortality with logistic regression on just 191 patients and14 have followed with multi-center validation with 299 patients for internal training and 145 patients for external validaton.15 have looked at regression-based predictions of all-cause mortality with hospital admission time as a predictor and using hazard models yet their results have also been limited due to a smaller dataset restricting generalisability. All of these studies have used a combination of demographics, comorbidities, symptoms, laboratory tests, and self-reported onset times.
In this study, we investigated how pulmonary embolism and all-cause mortality vary across subgroups of a large and international cohort. We also show how predictive certain clinical factors gathered from patients with COVID-19 can be to the respective outcomes. In studies looking at predicting thromboembolism more broadly, a defining limitation for impactful and generalisable application of machine learning methods has been a small patient sample and a lack of systematic comparison of algorithms16. Applying a diverse set of methods to one of the largest and most diverse datasets on hospitalised patients with COVID-19 can help find the best mechanism for risk prioritisation of patients in a timely way and may help reduce mortality and risk of PE in those with COVID-19.
Results
Variable distributions can be seen in Tables 1, 2, 3, and 4. A detailed collection of figures for variable distribution across age groups can be found in the Supplementary.
Several variables were highly correlated with PE and death (Supplementary Figures 3 and 4, Tables II and III). Multivariable logistic regression shows high association of country, age, alpha variant, and certain symptoms with PE and death (Figs. 1 and 2). Tables with p-values are included in Tables 5, 6, and 7.
The Cox proportional hazards model without regularisation yielded a C-index of 0.71 and the forest plot shows high hazard ratios for age, certain regions of admission, and specific symptoms (Fig. 3 and Tables 8 and 9).
The Kaplan-Meier curves for risk stratification across age, sex, and region groups show clear difference in risk with older men and those in South Asia and the Middle East with the lowest rates of survival (Fig. 4).
Tables 10, 11, and 12 show superior performance of the XGBoost model across all 3 test sets. Similarly, XGBoost maintains sensitive and accurate prediction of death compared to other alternative models (Table 13). The validation scores are for the combined UK and Spain set.
The model also maintains high predictive performance across various subgroups of the patient population stratified across sex and age (Tables 14 and 15).
To further evaluate our model, we test it on held-out test data with specific patient population subgroups including men, women, and different age groups as can be seen in Tables 14 and 15. Our model shows reliable prediction for PE and mortality in both men and women without a significant difference in performance for each group, whereas for age groups there is greater variation in results as compared to sex differences but it remains relatively consistent in predictive performance.
Taking the best performing XGBoost model and applying 2 different feature importance methods, average f1-score gain across splits and Shapley values, we obtain the results seen in Figs. 5, 6, 7, and 8. A feature importance stratification on a held-out test set of only men and only women separately for either PE or mortality prediction is also included in Figs. 9, 10, 11, and 12. As further clarification for the SHAP plot, darker colour indicates that a higher value of that feature contributes to the prediction either positively (if on the right hand side of the vertical line) or negatively (if on the left hand side of the vertical line). Higher placement of the feature vertically in the plot means it has a higher mean Shapley value and hence contributes more to correct predictions in the model.
XGBoost feature importance with SHAP for PE. The values in the legend being higher or darker colour in the plot correspond to higher values of that feature contributing to the prediction either for stronger positive prediction (more colour points for the feature on the right side of the vertical line) or stronger negative prediction of outcome otherwise.
XGBoost feature importance with SHAP for mortality. The values in the legend being higher or darker colour in the plot correspond to higher values of that feature contributing to the prediction either for stronger positive prediction (more colour points for the feature on the right side of the vertical line) or stronger negative prediction of outcome otherwise.
XGBoost feature importance with SHAP for PE (only men). The values in the legend being higher or darker colour in the plot correspond to higher values of that feature contributing to the prediction either for stronger positive prediction (more colour points for the feature on the right side of the vertical line) or stronger negative prediction of outcome otherwise.
XGBoost feature importance with SHAP for PE (only women). The values in the legend being higher or darker colour in the plot correspond to higher values of that feature contributing to the prediction either for stronger positive prediction (more colour points for the feature on the right side of the vertical line) or stronger negative prediction of outcome otherwise.
XGBoost feature importance with SHAP for mortality (only men). The values in the legend being higher or darker colour in the plot correspond to higher values of that feature contributing to the prediction either for stronger positive prediction (more colour points for the feature on the right side of the vertical line) or stronger negative prediction of outcome otherwise.
XGBoost feature importance with SHAP for mortality (only women). The values in the legend being higher or darker colour in the plot correspond to higher values of that feature contributing to the prediction either for stronger positive prediction (more colour points for the feature on the right side of the vertical line) or stronger negative prediction of outcome otherwise.
Discussion
To our knowledge, this multi-center dataset is the largest international cohort of hospitalised COVID-19 patients available. Our analysis showed that patients with PE are older, more often male, white, from higher income countries, and are more likely to suffer from: asthma, chronic cardiac disease, chronic kidney disease, chronic neurological disease, chronic pulmonary disease, hypertension, cancer, obesity, rheumatologic conditions, or smoke.
The occurrence of pulmonary embolism in our study population was 0.7% and our results showed a significant association between confirmed PE and mortality when compared with patients without PE as has been similarly found in patients without COVID-1917.
Accordingly, our logistic regression models for PE and death showed that different age-groups experience different risks of either outcome. The age group 40-80 was at highest odds of having PE, and those >60 of dying as can be seen in the Kaplan-Meier curve in Fig. 4a. Symptomatic COVID-19 patients were almost 3 times more likely to experience PE while also being more likely to die. Within symptoms and comorbidities, shortness of breath, chest pain, obesity, and bleeding were associated with higher odds of a PE, followed by hypertension and loss of smell. The regionality of the data must be addressed in the higher odds of death in South Asia, Middle East and North Africa (MENA), and South Africa compared to Europe and Central Asia as the hospital centers in those communities have different challenges and circumstances when it comes to fighting the pandemic. Symptoms like shortness of breath, confusion, severe dehydration, and wheezing were present in COVID-19 patients with higher odds of death, and comorbidities such as malignant neoplasm, diabetes, and chronic kidney or liver disease also lead to higher risk of death. For both correlation and odds of PE and death, men were more at risk. This is shown in the Kaplan–Meier curves for survival stratified across sex in Fig. 4b.
The hazard ratios confirmed those over the age of 60 were at highest risk of death, especially those COVID-19 patients who experienced shortness of breath, severe dehydration, confusion, and had pre-existing chronic conditions. Regionality of hospital admission was once again an important risk factor for death. Interestingly, patients with PE, chest pain, asthma, or fever seemed to have lower risk associated which could be due to earlier and easier detection of these symptoms and conditions in the progression of the disease.
Seeking to combine this clinically insightful information for outcome prediction, we developed a fast prediction model with XGBoost for both PE and death in COVID-19 hospitalised patients, and tested it in different countries separately. We also showed that appropriate class weighting can help with class imbalance and even outperform ensemble resampling methods without having to sacrifice the interpretability of the model (Tables 10, 13). The differences between measured performance on UK and Spain test sets as evaluated by sensitivity and accuracy are due to different class-imbalance ratios and positive case distributions between the datasets. It is important to note that the metric to focus on for our purposes are the validation and test AUC which remain consistent between the two datasets at around 75% as it is the most robust metric in the face of extreme class-imbalance. Since the class-imbalance varies between the two datasets as well, other metrics like sensitivity and accuracy will be significantly impacted despite attempts to dampen it but due to only a few percentages of positive cases, the potency of our approaches can only be limited. The best-performing model for PE prediction evaluated across separate held-out UK-only data, Spain-only data, and UK and Spain data combined is XGBoost without undersampling and without rigid thresholding using robust class weighting. As for death, the XGBoost again outperformed all other models including the ensemble with XGBoost on some metrics.
Since our XGBoost model outperformed other methods, we also showed that the best method for handling class imbalance is through robust class weighting and compared it to other methods for imbalance handling like ensembles and resampling methods. Another advantage of this method is that it avoids introducing bias like in the case of resampling. Finally, XGBoost provides feature importances which was useful for explaining clinical risk prediction of the model to healthcare professionals and policy-makers.
Exploring two different interpretability methods for XGBoost, average gain across splits and Shapley values, showed that the time of dominant presence of the alpha variant, age, fever, shortness of breath, and hypertension were the key predictors for PE, followed by region of admission, sex, and chest pain. Recent work has alluded to an association between the alpha variant and occurrence of thromboembolisms in mice but further research relevant to human samples is missing18. Age was a complex non-linear predictor with different age groups corresponding to varying risks. The clear colour separation for the Shapley values for age in Fig. 8 showed how each age group has a clearly separable predictive value for mortality with older groups having higher risk but which is not the case for PE as younger age groups can be more predictive of higher PE risk. Furthermore, Shapley values analysis identified obesity, smoking, and the presence of cough as important predictors for PE whereas the default XGBoost method does not. The most predictive features for all-cause mortality were age, region of hospital admission, sex, diabetes, and shortness of breath whereas the default method highlights hypertension and obesity in addition. For mortality, higher values of region corresponded to samples from South Asia and South Africa. Comparing all of the top identified predictors across these models for all outcomes can be seen in Tables 16 and 17 where certain symptoms and comorbidities have been identified to be universally predictive risk factors right at-admission without any additional measurements having to be taken for PE and mortality risk assessment.
The pulmonary embolism and mortality prediction model can help with management of COVID-19 as it uses standard demographics, comorbidity, and symptom data collected at admission for identifying patients most at risk of developing PE which may enable an earlier start of targeted anticoagulation therapy. Our mortality risk prediction model can also help with patient population risk assessment and prioritisation across different regions of the world.
A strength of the current study is that a combination of machine learning and traditional statistical modeling can offer a more reliable system for predictive risk forecasting. XGBoost provides at-admission prediction of both events, while odds and hazards ratios obtained from logistic regression and the Cox proportional hazards model give us an insight into stratified risk and global feature importance. We systematically compare our XGBoost model with different risk prediction algorithms. Our model also outperforms recently published results across a variety of metrics like AUROC and sensitivity despite being developed on a much larger and more heterogeneous and diverse dataset while being robust to class imbalance19. With existing scores built on non-COVID-19 data like The National Early Warning Score 2, there is insufficient information available on their reliability in the COVID-19 setting, and some have been found to underestimate mortality20. Our model is able to deploy at admission for both PE and death risk prediction and can help supplant these needs rapidly.
The study, however, has several limitations. First, almost 60% of patients who died did so in South Africa, and over 70% of PE cases were located in the UK. This may be due to limited access to d-dimer tests or CT scans. There were no mandatory diagnostic criteria in the ISARIC CRF for PE. The absence of a control group of patients without COVID-19 in this dataset prevented estimation of specificity. The patient cohort comprised of hospitalised patients with confirmed COVID-19 who had a mortality rate of 21.7%. These models are not for use in the community and could still perform differently in populations at lower risk of death and across different regions of the world. As part of future work, dependent on sufficient data, PE and death could be modelled with a comprehensive multi-state statistical framework, which incorporates the interrelations among survival, PE, and death states.
In conclusion, the set of decisions taken must include different stakeholders like patients, clinicians, hospital administrators, researchers, and data procurers so that trade-offs can be identified and context-informed decisions can be taken to address them, especially if our models could have missed harms or benefits to different groups and communities.
Methods
Data
In this work, we use data of COVID-19 patients from The International Severe Acute Respiratory and Emerging Infection Consortium (ISARIC), a repository that standardises and secures data on COVID-19 assembled from a global cohort over 2 years of the pandemic as of January 2022. It includes so far data on over 800,000 patients from 53 countries. These data capture the global experience of the first 2 years of the pandemic21. The clinical characterisation protocol underwent ethical review by the World Health Organization Ethics Review Committee and ethics approval was obtained for each participating country and site according to local requirements. Ethics Committee approval was given by the WHO Ethics Review Committee (RPC571 and RPC572, 25 April 2013). Institutional approval was additionally obtained by participating sites including the South Central-Oxford C Research Ethics Committee in England (Ref. 13/SC/0149), the Scotland A Research Ethics Committee (Ref. 20/SS/0028) for the UK and the Human Research Ethics Committee (Medical) at the University of the Witwatersrand in South Africa as part of a national surveillance programme (M160667), which collectively represent the majority of the data. Other institutional and national approvals are in place as per local requirements. This is a secondary analysis of data collected, with appropriate local permissions, and each institution signed a Terms of Submission in which committed that they had the appropriate permissions in place. All methods were performed in accordance with the relevant guidelines and regulations.
The study population consisted of all patients with either clinically diagnosed or laboratory confirmed COVID-19 admitted to the participating hospitals. The aim of the recruiting sites was to use a consecutive sample.
The dataset contains 800,459 patients and 182 variables. The mean age of patients was 56.4 (20.9), 48.6% were male, and the majority of cases were from South Africa (54.0%) and the United Kingdom (34.1%). 65.3% of patients were discharged alive and 20.4% died. We grouped countries with fewer than 60 individuals into a single category. Out of all patients, 5450 (0.7%) experienced a pulmonary embolism, 73 experienced thromboembolism, and 143 experienced deep vein thrombosis. We define our outcome of interest as the main pulmonary embolism (PE) diagnosis for subsequent analysis. 4653 (82.1%) of the PE cases were recorded in the United Kingdom (UK) and Spain which based on our knowledge makes our study the largest study of its kind for PE to date. Due to similar data collection patterns and recording, we used data from these two countries only for PE modelling as they contain the vast majority of reported PE cases.
Data preprocessing
Since treatment information does not have reliable timestamps for most patients, the following variables were used in the analysis for PE: demographics (including age, sex, country), comorbidities (hypertension, diabetes, smoking etc.), and symptoms (coughing, fever, fatigue etc.). The presence of diagnosis during domination by the alpha variant was also included (after December 2020) due to its possible association with incidence of PE. In our modelling of PE, we used data from patients from the UK and Spain only and did not use laboratory measurements or imputation methods. 269,373 patients and 45 variables remained for PE, and 734,282 patients and 55 variables for death. Age was grouped into 5 categories (0–20, 20–40, 40–60, 60–80, 80–120) with the distributions seen in Figs. 13 and 14 below. The symptomatic variable represents any symptoms reported for a patient. For number of days from admission to event (death), we removed outliers of more than 200 days and those in the negatives, thereby removing 1342 patients.
Prior to processing the data, for PE prediction, we held out 3 test sets of 20% of the total dataset for independent testing, one of which would only include patients from Spain, one only including patients from the UK, and another including both. For mortality prediction testing, we held out 20% of the total dataset sample. A workflow diagram for data processing and system design is shown in Fig. 15.
Stratified Kaplan–Meier curves by age, sex, and region of admission were plotted using Cox proportional hazards models while machine learning methods were applied for prediction of PE or death.
Baseline and machine learning methods
The reference groups for statistical analysis for age were those under 20 years old, for region it was East Asia, and for country variable it was Norway. For Cox proportional hazards model, proportionality assumption was verified through visualisation of the survival curves and observing parallel behaviour as seen in Fig. 4. We investigated several prediction methods for PE occurrence and death, including logistic regression, Linear Discriminant Analysis (LDA), naive Bayes classifier, random forests, ADABoosting algorithms, and the high-performing extreme gradient boost machine (XGBoost)22,23. Previous studies looking at tree-based algorithms such as XGBoost have highlighted its capacity to learn the correlations between covariates well when it comes to mortality prediction in COVID-19 patients while also being somewhat interpretable24. We applied all of these methods for the purposes of a systematic comparison using 5-fold cross-validation, several hold-out test sets stratified across countries and regions, and evaluated with multiple metrics. A list of methods applied can be seen in Table 18 with details in Supplementary.
As is often the case in disease prediction, there is class imbalance with about 1.7% of UK and Spain patients having been diagnosed with PE and around 20.4% having died in the case population. To address this, we use other metrics mentioned above in the evaluation of our models besides accuracy as it does not capture the true predictive performance of our models and we rely more on sensitivity and the F1 score. We also use a different threshold for prediction after probability estimation instead of the default 0.5 to achieve cost-sensitivity, and we apply random undersampling at a minority:majority ratio of 1:4 as has been highlighted in other work25,26. We evaluate these methods both separately and in combination to investigate the best approach for this set of prediction problems.
To address imbalance in predictions, we applied either undersampling, thresholding, or both. As for death, due to a much softer imbalance, undersampling was not necessary. We also added class-weighting to our XGBoost model using inverse proportions and compared it with the other methods to address class imbalance. All confusion matrices and parameter details for each model can be found in the Supplementary.
Furthermore, we build an ensemble that combines AdaBoosted decision trees with robust undersampling using different subsets for resampled training so as to address the imbalance and compare this cost-sensitive model with our best performing model and add further confidence in its ability to generalise in an imbalanced scenario27. We extend the ensemble learning methods by using our own XGBoost model in the ensemble structure instead of the standard AdaBoosted decision trees. The number of trees was a tunable hyperparameter listed in the Supplementary (Tables IV–VI). We compare our cost-sensitive class-weighted XGBoost machine learning model with these resampling ensembles to show improved performance without the need of introducing bias like in the case of resampling while maintainting interpretability.
Model validation and evaluation
We proceed to tune our machine learning models and validate them using stratified 5-fold cross-validation with Bayesian optimisation. We repeated the optimisation procedure for 50 iterations after which we evaluated the model on the independent test set with the following metrics: AUROC, Accuracy, Weighted F1 Score, and Sensitivity. The details can be found in the Supplementary.
While existing studies referenced in the Introduction mention some approaches to feature importance estimation for COVID-19 mortality and outcome prediction as well as for other problems, rarely does one find several interpretability methods compared and contrasted in one scenario. We implemented both tree-based F-score interpretability methods as well as Shapley values analysis, logistic regression and Cox regression, and hope to draw interesting conclusions from each and their comparisons28,29. A full explanation for the Shapley values method and its details can be found in the Supplementary materials.
Role of the funding source
The funder had no role in study design, data collection, data analysis, data interpretation, writing of the report, and decision to submit the paper for publication.
Data availability
The ISARIC-WHO CCP, case report form and consent forms are openly available on the ISARIC website at https://isaric.org/re search/covid-19-clinical-research-resources/clinical-characterisation-protocol-ccp/. Informed consent for data collection, sharing and/or analysis was obtained from individual participants or their representatives when required by local ethics committees. Some committees approved a waiver of consent due to the public benefit of the research and the minimal risk to participants. The data that underpin this analysis are highly detailed clinical data on individuals hospitalised with COVID-19. Due to the sensitive nature of these data and the associated privacy concerns, they are available via a governed data access mechanism following review of a data access committee. Data can be requested via the IDDO COVID-19 Data Sharing Platform (http://www.iddo.org/covid-19). The Data Access Application, Terms of Access and details of the Data Access Committee are available on the website. Briefly, the requirements for access are a request from a qualified researcher working with a legal entity who have a health and/or research remit; a scientifically valid reason for data access which adheres to appropriate ethical principles. The full terms are at https://www.iddo.org/document/covid-19-data-access-guidelines. A small subset of sites who contributed data to this analysis have not agreed to pooled data sharing as above. In the case of requiring access to these data, please contact the corresponding author in the first instance who will look to facilitate access.
GR declares receiving a grant from United States National Institute of Health, R01 Grant: Emerging Zoonotic Malaria in Malaysia: Strenghtening Surveillance and Evaluating Population Genetics Structure to Improve Regional Risk Prediction Tool and travel support from the European Society of Clinical Microbiology and Infectious Disease (ESCMID) for observership at European Centre for Disease Prevention and Control (ECDC). All authors declare no competing interests.
References
WHO. Novel coronavirus (2019-ncov): situation report, 11. (2020).
University, J. H. Covid-19 dashboard by the center for systems science and engineering (csse) (2022).
Yang, X. et al. Clinical course and outcomes of critically ill patients with sars-cov-2 pneumonia in Wuhan, China: A single-centered, retrospective, observational study. Lancet Respir. Med. 8, 475–481 (2020).
Liao, S.-C., Shao, S.-C., Chen, Y.-T., Chen, Y.-C. & Hung, M.-J. Incidence and mortality of pulmonary embolism in covid-19: A systematic review and meta-analysis. Crit. Care 24, 1–5 (2020).
Knight, S. R. et al. Prospective validation of the 4c prognostic models for adults hospitalised with covid-19 using the isaric who clinical characterisation protocol. Thorax 77, 606–615 (2021).
Jones, A. et al. External validation of the 4c mortality score among covid-19 patients admitted to hospital in Ontario, Canada: A retrospective study. Sci. Rep. 11, 1–7 (2021).
Tabata, S. et al. Clinical characteristics of covid-19 in 104 people with sars-cov-2 infection on the diamond princess cruise ship: A retrospective analysis. Lancet. Infect. Dis. 20, 1043–1050 (2020).
Susen, S. et al. Prevention of thrombotic risk in hospitalized patients with covid-19 and hemostasis monitoring. Crit. Care 24, 1–8 (2020).
Whiteley, W. & Wood, A. Risk of arterial and venous thromboses after covid-19. Lancet Infect. Dis. 22, 1093–1094 (2022).
Katsoularis, I. et al. Risks of deep vein thrombosis, pulmonary embolism, and bleeding after covid-19: nationwide self-controlled cases series and matched cohort study. bmj377 (2022).
Marcos, M. et al. Development of a severity of disease score and classification model by machine learning for hospitalized covid-19 patients. PLoS ONE 16, e0240200 (2021).
Venturini, S. et al. Classification and analysis of outcome predictors in non-critically ill covid-19 patients. Intern. Med. J. 51, 506–514 (2021).
Zhou, F. et al. Clinical course and risk factors for mortality of adult inpatients with covid-19 in Wuhan, China: A retrospective cohort study. The Lancet 395, 1054–1062 (2020).
Xie, J. et al. Development and external validation of a prognostic multivariable model on admission for hospitalized patients with covid-19. (2020).
Alaa, A., Qian, Z., Rashbass, J., Benger, J. & van der Schaar, M. Retrospective cohort study of admission timing and mortality following covid-19 infection in England. BMJ Open 10, e042712 (2020).
van de Sande, D. et al. Predicting thromboembolic complications in covid-19 icu patients using machine learning. J. Clin. Transl. Res. 6, 179 (2020).
Gómez, C. A. et al. Mortality and risk factors associated with pulmonary embolism in coronavirus disease 2019 patients: A systematic review and meta-analysis. Sci. Rep. 11, 1–13 (2021).
Law, N., Chan, J., Kelly, C., Auffermann, W. F. & Dunn, D. P. Incidence of pulmonary embolism in covid-19 infection in the ed: Ancestral, delta, omicron variants and vaccines. Emerg. Radiol. 29, 625–629 (2022).
Ikemura, K. et al. Using automated machine learning to predict the mortality of patients with covid-19: Prediction model development study. J. Med. Internet Res. 23, e23458 (2021).
Alballa, N. & Al-Turaiki, I. Machine learning approaches in covid-19 diagnosis, mortality, and severity risk prediction: A review. Inform. Med. Unlock. 24, 100564 (2021).
Akhvlediani, T. et al. Isaric clinical characterisation group. Global Outbreak Res. Harmony Not Hegemony Lancet Infect. Dis. 20, 770–772 (2020).
Kumari, R. & Srivastava, S. K. Machine learning: A review on binary classification. Int. J. Comput. Appl. 160 (2017).
Chowdhury, M. E. et al. An early warning tool for predicting mortality risk of covid-19 patients using machine learning. Cognit. Comput. 1–16 (2021).
Baqui, P. et al. Comparing covid-19 risk factors in brazil using machine learning: The importance of socioeconomic, demographic and structural factors. Sci. Rep. 11, 1–10 (2021).
Ling, C. X. & Sheng, V. S. Cost-sensitive learning and the class imbalance problem. Encycl. Mach. Learn. 2011, 231–235 (2008).
Lemaître, G., Nogueira, F. & Aridas, C. K. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18, 559–563 (2017).
Liu, T.-Y. Easyensemble and feature selection for imbalance data sets. In 2009 international joint conference on bioinformatics, systems biology and intelligent computing, 517–520 (IEEE, 2009).
Lundberg, S. M. et al. From local explanations to global understanding with explainable ai for trees. Nat. Mach. Intell. 2, 56–67 (2020).
Ibrahim, L., Mesinovic, M., Yang, K.-W. & Eid, M. A. Explainable prediction of acute myocardial infarction using machine learning and shapley values. IEEE Access 8, 210410–210417 (2020).
Acknowledgements
M. Mesinovic appreciates the support of the EPSRC Center for Doctoral Training in Health Data Science (EP/S02428X/1) and the Rhodes Trust.
This work was made possible by the UK Foreign, Commonwealth and Development Office and Wellcome [215091/Z/18/Z, 222410/Z/21/Z, 225288/Z/22/Z and 220757/Z/20/Z]; the Bill & Melinda Gates Foundation [OPP1209135]; the philanthropic support of the donors to the University of Oxford’s COVID-19 Research Response Fund (0009109); CIHR Coronavirus Rapid Research Funding Opportunity OV2170359 and the coordination in Canada by Sunnybrook Research Institute; endorsement of the Irish Critical Care-Clinical Trials Group, co-ordination in Ireland by the Irish Critical Care-Clinical Trials Network at University College Dublin and funding by the Health Research Board of Ireland [CTN-2014-12]; the Rapid European COVID-19 Emergency Response research (RECOVER) [H2020 project 101003589] and European Clinical Research Alliance on Infectious Diseases (ECRAID) [965313]; the COVID clinical management team, AIIMS, Rishikesh, India; the COVID-19 Clinical Management team, Manipal Hospital Whitefield, Bengaluru, India; Cambridge NIHR Biomedical Research Centre; the dedication and hard work of the Groote Schuur Hospital Covid ICU Team and supported by the Groote Schuur nursing and University of Cape Town registrar bodies coordinated by the Division of Critical Care at the University of Cape Town; the Liverpool School of Tropical Medicine and the University of Oxford; the dedication and hard work of the Norwegian SARS-CoV-2 study team and the Research Council of Norway grant no 312780, and a philanthropic donation from Vivaldi Invest A/S owned by Jon Stephenson von Tetzchner; Imperial NIHR Biomedical Research Centre; the Comprehensive Local Research Networks (CLRNs) of which PJMO is an NIHR Senior Investigator (NIHR201385); Innovative Medicines Initiative Joint Undertaking under Grant Agreement No. 115523 COMBACTE, resources of which are composed of financial contribution from the European Union’s Seventh Framework Programme (FP7/2007- 2013) and EFPIA companies, in-kind contribution; Stiftungsfonds zur Förderung der Bekämpfung der Tuberkulose und anderer Lungenkrankheiten of the City of Vienna, Project Number: APCOV22BGM; Italian Ministry of Health “Fondi Ricerca corrente-L1P6” to IRCCS Ospedale Sacro Cuore-Don Calabria; Australian Department of Health grant (3273191); Gender Equity Strategic Fund at University of Queensland, Artificial Intelligence for Pandemics (A14PAN) at University of Queensland, the Australian Research Council Centre of Excellence for Engineered Quantum Systems (EQUS, CE170100009), the Prince Charles Hospital Foundation, Australia; grants from Instituto de Salud Carlos III, Ministerio de Ciencia, Spain; Brazil, National Council for Scientific and Technological Development Scholarship number 303953/2018-7; the Firland Foundation, Shoreline, Washington, USA; the French COVID cohort (NCT04262921) is sponsored by INSERM and is funded by the REACTing (REsearch & ACtion emergING infectious diseases) consortium and by a grant of the French Ministry of Health (PHRC n20-0424); a grant from foundation Bevordering Onderzoek Franciscus; the South Eastern Norway Health Authority and the Research Council of Norway; Institute for Clinical Research (ICR), National Institutes of Health (NIH) supported by the Ministry of Health Malaysia; preparedness work conducted by the Short Period Incidence Study of Severe Acute Respiratory Infection; the U.S. DoD Armed Forces Health Surveillance Division, Global Emerging Infectious Diseases Branch to the U.S Naval Medical Research Unit No. TWO (NAMRU-2) (Work Unit #: P0153_21_N2). These authors would like to thank Vysnova Partners, Inc. for the management of this research project. The Lao-Oxford-Mahosot Hospital-Wellcome Trust Research Unit is funded by the Wellcome Trust.
This work uses data provided by patients and collected by the NHS as part of their care and support #DataSavesLives. The data used for this research were obtained from ISARIC4C. We are extremely grateful to the 2648 frontline NHS clinical and research staff and volunteer medical students who collected these data in challenging circumstances; and the generosity of the patients and their families for their individual contributions in these difficult times. The COVID-19 Clinical Information Network (CO-CIN) data was collated by ISARIC4C Investigators. Data and Material provision was supported by grants from: the National Institute for Health Research (NIHR; award CO-CIN-01), the Medical Research Council (MRC; grant MC_PC_19059), and by the NIHR Health Protection Research Unit (HPRU) in Emerging and Zoonotic Infections at University of Liverpool in partnership with Public Health England (PHE), (award 200907), NIHR HPRU in Respiratory Infections at Imperial College London with PHE (award 200927), Liverpool Experimental Cancer Medicine Centre (grant C18616/A25153), NIHR Biomedical Research Centre at Imperial College London (award ISBRC-1215-20013), and NIHR Clinical Research Network providing infrastructure support. We also acknowledge the support of Jeremy J Farrar and Nahoko Shindo.
Author information
Authors and Affiliations
Consortia
Contributions
CK conceived and designed the study. The data was curated by LM and BC. Formal analysis was undertaken by MM. Development of the statistical analysis and machine learning methodologies was completed by MM, LC, and CK. Project administration was done by CK, LC, LM, and BC. Software was developed by MM, and validated by MM and CK. Supervision was provided by CK and LC. Visualisations, writing, and editing was done by MM. Resources, clinical or otherwise, were provided by LM, PO, XW, GR, KP, and FG. LM, PO, XW, GR, KP, and FG also undertook the acquisition, analysis, and interpretation of the data. All authors subsequently critically edited the report. The corresponding author and CK had full access to all data. MM and CK accessed and verified the data and results. MM and CK had final responsibility for the decision to submit for publication. All authors have revised, edited, reviewed, and approved all versions of the manuscript. The full list of consortium members is included at the end of the Supplementary material.
Corresponding author
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Mesinovic, M., Wong, X.C., Rajahram, G.S. et al. At-admission prediction of mortality and pulmonary embolism in an international cohort of hospitalised patients with COVID-19 using statistical and machine learning methods. Sci Rep 14, 16387 (2024). https://doi.org/10.1038/s41598-024-63212-7
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-024-63212-7