Introduction

Coronavirus disease 2019 (COVID-19), caused by severe acute respiratory coronavirus 2 (SARS-CoV-2), emerged as a global pandemic in February 20201. The clinical course of COVID-19 ranges from mild to severe2. A significant number of patients continue to develop severe disease despite the advances in vaccines3,4 and therapies5,6; however, most cases are relatively mild. Therefore, identifying patients at risk of developing severe disease who require additional medical resources is crucial.

Various patient-related factors are associated with severe COVID-19, including obesity7, advanced age, diabetes mellitus8, hyperuricemia9, and Krebs von den Lungen-6 (KL-6) levels10. White blood cell (WBC) count, C-reactive protein (CRP) levels, blood urea nitrogen (BUN) levels, and lactate dehydrogenase (LDH) levels were higher in patients with severe COVID-1911,12,13,14,15. However, relying on a single biomarker may not be sufficient to predict severe disease because these biomarkers reflect different pathological conditions. The usefulness of combinations of biomarkers such as the LDH/albumin (Alb) ratio or the neutrophil/lymphocyte ratio (NLR) has been reported16,17,18. Nevertheless, determining the optimal combination of biomarkers and patient background factors to create accurate predictive models for COVID-19 severity remains a challenge. Furthermore, for practical use in clinical settings, it is essential that the model can be easily applied through composite indices such as estimated glomerular filtration rate (eGFR)19 or the Fibrosis-4 (FIB-4) index20 which combine multiple factors.

Predictive models have been established using conventional statistical methods, such as linear and logistic regression21,22,23; however, these approaches are limited to predicting the outcomes of complex interactions involving numerous factors. Machine learning (ML), a subset of artificial intelligence (AI) that can predict complex and nonlinear outcomes more effectively than classical analysis, has been used to create predictive models of COVID-19 severity24,25.

ML represents a significant innovation in data science that can provide high predictive accuracy; however, the underlying calculations are often unclear, making its translation into clinical practice difficult. Overfitting is another significant challenge associated with ML; a few models show high performance in the initial training data but lose performance during validation26. Overfitting results from complex model designs constructed using limited data, making them less universal27. Explainable ML approaches have been developed based on deep learning and have a mesh-like structure that reduces overfitting to address these issues28,29. Moreover, we aimed the development of a simple and accurate predictive model for COVID-19 severity for ease of use in clinical settings. To develop simple models, we combined reinforcement learning with explainable ML because the number of combinations of factors and operators is huge.

Materials and methods

Study Design and Settings

This retrospective cohort study used data from the Japan COVID-19 Task Force database collected between February 2020 and October 2022. The Japan COVID-19 Task Force collected clinical information on patients aged ≥ 18 years diagnosed with COVID-19 via polymerase chain reaction (PCR) or antigen testing at four institutions in Japan. Among the 3,424 patients identified, 123 participants, comprising those who lacked clinical data (N = 18), were not positive by PCR or antigen test results (N = 5), or were asymptomatic (N = 100), were excluded. Finally, data from only 3,301 patients were analyzed. The patients were divided into discovery and validation cohorts. The discovery cohort comprised patients with estimated disease onset before October 1, 2020 (N = 1,023), whereas the validation cohort comprised the remaining patients (N = 2,278; Fig. 1). This study was approved by the Ethics Committee of Keio University School of Medicine (ID: 20200061) and adhered to the 1964 Declaration of Helsinki and its subsequent amendments. Written or oral informed consent was obtained from all the patients.

Fig. 1
figure 1

Flowchart of the patient selection process. Flowchart of the patient identification and selection process. The 3,301 patients included were divided into the discovery and validation cohorts.

Clinical data

The following patient data were obtained from the electronic case record form: age, sex, body mass index (BMI), length of hospital stay (days), comorbidities, clinical signs and symptoms, laboratory findings, and post-hospitalization complications. All laboratory tests were performed within 48 h of the initial visit or admission based on the clinical care needs of the patients. A critical outcome was defined as the requirement for oxygen supplementation (via high-flow oxygen therapy), invasive or noninvasive mechanical ventilation, extracorporeal membrane oxygenation, or death4,30.

Extraction of key features via ML

Predictive models were constructed for patients with and without critical outcomes to determine the features relevant to critical outcomes, and the importance of each feature was evaluated (Supplementary Table 1). Two baseline predictive models were constructed: a point-wise linear (PWL) model27 (using PyTorch 1.5.1, Python 3.7.4), and a logistic regression model (using scikit-learn v0.24.2, Python 3.7.4). The output values of the PWL models were expressed as weighted sums of the input features, similar to logistic regression models. However, the weights of the PWL were calculated as nonlinear functions of the features using deep neural networks (in contrast to the logistic regression approach). This approach assigned different mathematical weights to each feature in the final algorithm.

The feature variables were classified as binary (e.g., sex and pregnancy), multicategorical (e.g., blood type), or quantitative (e.g., Alb and BUN). All binary features were assigned a value of 1 or − 1. One-hot encoding was used to convert multicategorical variables with K categories into K binary variables. Each variable was assigned a value of 1 if the sample belonged to the corresponding category; otherwise, a value of − 1 was assigned. Each continuous quantitative variable was normalized by subtracting the mean value from the original value and dividing the result by the standard deviation (SD) subsequently. Missing data were assigned a value of zero because they did not change the output in the weighted-sum layers of the deep learning model. A total of 87 feature variables were generated.

The predictive performance of each model was calculated using the area under the receiver operating characteristic (ROC) curve (AUC) and evaluated using 10-fold double cross-validation (10-fold DCV) to avoid the dependencies on specific test sets. The probability thresholds for calculating the specificity and sensitivity of each model were determined by locating the point nearest to the point (0–1.0) on the ROC curve. The predictive performance of each model in the validation cohort was calculated using the model fitted by the discovery cohort dataset and the fold hyperparameter values with the best AUC values of the test set during the 10-fold DCV. The best hyperparameter values were represented in Supplementary Tables 2 and 3.

Extracting important features via ML and constructing predictive factors

A simple predictive model was constructed using the important features extracted from the deep learning model. The model specifications are shown in Fig. 2A. The logistic regression model is based on two predictive factors. Predictive factors were defined as those that could distinguish critical outcomes using significant features and basic mathematical operations, such as addition and square root functions. Table 1 shows the performance of the PWL and logistic regression models. The PWL model outperformed the logistic regression model in terms of the estimated value for all patient cohorts; however, their performances did not differ significantly, considering their error margins. The importance scores for each feature in both models are provided in Supplementary Tables 4, 5. Features with a relative score of > 0.01 in the PWL were selected, as the PWL model demonstrated superior estimated performance and ability to consider nonlinear interactions. A flowchart of the analysis is represented in Fig. 2B. Ultimately, 41 features were extracted. Given the 2.66 × 1012 possible patterns for this simple predictive model (consisting of two predictors using 41 important features and 11 basic mathematical operations), reinforcement learning was used to generate a simple predictive model with high predictive performance. The classification boundaries of this model were set based on the threshold level determined by locating the point in the ROC curve nearest to the point (0–1.0). Further details of this process are provided in Supplementary Material (Supplementary Figs. 1, 2). The model with a high AUC in the discovery cohort was adapted for the validation cohort.

Fig. 2
figure 2

Specifications of the simple predictive model for coronavirus disease 2019 severity and the flowchart of the analysis conducted in this study. (A) A simple predictive model for predicting critical COVID-19 was defined as a logistic regression model consisting of two predictive factors. A predictive factor was defined as a factor that could be used to distinguish critical illness based on extracted important features and basic mathematical operations. (B) The models were constructed in three steps. First, the clinical data were encoded to generate 87 features, according to a previously described approach. Second, predictive models were generated for COVID-19 severity using our PWL model—which extracted 41 features with a relative score of > 0.01—to extract important features. The model was constructed based on these 41 features and basic mathematical operations but ultimately incorporated a combination of up to four features. COVID-19: coronavirus disease 2019, PWL: point-wise linear.

Table 1 Prediction performance of the two models.

Statistical analysis

For the baseline variables, categorical variables are expressed as frequencies and proportions, whereas continuous variables are expressed as means with SDs. Data were compared using the chi-square test for categorical variables and the Student’s t-test for continuous variables.

Results

Baseline characteristics of the patients

The baseline characteristics of the patients are summarized in Table 2. Among the 1,023 patients in the discovery cohort, the mean ages were 52.0 and 65.4 years, and 37.8% and 24.5% were women in the non-critical and critical outcome groups, respectively. The age and number of male patients were higher in the critical outcome group. A comparison of clinical characteristics revealed that BMI and incidence rates of hypertension, diabetes, cardiovascular disease, cancer, chronic obstructive pulmonary disease (COPD), hyperuricemia, and chronic kidney disease (CKD) were significantly higher in the critical outcome group. Among the 2,278 patients in the validation cohort, the mean ages were 56.1 and 63.5 years, and 32.9% and 25.3% were women in the non-critical and critical outcome groups, respectively. As observed in the discovery cohort, the age and number of men were higher in the critical outcome groups (BMI, hypertension, diabetes, COPD, and CKD). Unlike the discovery cohort, the validation cohort showed significant differences in terms of smoking history but no differences in terms of the prevalence of cardiovascular disease, cancer, and hyperuricemia.

Table 2 Baseline characteristics of the patients.

Construction of a simple predictive model for COVID-19 severity using the extracted features

Using 41 features, the predictive factors were defined using reinforcement learning. The final model was defined as a two-dimensional logistic regression model that combined two prediction factors. The relationship between the model performance and the features with AUC values of ≥ 0.8 in the discovery cohort, assessed by reinforcement learning, is illustrated in Fig. 3. One of the models had an AUC of ≥ 0.905 and incorporated Alb, LDH, age, and neutrophil count. The other model, with an AUC of ≥ 0.800, incorporated 41 features. This finding suggests that the Alb and LDH levels, age, and neutrophil count are the most critical factors for achieving a high AUC.

Fig. 3
figure 3

Changes in the number of features used in the simple predictive model. This figure shows the relationship between the performance of the simple predictive models and the features used. The x-axis represents the “area under the receiver operating characteristic curve (AUC) ≥ 0.80,” which summarizes the number of unique features in the discovery cohort with an AUC value of ≥ 0.80 that were incorporated into the model. From left to right, the x-axis shows a wider range of AUC values. Thus, the number of simplified models increases. AUC: area under the receiver operating characteristic curve.

The three simple predictive models with the highest AUC values for the discovery cohort are shown in Fig. 4A, and those for the validation cohort are shown in Fig. 4B. The top-performing model achieved AUC values of 0.906 in the discovery cohort (sensitivity, 0.842; specificity, 0.811; positive likelihood ratio, 4.456; negative likelihood ratio, 0.195) and 0.861 in the validation cohort (sensitivity, 0.804; specificity, 0.675; positive likelihood ratio, 2.477; negative likelihood ratio, 0.290). This model utilizes only two predictors according to the following formula:

Fig. 4
figure 4

Three simple predictive models with the highest area under the receiver operating characteristic curve values. The three simple predictive models with the highest AUC values in the discovery cohort. The orange and blue points correspond to patients with critical and noncritical outcomes, respectively. The orange and blue areas show the regions classified as critical and noncritical outcomes, respectively, for each model. The dashed line represents the classification boundaries obtained by setting the threshold level determined by locating the point in the ROC curve nearest to the point (0–1.0). (A) AUC values and plots for the discovery cohort. (B) AUC and plots for the validation cohort. AUC: area under the receiver operating characteristic curve.

$$\:Prediction\:factor\:1=-\sqrt{LDH}\times\:\left(Neutrophil\:fraction\right)$$
$$\:Prediction\:factor\:2=\frac{Albumin}{{\left({{log}}_{10}\left(Age+{10}^{-15}\right)\right)}^{2}}$$

Patients were predicted to experience critical COVID-19 when the following requirements were met:

$$\:Prediction\:factor\:2\le\:-770.6\times\:\left(Prediction\:factor\:1\right)-362.9$$

The second highest-performing model had AUC values of 0.905 in the discovery cohort (sensitivity, 0.835; specificity, 0.818; positive likelihood ratio, 4.583; negative likelihood ratio, 0.202) and 0.862 in the validation cohort (sensitivity, 0.785; specificity, 0.694; positive likelihood ratio, 2.563; negative likelihood ratio, 0.310). Moreover, it incorporates only two predictors according to the following formula:

$$\:Prediction\:factor\:1=-\sqrt{LDH}\times\:\left(Neutrophil\:fraction\right)$$
$$\:Prediction\:factor\:2=\frac{\sqrt{Albumin}}{{{log}}_{10}\left(Age+{10}^{-15}\right)}$$

Patients were predicted to experience critical COVID-19 when the following requirements were met:

$$\:Prediction\:factor\:1\le\:-1569.6\times\:\left(Prediction\:factor\:2\right)+402.8$$

The third-highest-performing model had an AUC value of 0.904 in the discovery cohort (sensitivity, 0.791; specificity, 0.838; positive likelihood ratio, 4.891; negative likelihood ratio, 0.249) and 0.856 in the validation cohort (sensitivity, 0.763; specificity, 0.714; positive likelihood ratio, 2.668; negative likelihood ratio, 0.332), which incorporated only two predictors according to the following formula:

$$\:Prediction\:factor\:1={\left(Neutrophil\:fraction\right)}^{6}{\left(weight\right)}^{2}$$
$$\:Prediction\:factor\:2=\sqrt{|LDH\times\:Age|}$$

Patients were predicted to experience critical COVID-19 when the following requirements were met:

$$\:Prediction\:factor\:1\ge\:-3.95\times\:{10}^{13}\left(Prediction\:factor\:2\right)+6.40\times\:{10}^{15}$$

Discussion

This study developed an explanatory ML-based predictive model that utilized baseline characteristics and blood test findings of patients hospitalized with COVID-19. Critical cases were predicted using three models, with AUC values of 0.9055, 0.9045, and 0.9041. Age, LDH and Alb levels, and neutrophil count were the most significant factors. The reproducibility was confirmed using a validation cohort at different time points. This is the first study to establish an explanatory ML-based predictive model for COVID-19 severity that avoids the risk of overfitting and has high reproducibility.

Two predictive models were created to divide the patient cohort into patients with and without critical COVID-19 outcomes. The PWL model predicts the output using all the features and assigns an importance score to each feature. Conventional explainable deep learning models can generate importance scores; however, the importance score does not necessarily reveal the mechanism underlying the development of critical illness (i.e., how each feature contributes to the onset of critical illness). The main advantage of our approach is the construction of ML models of the second type, specifically, two-variable linear models, which incorporate at most two patient features related to elementary mathematical operations. These models elucidate the collaborative interactions between features in a highly interpretable manner. However, the total number of potential combinations for these models is vast: (7 × 4 × 7 × N × 7 × N)2 – 2,000,000 × N4, for “N” candidate features. To manage this complexity, the number of candidate features was initially limited to 50. Features were selected based on the importance scores assigned to the PWL model. A reinforcement learning method is then used to explore different model combinations to achieve high accuracy.

The PWL model was designed to avoid overfitting the results to the training data because it uses a unified architecture wherein each network layer and its associated neurons are interconnected in a mesh-like form31. Moreover, our conclusive models were two-variable linear models, each of which incorporated a limited number of features and mathematical operations. Such models have a lower risk of overfitting because of the smaller parameter count compared with the number of patients. We developed simple prediction models using the important features extracted from PWL model by reinforcement learning. The best prediction performance of the simple prediction model was the AUC values of 0.906 and 0.861 in discovery and validation cohort, respectively (Fig. 4B). Those results surpass the performance of a severity prediction model that incorporated commodities and laboratory data constructed using machine learning (AUC values of 0.786)32. In addition, the performance of the severe prediction model that incorporated six factors extracted by the statistical approach were the AUC values of 0.86 and 0.83 in derivation and validation population, respectively33. In contrast, the simple prediction model developed in this study achieves high predictive performance using four features, suggesting that it can serve as a method to achieve a high level of accuracy while minimizing the number of required predictors. Notably, the model in this study achieves comparable or superior predictive accuracy despite using fewer factors. These findings suggest that, in severity prediction, modeling the interactions among factors appropriately may contribute more to improving predictive accuracy than merely increasing the number of factors.

A major strength of this study is its high reproducibility in a validation cohort of patients from different waves of the COVID-19 pandemic in Japan. Variations were observed in the clinical characteristics of hospitalized patients with COVID-19 during different periods, as documented in our previous study34. These variations occur because of a variety of factors, such as the viral strain, medical environment, and laboratory system35. The high predictive power of our model in both cohorts suggests its significant clinical utility. However, the AUC did not consistently surpass that of previous similar AI-based models32,36. This may be attributable to the comprehensive analysis of a large patient group over a long period. Another strength is the simplicity of our approach, which focuses on common indices, such as patient demographics, LDH, and BUN. Many AI-based predictive models incorporate radiographic and computed tomography indexes37; however, our models are straightforward and cost-effective. Our previous study highlighted differences in genetic characteristics, such as polymorphisms and blood groups38, comorbidities, such as obesity7, and clinical characteristics, such as disease severity, between patients with COVID-19 in Japan and those in Western countries39. Consequently, predictive models for COVID-19 severity are particularly important for Japanese patients. Although several predictive models have been established in Japan, most are primarily based on single-center data33,40, with only one ML-based study incorporating multicenter data41. In practice, models developed by patient data obtained at a single institution often exhibited diminished performance in external validation32. However, the model developed in this study was constructed using multicenter data from Japan. Therefore, it is less likely to experience performance degradation when applied to new Japanese patient populations. Therefore, a significant strength of our study is the development of a predictive model for COVID-19 severity in Japanese patients with COVID-19 using comprehensive multicenter data.

Age, neutrophil count, LDH levels, and Alb levels were selected as factors in our predictive model. Age is a significant prognostic factor for COVID-1942,43, and its fatality rate increases with age44. A meta-analysis examining the sole effect of age reported that age increased the risk of hospitalization, in-hospital mortality, and COVID-19-related mortality45. This association is attributed to the tendency for the prevalence of comorbidities that exacerbate disease severity to increase with age; however, other factors have also been reported. Older adults often experience impaired clearance of dead cells, persistent inflammation, and heightened secretion of inflammatory molecules, which contribute to the exacerbation of the cytokine storm (a typical characteristic of COVID-19)46. Indeed, the association between NLRP3 inflammasome and COVID-19 severity have been reported. NLRP3 activity is associated with cytokine storm that results in severity47, and NLRP3 is over-activated in aged patients48. Age is known as a risk factor for various disease, not only COVID-19, and may be related to metabolism and resistance to treatment as well as comorbidities and abnormal immune response. With aging, the immune system becomes weakened, and immune regulation fails, which may lead to severe respiratory failure in COVID-19. Other factors such as age-related malnutrition, sarcopenia, changes in the respiratory system, and hormonal changes have also been suspected49, and these factors are thought to interact and cause the severity of disease.

COVID-19 causes cytokine storms, and increased neutrophil counts can be attributed to hyperinflammatory conditions. This increase may have been exacerbated by concomitant bacterial infections. Severe COVID-19 patients had higher neutrophil counts at admission than mild or moderate patients50, and elevated neutrophil counts was reported to be an indicator of severity51. Severe cases of COVID-19 are associated with high expression of neutrophil-associated cytokines, and neutrophilia has been identified as a predictor of poor prognosis52. Several mechanisms contribute to this, including higher levels of low-density neutrophils (LDNs) and the formation of neutrophil extracellular traps (NETs)53,54. LDN levels are particularly elevated in patients with severe COVID-1955 and are associated with NET formation, which contributes to lung injury and vascular occlusion56. Elevated neutrophil activation promotes NET formation, which activates the coagulation cascade and contributes to microangiopathy and thrombosis57,58. In proteomic analysis, this was associated with an increase in proteins involved in metabolism, immunosuppression, and pattern recognition in severe patients59. These mechanisms of neutrophils were suggested to be related with the severity.

LDH is an intracellular enzyme found in various organs that oxidizes pyruvate to lactate. LDH is commonly used as a marker of tissue damage in various diseases and is associated with lung damage and pneumonia. LDH is a terminal enzyme involved in anaerobic glycolysis, and elevated LDH levels indicate oxygen deficiency. LDH is associated with severity of disease in COVID-1960 and is indicated as a factor associated with the extent of COVID-19 pneumonia by computed tomography61. In the meta-analysis, LDH was higher in patients who died compared to those who survived, and in patients with severe disease compared to those with non-severe disease62. LDH levels increase in response to tissue damage, promote the synthesis of lactate, and elevate the levels of cells, such as macrophages and dendritic cells, while suppressing natural killer cells and cytotoxic T lymphocytes63, potentially exacerbating COVID-19 severity.

Hypoalbuminemia is a well-known predictor of disease severity. A previous meta-analysis reported that it was associated with a poor prognosis64. The mechanism underlying the association between COVID-19 and hypoalbuminemia remains unknown; however, it is thought to increase vascular permeability and shorten the half-life of Alb through inflammation. Another study reported an inverse relationship between serum Alb levels and venous thromboembolism (VTE) in critically ill patients65. Lower serum Alb levels reduced the incidence of VTE in patients with COVID-1966. This underscores the complex interplay between hypercoagulability-induced hypoalbuminemia and neutrophil hypercoagulability, which potentially exacerbates disease severity.

There were several limitations in this study. First, an internal validation was performed for this study. However, external validation, particularly in patient populations from other regions, has not been performed. There are known ethnical differences in hospitalization and mortality rates of COVID-1967. Therefore, large-scale studies should be conducted in other countries to confirm the reproducibility of these results. Second, the study duration was relatively short. It was reported that the severity of COVID-19 varied depending on strain. Omicron variant was reported to cause milder COVID-19 than earlier strains68,69, and to be more contagious70. Therefore, research of COVID-19 should take into account the prevalent strain. Because the most recent Omicron strain has not been sufficiently investigated, validation studies at different time points are warranted. Furthermore, various treatments for COVID-19 have emerged over time71, necessitating further studies to evaluate these new approaches. Third, data on vaccinations were not accessible. Vaccination was reported to reduce the severity of COVID-19 in systematic review72 and similar effects was also shown in Japan73, where vaccine administration is recommended by the national policies. Therefore, the applicability of our predictive model in vaccinated patients should be examined.

In this multicenter study, simple and well-structured predictive models for COVID-19 severity in Japanese patients were established using explanatory ML. These results may aid in patient management and the selection of effective therapeutic interventions.