Introduction

Nearly four years after the World Health Organization’s (WHO) first recognized the emergence of severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), a novel coronavirus causing COVID-19, individuals around the world are still impacted by its devastating consequences1. As the virus transitions into an endemic phase2, at which point the infection rate will persist indefinitely3,4, consistent monitoring of cases and disease transmission rates is vital to prevent further outbreaks5,6. Vaccines have demonstrated effectiveness in protecting against severe disease, hospitalization, and death7, however breakthrough infection cases are inevitable due to acquired immunity waning over time4. Immunity is further undermined by viral mutations, as observed in vaccinated individuals contracting Omicron subvariants of COVID-198. Variants of the SARS-CoV-2 (Severe Acute Respiratory Syndrome-Coronavirus-2) virus continue to emerge in countries around the world9 and are becoming increasingly resistant to neutralization10. Further, because the signs and symptoms of COVID-19 can affect several organ systems, clinical presentation is often heterogeneous and severity of symptoms can vary greatly2,11. Variable presentations of disease states may hinder current and future efforts to control transmission if infections are not detected early on.

Large-scale, high volume clinical testing can effectively reduce the spread of disease and promote earlier, targeted intervention12. To facilitate and boost testing, advancements can be made in the rapidness, accuracy, and scalability of testing options. The gold standard for molecular testing (nasopharyngeal swab (NPS) reverse transcription-quantitative polymerase chain reaction (RT-qPCR)13) requires specialized instrumentation and is unable to provide immediate time-sensitive results14,15,16. Comparatively, rapid antigen tests (RAT) offer a portable14 and easily scalable17 option for prompt point-of-care screening18,19,20 but compensate with lower detection rates (overall pooled sensitivity of 68.4%) and reliance on additional resources to confirm a diagnosis17,21,22. The resulting impacts of false negative tests can be drastic. Besides allowing infected people to spread disease, they can depress testing rates by eroding trust in health care systems23 and disincentivizing potential close contacts from getting tested when they might have otherwise done so24.

Although both testing options have been well integrated into some societies, inequitable access is an on-going problem in developing countries. Despite low- and lower-middle-income countries making up 76.3% of the global population, only 36.9% of all tests across the world were used in these countries25. The WHO’s Access COVID-19 Tools Accelerator (ACT-A) set its testing target for 1 test per 1000 people per day26, but this target was underachieved predominantly by low- and lower-middle-income countries in the past 12 months as of January 6, 202325. These regions have also particularly struggled with related plastic waste management and environmental contamination27. RT-PCR testing alone has created an alarming quantity of plastic waste in the past four years28, much of which is non-biodegradable29, creating a large and lasting environmental impact.

Consequently, there has been a remarkable upsurge in research to develop novel approaches for COVID-19 testing, some of which address concerns related to accessibility, implementation, and reliance on single-use materials. Notably, wearable sensing technology has been proposed as a viable alternative. Advancements in materials engineering have permitted low-profile, cost-effective sensing platforms conducive to large-scale use30. Triaxial accelerometer-based sensors have demonstrated potential to analyze changes in physiological patterns, specifically cough31,32. Additionally, wearable sensors interfacing with the skin can capture targeted clinical measures at high levels of accuracy (e.g., heart rate, respiration rate, body temperature, and percent oxygen saturation) as well as provide insight into the quality of health over time (e.g., sleep quality33, physical activity34,35, and disease diagnosis and treatment36). The ability of consumer-grade wearables, including smartwatches37,38,39 and smart rings40,41, to detect and monitor probable COVID-19 infection and other cardiorespiratory illnesses has been previously investigated. These systems often operate on continuous data which can provide important insights related to exposure tracking and long-term effects of illness39,42,43,44,45. While promising, single-user consumer technologies are not ubiquitous in all societies46 and can limit usability in point-of-care clinical testing. Furthermore, the robustness of continuous data collection is still unclear, as continuous collection is known to generate large sums of noisy, heterogeneous data47,48 and is affected by wear-time compliance49.

To address these challenges, our team previously investigated the use of shareable wearable sensor systems to measure physiological symptoms of illness within a rapid “snapshot” of data50. Our novel methodology paired clinically meaningful physiological signal features with an activity-based movement sequence spanning less than two minutes to quantify the probability of a cardiorespiratory infection. We have since adopted a more advanced, comprehensive, and compact sensor platform, capable of capturing multimodal measurements including physical activity, cardiorespiratory function, and percent oxygen saturation. It also measures dermal temperature, giving it a distinct advantage over other commercial wearable devices40. The goal of the present study was to evaluate the feasibility of collecting “snapshot” data with this updated sensor. We demonstrated this technology’s potential for detecting cardiorespiratory illnesses, in this case COVID-19, via real-world deployment of this sensor in India during the COVID-19 pandemic.

The technology shown in this work has the potential to be implemented as a rapid and reusable screening tool, proactively warning users presenting with cardiorespiratory symptoms of their infection. Shared diagnostic tools may effectively reduce reliance on single-use testing materials, offering an alternative option for preliminary screening at scale in the case of future pandemics or for long-term community-based monitoring of disease states. As an intended outcome of this work, we hope to provide supporting evidence for future applications of multimodal wearable sensing systems to address gaps in the implementation of current diagnostic and monitoring technologies, specifically using only a snapshot of physiologically relevant data. Innovative applications could become impactful in addressing global health disparities, especially in remote and under-resourced communities.

Results

Metadata

Data collection spanned across four different site locations over approximately nine months (June 2021 to April 2022) including 85 days of active data collection. We intended to recruit 400 participants and successfully enrolled 532 individuals in the study. However, some samples were dropped at the initial signal segmentation stage due to poor signal quality (defined by noise that obstructed identification of individual activities) or incorrect performance of the activity sequence (both confirmed by visually inspecting the signal), or download failure. Furthermore, we only considered individuals in the COVID cohort who were within 14 days of a positive RT-qPCR (reverse transcription-quantitative polymerase chain reactions) diagnosis (to remove confounding longitudinal effects of illness). The remaining 467 subjects (i.e., 295 COVID and 172 NON-COVID) were used in model development and testing. Each subject contributed one data sample to the final feature matrix. The corresponding metadata for those samples is reported in Table 1. Table 2 shows the percentages of feature availability for each sensor modality.

Table 1 Metadata table
Table 2 Feature availability

Hyperparameters

The best hyperparameters resulting from the internal cross-validation search are summarized over all model splits in Supplementary Table 1. We provide here the average and standard deviation of each; maximum tree depth = 6.07 (Std. 2.36); minimum child weight = 4.39 (Std. 2.23); learning rate = 0.26 (Std. 0.17); gamma = 2.0 (Std. 1.43). Trends on the effect of hyperparameter value on F1-Score can be visualized in Supplementary Fig. 1.

Model performance

The results of the trained XGBoost algorithm to classify COVID from NON-COVID individuals in the held-out test set are shown in Fig. 1. The average confusion matrix for a given test set and the distributions of the F1-score, recall, and precision are shown for the 100 model iterations in Fig. 1a. The mean F1-score was 0.80 (95% CI = [0.79, 0.81]), with the best performing model run achieving an F1-Score of 0.88 (see decision tree structure in Supplementary Fig. 2). The predicted probability of COVID-19 can be visualized in the distributions in Fig. 1b. Predicted values near one have higher model certainty for being classified as COVID, and lower values near zero have more certainty for being labeled as NON-COVID. As shown, true positive and true negative predictions were classified with higher certainty than false positive and false negative classifications, which were distributed more centrally around predictive values of 0.5, i.e., low certainty. Figure 1c displays the model performance (AUC) of the full set of physiological signals compared to each physiological signal used independently, used in identical model pipelines, and evaluated on the same test splits. A full report of the pairwise ROC curve comparisons for statistically significant differences (using methodology as reported in DeLong et al.51) is given in Supplementary Table 2. Important to note, the model using all physiological features has statistically significant higher ROC than models using any physiological signals alone.

Fig. 1: XGBoost to Classify COVID and NON-COVID.
figure 1

a Average confusion matrix over the 100 model iterations, as well as the mean and standard deviation of performance metrics (f1-score, recall, and precision). b Predicted probability confidence of COVID, where values near 1 correspond to high confidence of COVID classification, are color mapped to the model predictions; FP = False positive; FN = False negative; TP = True positive; FN = False negative. c ROC curves displayed for subsets of physiological signal features. The label “All” corresponds to the full model compromising all physiological sensor features. Faded lines represent the standard deviation of model performance. * indicates that this model was statistically significantly different from all other models using DeLong’s method. Boxplot shows the interquartile range (IQR) of the AUC scores obtained from the cross validation, with whiskers extending to 1.5 times the IQR.

Feature Importance

XGBoost feature importance and SHAP (SHapley Additive exPlanation) values were obtained to examine the feature importance during training and model classification, respectively (Fig. 2). XGBoost assigns higher feature importance to features which heavily contributed to the construction of model decision trees during training. SHAP values are calculated by considering all possible combinations of features and how each particular feature contributes to the prediction52. Visually, SHAP values can elucidate the relative value of features (i.e., high or low values) that the model considered important when classifying each sample as COVID or NON-COVID. In both cases, the ranking was obtained by calculating the absolute mean of each feature’s importance value across the 100 validation tests. All sensor modalities had at least one feature considered important, with the top three most important features stemming from temperature, cough, and lung sounds. Notably, features from the photo plethysmography (PPG) data were found to not be critical to model performance, likely due to the inconsistent availability of this signal in the extracted data. Supplementary Table 3 shows the percentage of data available for the 467 samples used in each of the feature deemed as important by the model.

Fig. 2: Feature Importance Separating COVID and NON-COVID.
figure 2

a Average feature importance across the 100 model runs, ranked in descending order based on the XGBoost decision tree model during training. Larger values on the x-axis correspond to a higher feature importance. b Top 10 features defined by SHAP values in descending order for model prediction. The SHAP values represent the marginal contribution of the feature for the model prediction. Larger SHAP values along the x-axis push the prediction towards a classification of the positive case (i.e., COVID). Smaller SHAP values rather push the prediction towards a classification of the negative case (i.e., NON-COVID). Each dot represents a single data sample from the 100 test sets, with the color corresponding to feature value (i.e., dark purple dots for the top feature corresponds to a high temperature range). In both (a) and (b), features are color-coded to highlight the different sensor modalities and arrows represent the period in the activity segmentation of those features. Supplementary Table 3 shows the percentage of data availability for each of these features.

Discussion

This study was a first step in determining if physiological changes of cardiorespiratory infection can be detected by a commercial-grade wearable sensor using only a snapshot of data. Real-world deployment of this technology while detecting probable COVID-19 in India advances our team’s preliminary research50, demonstrating the feasibility and potential clinical implications of translating snapshot data into a rapid screening tool for cardiorespiratory illnesses. Our results demonstrated the primary model encompassing all sensor modalities and activities performed more accurately than any single sensor modality feature set used alone. Thus, multimodal sensor fusion across a variety of movement-based activities may be beneficial to observing broad physiological manifestations of cardiorespiratory illness, which in turn can be quantified and classified by advanced machine-learning algorithms. In this proof-of-concept implementation, we were able to use XGBoost to classify a diverse set of individuals with or without COVID-19 infection with an average F1-score of 0.80 (recall, precision = 80%) and an AUC of 0.81.

We calculated SHAP values to reveal the underlying significance of physiological features during model classification. With the exception of percent oxygen saturation, all sensing modalities appeared among the most important ranked features during both training and classification. The top three features were chest temperature range, the spread (IQR) of cough accelerometer signals, and the regularity of the lung sounds (entropy) in the frequency domain. Specifically, greater change in temperature, lower cough signal variability, and less regularity in the lung sound frequency domain pushed the models towards a positive prediction of COVID. A possible reason for the trend identified in temperature is that infections can alter the natural fluctuations in temperature53 that might occur during and after a short aerobic exercise54. Immune responses in infected individuals increase metabolic rate, making these responses energetically costly55 and limiting the body’s ability to regulate temperature efficiently during periods of acute illness. Additionally, Mason et al. (2022) reported the inclusion of temperature as a feature increasing their AUC score by 4.9% when using continuous monitoring of Oura rings for COVID detection40. It may also be understood why COVID-19 positive cases show less variability while coughing. In our dataset, nearly 50% of participants reported cough as a symptom. Thus, in this group, it is more likely that natural cough was triggered by central pattern generators which elicit rhythmic vibrations, versus in individuals forcing a cough56. Regarding the power spectral entropy of lung sounds, it can be noted that lower spectral entropy is associated with a more irregular distribution of power across observed frequencies, as opposed to higher values which are characteristic of a flat or consistent distribution of power57. High-frequency lung sound activity was observed in individuals with respiratory illness as compared to healthy individuals with less high frequency respiratory content (thus resulting in a flat signal), which is consistent with literature on this topic58. In all SHAP value cases, results should be cautiously inferred as the trends observed to support classification does not necessarily mean the prediction was correct.

To further evaluate the effectiveness of the model, we re-ran it using only the top 10 features identified by SHAP values, which yielded a superior AUC (0.85) and a similar F1 score of 0.81 compared to the full feature set model. This underscores the importance of feature selection in improving model efficacy and suggests that a well-chosen set of features can robustly classify respiratory illnesses, potentially simplifying the model without sacrificing accuracy. This information provides valuable insights for future studies, especially in exploring these features across a more diverse population of cardiorespiratory illnesses.

As COVID-19 continues to evolve, rapid testing methods are necessary to reduce transmission rates, especially in institutions with high-risk individuals and/or high transmission rates that require repeatable and proactive monitoring. Previous research has shown the practicality of using wearable sensors in this context. Radin et al. (2020) tracked infection via consumer wearable technology in real-time at the population level while predicting future cases59. Work from Mishra et al. (2020) as well as from Bogu and Snyder (2021) similarly demonstrated the use of smartwatch data to detect COVID-19 before the onset of symptoms60,61. In both instances, models relied on continuous, individual level changes in heart rate and step count. While longitudinal data is useful for tracking alterations in physiology, it often requires several days of recorded baseline activity on a personal device, which can be limiting in certain applications. Thus, we designed our activity sequence to capture a quick snapshot of physiological data during a brief baseline activity (i.e., seated rest) and after a period of exertion (i.e., fast walking) to elicit observable physiological change and extract innovative features to use for model training. Our study design’s novelty can be observed when compared to Walter et al. (2024) which achieved a sensitivity of 0.47 using continuous data and the given features of the ANNETM One sensor system for early detection of COVID in home environments62. Our relatively greater performance emphasizes the potential for rapid, snapshot testing and engineered, physiologically relevant features. Similar to the comparisons made in longitudinal data, we normalized several features by individual-level baseline measures to account for relative changes in signal characteristics. By doing so, we can illuminate some level of individuality within a general use model. While we evaluated participants at a single time point, the same sensor approach could be applied to track cardiorespiratory illness at scale over time, simply by repeating the sequence at the desired number of time points per individual to render new model outcomes and insights.

When using this technology in practice, it is important to consider the consequential tradeoff between prioritizing sensitivity or specificity in testing for the presence of COVID-19. As sensitivity and specificity are inversely proportional63, one must evaluate the consequences of false positives compared to those of false negatives when designing or choosing tests. These considerations may change as the disease spreads over time. At the beginning of the COVID-19 outbreak, testing methods with a higher sensitivity at the cost of lower specificity were more desirable to encourage isolation and prevent the spread of disease64,65. False negatives were a greater risk from a public health perspective, whereas false positives would merely lead to unnecessary testing and quarantine66. As COVID-19 reaches an endemic stage, false positive results have become a proportionally greater concern67. The costs of quarantining, including further isolation, mental anguish, and accidental viral exposure have become more significant. False positives amongst healthcare workers especially create additional burden on already strained healthcare systems68. Thus, it has become acceptable to utilize testing tools that sacrifice sensitivity but are less invasive and encourage testing. In the context of wearable technology, noninvasive screening tools could promote even greater testing rates and reduce transmission. For the scope of this manuscript, we treated false positives and false negatives equally using the balanced F1-score. In future iterations of this technology, the practical consequences of each should be considered. Perhaps the algorithm could even be re-trained to favor one over the other, depending upon the context of sensor use and the user preference.

Although the methodological approach is promising, there are several limitations in diagnostic reliability. First, the current iteration of this methodology does not directly measure viral material and cannot offer a definitive diagnosis. Second, a major limitation to the study is that we cannot confirm the level to which our model can separate COVID-19 from similar cardiorespiratory illnesses, due to a lack of metadata available to the clinical team and challenges in recruiting individuals with other symptoms or cardiorespiratory illnesses. Future work should evaluate large datasets differentiating between other types of cardiorespiratory illnesses, such as COVID-19 versus influenza, and should include presentations of illness that are more representative of what is observed in the community (i.e., asymptomatic infected individuals).

Furthermore, despite balancing positive and negative cases in the training data, the model still had slight preference in classifying samples as COVID-19 positive. This may be due to the internal cross-validation method retaining the original ratio of more positive cases in the test group. Thus, the internal hyperparameter values selected may have been influenced by classifying correctly for this ratio. This method is standard for practical applications of machine-learned models, which learn to classify positive cases with respect to the true prevalence in the community, however our dataset was not representative of the true community prevalence.

COVID-19 is currently transitioning into endemic stability, but despite this shift, widespread infection and associated complications persist. Therefore, researchers must continue developing and enhancing environmentally friendly, large scale screening tools, including wearable sensors. The expansion and adoption of this technology remains crucial, especially in light of warnings that pandemics are likely to occur more frequently in the future69,70. Our study demonstrated the potential of wearable sensors, particularly when using snapshot sensor data. If appropriately improved upon, this technology will facilitate early screening and community monitoring in inevitable future pandemics. In this proof-of-concept study, we trained on data collected during different viral dominance periods, however it is unknown whether this was an advantage to the model or perhaps a limiting factor in the model’s ability to learn various strains. Transfer learning may be promising for future iterations of this work.

In summary, we have demonstrated the potential use of wearable sensing technology to screen for and monitor cardiorespiratory illnesses, specifically COVID-19. Current conventional testing methods have drawbacks concerning accessibility, implementation, and reliance on single-use materials. The alternative methodology proposed here is easily implementable for use in remote and under-resourced communities and could make early diagnostic testing more accessible. Rapid, reusable, and easy-to-scale diagnostic technologies may additionally lower environmental impact. As COVID-19 continues to evolve and global pandemics become increasingly common, innovative alternatives for testing are necessary to limit the spread of disease around the world. We see this work as providing evidence for the feasibility of wearable sensor systems in screening for and monitoring cardiorespiratory illnesses, especially as this form of technology continues to advance. Furthermore, future iterations of this work should attempt to include individuals infected with other cardiorespiratory disease in accordance with the prevalence of the infection of interest to further demonstrate the potential of this technology.

Methods

Sensing device

The ANNETM One sensor system (Sibel Health; Niles, IL, USA) is a United States Food and Drug Administration (FDA) cleared, clinical-grade sensing platform capable of monitoring both full-body motion signals and physiological measures30. The system consists of two soft, flexible sensors: one anatomically positioned at the chest to measure tri-axial acceleration, electrocardiography (ECG), heart rate, respiratory rate, and proximal skin temperature, and the other positioned on the finger to measure photo plethysmography (PPG) for SpO2 and distal skin temperature (Fig. 3). The sensors are time-synchronized and connect to a tablet via Bluetooth for guided use and data storage. Acceleration was collected at 200 Hz in the direction of the x- and y- axes and 1600 Hz in the direction of the z-axis (sagittal plane). ECG, PPG, and skin temperature were recorded at 512 Hz, 128 Hz, and 1 Hz, respectively. The resulting sensor data was pre-processed with respect to the physiological signal of interest (see Feature Generation below).

Fig. 3: ANNETM One Sensor System and Application.
figure 3

The ANNETM One Sensor System is composed of two individually packaged wireless sensors as illustrated in (a). The anatomical placement of each sensor, the limb sensor (left) and chest sensor (right), is demonstrated in (b). Soft, flexible sensors are adhered to the skin with hydrogel adhesives. The form factor of the limb sensor reinforces the photodiode and LED to the index finger.

Participants

This research was made possible in collaboration between Shirley Ryan AbilityLab (USA) and Bionic Yantra (India) through the United States-India Science and Technology Endowment Fund (USISTEF), supporting research for innovative solutions to address challenges posed by COVID-19. According to the FIND SARS-CoV-2 test tracker, India has been amongst the list of countries short of reaching the ACT-A testing target25. Thus, this collaboration was realized as an opportunity to implement novel testing technology in a region which may practically benefit from its use.

Characteristics of participants can be found in Table 1 (see Results). Individuals between 18 and 85 years of age were recruited from a sample of convenience across heterogeneous test sites as either experiencing (COVID) or not experiencing (NON-COVID) COVID-19 infection, confirmed via a RAT and/or RT-qPCR test. The COVID cohort consisted of COVID-19 positive individuals at COVID-19 inpatient or outpatient facilities at the time of data recording. The NON-COVID cohort consisted of members of the community who did not test positive for COVID-19. Individuals with implanted pacemakers or defibrillators and individuals pregnant at the time of consent and study involvement were excluded from this study.

All participants provided written and/or verbal consent prior to their participation in this research study. The study was approved by the Institutional Ethics Committee (IEC) of the Indian Institute of Public Health-Delhi (IIPH-D) and the S2J Independent Ethics Committee (S2J IEC). The trial was conducted as per the Indian Council of Medical Research Guidelines for Biomedical Research on Human subjects, including standards of the Declaration of Helsinki (Brazil, 2013) and other applicable guidelines. This study is a registered clinical trial with the NIH initially released April 15, 2022 under the registry name: “International Validation of Wearable Sensor to Monitor COVID-19 Like Signs and Symptoms” and identifier: NCT05334680.

Data collection

Data was collected in an uncontrolled, real-world clinical environment across multiple sites in India. The primary clinical research team trained external clinical staff on the data collection protocol including sensor application, the activity sequence, and data transfer. Initial data quality checks and routine site visits were conducted to verify protocol adherence and data integrity. Institutionally required personal protective equipment (PPE) was worn during all study procedures. Participants were asked to remain masked throughout the session. Sensor locations on the skin were cleaned with a medical wipe and sanitized before and after every sensor application. Sensors were systematically rotated to allow a complete sanitation cycle and were adhered to the skin as shown in Fig. 3. Clinical staff placed one sensor on the suprasternal notch and one on the tip of the index finger. Sensors placed on the chest were adhered using medical adhesive.

An overview of the data collection protocol can be visualized in Fig. 4. The activity sequence was designed to leverage changes in physiology from baseline measurements to measurements during recovery from exertion, emulating stress or walking tests that are commonly used to evaluate cardiorespiratory function71,72. Thus, the activity sequence aims to 1) address the capability of a rapid protocol to elicit measurable physiological change that can be picked up by a commercial-grade wearable sensor, and 2) determine if the physiological features captured are sensitive enough to detect differences in individuals with and without cardiorespiratory illness.

Fig. 4: Overview of Rapid Snapshot Approach to Classify Probability of COVID-19.
figure 4

A short and structured sequence of movement-based activities are performed while wearing the ANNETM ONE sensor system. Periods of rest and deep breathing are captured prior to and after gait to observe both baseline measurements and elicited responses to increased exertion. Raw sensor signals are translated into physiological measurements, including gait, respiratory rate, heart rate, cough, core, and peripheral temperature, and SPO2. Time series and frequency domain features of each sensing modality are extracted and transferred as input to a machine learning classifier to predict probability of COVID-19 infection.

Under supervision by clinical staff, all participants began seated and were asked to refrain from talking during a short and simple sequence of standardized activities. Normal breathing was collected while seated before and immediately after 30 s of walking, in which participants were instructed to walk at a fast yet safe pace. Participants were then instructed to take five voluntary deep breaths while seated, followed by another 30 s period of fast yet safe walking, and then five voluntary deep breaths while seated. Participants then performed five consecutive voluntary coughs while seated. The entire activity sequence is approximately two minutes long. Between each activity, three consecutive hand-taps were performed onto the chest sensor as a reference for activity segmentation. Each participant performed one trial of the activity sequence.

Feature generation

Real time bio-signals were streamed from the sensors via Bluetooth to the ANNETM Sync application. Data from the sensor system was directly uploaded to a cloud-based server for use in the feature extraction pipeline. Oxygen saturation index (SpO2), central body temperature, and peripheral body temperature were directly used as measured by the ANNETM system’s proprietary software (Sibel Health, Inc., Niles, IL, USA), including only measurements with the given signal quality index above 0. To access increased signal information from other data streams, all other physiological measures were custom derived from the raw sensor stream using methods in literature, as detailed in the sections below. The acceleration signal was used to segment the sequence into the activities described above using a custom code that required visual inspection and confirmation of starting and end points of each activity. The time points at which various physiological signals were analyzed and extracted relative to the movement-based activities can be visualized in Fig. 5. The design of the activity sequence permits accounting for physiological differences before and after exertion, thus many of the features were normalized to measures at baseline (i.e., resting) to capture this. A summary of the features input to the model across all sensing modalities can be found in Supplementary Table 4.

Fig. 5: Extraction of Physiological Signals and Features from Rapid Snapshot Protocol.
figure 5

a A complete sequence of the rapid snapshot activity protocol is shown for the time series acceleration signal (in black). Directly below, colored bars representing physiological signals are vertically aligned in the time series domain for time points of various activity-wise extraction and are horizontally stacked with respect to the sensing modality from which they were extracted; ACC acceleration, ECG electrocardiography; CT central temperature, PT peripheral temperature, PPG photoplethysmography. bd present supporting visual diagrams of the signal processing and/or feature extraction domains for heart rate, respiration, and cough, respectively. (c) also shows the chest sensor placement and the gravity vector with respect to the participant. Feature extraction processes for all physiological signals are detailed in their respective sections.

Estimation Of R-R Intervals

R-R intervals were extracted from the raw ECG time series signal using the Pan-Tompkins algorithm, which applies a series of filters to reduce noise and accentuate the peak of the R wave in the QRS complex73. In cases of inadequate detection by the Pan-Tompkins algorithm alone and after visual confirmation, a maximal overlap discrete wavelet transform (symlet wavelet filter, MATLAB R2022B) was applied to confirm the available output of the Pan-Tompkins algorithm and impute missing time points. In the case this alternative could not detect reliable QRS complexes, the heart rate signal was dropped for that sample. The final array of detected R peaks was passed into the hrv-analysis Python module to extract features related to heart rate variability74. Certain features that are not recommended for durations at or below 30 seconds75 were ignored from the full set of hrv-analysis features. To additionally measure the change in heart rate, the post-exertion measurement was divided by the baseline measurement, thus normalizing the change amongst subjects.

Respiration Rate

To measure respiration rates, the angular motion due to breathing was reconstructed by tracking the rotation of the gravity vector in the accelerometer signal76. Tri-axial acceleration signals were down sampled (200 Hz) and filtered (2nd order Butterworth low-pass filter at 1 Hz). The normalized vector of acceleration at each time point, \({a}_{t}\), was used to calculate the axis of rotation, \({r}_{t}\), between two consecutive measurements:

$$\,{r}_{t}={a}_{t}\times {a}_{t-1}$$
(1)

Each axis of rotation was weighted using a Hamming window function. The resulting rotation angle, \({\varphi }_{t}\), calculated as

$${\varphi }_{t}={\sin }^{-1}(\left({a}_{t}\times {r}_{t}\right)\times {a}_{t})$$
(2)

was filtered (8th-order Butterworth band-pass filter at 0.1 and 0.8 Hz) and differentiated to get the angular rate. The power spectral density of the angular rate was estimated using Welch’s method, and the respiration rate in BPM (breaths per minute) was taken as the dominant signal frequency (see Fig. 5C). Frequency values were evaluated within physiologically relevant bounds; 10 to 32 BPM for baseline respiration77,78 and 10 to 60 BPM for respiration after walking79. Frequency values with a power of at least 50% of the dominant signal frequency were included in calculations for the maximum signal power, average frequency, and sum of the signal power. To measure the difference in respiration rate before and after walking, we calculated the ratio of values—dividing the post-exertion measurement by the baseline measurement.

Cough signal properties

Cough accelerometer signals contain high-frequency oscillating peaks due to varying motions of the chest during a cough: breathing in, compression of muscles, and quick expulsion of air80. Each axis of the tri-axial accelerometer data was interpolated to the sampling frequency of the z-axis (1600 Hz) and filtered (5th-order, high-pass Butterworth filter at 40 Hz). The normalized acceleration vector at each time point was calculated. Welch’s method was applied to the entire signal duration to segment individual coughs using a custom sliding window technique (0.2 s with a 50% overlap) based on a threshold relative to the mean signal power (see Fig. 5D). To compare cough data between subjects, cough signals were normalized via a linear scaling method based on the accelerometer signal range during seated baseline respiration. Features were extracted from both the time and frequency domains of each segmented cough signal. The mean and standard deviation of feature values across all detected coughs for a given individual (expected count of 5) were used in the final model.

Estimating lung sounds

High frequency lung sounds (i.e., crackles, wheezes) often captured via lung auscultation can provide important digital biomarkers of respiratory illness and severity81. Vibratory movements of the lung have been correlated to diagnosis of COVID-19 infection82,83. We therefore used the upper observable range of the frequency domain for our sensing device to characterize high frequency content as observable during the deep breathing activity. Accelerometer data was filtered (8th-order, high-pass Butterworth filter at 100 Hz) and the power spectral density of frequency was calculated via Welch’s method. Information including the dominant frequency, statistical moments of the power spectrum, and the number of detected frequency peaks (at least 50% of the dominant frequency power) were included as features to the model.

Estimation Of Walking Cadence

To estimate the subject walking cadence, the L2-norm of the acceleration was calculated and the fast Fourier transform was applied to find the dominant frequency between a low and high frequency bound, 0.7 Hz and 3.5 Hz, respectively. The dominant frequency was considered to be the stepping frequency, i.e., cadence. Statistical moments and entropy of the power density spectrum were extracted as input to the model. To quantify post-gait activity relative to the exertion level of gait activity, mean heart rate, frequency and power of the respiration rate, and lung sound features during deep breathing were normalized by the dominant gait frequency. Gait features were also compared between the first and second walking pass to determine any effects of fatigue on the walking dynamics.

Model architecture and evaluation

All sensor features and select metadata information (age, sex) were provided as input to a machine-learning pipeline in Python. The pipeline architecture was designed to maximally utilize all data samples while also implementing external validation via split-sample validation. An overview of the pipeline is visually represented in Fig. 6. Training, validation, and testing splits prevented sample-wise data leakage across trained and evaluated partitions of the dataset and evaluated robustness of the model to generalize to unseen data. Additionally, the robustness of the model was validated with an additional cross-validation method described in Supplementary Table 5.

Fig. 6: Machine Learning Cross-Validation Pipeline.
figure 6

This figure provides a visual overview of the model pipeline architecture. a The entire feature dataset was randomly split into train-test sets (k = 1:100, 80–20 split) using a stratified shuffle split method (S.S.S.). The original ratio of COVID (C + ) to NON-COVID (C-) samples is maintained within each train and test set. b An example of the internal modeling structure occurring for each outer loop split. Training data from a single split (i.e., k = 3) was input to an internal hyperparameter tuning process, also employing S.S.S. cross-validation for n = 10 folds. Preprocessing steps, feature selection, and the modeling classifier steps are shown vertically in gray. Note that in the model pipeline, the step “Balance Class Labels” handles imbalance in the training data using Synthetic Minority Over-sampling Technique (SMOTE). The best parameters and training data were used to fit a retrained model operating on the same modeling steps. c The final metrics are reported as an average over all 100 outer loop splits.

Stratified Shuffle Split technique was used to split the data into training and testing subsets (k = 100 folds) and preserve the ratio of COVID to NON-COVID in our sample84. Training data was further partitioned into training and validation subsets within a nested loop cross-validation procedure (Stratified Shuffle Split method, k = 10 folds) with the purpose of tuning hyperparameters. Optimal values of the internal model parameters were determined within a randomized search process84,85, evaluating each internal fold on 100 randomly selected combinations of model hyperparameters. Within this nested optimization and training procedure, the feature data were provided as input to a series of standardized pipeline processes:

  • Imputation of missing feature values using k-Nearest Neighbors technique84,86 (reasons for missing data mentioned in Results and percentage of feature availability per sensor modality is presented in Table 2).

  • Normalization of features using RobustScaler method, robust to outliers84.

  • Removal of outliers via Isolation Forest algorithm84,87.

  • Removal of highly correlated features (method = ‘Pearson’, threshold = 0.95)88 to reduce feature dimensionality.

  • Handling of imbalanced training data via SMOTE, known as Synthetic Minority Over-sampling Technique89,90.

  • Feature selection via SelectKBest (n = 50) which comprehensively compares feature values against the target variable84 to reduce feature dimensionality.

As the final step in the pipeline, we used an extreme gradient boosting decision tree algorithm (XGBoost) in a binary supervised learning task to classify COVID from NON-COVID cases91. We defined a test positivity cut-off of 0.5 for the model’s output probability, as this is the standard default in binary classification tasks. Model parameters specific to gradient tree algorithms known to affect model fit were lent to the hyperparameter optimization process: maximum tree depth, minimum child weight, learning rate, and gamma (Supplementary Table 1). The hyperparameters selected were those which resulted in the highest average harmonic F1-score across the ten-fold cross-validation. The best performing model from the internal cross-validation procedure was fit to the corresponding training data using the selected hyperparameters and evaluated upon the held-out test set to observe model generalization errors. For each of the train-test split iterations, the model was independently evaluated on metrics of recall, precision, F1-score, and Area under the Receiver Operating Characteristic Curve (AUC). The distribution of performance scores, optimized hyperparameter values, and feature importance across the 100 train-test splits are reported.