Introduction

In recent years, computer vision research has made significant strides in various medical applications1,2. Leveraging AI algorithms, medical videos have been utilized for patient monitoring, encompassing clinical mobilization3, gait analysis4,5, evaluation of pediatric head injuries resulting from falls6, assessment of Parkinsonian hand movements and disease severity7,8, and detection of cognitive impairment9. Moreover, medical video technology shows promise in aiding medical staff with diverse tasks, including hand hygiene detection 10 and surgical assessment through video analysis11,12.

Despite these advancements, one important area that remains largely unexplored is the use of medical video AI for patient triage. During triage and ED encounters, physicians determine patient disposition—such as hospital admission or discharge—based on the presenting problem, vital signs, expected clinical course, patient resources, and their observations of the patient’s condition, often referred to as “clinical gestalt”13,14,15. Notably, a study has demonstrated that ED physicians can achieve an 80% accuracy rate in predicting disposition based on a 30-s observation of a patient, complemented by routinely available triage information such as vital signs, mode of arrival (e.g., ambulance), and chief complaints13.

While physician judgment is valuable, it can be limited in a busy ED where physicians may not be available at the time of triage. Existing systems like the Emergency Severity Index (ESI) offer insights into patient acuity but are not designed to predict patient disposition at triage. An AI-based predictive model could provide a more consistent, objective, cost-efficient, and scalable solution to optimize patient flow and resource allocation in real-time.

We hypothesize that mobile phone video AI algorithms can similarly extract valuable insights from patients’ clinical appearances to predict their disposition, effectively mimicking clinical gestalt. Successful implementation of such a triage algorithm could mitigate the challenges of overcrowded emergency rooms, particularly during peak viral seasons and pandemics16. Moreover, it could enable healthcare systems to allocate finite resources more efficiently and, in the future, prioritize first available appointments for urgent patient telehealth visits17. Managing hospital capacity and staffing effectively is a significant challenge in healthcare systems. Imbalances between bed availability and patient demand, along with staffing requirements, can impact hospital access, wait times, care quality, and overall satisfaction for both patients and staff18. When bed supply exceeds demand, it results in unnecessary costs due to underutilized resources. On the other hand, when demand surpasses available beds, it leads to longer wait times—especially for ED patients awaiting admission—lower care quality, higher risks of errors, and decreased satisfaction for both patients and staff, and in extreme cases diversion of patients to other hospitals. Predicting disposition early could help balance these issues, improving patient flow and resource use.

Studies have shown that machine learning models using data collected at triage and from electronic health records (EHR) can effectively predict patient disposition from the ED, including hospitalization and the need for critical care19,20,21,22,23,24,25,26,27. Various algorithms, such as logistic regression, random forests, gradient boosting, and deep neural networks, have been employed, with gradient boosting and deep neural networks often yielding better performance19,27.

However, many of these models rely on historical patient data from the EHR, such as past ED visits or hospitalizations20,23,24, or past diagnosis19,20,21,23,24,26. This dependence on historical data poses challenges, particularly for patients new to a health system or those transitioning between systems with incomplete or limited data. Additionally, the use of EHR-based models may limit applicability in developing countries where EHR systems are not yet widely available.

Some existing models also incorporate the ESI, a 5-point triage scoring system commonly used in EDs28,29, as a predictor22,23,24,27. While ESI scores are useful, they still require manual input from triage staff based on a patient’s presenting symptoms, vital signs, and clinical judgment regarding severity and resource needs. This reliance on human input limits the potential of algorithms to ease the triage burden.

The aim of this study is to assess the predictive performance of a multimodal AI algorithm for patient disposition from the ED, using only a short mobile phone video clip capturing the patient’s clinical appearance (Fig. 1) and a limited set of clinical data (age, sex, vital signs, pain level, and chief complaint), without relying on historical patient data from the EHR. We compare the performance of our multimodal model (Fig. 2) to a reference model based on logistic regression using ESI triage scores, as well as to ablated versions of our model that utilize only video data or only triage data. This is the first step towards our long-term goal of developing a video-based AI triage tool in pre-hospital settings that could be adopted without the need for EHR integration.

Fig. 1: Patient video recording and processing.
figure 1

a Video recording was conducted using secured mobile devices and lasted ~5 min, with the camera at the base of the patient’s bed, and the patient as upright as tolerated. Video recording did not interfere with patient care, and was paused if the clinical team needed to interact with the patient. Videos were spot-checked regularly to ensure adherence to study protocol. The figure was created with BioRender.com. b The video recording (did not include audio) was then processed via the ImageBind video processing pipeline, which involves uniformly sampling up to five 2-s clips at 1 frame per second and taking three spatial crops (left, middle, right) per clip. Then, these fifteen spatially cropped clips are passed into the vision encoder and the resulting output is the mean of the clips’ 1024-dimensional representations. (Subject in the figure is a co-author, not patient).

Fig. 2: Schematic representation of the predictive model.
figure 2

We illustrate our multi-modal late fusion method. 1) Collect triage data and a short video of the patient. 2) Pre-process the tabular data and encode the chief complaint and video with ImageBind pre-trained text and vision encoders, to obtain embeddings for each data modality (tabular, text, video). 3) Independently train random forest classifiers for each data modality. 4) Fuse the predictions of each of the trained Random Forests to get the final model prediction for patient disposition.

Results

Patient characteristics

We approached a total of 843 adult patients and enrolled 723 of them at the Stanford Health Care ED between August 2021 and September 2022. In the enrolled patients, the median age was 52 years (interquartile range [IQR] 33–76), and 51% were female (Table 1). Nearly all patients (over 95%) received a triage ESI score of 2 or 3, and 40.9% were subsequently admitted to the hospital (inpatient or hospital observation). Pain level data were missing for 97 patients, temperature was missing for two patients, and one patient was missing all vital signs. The most frequently reported chief complaints during triage were abdominal pain (16.5%), followed by chest pain (8.0%), shortness of breath (5.5%), and dizziness (3.9%) (Table 1).

Table 1 Characteristics of enrolled patients

Model performance

The model utilizing video data alone yielded better performance in hospitalization prediction compared to the one utilizing triage clinical data alone, across multiple metrics including AUROC (0.693 vs 0.678), PPV (0.563 vs 0.529), and specificity (0.658 vs 0.587). The video-data-only model and the triage-data-only model were comparable for AUPRC (0.608 vs 0.603), NPV (0.721 vs 0.716), while sensitivity (0.632 vs 0.658) is better in the triage-data only model.

Combining both video and triage data resulted in the highest performance, achieving an AUROC of 0.714 (95% CI: 0.709–0.719) and an AUPRC of 0.642 (95% CI: 0.636–0.649). Additional performance details are illustrated in Fig. 3, Fig. 4 and Supplementary Table 1. Notably, all models demonstrated improved performance with the exception of specificity when compared to the reference ESI model.

Fig. 3: Comparison of receiver operating characteristic curve and precision-recall curve.
figure 3

Comparison of a receiver operating characteristic curves and b precision-recall curves for the baseline ESI model, triage data-only model, video-only model, and late fusion video and triage data model. We observe that each uni-modal model is able to achieve performance over the baseline (p < 0.01), with the late fusion multi-modal model achieving higher performance than the uni-modal ones (p < 0.01). The square bracket indicates 95% confidence intervals.

Fig. 4: Comparison of sensitivity, specificity, PPV, and NPV of different models.
figure 4

Comparison of a sensitivity, b specificity, c PPV, and d NPV (with 95% confidence interval) for the baseline ESI model, triage data-only model, video-only model, and late fusion video and triage data model. We observe that the video-based models (video-only, and the late fusion model) have better PPV (p < 0.01). The late fusion video and triage data model is able to achieve the best performance in sensitivity (p < 0.01 compared to ESI and video-only models; p < 0.05 compared to triage data-only model) and NPV (p < 0.01). The ESI baseline model has the best specificity (p < 0.01 compared to triage data-only and late fusion models; p < 0.02 compared to video-only model).

Furthermore, we conduct a Shapley30 value analysis (Supplementary Fig. 1) to determine the contribution of adding triage data and video to the overall model performance. These values quantify contribution by taking the average marginal contribution of each of these data types over all possible subsets. We find that video data contributes +0.111 and triage data contributes +0.096 points increase in average model AUROC.

Discussion

Physicians instinctively use visual cues from their interactions with patients to assess the severity of illness. This is commonly referred to as “the eye-ball test” or “the foot of the bed test”, which involves observing a patient from the foot of their hospital bed. Similarly, using visual AI, hospital admissions can be predicted from the video signal alone when a patient is asked to perform simple tasks. Our findings suggest that mobile phone video data contains valuable information for hospitalization prediction, with an AUROC of 0.693. Interestingly, triage data alone, which included an assessment by a medical professional using medical equipment (such as pulse oximetry and sphygmomanometer) has an AUROC of 0.678. When video AI is combined with triage data, the AUROC increases to 0.714. Our results suggest that the short video clips may capture “clinical gestalt” detected and leveraged by the AI algorithm.

Our study demonstrates the potential of utilizing short video clips of patients to predict their ED disposition. It is unexpected to see that video-only model has better performance compared to the triage-date-only model considering the prominence of biometric data, such as vital signs, in clinical risk assessment, as indicated by previous studies15,19. One possible explanation is that video data implicitly encodes certain biometric parameters, such as respiratory rate and heart rate31,32, and also includes markers of patient distress and alterations in breathing patterns and uncomfortable movements.

Other machine learning models have been developed to predict hospitalization from the emergency department19,20,21,22,23,24,25,26,27, but none have leveraged patient video data. With the widespread availability of smart mobile devices equipped with cameras, video data may become increasingly accessible to both patients and providers in healthcare settings. Unlike many models that rely on historical EHR data, our multimodal AI algorithm uses only routinely available triage data (age, sex, vital signs, pain level, and chief complaints). As a result, this predictive model does not require integration with existing EHR systems, making it easier to deploy. It could also be useful in countries or regions where EHR systems are not available. Unlike studies such as Raita et al.19 or Hong et al.24, we did not include the mode of ED arrival as a covariate, despite it being a known strong predictor of hospital admission33, because our goal was to develop an algorithm that could facilitate patient disposition decisions before hospital arrival.

This study has several limitations. Our outcome variable was admission decisions made by ED physicians for each patient. ED physicians may vary in their admission decisions based on factors such as patient severity of illness, patient resources (e.g., availability of a caregiver at home, homelessness, or access to transportation), practice style variation, and subjective risk assessment of likely clinical deterioration34,35,36. Such variability inherently sets an upper bound on how well an AI algorithm can perform in predicting hospitalization. Additionally, the videos were sourced from a single academic institution, resulting in a relatively small dataset from an AI training perspective. We anticipate that predictive performance will improve with larger and more diverse datasets from various healthcare settings. Next, the patient population in this study has higher acuity (only 3.6% with ESI 4 or 5), and we limited enrollment to English- or Spanish-speaking patients as well as those able to provide informed consent, which may not generalize to other ED populations. The admission rate of our enrolled cohort was 40.9%, which is slightly higher than the overall admission rate at the Stanford ED of 33.1%. This discrepancy is likely due to some low-acuity patients being treated and discharged quickly for simple issues (e.g., suture removal) before our research assistants could obtain consent. Apart from this, our cohort remains representative of the overall Stanford ED population, with a similar ESI distribution, where the majority of patients were categorized as ESI 2 and 3.

Furthermore, our algorithm did not incorporate audio or additional EHR data elements, such as past medical history, prior hospitalizations, or other social determinants of health (e.g., health literacy, homelessness), which may further enhance performance24. Our experiment represents a scenario where patients are new to the healthcare system, providing a baseline for algorithm performance using only routinely collected triage information (i.e., vital signs, pain level, and chief complaints). Lastly, we did not directly compare physician gestalt with our AI algorithm given the added study design complexity and coordination with more than 90 attending physicians, but it would be a valuable direction for future research.

Since routine recording of patient clinical appearance is not standard practice, this study uncovered several logistical and ethical challenges. Logistically, the research team had to secure approval from the institution’s information technology office to ensure all recording devices and the infrastructure for secure video data storage complied with institutional standards. In addition to obtaining IRB approval, we also sought review from the hospital’s privacy office to ensure full compliance with institutional policies. Early in the study, we learned the importance of avoiding the unintentional recording of hospital staff in the patient’s room to protect their privacy. To address this, our research assistants received extensive training not only on how to operate the recording devices but also on minimizing disruptions to clinical workflows.

Ethically, a key concern is the potential for video recordings to capture sensitive situations, such as patient distress or unexpected clinical events. Obtaining informed consent from patients is critical, as they must understand how their video data will be used and the risks of sharing such sensitive information. Patients are also given the option to request the removal of their data from the study after their ED encounter, recognizing that the stress of the situation may lead to second thoughts. Future studies should address these challenges by establishing clear protocols that prioritize patient privacy and safety, while adhering to institutional and regulatory standards.

Despite these challenges, advances in video AI research offer promising opportunities for future triage workflows. In a future scenario, patients could check in at kiosks equipped with mobile devices, where a brief video recording, combined with vital signs measurements, could aid in triage. Alternatively, patients could remotely provide similar information using their own mobile devices, potentially streamlining the triage process and diverting low-risk patients to telemedicine encounters for final disposition. Increasing appropriate healthcare access is particularly critical during periods of overcrowding, disease outbreaks, and in underserved areas both nationally and internationally. Video AI innovations have the potential to alleviate healthcare capacity strains—delivering the right care, at the right time, and in the right place for patients.

Methods

Data acquisition—video and triage data

We enrolled adult patients (aged 18 years and older), who were English and/or Spanish speaking, undergoing evaluation at the Stanford Health Care ED between August 2021 and September 2022. Stanford Health Care ED is a suburban tertiary academic medical center in Palo Alto, California, with an annual patient volume of over 110,000. The distribution of patients by ESI is as follows: ESI 1—0.7%, ESI 2—27.6%, ESI 3—60.8%, ESI 4—10.2%, and ESI 5—0.8%. The overall admission rate for adult patients is 33.1%. The ED is staffed by 93 attending physicians, 60 resident physicians, 15 advanced practice providers, and 190 registered nurses. Patients were excluded if they presented with psychiatric emergencies, major trauma, altered mental status (e.g., significant delirium), clinical deterioration requiring immediate intervention (such as intubation), or were otherwise unable to provide consent. Due to privacy concerns related to video recording, we excluded patients who were placed in the hallway. However, these cases were uncommon, as our facility, which opened in 2019, has expanded capacity to accommodate a high volume of patients. Prior to participation, all patients provided informed consent for video recording and study participation.

Research assistants (RAs) received 10 h of training, which included observing patient interactions by faculty members to ensure video fidelity and ethical patient engagement. RAs collaborated with ED attending physicians and charge nurses to identify eligible patients during morning, afternoon, and evening shifts on weekdays and weekends throughout the study period. On average, an RA enrolled 2–4 eligible patients per shift.

Video recording was conducted using secured mobile devices (iPhone 12, Apple Inc) and lasted ~5 min. The camera was positioned at the base of the patient’s bed, with the patient in an upright position as tolerated (Fig. 1). Video recording did not interfere with patient care and was paused if clinical staff needed to interact with the patient. Videos were regularly spot-checked to ensure adherence to study protocol.

During video recording, patients were verbally instructed to perform seven tasks (Supplementary Table 2) designed to capture their clinical appearance and behaviors. The tasks were designed to be simple for patients to perform and intended to mimic a range of observations typically made during a clinical encounter. These tasks included assessment of general appearance (looking into the camera at the beginning and end), respiration (taking a single deep breath; counting after taking a deep breath), orientation (answering three simple questions: “What is your name?”, “Where are you now?”, and “What is today’s date?”), and standardized movements (performing a modified finger-to-nose test and covering the left eye with the right hand). Although audio was recorded, it was not included in our models.

We extracted 14 clinical data elements from the electronic health record (EHR): age (in years), sex as recorded in the EHR (female or male), self-reported racial and ethnicity group (Asian, Black or African American, Hispanic or Latino, Native American or Pacific Islander, Non-Hispanic White, Other), and primary reasons for the ED visit (free text). Additionally, six triage vital signs were collected: body temperature, heart rate (beats per minute), respiratory rate (respirations per minute), pulse oximetry (peripheral hemoglobin oxygen saturation %), and systolic and diastolic blood pressure (mmHg). The nurse-generated triage score was recorded using the ESI28,29 where 1 indicates the most urgent and 5 the least urgent. We also collected the initial pain level reported at triage (on a scale of 0 = no pain to 10 = severe pain) and the final disposition (discharge or hospital admission).

The primary reasons for ED visits, known as chief complaints (e.g., chest pain, shortness of breath, dizziness) were entered as free text by nursing staff. The ESI is a commonly used triage tool that reflects the triage nurse’s assessment of the severity of the patient’s presenting illness and downstream resource utilization28. Hospital admission, our dependent variable, included patients admitted to the ED observation unit or an inpatient clinical service (e.g., medicine, cardiology, neurology), as well as those transferred to another hospital

All patient data was securely stored on encrypted devices, following security best practices. The study team completed a data risk assessment and privacy review through Stanford Health Care. The research protocol was approved by the Stanford University Institutional Review Board (IRB protocol #61666).

Data processing—video and triage data

Our tabular data include nine elements (patient’s age, sex, triage vital signs including body temperature, heart rate, respiratory rate, oxygen saturation, systolic blood pressure, and triage pain level). We use these nine data elements to create a 9-dimensional vector representation of the data. Missing feature values were imputed using the mean value from the training set, with the exception of pain level, which was imputed with zero. Our assumption is that if the triage nurse did not record pain level, the patient was likely not experiencing pain.

We used frozen pre-trained ImageBind37 text and vision encoders to encode the patient chief complaint-free text and the patient videos. ImageBind is a method that learns a shared embedding space for six different data modalities (text, image/video, audio, depth, thermal, inertial measurement unit) through contrastive learning38,39,40,41. These multi-modal contrastive trained models have been shown to perform well on many modality-specific downstream tasks, even out-performing their uni-modal trained counterparts42.

Specifically, the free text is tokenized into a byte-pair encoding and encoded using a transformer-based architecture to obtain a 1024-dimensional embedding. The ImageBind video processing pipeline involves uniformly sampling up to five 2-s clips at 1 frame per second and taking three spatial crops (left, middle, right) per clip. Then, these fifteen spatially cropped clips are passed into the vision encoder and the resulting output is the mean of the clips’ 1024-dimensional representations (Fig. 1).

Modeling

We developed a late fusion model (Fig. 2) to classify whether a patient should be admitted or discharged from the ED. Late fusion is a technique that aggregates the predictions of multiple models to make a final classification. In contrast, early fusion models concatenate embeddings and train a single classifier43. We chose late fusion for its flexibility in handling varying availability of input data.

Since the individual classifiers are trained independently, this design allows for flexibility in the training set, as it does not require all data modalities to be present for each patient. At inference time, the model can easily manage missing data modalities by excluding their contribution in the final weighted average. In contrast, an early fusion model would require the input size of the concatenated embeddings to remain consistent during both training and inference. We also found experimentally that late fusion outperformed early fusion (Supplementary Tables 3, 4). In our approach, we trained independent Random Forest44 classifiers for each data modality: triage tabular data, triage chief complaint text, and patient video. A random forest classifier is a classical machine-learning algorithm that combines the outputs of multiple decision trees. Each tree is built by learning the optimal decision rules to classify the data based on its features. We also experimented with other classifiers such as XGBoost and AdaBoost, but found that Random Forest models outperformed the others (Supplementary Tables 35).

For the video data, the model was trained on a single task. At inference time, we combined the predicted probabilities from the video data with triage data using the following weighted average:

$$\hat{y}=\alpha {\hat{y}}_{{video}}+(1-\alpha ){\hat{y}}_{{triage}}$$
(1)
$${\hat{y}}_{{triage}}=0.5{\hat{y}}_{{tabular}}+0.5{\hat{y}}_{{chief\; complaint}}$$
(2)

Here in Eqs. (1) and (2), \(\hat{y}\) represent the multimodal model’s predicted probability of admission, and \({\hat{y}}_{{modality}}\) denotes the modality-specific model’s prediction, where modality is either video or triage data (tabular or chief complaint free text) The scalar mixing coefficient α lies between 0 and 1.

Upon examining all single-task models, our sensitivity analysis, which compared different video segments where patients performed different tasks (Supplementary Tables 35), revealed the Orientation segment to have an AUROC and an AUPRC significantly greater than most other segments. Therefore, we selected the orientation video segment for use in the algorithm.

Experimental details

We compared the predictive performance of our multi-modal model to a reference model that uses logistic regression on the ESI triage severity scores19, as well as ablated versions of our model using video-data only and triage-data only.

The following metrics were used to assess the performance of the models:

  • Area Under the Receiver Operating Characteristic Curve (AUROC): This metric quantifies the overall ability of the model to discriminate between positive (i.e., hospitalization) and negative (i.e., discharge) cases across all possible threshold values. A higher AUROC indicates better model performance.

  • Area Under the Precision-Recall Curve (AUPRC): This metric emphasizes the model’s performance on the positive class, particularly useful for imbalanced datasets. A higher AUPRC indicates better precision and recall trade-off.

  • Sensitivity (Recall): The proportion of actual positives correctly identified by the model (true positive rate). High sensitivity means the model effectively identifies positive cases.

  • Specificity: The proportion of actual negatives correctly identified by the model (true negative rate). High specificity indicates the model effectively identifies negative cases.

  • Positive Predictive Value (PPV or Precision): The proportion of positive predictions that are truly positive. High PPV indicates a low false positive rate.

  • Negative Predictive Value (NPV): The proportion of negative predictions that are truly negative. High NPV indicates a low false negative rate.

For all models, we performed nested stratified threefold cross-validation, repeated over ten random seeds, to conduct our experiments and hyperparameter tuning. We stratified the folds by the patient to prevent data leakage between training and testing. Nested cross-validation involves performing cross-validation on the training set for each outer cross-validation fold and tuning hyperparameters to maximize the average performance on the nested validation sets. This common technique tunes hyperparameters to reduce the bias in performance from tuning hyperparameters on each fold’s outer validation set.

For the logistic regression baseline, we selected the inverse of regularization strength from the following values: {0.1, 0.5, 1, 2, 5, 10}. For each Random Forest classifier in our multi-modal model, we selected the number of estimators from {10, 50, 100} and the maximum tree depth from {1, 2, 3, None}. The “None” option allows nodes to be expanded until all leaves are pure or until all leaves contain fewer than two samples. We stratified our dataset by the admission rate, ensuring even distribution between the training and validation sets within each fold.

Given the limited size of our data, we opted for repeated k-fold cross-validation to obtain a more reliable estimate of model performance, choosing a lower value of k = 3 for a better estimate of model generalization.

For simplicity, our experiments used a mixing coefficient \(\alpha =0.5\). We report the results of our multi-modal late fusion model with a binary classification threshold of 0.4. This threshold was selected through a grid search of values between 0.1 and 0.9 (at intervals of 0.1), identifying the value that maximized the average Youden’s index across all training folds.