Main

Aging is a complex, multifaceted process involving molecular, cellular and organ-level changes that ultimately impact whole-organism health and survival1. Understanding how these changes contribute to increased disease susceptibility is essential for developing interventions that extend healthspan2,3. Biological age (BA), a measure of accumulated biological damage relative to an average individual of the same chronological age (CA), has emerged as a key metric for assessing age-related disease risk1. BA can diverge from CA, providing a valuable indicator of aging trajectories and health outcomes4.

Initially, BA estimation relied on the measurement of DNA methylation and transcription patterns2,5, but recent advancements have expanded aging clocks to incorporate imaging and multi-omics data, improving the accuracy and comprehensiveness of BA predictions6,7,8. For example, mass spectrometry and antibody-based proteomics and metabomics have enabled large-scale serum analyses, generating valuable resources for aging research9. Furthermore, various medical images and electronic health record (EHR) modalities have provided organ functional aging assessments and linkage to health and diseases10,11,12,13,14. These innovations highlight the variability in aging across organs and their differing responses to external factors such as lifestyle or medications, paving the way for personalized anti-aging strategies15,16.

The growing interest in aging mechanisms and interventions has driven interest in aging clocks—molecular markers that predict BA more precisely than CA, which measures the passage of time. Unlike CA, which is static, BA reflects the efficiency of biological functions using genomic, epigenetic, clinical and functional markers17,18. Genomic markers are fixed at birth, whereas epigenetic markers, such as DNA methylation and histone modifications, change with age4,19,20.

In theory, individuals of the same CA should exhibit similar rates of functional decline. However, genetic and environmental factors influence cellular, tissue and organ aging, making some individuals age more quickly or slowly biologically compared to their CA. This discrepancy, quantified as the difference between predicted BA and CA, is known as the age gap21. Studies have shown that an increased age gap is associated with accelerated aging and heightened disease risk and mortality10,11,12,13,14. For example, individuals with increased brain age gaps often exhibit systemic aging features, such as sensory-motor decline and older appearance22. Accelerated aging is particularly evident in individuals with chronic diseases, suggesting that disease burden further drives biological aging23. Conversely, accelerated biological aging may also serve as a strong determining factor in shaping disease risk before onset, as demonstrated through incident disease and multimorbidity analyses24. By developing reliable BA measures, aging clocks hold promise for extending healthspan and improving quality of life, a crucial goal as human life expectancy continues to rise25.

Despite substantial progress in adult aging clocks, our understanding of a full life cycle clock, particularly during infancy and childhood, and its impact on health and disease remains limited15,19,20,26. In pediatrics, rapid child physiological changes represent a scripted development progression rather than accumulated biological aging damage, making the current BA definition ill-posed in the pediatric context, in which the term ‘physiological maturity clock’ may be more appropriate. Furthermore, a study on the link between (and relevance of) clock deviations and clinical implications would have great potential in pediatric care. For example, a desirable goal is to calculate ‘physiological maturity’ deviations either in maturation precocity or puberty precocity versus growth/developmental delays/malnutrition relative to peers. This information may provide a useful clinical interpretation and facilitate pediatric care and interventions in the setting of pediatric growth charts or preventive screening programs.

This study introduces LifeClock, a full life cycle biological clock leveraging 24.6 million EHRs, including laboratory test data, to predict BA across all life stages and assess its association with disease risk and survival outcomes. Physicians traditionally focus on EHR indicators/laboratory values that exceed reference ranges, yet normal values also contain valuable insights. Integrating longitudinal data—regardless of whether values are normal or abnormal—can help identify individual-specific setpoints and fluctuations, improving disease risk assessment and the detection of critical aging transitions25. Although deep learning models have the capacity to extract such information, previous studies have largely focused on specific diseases within narrow age ranges27,28,29. Previous studies on BA models derived from routine clinical labs and vitals (for example, PhenoAge, Klemera-Doubal and DOSI) and EHR were largely focused on single-visit or cross-sectional studies, making it difficult to capture longitudinal trajectories to achieve good predictive performance or clinical interpretability.

To further advance precision aging health research and clinical applications, virtual representations of individual patients30 were generated using massive (24,633,025) longitudinal EHR data through EHRFormer, a transformer-based model. This approach enabled high-granularity modeling of aging processes, enhancing our understanding of the interplay between biological aging and disease risk, allowing us to identify distinct clinical patterns correlated with age, stratifying individuals into unique clusters with varying disease trajectories.

Using unsupervised learning, we trained EHRFormer to extract features from vast patient data spanning birth and childhood to adulthood and the geriatric phase, to provide more accurate BA estimates than CA. The model integrates EHR data that reflect the functioning of multiple organ systems—including blood, immune, liver and kidney—while also accounting for sex differences in aging patterns. Model performance was evaluated by comparing predicted BA to CA using R², the Pearson correlation coefficient (PCC) and mean absolute error (MAE), demonstrating high accuracy, particularly in younger individuals with more uniform developmental trajectories. By focusing on biological aging, our study also identifies age gaps—notable divergences between BA and CA—as critical biomarkers for disease risk prediction and patient stratification. This is especially relevant in cases of accelerated aging, which correlate with increased disease risk across both younger and older populations. Our study presents a novel framework for studying aging and age-related diseases across the full life cycle, leveraging widely available and cost-effective EHR data to advance precision medicine in aging research.

Results

A blood test-based biological clock in a full life cycle using longitudinal EHRs

To construct a virtual representation of human health from rich, longitudinal EHRs, we first had to overcome inherent data challenges such as heterogeneity, missing values and cohort-specific batch effects. To address this, we developed a foundation model, EHRFormer (Fig. 1a), using data from multiple cohorts (Table 1 and Extended Data Fig. 10), starting with 184 carefully selected clinical indicators (Supplementary Table 1) from the China Healthy Aging Investigation (CHAI). The model’s architecture incorporates several key strategies: an input–output dual stochastic masking strategy to capture complex feature interactions while imputing missing data (Fig. 1b), and a cohort-agnostic adversarial training model to eliminate batch effects, ensuring the representations are robust and generalizable (Fig. 1c). Furthermore, an autoregressive training approach was used to ensure each visit’s representation captures an individual’s evolving health trajectory by learning from past and present records to predict the future (Fig. 1a,e,f).

Fig. 1: EHRFormer architecture and applications for longitudinal EHR data analysis.
figure 1

a, Computational framework for generating comprehensive digital representations from sequential EHR data using EHRFormer. Multiple self-supervised pretraining tasks (reconstruction, cohort discrimination, missing data discrimination and next-visit prediction) optimize the model for EHR feature extraction. b, Input–output mask reconstruction mechanism with attention-based encoding for handling masked and missing values. c, Bias-mitigation strategies employing discriminators to address missing data patterns and cohort-specific variations. d, Biological developmental clock for developmental processes (<18 years) and biological aging clock for aging processes (>18 years) across the human lifespan. e, Disease risk stratification and temporal development patterns derived from patient clustering, illustrating differential risk trajectories. f, Clinical application diagram depicting first occurrence disease diagnosis and future disease prediction based on longitudinal patient visit data using EHRFormer.

Table 1 Demographic characteristics of the study cohorts

Using EHRFormer, we generated digital representations from each visit of healthy individuals and developed a task-specific regression model to predict CA, with the predicted CA values serving as BA estimates (Fig. 1d). This BA clock demonstrated strong overall performance in the internal validation cohort, achieving a low MAE, high R² and high PCC when compared with CA, indicating that laboratory tests alone can reliably estimate CA (Fig. 2a). Our analysis revealed two distinct aging patterns: a pediatric phase (birth to 18 years) and an adult phase (18 years onward), which were characterized by markedly different profiles of laboratory markers (Fig. 2a,b). Consequently, we trained separate, specialized models for each phase, which substantially improved prediction accuracy (Fig. 2c,e).

Fig. 2: Overall BA prediction model, specific models and associated features for predictions on the pediatric development clock and adult aging clock.
figure 2

a, The strong correlation between CA and BA predicted by the EHRFormer-based age model. Each dot represents one EHR data point. b, Heatmap showing the average values of multiple clinical parameters (z-score normalized across the population) clustered by their age-related trajectories, with data aggregated at one-year intervals across the entire study population. c, Correlation between CA and BA in the pediatric clock predicted by the EHRFormer-based age model. Each dot represents one EHR data point under 18 years of age. d, SHAP values of the top 20 contributors in the BA prediction for EHR data under 18 years of age. e, Correlation between CA and BA in the adult clock predicted by the EHRFormer-based age model. Each dot represents one EHR data point over 18 years of age. f, SHAP values of the top 20 contributors in the BA prediction for EHR data over 18 years of age.

Visual explanations using ‘Shapley additive explanations’ (SHAP) identified the key contributors for each clock. The pediatric clock was primarily driven by low aspartate aminotransferase (AST), high creatinine (crea) and high total protein (TP) levels (Fig. 2d). In contrast, the adult clock’s most influential features were high urea, low albumin (ALB) and high red cell distribution width (RDW) (Fig. 2f), with the top 20 markers being almost entirely different between the two clocks. The model’s performance was consistent across sexes (Extended Data Fig. 1a,c,e,g), though feature contributions varied slightly (Extended Data Fig. 1b,d,f,h). Importantly, EHRFormer’s predictive power was validated in the external UK Biobank cohort, where it achieved an MAE of 4.14 (Extended Data Fig. 7a). Key aging biomarkers such as urea, ALB and RDW were identified as top contributors in both the CHAI and UK Biobank cohorts, demonstrating their cross-cohort stability (Extended Data Fig. 7b).

LifeClock predicts current and future disease risks in both children and adults

We applied our EHRFormer-derived representations for dimensionality reduction using principal component analysis (PCA) and uniform manifold approximation and projection (UMAP), followed by Leiden clustering analysis29. Our results revealed that, among healthy individuals, different CA groups could be clearly clustered, indicating that EHR data contain age-related information (Extended Data Fig. 2j). Furthermore, data from different hospitals or cohorts were evenly distributed across clusters, particularly when separating those under 18 and over 18 years of age, indicating the successful elimination of batch effects (Extended Data Fig. 2).

Given the well-established link between BA and disease risks31, we performed dimensionality reduction followed by a Leiden clustering analysis on the entire CHAI dataset and examined whether individuals with higher age differences were more likely to develop diseases. We also explored potential associations between the clusters and diseases (Fig. 3 and Supplementary Tables 2 and 3). Our aging model, built on the EHRFormer framework and trained on healthy individuals, computed an age difference for each individual in the CHAI dataset by quantifying deviations from the individual’s BA relative to same-CA peers through analysis of EHR profiles (Methods). A total of 64 Leiden clusters were obtained from all EHR representations (Fig. 3b). We classified adult EHRs into two categories, average-aged (age difference within ±1 s.d.) and over-aged (age difference > 3 s.d.), and then calculated the prevalence and incidence proportions of different diseases within each cluster. Our results showed that, for most diseases, a markedly higher disease prevalence proportion was present in over-aged individuals when compared to average-aged individuals within the same cluster, which was further increased in the future (Fig. 3c,d). In addition, some diseases within certain clusters, such as hypoglycemia, may exhibit a higher incidence proportion in the future in the over-aged individuals, even though these over-aged individuals may not have demonstrated a higher prevalence proportion (Fig. 3e). In summary, these results suggest that the EHRFormer-based aging model may not only demonstrate the present health status but also indicate future disease risk based on current EHR profiles.

Fig. 3: Clusters generated based on EHRFormer-representations are informative of current and future health status.
figure 3

a, UMAP projection of patient visits colored by age (yellow to purple: young to old). Each point represents one EHR data point (one patient visit). b, Leiden cluster distribution of the UMAP in a. ce, The prevalence proportion of pulmonary heart disease (c), hirsutism (d) and hypoglycemia (e) in the average-aged group (left), the current over-aged group (middle) and the future over-aged group (right). f, Heatmap showing the adjusted log2HRs of 42 diseases in each of the 14 clusters of <12-year-old pediatric EHR data. g, Heatmap showing the log2(HR) of 58 common diseases in each of the 50 clusters using >18-year-old adult EHR data. Col-or from blue to red indicates the log2(HR) in a specific cluster truncated at a maximum absolute value of 2.

Because clusters can serve as indicators of future disease risks, and the EHR representations for children (<18 years old, clusters 1–14) were well separated from those for adults (>18 years old, clusters 15–64), we analyzed the disease risks separately within the children (0–20 years old) and adult (>20 years old) clusters, respectively. For each identified cluster, we used Cox proportional hazards models for incidence calculations using the cluster assigned at each patient’s first clinical visit as a baseline predictor for the remainder of the study population, applying multivariate adjustment for age, sex, hospital, smoking and alcohol history to minimize potential confounding from demographic factors and institutional variations. Sex was included as a covariate rather than used for stratification to maintain statistical power across all clusters. As a result, in clusters 1–14, by calculating adjusted log2 hazard ratios (HRs) for incidence (between ages 12 and 20 years, which represent the children maturation period) using EHR data from individuals <12 years of age (before the children maturation period), we observed that individuals within different clusters exhibited distinct tendencies to develop specific pediatric disease conditions. For instance, we found that individuals in cluster 14 had 15.36 times and 11.07 times higher risk of developing pituitary hyperfunction and obesity, respectively; individuals in cluster 12 had a 10.13 times higher risk of developing hernia; individuals in cluster 3 had 4.71 times higher risk of developing viral meningitis; and individuals in cluster 8 had 4.95 times higher risk of developing precocity puberty. In contrast, individuals in cluster 10 had 3.57 times higher risk of developing developmental growth delay (Fig. 3f). Furthermore, analysis of developmental clock-derived age differences in children <18 years showed significant BA deceleration in growth-inhibiting conditions (delayed puberty, growth hormone deficiency and developmental delay) compared to healthy controls. Conversely, growth-promoting disorders (precocious puberty, gigantism and overgrowth syndromes) exhibited marked BA acceleration, demonstrating that our developmental clock captures physiologically meaningful growth variations (Extended Data Fig. 4).

In parallel, within clusters 15–64, individuals in cluster 20 had a more than 30 times increased risk of vascular-related disorders, including hypotension (9.03 times) and renal failure (37.70 times) (Fig. 3e). Similarly, diabetes showed increased HRs for incidence by 3.75, 3.59 and 3.00 times in clusters 16, 52 and 20, respectively (Fig. 3e). These findings demonstrate that our model can effectively identify individuals at high risk of developing diseases based on their longitudinal EHR data. To further interpret these high-risk clusters, we examined their underlying clinical profiles. For instance, in the pediatric cohort, cluster 5, which showed a higher incidence of appendicitis, ulcerative colitis and other immune-related diseases (Fig. 3f), was correspondingly characterized by elevated immune and inflammatory markers, including interleukin-6 (IL-6), IL-8, IL-10, white blood cell count (WBC) and C-reactive protein (CRP) (Extended Data Fig. 3a). Similarly, in the adult cohort, cluster 44 was associated with a substantially higher incidence of cardiopulmonary diseases (Fig. 3g) and was defined by a corresponding clinical signature of elevated cardiac troponin T (cTnT) and serum potassium, alongside lower oxygen saturation (saO2) (Extended Data Fig. 3b).

Fine-tuning EHRFormer for individual disease risk predictions

Because the success of the EHRFormer-based BA prediction model in indicating current disease diagnosis and future disease predictions suggests that EHR data may contain information beyond aging, such as overall health status and disease progression, we speculated that our EHRFormer model could be fine-tuned with the introduction of disease labels for disease risk predictions. This approach would enhance the model’s ability to diagnose first occurrence disease status and predict future disease, enabling a quantitative assessment of its predictive capabilities (Fig. 1f). For each predicted disease, we stratified the population into high-, middle- and low-risk cohorts based on model-generated probability scores, then quantified cumulative risk profiles across age groups. This stratification approach enables age-specific risk assessment and potentially facilitates the identification of critical intervention windows within the disease trajectory (Fig. 1e).

We found that our EHRFormer-based disease prediction model demonstrated strong current diagnostic performance across multiple diseases. Specifically, it achieved a high prediction accuracy in cardiovascular diseases (atrial fibrillation area under the curve (AUC) = 0.95, coronary artery disease (CAD) AUC = 0.98, hypertension AUC = 0.95, ischemic stroke AUC = 0.97), neurological disorders (multiple sclerosis AUC = 0.96, Parkinson’s AUC = 0.94) and systemic conditions (osteoporosis AUC = 0.96, rheumatoid arthritis AUC = 0.96, diabetes AUC = 0.98) (Fig. 4a and Supplementary Table 4). Additionally, the model effectively predicted future risks of these diseases (AUC ≥ 0.8) (Fig. 4b and Supplementary Table 4). To further assess its capability for long-term risk stratification, we specifically evaluated its performance on five-year and ten-year incidence-prediction tasks. The model maintained strong predictive power, achieving AUCs ranging from 0.80 to 0.90 for five-year incidence (Extended Data Fig. 5a) and 0.81 to 0.91 for ten-year incidence across various diseases (Extended Data Fig. 5b). For comparison, we also evaluated EHRFormer against other models such as Recurrent Neural Network (RNN) and XGBoost. RNN, similar to EHRFormer, accepts sequential data and follows the autoregressive paradigm, but lacks an attention mechanism. XGBoost can also handle sequential data, yet it does not operate under the autoregressive framework. EHRFormer also demonstrated a superior performance compared to XGBoost and RNN in current disease diagnosis tasks across nine diseases. For example, in atrial fibrillation diagnosis, EHRFormer achieved an area under the receiver operating characteristic curve (AUROC) of 0.962, while XGBoost achieved 0.899 and RNN 0.907. In diabetes future prediction, EHRFormer’s AUROC was 0.911, versus 0.837 for XGBoost and 0.876 for RNN (Supplementary Table 5). To further validate its predictive abilities, we tested the model on an external validation cohort consisting of 219,485 longitudinal clinical visits from 86,257 individuals collected in an independent cohort (Table 1). We fine-tuned the model using each individual hospital’s EHR data in CHAI-Training and observed consistently good predictive performance in for CHAI-External cohort #5 (Extended Data Fig. 6). Our model demonstrates robust performance when evaluated on UK Biobank EHRs, comparable to results observed in the CHAI-External cohort, highlighting its high generalizability and consistent effectiveness across diverse populations and healthcare institutions (Extended Data Figs. 7c,d and 8 and Supplementary Table 6). Therefore, by benchmarking against baseline models and analyzing the correlation between the number of visits and predicting accuracy, EHRFormer markedly improved current and future disease prediction performance by integrating the whole life cycle of aging and disease information (Fig. 4c,d).

Fig. 4: Performance of the EHRFormer-based disease predicting model and accumulated risk analysis in the CHAI-Internal cohort.
figure 4

a, ROC curves of the EHRFormer-based disease predicting model in diagnosing different diseases. b, ROC curves of the EHRFormer-based disease prediction model in predicting future risk of different diseases. c, AUC changes along with visit number in diagnosing different diseases. d, AUC changes along with visit number in predicting different diseases in the future. ej, Accumulated risk for various diseases based on a predictive model that categorized individuals into high-, middle- and low-risk groups for each given age under 20 years. As development progresses, the gap in accumulated risk between the high-, middle- and low-risk groups becomes more pronounced. ks, Accumulated risk for various diseases based on a predictive model that categorized individuals into high-, middle- and low-risk groups for each given age over 40 years. As aging progresses, the gap in accumulated risk between the high-, middle- and low-risk groups becomes more pronounced.

We also applied the model for future disease risk predictions in both pediatric populations (using EHR data before <12 years of age) and adult populations (using EHR data >18 years of age). Using EHR data collected before the age of 12 years, we predicted future common pediatric disease risks, achieving AUCs ranging from 0.70 to 0.96. Similarly, using EHR data from individuals over 18 years, we predicted future adult diseases with comparable accuracy (Extended Data Fig. 9). Furthermore, we stratified individuals under 10 years old into three risk-level groups based on their predicted probabilities: the highest one-third as the high-risk group, the middle one-third as the medium-risk group and the bottom one-third as the low-risk group. Cumulative incidence plots provide a useful visual tool for comparing disease incidence over time among these groups, revealing large differences in future disease risk for various conditions, including obesity, meningitis, epilepsy, systemic lupus erythematosus (SLE), asthma and juvenile arthritis (Fig. 4e–j). Similarly, we applied the same stratification approach to individuals over 40 years of age, dividing them into three risk-level groups based on predicted probabilities. The cumulative incidence curves for these groups demonstrated substantial differences in future disease risk for atrial fibrillation, coronary artery disease, diabetes, hypertension, ischemic stroke, multiple sclerosis, osteoporosis, Parkinson’s disease and rheumatoid arthritis after age 40 (Fig. 4k–s). These results suggest that risk stratification based on early-life pediatric EHR data and early-adulthood EHR data can effectively reveal differential long-term disease risks.

Discussion

This study highlights the potential of EHRFormer as a powerful tool for predicting BA across the full life cycle, providing novel insights into aging processes and their association with disease risks30,32,33. By leveraging a large longitudinal cohort of EHR data, our results reveal distinct biological aging clocks in the pediatric and adult phases and demonstrate how deviations from CA—captured as differences between it and predicted BA—are linked to disease susceptibility. These insights offer a unique opportunity to enhance our understanding of aging across the lifespan29,34.

Building on this foundation, our initial finding of a strong correlation between BA and CA using EHR data (Fig. 2a,c,e) led us to discover age-correlated changes in 184 clinical laboratory test results, vital sign indicators and basic metadata (Fig. 3b,d,e). These features were subsequently subject to a clustering analysis similar to that in single-cell analysis methods (Fig. 3a). The resultant 64 clusters displayed distinct age characteristics and disease features, which were then subjected to aging assessment and disease risk predictions (Figs. 3 and 4).

The ability to deconstruct a heterogeneous patient population into these clinically meaningful subgroups via unsupervised clustering is a key finding of our study, moving beyond simple disease labels. For example, our analysis revealed that cluster 5 was associated with a high risk for immune-related diseases such as appendicitis in the pediatric cohort and was characterized by elevated inflammatory markers such as IL-6 and CRP (Extended Data Fig. 3a). Given that IL-6 and CRP are canonical biomarkers of systemic inflammation cited in countless studies on pediatric inflammatory conditions35, our interpretation that cluster 5 represents a state of ‘heightened pediatric immune activity or dysregulation’ is strongly supported. Similarly, in the adult cohort, cluster 44, which predicted a high incidence of cardiopulmonary diseases, was defined by elevated cardiac troponin T and lower oxygen saturation (Extended Data Fig. 3b), identifying a subpopulation with subclinical or overt cardiorespiratory stress. Furthermore, clusters like cluster 20, with its strong association with renal failure and diabetes, likely represent a well-described metabolic syndrome or vasculopathy phenotype36. In the pediatric population, clusters successfully stratified individuals along a spectrum of endocrine and developmental trajectories, capturing conditions from precocious puberty (cluster 8) to developmental delay (cluster 10), reflecting known endocrine feedback loops that govern growth37. This mechanistic interpretation of clusters transforms them from abstract groupings into actionable clinical phenotypes that reflect underlying biological states. Notably, although core metabolic markers such as glucose and HbA1c were not top global predictors in our SHAP analysis, their predictive importance was still considerable, likely because our full life cycle model captures metabolic health through a complex interplay of correlated longitudinal markers rather than single-point indicators.

Our work also fits into a broader landscape of foundation models developed for healthcare, unlike models such as OMICmAge38, which rely on specialized and costly multi-omics data for an aging clock construction. Other models like COMET29 leverage EHR data through supervised pretraining to enhance the analysis of separate omics datasets. In contrast, EHRFormer demonstrates strong predictive performance using only widely available, low-cost routine laboratory tests and EHR data, enhancing its potential for broad clinical translations, and employs large-scale self-supervised pretraining directly on longitudinal EHRs to learn deep, clinically relevant patient representations without the need for labeled data. Although models such as MILTON39 excel at integrating unstructured clinical text with structured EHR data, the unique strength of EHRFormer lies in its autoregressive architecture, specifically designed to capture the temporal dynamics and long-range dependencies within an individual’s full life cycle and project the information onto a latent space, facilitating aging and age-related disease trajectories (Fig. 5). Therefore, EHRFormer carves a unique niche by focusing on deriving actionable, longitudinal health insights directly from routine clinical data.

Fig. 5: A latent space approach for modeling a full life cycle biological clock using the EHRFormer architecture.
figure 5

a, The model takes high-dimensional, longitudinal EHR data as input, including structured clinical records, laboratory test results and vital signs. b, EHRFormer, a transformer-based foundation model, processes these data and projects them onto a low-dimensional latent space. This creates a comprehensive and robust ‘digital representation’ of an individual’s health status over time. This learned representation is then leveraged for two key downstream applications. c, Modeling biological aging across the full human life cycle, from infancy and childhood development to adult and geriatric aging. d, Predicting individual disease trajectories, enabling risk stratification, and identifying patient subgroups with distinct clinical paths.

Despite its strengths, our model has limitations, including the observational nature of our datasets and potential biases inherent in longitudinal cohorts. Nevertheless, our study underscores the effectiveness of EHRFormer as a virtual representation technology capable of capturing critical health information and providing a novel framework for leveraging widely available EHR data28. The strong predictive performance of our representation-based aging clock highlights its potential for future applications in aging research40,41,42,43,44. Although traditional aging clocks estimate BA based on specific biomarkers, EHRFormer extends this capability by integrating diverse data sources, offering a dynamic and holistic approach to aging analysis10,45,46. By continuously updating with new information, EHRFormer transforms aging clocks from static estimators into adaptive, real-time systems47,48,49,50,51. Looking ahead, incorporating wearable devices, cloud medical records and environmental sensors can enable aging clocks to use the most current data, improving their adaptability and accuracy32,52,53,54. The EHRFormer-based clock establishes a robust framework for advancing personalized healthcare strategies, promoting healthy aging, facilitating timely interventions and mitigating aging-related decline.

The findings from this study suggest that our full lifespan aging clock, EHRFormer, offers greater accuracy in predicting disease risk compared to CA alone. The integration of longitudinal EHR data into biological aging models holds the potential to revolutionize our understanding of aging and its relationship with disease. These insights can drive the development of more precise aging biomarkers, enable prompt disease detection, and guide personalized treatments tailored to unique aging trajectories in diverse populations.

Methods

Study populations

The China Health Aging Investigation (CHAI), as a project of the International Consortium of Digital Twin in Medicine30, is an ongoing study using EHRs to predict patients’ BA and assess individual disease risks10,55,56,57. Data for this study were sourced from several hospitals in the CHAI project. Cohort #1 (The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, China), cohort #2 (The Second Affiliated Hospital of Wenzhou Medical University, Wenzhou, China), cohort #4 (Dazhou People Hospital, Sichuan, China) and cohort #5 (Nanfang Hospital, Southern Medical University, Guangzhou, China and the PLA General Hospital, Beijing, China) are major tertiary hospitals offering full comprehensive adult services, whereas cohort #3 (Women and Children’s Center of the PLA General Hospital and Women and Children’s Center of the Second Affiliated Hospital of Wenzhou Medical University, China) comprises major regional referral hospitals with primary services focused on women and children’s health and diseases. Our analysis included 24,633,025 longitudinal clinical visits from the EHR data of 9,680,764 patients. Additionally, longitudinal EHR data from cohort #5 and the UK Biobank were utilized as two external validation cohorts. Data were collected on biological sex. Ethics Committee approvals were obtained in all institutions. The study was registered at clinicaltrial.gov (NCT06791486). The work was conducted in compliance with the Chinese CDC policy on reportable infectious diseases and the Chinese Health and Quarantine Law, in compliance with patient privacy regulations in China, and was adherent to the tenets of the Declaration of Helsinki. For the purposes of training our biological clock, ‘healthy’ individuals were defined as participants who had no recorded disease diagnoses within their EHRs at the time of their clinical visits. This approach was important for establishing a baseline model of a normal pediatric development clock and an adult aging clock, against which BA deviations in individuals with specific diseases could be precisely assessed.

Data representation

We structured the EHR data as chronological sequences of clinical visits for each patient. Each patient’s longitudinal clinical record is represented as a time-ordered sequence \(S=\{({X}_{0},\,{T}_{0}),\,({X}_{1},\,{T}_{1}),\,\ldots ,\,({X}_{L},\,{T}_{L})\}\), where Xi denotes the vector of clinical variables (including continuous and categorical laboratory test results and clinical measurements) collected at the ith visit, Ti represents the time elapsed (in days) since the initial visit, with T0 = 0 by definition, and L is the number of visits for this patient. The continuous clinical variables were quantized according to the formula \(D(x)=\lfloor (x-{X}_{max})/({X}_{max}-{X}_{min})\times {d}_{{\rm{c}}{\rm{o}}{\rm{n}}{\rm{t}}}\rfloor\), where \(\lfloor X\rfloor\) represents the floor function, Xmax the maximum value of feature x, Xmin the minimum value of feature x, and dcont is the number of discrete bins. This discretization resulted in integer values between 0 and dcont, with values exceeding the defined range truncated to the maximum boundary and missing values encoded as −1. This discretization strategy preserved the distributional characteristics of the original variables while enabling a unified representation of patient data. At each visit, features Xi were represented as a concatenation of categorical variables and discretized continuous variables: \({X}_{i}=[{X}_{{\rm{c}}{\rm{a}}{\rm{t}}};\,{X}_{{\rm{c}}{\rm{o}}{\rm{n}}{\rm{t}}}],\) where \({X}_{{\rm{c}}{\rm{a}}{\rm{t}}}\in {{\mathbb{N}}}^{L\times {N}_{{\rm{c}}{\rm{a}}{\rm{t}}}}\) and \({X}_{{\rm{c}}{\rm{o}}{\rm{n}}{\rm{t}}}\in {{\mathbb{N}}}^{L\times {N}_{{\rm{c}}{\rm{o}}{\rm{n}}{\rm{t}}}}\), respectively, with L denoting the number of clinical visits, Ncat and Ncont are the numbers of categorical and continuous features, respectively.

EHRFormer architecture

EHRFormer is an encoder–decoder style transformer architecture specifically designed to process longitudinal EHR data. The model comprises three key components: an examination encoder, a temporal embedding and task-specific decoder heads.

EHRFormer architecture and examination encoder

After preprocessing each patient’s longitudinal EHR data through discretization and concatenation into a unified feature representation as \({X}_{i}=[{X}_{{\rm{c}}{\rm{a}}{\rm{t}}};\,{X}_{{\rm{c}}{\rm{o}}{\rm{n}}{\rm{t}}}]\), we implemented a visit-level encoding framework. Similar to a BERT’s58 embedding approach, our embedding layer employed a dual representation strategy: discretized feature values were encoded using shared token embeddings to represent their magnitude, and separate type embeddings were assigned to each variable position to denote the specific clinical feature category. This complementary embedding method allowed the model to simultaneously capture both the value distributions and the semantic meaning of different clinical measurements. A designated special vector reserved (preserved) missing examinations, enabling the model to differentiate between absent tests and actual clinical observations. To capture complex interdependencies between clinical variables, we applied a transformer-based architecture that processed these embedded features through multiple self-attention layers. This encoding process can be formalized as \({E}_{{\rm{v}}{\rm{i}}{\rm{s}}{\rm{i}}{\rm{t}}}=\text{Encoder}(\text{Embed}({X}_{i}))\), where Encoder is a Transformer encoder that generates a contextualized representation for each clinical visit.

EHRFormer architecture, temporal embedding and decoder

To model disease progression and capture the longitudinal nature of patient trajectories, we implemented a temporal embedding to capture the relative time between visits. From the examination encoder output, we retrieved a visit-level embedding, Evisit. To model temporal relationships, we used days elapsed since the initial visit as a linear positional embedding TimeEmbed (T) to enable the architecture to learn time-dependent patterns in longitudinal EHR data. To create a longitudinal patient-level representation, we passed visit embeddings Evisit augmented with time information through a Transformer decoder: Epatient= Decoder (Evisit+ TimeEmbed (T)), where causal masking ensures unidirectional information flow in this autoregressive process.

EHRFormer architecture and task-specific decoders

Following the established patient-level longitudinal representation Epatient, we designed a task-specific decoder with separate pathways for discrete outcomes (for example, diagnosis prediction) and continuous measurements (for example, biomarker estimation and BA prediction). Each pathway applies a projection layer followed by ReLU activation, formalized as \({y}_{i}={\rm{R}}{\rm{e}}{\rm{L}}{\rm{U}}({W}_{i}^{{\rm{T}}}{E}_{{\rm{p}}{\rm{a}}{\rm{t}}{\rm{i}}{\rm{e}}{\rm{n}}{\rm{t}},\,i})\), where Epatient, i represents a patient’s digital representation derived from first to ith visit. Importantly, causal masking prevents information from future visits from influencing predictions at the ith visit, ensuring fairness by restricting the model to only information available in real clinical scenarios. This architecture also enables simultaneous handling of diverse clinical prediction tasks while facilitating knowledge transfer between related objectives through jointly optimized parameters.

Training procedures

Our training procedure consisted of two stages: pretraining and fine-tuning. During pretraining, we employed self-supervised learning on unlabeled longitudinal EHR data to develop robust clinical representations. The subsequent fine-tuning stage adapted these representations for specific prediction tasks. This approach leverages generalizable patterns from large-scale unlabeled data before specializing downstream applications. Both stages utilize specialized loss functions and incorporate strategies to mitigate dataset-specific biases.

Controlling for missingness and cohort bias through adversarial methods

Missing values in EHRs lead to incomplete or biased digital representations, as models may inadvertently learn to rely on the missing-state biases rather than the true clinical meaning of the feature expression values. Drawing inspiration from examples of the concept of adversarial learning in other domains, we implemented a missingness discriminator output head. Concurrently, the missingness discriminator is designed to determine whether a specific feature value is missing or not. We implemented a gradient reversal layer (GRL) between the feature encoder and the missingness discriminator. During backpropagation, the GRL inverts the gradient, compelling the feature encoder to produce representations that are independent of the missingness status. This forces the encoder to focus on encoding the inherent clinical importance of the features rather than being influenced by whether a value is present or absent. The loss function for the missingness discrimination task is defined as \({{\mathcal{L}}}_{{\rm{m}}{\rm{i}}{\rm{s}}{\rm{s}}{\rm{i}}{\rm{n}}{\rm{g}}}=-\frac{1}{M}{\sum }_{j=1}^{M}{\sum }_{m=0}^{1}z_{j,\,m}{\rm{l}}{\rm{o}}{\rm{g}}{\hat{z}}_{j,\,m}\), where \(z_{j,\,m}\) is a binary indicator denoting whether the feature value of sample j has a missing status m (0 for present, 1 for missing), \(\hat{{z}}_{j,\,m}\) is the predicted probability, and M is the number of samples with potential missing values. By minimizing this loss, the model is encouraged to learn missing-invariant representations. This approach enables the creation of more robust digital representations that can better generalize across datasets with different missing value patterns, ultimately improving the accuracy and reliability of clinical outcome predictions.

Cohort bias, also known as a batch effect, is a substantial challenge in multi-center data studies. Clinical data collected from different hospitals often exhibit systematic variations due to differences in patient demographics, practice patterns and measurement protocols, potentially leading to biased models. Similar to the missingness discriminator, we designed a cohort discriminator that aims to identify the cohort label of each sample, while the encoder is forced to suppress cohort-specific information. The cohort classification loss is formulated as \({{\mathcal{L}}}_{{\rm{c}}{\rm{o}}{\rm{h}}{\rm{o}}{\rm{r}}{\rm{t}}}=-\frac{1}{N}{\sum }_{i=1}^{N}{\sum }_{d=1}^{D}{y}_{i,\,d}{\rm{l}}{\rm{o}}{\rm{g}}(\,\hat{{y}}_{i,\,d})\), where \({y}_{i,\,d}\) is a binary indicator of whether sample i belongs to domain d, \(\hat{{y}}_{i,\,d}\) is the predicted probability, N is the number of samples, and D is the number of domains (clinical cohorts). This approach encourages the model to learn cohort-invariant representations that generalize across healthcare settings while maintaining predictive performance for clinical outcomes.

Pretraining step

We employed a self-supervised pretraining approach with multiple complementary objectives to enable our model to learn comprehensive representations of EHR data. We randomly masked 50% of the valid test results in the current examination as input, and trained the model to predict 50% masked values in the current examination and next examination. The masked language modeling loss function is defined as \({{\mathcal{L}}}_{{\rm{M}}{\rm{L}}{\rm{M}}}=\frac{1}{|{\mathcal{M}}|}{\sum }_{i=0}^{N}{\sum }_{j\in {{\mathcal{M}}}_{i}}{{\mathcal{L}}}_{{\rm{M}}{\rm{S}}{\rm{E}}}(\hat{{v}}_{i,\,j},{v}_{i,\,j})\), where \({{\mathcal{M}}}_{i}\) is the set of masked indices in examination event si and the next examination after si, \(|{\mathcal{M}}|\) is the total number of masked tokens across all examination events, vi, j is the true value of the jth test in examination si and the next examination after si, and \(\hat{{v}}_{i,\,j}\) is the predicted value. To quantify uncertainty in the clinical data, we incorporated a variational framework with evidence lower bound (ELBO) maximization as \({{\mathcal{L}}}_{{\rm{E}}{\rm{L}}{\rm{B}}{\rm{O}}}={E}_{{q}_{\phi }}[{\rm{l}}{\rm{o}}{\rm{g}}{p}_{\theta }(x|z)]-{D}_{{\rm{K}}{\rm{L}}}({q}_{\phi }(z|x)||{p}_{\theta }(z|x))\), balancing reconstruction fidelity against latent space regularization. Additionally, we incorporated the domain adversarial loss \({{\mathcal{L}}}_{{\rm{d}}{\rm{o}}{\rm{m}}{\rm{a}}{\rm{i}}{\rm{n}}}\) and \({{\mathcal{L}}}_{{\rm{m}}{\rm{i}}{\rm{s}}{\rm{s}}{\rm{i}}{\rm{n}}{\rm{g}}}\) to promote cohort-invariant and missing-invariant representations. Finally, for the age regression task, we trained the model to predict patients’ ages at each examination event using only clinical measurements (with all age-related information explicitly removed from inputs) to assess biological aging patterns, using the mean squared error (MSE) loss function defined as \({{\mathcal{L}}}_{{\rm{a}}{\rm{g}}{\rm{e}}}=\frac{1}{N}{\sum }_{i=0}^{N}{(\hat{{a}}_{i}-{a}_{i})}^{2}\), where \(\hat{{a}}_{i}\) is the predicted age at the examination event si, and ai is the true age. Only healthy individuals were included in the loss calculation for the age-prediction task, enabling EHRFormer to construct a biological clock reflecting normal aging patterns. This approach allows subsequent precise assessment of BA deviations between diseased individuals and their healthy peers in subsequent CA–BA differential analyses. Therefore, the final pretraining objective combined these components with appropriate weighting coefficients: \({{\mathcal{L}}}_{{\rm{p}}{\rm{r}}{\rm{e}}{\rm{t}}{\rm{r}}{\rm{a}}{\rm{i}}{\rm{n}}}={\alpha }_{1}{{\mathcal{L}}}_{{\rm{M}}{\rm{L}}{\rm{M}}}+{\alpha}_{2}{{\mathcal{L}}}_{{\rm{E}}{\rm{L}}{\rm{B}}{\rm{O}}}-{\alpha }_{3}{{\mathcal{L}}}_{{\rm{c}}{\rm{o}}{\rm{h}}{\rm{o}}{\rm{r}}{\rm{t}}}-{\alpha }_{4}{{\mathcal{L}}}_{{\rm{m}}{\rm{i}}{\rm{s}}{\rm{s}}{\rm{i}}{\rm{n}}{\rm{g}}}+{\alpha }_{5}{{\mathcal{L}}}_{{\rm{a}}{\rm{g}}{\rm{e}}}\), where the negative sign reflects the gradient reversal mechanism.

Fine-tuning step for disease state prediction tasks

We implemented three distinct disease prediction tasks that reflect different clinical scenarios: first occurrence disease diagnosis, future disease prediction and fixed-time-window future prediction.

For first occurrence disease diagnosis, we trained the model to identify the first occurrence of specific diseases, excluding subsequent visits after initial diagnosis to capture true onset patterns rather than disease management. Formally, for a patient with a longitudinal sequence S with length L, and where li, d represents whether this patient was diagnosed as positive for disease d at the ith visit, the label ci, d of the first occurrence diagnosis task is defined as

$$\begin{array}{l}{c}_{i,\,d}=\left\{\begin{array}{l}1,\,{\mathrm{if}}\,{l}_{i,\,d}=1\,{\mathrm{and}}\,{l}_{j,\,d}=0\,{\mathrm{for}}\,{\mathrm{all}}\,j < i\\ 0,\,{\mathrm{if}}\,{{l}_{i,\,d}=0\,{\mathrm{and}}\,l}_{j,\,d}=0\,{\mathrm{for}}\,{\mathrm{all}}\,j\in \{0,\,1,\,\ldots ,\,L\}.\end{array}\right.\end{array}$$

For the future disease prediction task, we developed a labeling strategy to identify patients at risk before disease manifestation, using each visit as a dynamic baseline for prediction. Formally, for a patient with longitudinal sequence S with length L, the label fi, d of the future prediction task for disease d at the ith visit is defined as

$$\begin{array}{l}{f}_{i,\,d}=\left\{\begin{array}{l}1,\,\mathrm{if}\,l_{j,\,d}=0\,\mathrm{for}\,j\le i\,\mathrm{and}\,{\rm{\exists }}\,k > i\,\mathrm{such}\,\mathrm{that}\,{l}_{k,\,d}=1\\ 0,\,\mathrm{if}\,{l}_{i,\,d}=0\,\mathrm{and}\,l_{j,\,d}=0\,\mathrm{for}\,\mathrm{all}\,j\in \{0,\,1,\,\ldots ,\,L\}.\end{array}\right.\end{array}$$

The third prediction task assesses N-year disease incidence. This is achieved by predicting over a fixed look-ahead window (t = 5 or 10 years) from each potential per-visit baseline. To ensure the validity of our labels, we implemented rigorous censoring for any observation with insufficient follow-up time. Formally, for a patient with recorded age A(i), the rolling t-year window prediction label \({w}_{i,\,d}^{t}\) of disease d at visit i is defined as

$$\begin{array}{l}{w}_{i,\,d}^{t}=\left\{\begin{array}{l}1,\,\mathrm{if}\,l_{j,\,d}=0\,\mathrm{for}\,j\le i\,\mathrm{and}\,{\rm{\exists }}\,k > i\,\mathrm{such}\,\mathrm{that}\,{l}_{k,\,d}=1\,\mathrm{and}\,A(k)-A(i)\le t\\ 0,\,\mathrm{if}\,{{l}_{i,\,d}=0\,\mathrm{and}\,l}_{j,\,d}=0\,\mathrm{for}\,\mathrm{all}\,j\in \{0,\,1,\,\ldots ,\,L\}\,\mathrm{and}\,A(L)-A(i)\ge t.\end{array}\right.\end{array}$$

The loss function for each task is \({\mathcal{L}}=\frac{1}{N}{\sum}_{i=0}^{N}{\sum }_{d=1}^{D}{{\mathcal{L}}}_{{\rm{B}}{\rm{C}}{\rm{E}}}({\hat{y}}_{{i},\,{d}},\,{y}_{{i},\,{d}})\), where \(\hat{{y}}_{i,\,d}\) is the predicted probability of disease d on one of the above three labels, D is the total number of diseases considered, and \({{\mathcal{L}}}_{{\rm{B}}{\rm{C}}{\rm{E}}}\) is the binary cross-entropy loss.

Implementation details

We implemented our EHRFormer architecture using a combination of transformer models. Specifically, we utilized a 24-layer transformer encoder with a hidden dimension of 1,024 as the examination encoder to process individual examination events, and a 12-layer autoregressive transformer decoder with a hidden dimension of 768 as the temporal encoder to capture longitudinal patterns across the sequence of examinations. This design leverages the attention capabilities of the multi-headed self-attention mechanism for understanding relationships between clinical measurements within each examination, while employing the causal masked attention mechanism to model the temporal progression of patient health.

The model was implemented using PyTorch and trained using a two-stage approach. For the pretraining phase, we trained the model for 200 epochs using the Adam optimizer with a learning rate of 10−3 and a weight decay of 10−6. The subsequent fine-tuning phase for the downstream tasks was conducted for 100 epochs using the Adam optimizer with a reduced learning rate of 10−4, while maintaining the same weight decay of 10−6.

For both pretraining and fine-tuning steps, we utilized subsets (CHAI-Training and CHAI-Tuning) from the CHAI-Main dataset. Internal validation results were reported using CHAI-Internal, and external validation was conducted using two independent cohorts: CHAI-External-1 and UKB-External. To ensure methodological rigor, we implemented a patient-level non-overlapping partitioning strategy, randomly dividing the CHAI-Main dataset in an 8:1:1 ratio to generate the CHAI-Training, CHAI-Tuning and CHAI-Internal subsets, respectively. The healthy participants in CHAI-Main constituted the CHAI-Healthy Controls cohort used for BA calculation and age difference analysis. The UKB-External dataset comprised all available samples from the UK Biobank cohort.

Age difference calculation

To quantify biological aging deviations, we calculated standardized age differences for each individual using our aging model. First, we predicted BA Ab using the pretrained EHRFormer model on healthy participants in CHAI-Healthy Controls. We then modeled the nonlinear relationship between predicted BA Ab and CA Ac using locally weighted scatterplot smoothing (LOWESS) with a bandwidth parameter of 2/3 via the statsmodels Python package (version 0.14.4) using EHR data from healthy individuals. The resulting function f(Ac) represents the expected BA for a given CA based on healthy population trends. For each individual i, we calculated the raw age difference as \({{\varDelta}}_{i}={A}_{{b},\,{i}}-f({A}_{{c},\,{i}})\), representing a deviation from healthy peers with the same CA. Finally, we computed standardized age differences as \({z}_{i}={{\varDelta}}_{i}/\sigma\), where σ represents the s.d. of raw age differences within the model.

Visualization of latent space and disease risk analysis

Visualization and clustering of EHRFormer-derived latent vectors were performed by first extracting the laboratory and vital sign features, followed by PCA with 50 components. The resulting embeddings were processed using a neighbor graph approach (15 neighbors, Euclidean metric) and visualized with UMAP (parameters: min_dist=0.3, spread=1.0, 2 components, spectral initialization). Cluster identification was performed using the Leiden community detection algorithm, revealing distinct patient groups that correspond predominantly to pediatric and adult populations. For disease visualization, prevalence and incidence proportions were calculated per cluster. Prevalence was defined as the proportion of individuals with pre-existing disease at baseline (first hospital encounter). Incidence was calculated as the proportion of initially disease-free individuals who developed the condition during the follow-up period (five years from first admission). Each data point was colored according to its corresponding cluster-specific disease prevalence or incidence proportion, providing a visual representation of disease burden across identified patient subgroups. PCA, UMAP and projection visualizations were constructed using the Scanpy59 Python package (version 1.10.4).

Disease–cluster associations were quantified using adjusted log2HRs, calculated for each cluster based on the cluster of each patient at their first clinical visit in reference to the remainder of the study population using Cox proportional hazards models. These models incorporated multivariate adjustment for patient demographics (age and sex), smoking, alcohol history and hospital to minimize potential confounding. These associations were visualized using a heatmap with log2HR values truncated at a maximum of 2 to enhance interpretability while preserving meaningful signal contrast. HRs were calculated using the lifelines Python package (version 0.30.0).

Statistical analysis

We evaluated the performance of regression models for continuous value predictions using MAE, R² and PCC. Binary classification models were evaluated using receiver operating characteristic (ROC) curves showing sensitivity versus 1–specificity, with the AUC reported along with 95% confidence intervals. AUCs were calculated using the scikit-learn package (version 1.6.1). Cumulative incidence curves for deciles of disease risk score were calculated using KaplanMeierFitter from the lifelines Python package (version 0.30.0). We plotted cumulative events against each visit age on the x axis. Incidence rates for subsequent records after each given visit age are shown on the y axis.

Reporting Summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this Article.