- Methodology
- Open access
- Published:
A compact encoding of the genome suitable for machine learning prediction of traits and genetic risk scores
BioData Mining volume 18, Article number: 44 (2025)
Abstract
Genotype to phenotype prediction is a central problem in biology and medicine. Machine learning is a natural tool to address this problem. However, a person’s genotype is usually represented by a few million single-nucleotide polymorphisms and most datasets only have a few thousand patients. Thus, this problem typically has many more predictors than the number of samples (patients), making it unsuitable for machine learning. The objective of this paper is to examine the efficacy of a compact genotype representation, which employs a limited number of predictors, in predicting a person’s phenotype through the application of machine learning. We characterized a person’s genotype using chromosome-scale length variation, a measure that is computed as the average value of reported log R ratios across a portion of a chromosome. We computed these numbers from data collected by the NIH All of Us program. We used the AutoML function (h2o.ai) in binary classification mode to identify the best models to differentiate between male/female, Black/white, white/Asian, and Black/Asian. We also used the AutoML function in regression mode to predict the height of people based on their age and genotype. Our results showed that we could effectively classify a person, using only information from chromosomes 1–22, as Male/Female (AUC = 0.9988 ± 0.0001), White/Black (AUC = 0.970 ± 0.002), Asian/White (AUC = 0.877 ± 0.002), and Black/Asian (AUC = 0.966 ± 0.002). This approach also effectively predicted height. In conclusion, we have shown that this compact representation of a person’s genotype, along with machine learning, can effectively predict a person’s phenotype.
Introduction
Genotype to phenotype prediction is a central problem in biology and medicine [1]. The challenge is to understand how a person’s genome translates into their observable traits. Traditionally this problem has been tackled by examining how individual genes and their resulting proteins function, and how changes in these genes affect the proteins and ultimately a person’s observable traits. However, this approach has limitations, as it does not consider how combinations of genetic variations can work together to influence traits [2].
Machine learning is a natural tool to address this problem. Machine learning algorithms are very good at finding complex patterns that differentiate between two groups. However, this genotype to phenotype problem is a “large-p, small-n”, or p n problem, where p is the number of predictors and n is the number of samples (patients) [3]. The human genome contains \(\:3\:\times\:{10}^{9}\:\)base pairs, and each one could be considered a predictor. Even if we only consider the predictors that differ in humans, the single nucleotide polymorphisms (SNPs), there are still \(\:{10}^{6}\) relevant SNPs [4]. Depending on the phenotype, or trait, being studied and the dataset available, a typical n is \(\:{10}^{3}\) to \(\:{10}^{4}\), the number of human patients with that trait that contribute genetic data to a dataset.
In this paper, we outline a different approach to represent a person’s genotype by a much smaller number of predictors, p. We compute the average value of the \(\:{log}R\) ratio, a parameter that is measured for each SNP location in micro array chips. We average these \(\:{log}R\:\)ratio values across large portions of each chromosome. This \(\:{log}R\:\)ratio can be thought of as the copy number at each SNP location and the number we compute is a measure of the length of the corresponding chromosomal region. We have previously applied this method to predicting a number of different diseases and conditions [5,6,7,8].
Background
Shortly after the first drafts of the human genome were published [9, 10] attention turned to studying differences between humans. The most convenient way to characterize the differences between humans was to catalog all the single nucleotide polymorphisms (SNPs) found by sequencing a diverse collection of humans [11,12,13,14]. Today, more than a million specific sites within the 3 billion base pair human genome are known to contain significant differences across the human species. Experimental techniques exist that can rapidly and inexpensively characterize these SNPs [15].
The availability of this new technique led to the popularization of genome wide association studies (GWAS) [16,17,18,19]. The goal of GWAS is to identify any significant genetic difference that is associated with a disease or phenotype. The experimental design of a GWAS is straightforward. First, two groups are identified: one that exhibits the trait being studied (e.g., a particular form of breast cancer) and a second that does not (the control group). All members of both groups are genotyped—the values of all the SNPs are identified. A statistical analysis is then performed to determine whether any SNP value occurs significantly more in the disease group than the control group. This analysis treats each SNP independently of all others.
The success of GWAS led to the advent of polygenic risk scores [20,21,22]. Although many human diseases exist that can be attributed to a single mutation in the inherited genome, most of these are rather rare. Most common diseases—cancers, heart disease, and different forms of mental disease—cannot be attributed to any single mutation. The goal of a polygenic risk score is to predict whether a person will develop these complex diseases based upon their inherited genetic profile, as determined by SNP genotyping. Polygenic risk scores have been computed for many different conditions [23, 24]. The computation generally selects a number of different SNPs to include in the score and then selects a weighting function. A score is computed based on these SNP values for each person. Based upon this score, one can compute a receiver operating characteristic curve (ROC curve) and an associated area under the curve (AUC).
Machine learning has been applied to different aspects of the problem of determining polygenic risk scores. For instance [25], used gradient boosted regression trees to select the optimal weight of SNPs to include in the score. To include non-linear effects between SNPs [26], used XGBoost on a limited number of SNPs to compute a polygenic risk score and saw substantial increase in effectiveness across a number of different traits. Also [27], used a deep neural network and showed that it outperformed other machine learning algorithms and widely used algorithms when predicting breast cancer. In addition, machine learning algorithms have been applied to better understand the link between particular genes and traits, using methods like transcriptome-wide association studies [28,29,30] and other derivatives of that method [31].
One particular trait that is widely used to benchmark polygenic risk scores is height. A person’s adult height is known to be inherited and influenced by many different genetic loci [32, 33]. Height is also easily measured and available for everyone in a dataset. Polygenic risk scores for height have been computed using increasingly larger sample sizes, from 130k in 2010 [33], 250k in 2014 [34], 700k in 2018 [35] to 5.4 million in 2022 [36]. As sample sizes have increased, the effect size for polygenic scores have also increased.
A fundamental challenge in applying statistical and machine learning methods to genomic data arises from the dimensionality of the problem. The human genome, as represented by common SNPs, contains a vast number of features (on the order of millions of SNPs). In contrast, even the largest GWAS and PRS studies typically involve sample sizes (i.e., the number of individuals) that are orders of magnitude smaller. This high-dimensionality poses a significant problem. Traditional machine learning algorithms are often prone to overfitting in such situations, meaning they learn the noise in the training data rather than the true underlying relationships. This can lead to poor generalization performance on new, unseen data. Therefore, methods that can effectively reduce dimensionality or regularize the model are crucial for successful application of machine learning to polygenic risk score prediction.
Methods
Dataset: NIH all of Us
We use data from the NIH All of Us research program [37]. The goal of All of Us is to compile a database that characterizes one million US residents who represent the diverse population of the US. The characterization includes demographic, medical and genetic information for each person who has volunteered to participate in this program. The data is anonymized and only available for analysis through Jupyter notebooks running on Google Cloud, known as the Researcher Workbench. Only summary data can be downloaded from the Researcher Workbench.
The All of Us Research Program obtains consent from all participants. Participants view videos describing the research program, what information will be gathered from each participant, and how the information will be used. Participants must sign a consent form to join the research program, and a second form to contribute DNA to the research program. In addition, a participant can opt out of the research program at any time, and their information will be removed from the dataset. Since the research presented in this paper only uses anonymized data, it is not considered human subject research and does not require approval by an Institutional Review Board.
The All of Us dataset is releasing its genetic data in phases [37]. The initial release, known as controlled tier V6, included DNA microarray data for 165,127 participants. The microarray data in controlled tier V6 contains measurements on 1,814,517 genetic variants for each of the 165,127 participants. The microarray data was collected with Illumina Global Diversity Arrays. This specific array is designed to provide optimal cross-population imputation coverage and enables the development of polygenic risk scores.
Data preprocessing
The goal of this work is to test how well one can predict a phenotype (height or self-declared race) from a person’s genetics. This requires three distinct steps:
-
1.
Compute a set of numbers that represent the person’s genome. This is the novel aspect of this work. We compute a set of numbers that represent the mean log (copy number) over a chromosome or a major portion of a chromosome. We call these numbers chromosome-scale length variation.
-
2.
Identify the phenotype of interest for each person. This is a single categorical variable, or a number, self-identified race or height. We create a dataframe where this value is in one column and the set of numbers produced in step 1 is in the other columns.
-
3.
Using standard machine learning protocols, test how well machine learning algorithms can predict the phenotype of interest (step 2) from the set of numbers that represent the person’s genome (step 1). The standard machine learning protocols include splitting the dataframe from step 2 into a test set and a training set, training the machine learning algorithm, then testing the resultant algorithm on the test set.
Computation of chromosome-scale length variation (CSLV)
We extracted microarray genetic information for each participant by analyzing the data stored in the Hail matrix.
In the All of Us dataset, genetic information is stored in a Hail matrix, a scalable and flexible framework for genetic analysis [38]. The Hail matrix organizes genetic data into multiple dimensions. This allows for efficient querying, filtering, and transformation operations on large-scale genetic datasets.
The computation of CSLV values was conducted within a cloud based Jupyter notebook environment using the Python programming language on the designated workbench. Figure 1 depicts a Hail matrix, which is usually structured with four dimensions.
An example of a small section of a Hail matrix, a data structure that contains key genetic information. We use the average of the LRR (log R ratio) over portions of each chromosome to characterize a person’s genetics. This has the advantage of combining many measurements into a single continuous number, but the disadvantage of losing information about the state of individual SNPs
In the Hail matrix, the column fields represent individual participants in the study, allowing for the identification of specific individuals within the dataset. The row fields, on the other hand, contain constant information that applies to entire rows of entries. In this table, the row field represents the locus, which refers to the specific location on a chromosome where a genetic marker is situated. This field can be used to efficiently query or manipulate subsets of the rows based on their genomic location or other annotations. Lastly, the entry fields are indexed by both row and column and encompass various attributes such as Genotype (GT), Illumina GenCall Confidence (IGC) Score, Raw X and Y intensities as scanned from the original genotyping array, normalized X and Y intensities, normalized R value, normalized Theta value, Log R ratio, and B allele frequency (BAF).
This study investigated the incorporation of structural variations (insertions, deletions, translocations, and copy number variations) into a machine learning (ML) model for improved understanding of individual chromosomal profiles. Structural variations are known to cause slight modifications in overall chromosomal length. To achieve this, we focused on extracting log R ratio (LRR) values from patient entry fields at specific loci, excluding other data points. LRR values represent the logarithm of the observed signal intensity ratio, reflecting the copy number status (dosage) of genetic material at a given locus. By calculating the average LRR across a chromosome or a chromosomal segment, we obtained the nominal length, known as chromosome-scale length variation (CSLV). A CSLV value of 0 indicates two copies at the locus, while positive values signify duplications and negative values represent deletions.
To begin our analysis, we filtered the genetic data for each chromosome and stored it in separate Hail Matrix tables within our workbench. We then calculated the average LRR values within all segments of the chromosome along each column of entries, where each column corresponds to a specific participant. The resulting average values were stored as new column annotations in a new Hail Matrix table. We subsequently analyzed the column fields (column fields are the average values of all the LRR values of that specific chromosome and patient IDs) of this new table individually, focusing on the average LRR values for each chromosome or segment of each chromosome and patient ID. We did not impute any data. We only included patient IDs that included complete data.
To facilitate further analysis and reduce computational load, we converted the column fields table into a Pandas DataFrame format. This format offers greater flexibility for data manipulation. The resulting DataFrame consists of 165,127 rows, representing each participant, and two columns: patient ID and average LRR value for the analyzed chromosome. These steps were repeated 22 times to calculate the average LRR values for each chromosome. Figure 2 presents a histogram that illustrates the distribution of relative chromosome lengths obtained from DNA samples in the All of Us dataset, specifically for chromosomes 1, 7, 13, and 19. A value of “0” represents the nominal average chromosome length. This visualization provides insights into the variations in chromosome lengths within the dataset.
Histograms of chromosome-scale length variation values measured for 10,000 people in the NIH All of Us dataset across four different chromosomes (Chromosomes 1, 7, 13, and 19). A value of zero indicates that person has a chromosome of the nominal length. A value of 0.1 indicates the person has a chromosome slightly longer than the average. These histograms indicate that this measurement extracted from genetic data varies across the population
The choice of one CSLV number per chromosome is arbitrary. We performed some tests and found that splitting the chromosomes into quarters and computing a chromosome-scale length variation number for each quarter of a chromosome (88 numbers) performed significantly better for machine learning classification tasks. This choice is not optimal and finding the optimal choice of predictors is a task for future work. The rest of this paper will use 88 numbers to characterize each participant’s genome.
Machine learning algorithms
We used the H2O AI platform (H2O.ai, Inc, Mountain View, CA) in conjunction with R within the NIH All of Us Researcher Workbench (a Google Cloud Jupyter notebook) to train, test, and validate various machine learning models. (H2O is licensed under the Apache License, Version 2.0.) We are not partial to any particular model algorithm. We are searching for the algorithm that provides the best AUC.
We used H2O’s AutoML function. This function explored various machine learning algorithms and assessed different hyperparameters for each algorithm. The AutoML function evaluates models based on gradient boosting machine (GBM), distributed random forest (DRF), deep learning, logistic regression, generalized linear models (GLM) and ensemble models built from combinations of these five types of models. The AutoML function is provided a time (in seconds) and it uses that time to test different types of models, optimize the hyperparameters for those models and after the given time provides a best model, along with a scoreboard of how well other models performed. An example scoreboard is shown in Table 1.
We run the AutoML with cross-validation. Cross-validation randomly splits the dataset into 10 different subsets. Nine of these subsets are combined, then split 80%/20% to create a training/testing group. Once a model is developed on this training/testing group, it is validated with the 10% of the data that was not included in the training/testing group. The process is repeated ten times to measure repeatability.
We conducted multiple runs with consistent hyperparameters, timeframes, and workspace configurations to assess the effectiveness of these 88 chromosome-scale length variation values as predictors of known genetic factors, including an individual’s sex at birth, self-declared race, and height.
Binary classification of sex and self-declared race
For classification problems using sex at birth, we used the results of this survey question asked of participants: “What was your biological sex assigned at birth?” Possible answers were “Female”, “Male”, “Intersex”, “None of these describe me”. We selected only those who answered “Female” or “Male” to the question. We then set up a binary classification experiment to distinguish between these two categories. The dataset we used had 62,090 participants labeled as “Male” and 97,689 labeled as “Female”.
For classification problems using a person’s race, we used the results of this survey question asked of participants: “Which categories describe you? Select all that apply. Note, you may select more than one group.” Possible answers were: “American Indian or Alaska Native”, “Asian”, “Black, African American, or African”, “Hispanic, Latino, or Spanish”, “Middle Eastern or North African”, “Native Hawaiian or other Pacific Islander”, “White”, “None of these fully describe me”, and “Prefer not to answer”. We selected only those people who selected “Black, African American or African”, “Asian”, or “White”. In the dataset we used, there were 32,426 people who selected “Black, African American or African”, 89,767 who selected “White” and 5,163 who selected “Asian”. We set up three binary classification experiments to differentiate between Black/White, Black/Asian, and White/Asian.
In each case of binary classification, we randomly split the dataset into an 80% training dataset and a 20% test dataset. We reran these binary classification experiments multiple times with different random splits to quantify the variation in the measured AUC.
Regression to predict height
To predict height, we combined the 88 CSLV numbers with the parameter “Body height” (measured in centimeters and reported with four significant digits) and the parameter “current age”. Race and sex at birth were available parameters but were not included in the machine learning models. The models were built only with age and the 88 CSLV parameters. We included age because there are well known age cohort effects on height [39, 40]. Older people grew up with different diets and nutrition than younger people and are shorter on average.
We filtered the dataset to remove anyone less than 21 years of age, to include only fully grown people. This dataset had 161,820 people. We randomly split the dataset into an 80% training set (129,456 people) and a 20% test set (32,364 people). Then, we used the H2O AutoML function to train a regression model on the training set that could best predict the “Body height” based only on the person’s age and the person’s 88 CSLV parameters. We ran the AutoML function for 15 min on a cloud analysis environment that had 4 CPUs and 15 GB of RAM.
The AutoML function produced a machine learning model. We applied the model to 32,364 people in the test data, resulting in a predicted height for each of these people.
We assessed the statistical significance of the results by repeated measurements using different sub populations drawn from the overall All of Us population. Statistical tests were performed in R. We computed the 95% confidence intervals using the R command t.test. Normality was first confirmed with the Shapiro test.
Results
We initially computed the 88 chromosome scale length variation values, four from each of the Chromosomes 1 through 22 for all people available in the All of Us controlled tier V6 dataset. (We did not use any information from the X and Y chromosomes.) Our aim was to determine whether these 88 values could effectively predict three phenotypes: sex at birth, race, and height. (Race is self-reported.)
Classification of race and sex
To evaluate the predictive power of the chromosome scale length variation data, we conducted machine learning experiments to see how well an algorithm could differentiate between various groups of people. The effectiveness of this differentiation was quantified by measuring the area under the curve (AUC) of the receiver operating characteristic curve. We conducted four separate experiments, each repeated at least five times with different randomizations. The four experiments were to distinguish people who were (A) male from female, (B) Black from white, (C) Asian from white, (D) Asian from Black. The results are shown in Fig. 3 and in Table 2.
We set up four binary classification experiments. This receiver operating characteristic (ROC) curves shown here characterize the classification process. Specifically, an area under the curve (AUC) of 1.0 is perfect classification and an AUC of 0.5 indicates random classification. As shown in (A), we could perfectly classify whether a person was born as a male or female. We could classify whether a person considered themselves white or Black (B) or Asian or Black (D) about equally well. We could classify whether a person considered themselves as Asian or White (B) with less precision
Prediction of height
We created a dataset that contained the predicted height (as described in the Methods section) along with the actual measured height for all 32,364 people in the test set. We took this dataset and grouped the results into 50 different groups (each with about 647 people) based on a ranking from the predicted height. (The first group contained the 647 people with the shortest predicted height, while the last group contained the 647 people predicted to be the tallest.) We then computed the mean height of the people in each of the 50 groups, expecting the mean height to increase from the first group to the last group. Results are shown in Fig. 4.
This figure depicts the results from a machine learning experiment to predict height based only on age and 88 parameters derived from genetic data for 32,364 people in the NIH All of Us dataset. We assign each person to a group. The first group contains the 647 people with the shortest predicted height. The last group contains the 647 people with the tallest predicted height. Then, we compute the average of the actual height of the 647 people in each group and plot the average height of the group vs. the group number. The two distinct levels of actual height are due to the machine learning algorithm recognizing the difference between men and women
Discussion
In this paper, we have shown that chromosome scale length variation (CSLV) data, in combination with machine learning techniques, can be used to accurately predict a person’s sex at birth, self-declared race, and height. Since this data is largely independent of specific SNP values, combining this approach with the classic SNP based polygenic risk score approach should lead to more accurate predictions.
One interesting finding from our study is that the machine learning model was able to infer male/female from the CSLV data and use that in the model for height, even though we only included data from chromosomes 1–22. This should not be possible (sex is only determined by the X/Y chromosomes) and suggests that significant crosstalk exists within the data. (For instance, the log R data values reported on chromosomes 1–22 are influenced by whether the Y chromosome is present, probably due to the design of specific probes on the microarray chip.)
The range of predicted heights varied over 18 cm, from about 160 cm (for the shortest 2%) to 178 cm (for the tallest 2%) using 130k people (Fig. 4). This is comparable to [35], which used 700k people, but less than [36], which found a range of 23 cm using 5.4 million people. Detailed comparison to other polygenic predictions for height is difficult for several reasons. First, many works quote the R2 value, or equivalently the percentage of variance explained, as the metric quantifying the accuracy of their prediction [32, 36, 41]. But the percentage of variance explained is computed based on an assumption of linearity, which is not valid for highly non-linear models, like we are working with. Second, the variation in height of a dataset depends on the distribution of ancestries within the dataset. The UK Biobank has a much more uniform population compared to the NIH All of Us dataset. It is not clear whether the accuracy of prediction on one dataset would be the same on another dataset of different makeup.
The classification of self-declared race experiments showed that the Asian-white classification was significantly worse than the Asian-Black and white-Black classifications. This is consistent with the Out of Africa hypothesis, which implies that white and Asian people are genetically closer to one another than to Black (African) people [4]. Because they are genetically closer, they are more difficult to differentiate.
These findings have important consequences. First, a person’s self-identified race can be predicted with high accuracy using this genetic data. Second, it is crucial to remember that analyses like polygenic risk scores and Genome-Wide Association Studies (GWAS) only show correlations; they do not definitively prove cause-and-effect relationships. In medicine, strong genetic correlations with diseases are often interpreted as causal, because many diseases have a clear genetic basis. This same reasoning is sometimes applied to social traits like IQ, education level, or income, suggesting these traits are also genetically determined. However, our results raise a critical concern. These analyses might simply be rediscovering well-established correlations between social traits and race, correlations that are primarily due to long-standing societal factors. The genetic data, in this case, might be indirectly reflecting social history, rather than revealing direct genetic causes of complex social outcomes.
The primary limitation of this approach is that it is not easily interpretable. Although interpretable machine learning techniques exist, our computation of CSLV values groups large portions of a chromosome together, obscuring the origin of differences. Figure 5 shows the variable importance and SHAP plots for the Asian-white classification. The classification model is based on many features across the genome. It is not dependent on one, or even a small number of features. This approach might provide better predictive power of a particular phenotype, but it is not as useful to develop new treatments or build understanding of what is causing a disease or trait.
This figure presents the variable importance plot (top) and SHAP contribution (bottom) for the classification of each participant’s self-declared race based upon their genetic profile. This classification problem built a model to classify a person’s self-declared race as either “Asian” or “White” based on the CSLV values. The CSLV values were computed from 4 segments of each chromosome. The label “chr6_2” indicates that value was computed from the second segment of Chromosome 6. The corresponding receiver operating characteristic curve is shown in Fig. 3c. This figure indicates that the prediction does not rely on a few segments of the genome but is instead drawn from many different segments that all have similar importance to the prediction
Based on the results presented here, we have identified several areas of future work that would address weaknesses. First, we need to do external validation. We have validated models internally with NIH All of Us data and we have previously applied these methods to UK Biobank data [5, 42, 43]. However, we have not yet developed a model on one dataset and tested that model on a separate dataset. Second, we need to combine these predictions with SNP based methods to understand how much CSLV increases the accuracy of predictions. Finally, we need to optimize the selection of CSLV parameters to provide the best predictive power.
Conclusion
In conclusion, we have shown that this approach can effectively predict demographic traits of a person (their phenotype) from their genotype. This approach offers a promising avenue for personalized medicine by providing a compact and informative representation of a person’s genome that may be able to identify people at high risk for specific diseases, optimize treatment decisions, and eventually advance our understanding of complex traits.
Data availability
No datasets were generated or analysed during the current study.
Abbreviations
- AUC:
-
Area under the curve
- CNV:
-
copy number variation
- CSLV:
-
chromosome-scale length variation
- GBM:
-
Gradient Boosted Machines
- ROC:
-
Receiver operator curve
- SNP:
-
single nucleotide polymorphism
References
Lehner B. Genotype to phenotype: lessons from model organisms for human genetics. Nat Rev Genet. 2013. https://doi.org/10.1038/nrg3404.
Medvedev A, Sharma SM, Tsatsorin E, Nabieva E, Yarotsky D. Human genotype-to-phenotype predictions: boosting accuracy with nonlinear models. PLoS ONE. 2022;17. https://doi.org/10.1371/journal.pone.0273293.
Liao JG, Chin KV. Logistic regression for disease classification using microarray data: model selection in a large p and small n case. Bioinformatics. 2007;23. https://doi.org/10.1093/bioinformatics/btm287.
Gibbs RA, Boerwinkle E, Doddapaneni H, Han Y, Korchina V, Kovar C, et al. A global reference for human genetic variation. Nature. 2015;526:68–74. https://doi.org/10.1038/nature15393.
Ko C, Brody JP. Evaluation of a genetic risk score computed using human chromosomal-scale length variation to predict breast cancer. Hum Genomics. 2023;17. https://doi.org/10.1186/s40246-023-00482-8.
Toh C, Brody JP. Evaluation of a genetic risk score for severity of COVID-19 using human chromosomal-scale length variation. Hum Genomics. 2020;14. https://doi.org/10.1186/s40246-020-00288-y.
Toh C, Brody JP. A genetic risk score using human chromosomal-scale length variation can predict schizophrenia. Scientific Reports 2021 11:1. 2021;11: 1–10. https://doi.org/10.1038/s41598-021-97983-0
Toh C, Brody JP. Genetic risk score for ovarian cancer based on chromosomal-scale length variation. BioData Min. 2021;14. https://doi.org/10.1186/s13040-021-00253-y.
Craig Venter J, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG et al. The sequence of the human genome. Science (1979). 2001;291. https://doi.org/10.1126/science.1058040
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409. https://doi.org/10.1038/35057062.
Marth GT, Korf I, Yandell MD, Yeh RT, Gu Z, Zakeri H, et al. A general approach to single-nucleotide polymorphism discovery. Nat Genet. 1999;23. https://doi.org/10.1038/70570.
Sachidanandam R, Weissman D, Schmidt SC, Kakol JM, Stein LD, Marth G, et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature. 2001;409. https://doi.org/10.1038/35057149.
Wakeley J, Nielsen R, Liu-Cordero SN, Ardlie K. The discovery of single-nucleotide polymorphisms—And inferences about human demographic history. Am J Hum Genet. 2001;69. https://doi.org/10.1086/324521.
Altshuler D, Pollara VJ, Cowles CR, Van Etten WJ, Baldwin J, Linton L, et al. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature. 2000;407. https://doi.org/10.1038/35035083.
Peiffer DA, Le JM, Steemers FJ, Chang W, Jenniges T, Garcia F, et al. High-resolution genomic profiling of chromosomal aberrations using infinium whole-genome genotyping. Genome Res. 2006;16. https://doi.org/10.1101/gr.5402306.
Ikegawa S. A short history of the Genome-Wide association study: where we were and where we are going. Genomics Inf. 2012;10. https://doi.org/10.5808/gi.2012.10.4.220.
Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, et al. 10 years of GWAS discovery: biology, function, and translation. Am J Hum Genet. 2017;101:5–22. https://doi.org/10.1016/j.ajhg.2017.06.005.
Visscher PM, Brown MA, McCarthy MI, Yang J. Five years of GWAS discovery. Am J Hum Genet. 2012;90:7–24. https://doi.org/10.1016/j.ajhg.2011.11.029.
Abdellaoui A, Yengo L, Verweij KJH, Visscher PM. 15 years of GWAS discovery: realizing the promise. Am J Hum Genet. 2023;110:179–94. https://doi.org/10.1016/J.AJHG.2022.12.011.
Torkamani A, Wineinger NE, Topol EJ. The personal and clinical utility of polygenic risk scores. Nat Rev Genet. 2018;19:581–90. https://doi.org/10.1038/s41576-018-0018-x.
Lambert SA, Abraham G, Inouye M. Towards clinical utility of polygenic risk scores. Hum Mol Genet. 2019. https://doi.org/10.1093/hmg/ddz187.
Lewis CM, Vassos E. Polygenic risk scores: from research tools to clinical instruments. Genome Med. 2020;12:44. https://doi.org/10.1186/s13073-020-00742-5.
Jia G, Lu Y, Wen W, Long J, Liu Y, Tao R, et al. Evaluating the utility of polygenic risk scores in identifying High-Risk individuals for eight common cancers. JNCI Cancer Spectr. 2020;4. https://doi.org/10.1093/JNCICS/PKAA021.
Khera Av, Chaffin M, Aragam KG, Haas ME, Roselli C, Choi SH, et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to Monogenic mutations. Nat Genet. 2018;50:1219–24. https://doi.org/10.1038/s41588-018-0183-z.
Paré G, Mao S, Deng WQ. A machine-learning heuristic to improve gene score prediction of polygenic traits. Sci Rep. 2017;7:12665. https://doi.org/10.1038/s41598-017-13056-1.
Elgart M, Lyons G, Romero-Brufau S, Kurniansyah N, Brody JA, Guo X, et al. Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations. Commun Biol. 2022;5. https://doi.org/10.1038/S42003-022-03812-Z.
Badré A, Zhang L, Muchero W, Reynolds JC, Pan C. Deep neural network improves the Estimation of polygenic risk scores for breast cancer. J Hum Genet 2020. 2020;66:4. https://doi.org/10.1038/s10038-020-00832-7.
Gamazon ER, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K, Carroll RJ, et al. A gene-based association method for mapping traits using reference transcriptome data. Nat Genet. 2015;47. https://doi.org/10.1038/ng.3367.
Wainberg M, Sinnott-Armstrong N, Mancuso N, Barbeira AN, Knowles DA, Golan D, et al. Opportunities and challenges for transcriptome-wide association studies. Nat Genet. 2019;51. https://doi.org/10.1038/s41588-019-0385-z.
Mai J, Lu M, Gao Q, Zeng J, Xiao J. Transcriptome-wide association studies: recent advances in methods, applications and available databases. Commun Biology. 2023. https://doi.org/10.1038/s42003-023-05279-y.
Cao C, Kwok D, Edie S, Li Q, Ding B, Kossinna P, et al. KTWAS: integrating kernel machine with transcriptome-wide association studies improves statistical power and reveals novel genes. Brief Bioinform. 2021. https://doi.org/10.1093/bib/bbaa270.
Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010;42. https://doi.org/10.1038/ng.608.
Allen HL, Estrada K, Lettre G, Berndt SI, Weedon MN, Rivadeneira F, et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature. 2010;467. https://doi.org/10.1038/nature09410.
Wood AR, Esko T, Yang J, Vedantam S, Pers TH, Gustafsson S, et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat Genet. 2014;46. https://doi.org/10.1038/ng.3097.
Yengo L, Sidorenko J, Kemper KE, Zheng Z, Wood AR, Weedon MN, et al. Meta-analysis of genome-wide association studies for height and body mass index in ~ 700 000 individuals of European ancestry. Hum Mol Genet. 2018;27. https://doi.org/10.1093/hmg/ddy271.
Yengo L, Vedantam S, Marouli E, Sidorenko J, Bartell E, Sakaue S, et al. A saturated map of common genetic variants associated with human height. Nature. 2022;610. https://doi.org/10.1038/s41586-022-05275-y.
Bick AG, Metcalf GA, Mayo KR, Lichtenstein L, Rura S, Carroll RJ, et al. Genomic data in the all of Us research program. Nature. 2024. https://doi.org/10.1038/s41586-023-06957-x.
Hail Team. Hail. In: https://github.com/hail-is/hail[Internet]. [cited 25 Feb 2024]. Available: https://github.com/hail-is/hail
Cole TJ. The secular trend in human physical growth: A biological view. Econ Hum Biol. 2003;1. https://doi.org/10.1016/S1570-677X(02)00033-3.
Arntsen SH, Borch KB, Wilsgaard T, Njølstad I, Hansen AH. Time trends in body height according to educational level. A descriptive study from the Tromsø study 1979–2016. PLoS ONE. 2023;18. https://doi.org/10.1371/journal.pone.0279965.
Lello L, Avery SG, Tellier L, Vazquez AI, de los Campos G, Hsu SDH. Accurate genomic prediction of human height. Genetics. 2018;210. https://doi.org/10.1534/genetics.118.301267.
Toh C, Brody JP. A genetic risk score using human chromosomal-scale length variation can predict schizophrenia. Sci Rep. 2021;11. https://doi.org/10.1038/s41598-021-97983-0.
Toh C, Brody JP. Human chromosomal-scale length variation and severity of COVID-19 infection using the UK biobank dataset. Hum Genomics. 2020. https://doi.org/10.1101/2020.07.06.20147637.
Acknowledgements
We gratefully acknowledge All of Us participants for their contributions, without whom this research would not have been possible. We also thank the National Institutes of Health’s All of Us Research Program for making available the participant data examined in this study.
Funding
No external funding supported this project.
Author information
Authors and Affiliations
Contributions
Y.F. computed the CSLV statistics and composed the figures. J.B. wrote the initial draft of the manuscript text. Y.F. and J.B. edited the manuscript through several revisions. All authors reviewed and approved the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Fatapour, Y., Brody, J.P. A compact encoding of the genome suitable for machine learning prediction of traits and genetic risk scores. BioData Mining 18, 44 (2025). https://doi.org/10.1186/s13040-025-00459-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13040-025-00459-4