- Research
- Open access
- Published:
SPAmix: a scalable, accurate, and universal analysis framework for large-scale genetic association studies in admixed populations
Genome Biology volume 26, Article number: 356 (2025)
Abstract
Background
Inclusion of individuals with diverse or admixed genetic ancestries is crucial to discover novel findings that may be missed by genomics analyses rooted solely in European population.
Results
Here, we present an analysis framework, SPAmix, which is scalable to a large-scale biobank data analysis including hundreds of thousands of admixed individuals and is universally applicable to various types of complex traits including quantitative traits, time-to-event traits, ordinal traits, and longitudinal traits. Since no alternative model is fitted, SPAmix primarily focuses on association p values. For each genetic variant, SPAmix uses genotype data and genetic principal components to estimate individual-specific allele frequency, which is subsequently used to calibrate p values via a retrospective analysis. A hybrid strategy including saddlepoint approximation (SPA) can greatly increase the accuracy to analyze rare genetic variants, especially if the phenotypic distribution is unbalanced or extremely unbalanced. We also propose SPAmixlocal to incorporate local ancestry to calculate ancestry-specific p values. To maximize the statistical powers, SPAmixCCT is proposed to combine the p values of SPAmix and SPAmixlocal via Cauchy combination.
Conclusions
The SPAmix-based approaches are more accurate than Tractor to address phenotypic variance heterogeneity among ancestries when analyzing quantitative traits and to address an unbalanced case–control ratio when analyzing binary traits. SPAmixCCT is an optimal unified approach for various cross-ancestry genetic architectures. Extensive simulation studies and real data analyses of 369,314 UK Biobank individuals from multiple ancestries demonstrated that SPAmix is scalable and can discover novel hits while controlling type I error rates well.
Background
Large-scale biobank data integrate in-depth genetic information and electronic health records (EHR) with extensive health information, enabling the response variable in genome-wide association studies (GWASs) to extend beyond quantitative and binary traits [1]. As a result, complex traits with intricate structures are increasingly attracting attention in large-scale GWAS researches [2,3,4,5,6,7,8]. Statistical models in GWAS have evolved to include more sophisticated models such as the Cox proportional hazards regression model to fit a time-to-event trait [6, 9,10,11,12].
Recently, SPACox was proposed as a general analysis framework to accurately and efficiently conduct GWAS for complex traits with intricate structures [6]. Using an empirical saddlepoint approximation (SPA) to approximate the null distribution of score statistics [13,14,15,16,17,18,19], SPACox is computationally efficient and remains powerful while controlling type I error rates for common variants (minor allele frequency, MAF > 5%), low-frequency variants (5% > MAF > 1%), and rare variants (MAF < 1%) [6]. Although demonstrated using time-to-event traits, SPACox is of high potential to be extended. For example, trajGWAS also employs the empirical SPA used in SPACox to analyze longitudinal trait [20]. However, SPACox was designed to analyze a homogeneous population and its performance in multiple subpopulations or admixed populations remains unclear.
Population stratification is a significant confounder in GWAS, and large biobanks may include a considerable number of individuals from heterogeneous ancestries, such as those in admixed populations [21,22,23]. The admixture can occur due to various reasons, such as migration, colonization, and intermarriage between individuals from different populations [24]. Incorporating SNP-derived principal components (PCs) as covariates in regular regression models can help control for confounding but still could result in inflation or deflation, for example, in the case of phenotypic heterogeneity (such as phenotypic variance heterogeneity for quantitative traits and baseline hazard function heterogeneity for time-to-event traits) among genetic ancestries [25]. Due to the concerns over population structure, these individuals are routinely excluded from analysis.
Recently, Tractor was proposed to enable the inclusion of unrelated admixed individuals in GWAS [26]. In addition, Tractor can incorporate local ancestry to conduct population-specific association studies, which can enhance power compared to regular approaches in the case of the effect size heterogeneity [27]. Although powerful, Tractor is computationally expensive because it requires fitting a full model for each genetic variant, which limited its application to a large-scale biobank dataset with hundreds of thousands of individuals. In addition, Tractor package does not support complex traits with intricate structures (e.g., time-to-event and longitudinal traits). If the effect size remains the same for different ancestries, Tractor is less powerful than regular approaches [27]. An alternative strategy is to divide all individuals into multiple ancestry groups and then conduct a meta-analysis. However, the division criteria are arbitrary and may result in the removal of individuals in transitional regions.
Retrospective association analysis treats the genotype of a genetic variant as a random variable and conducts association analysis conditional on the phenotype and covariates. Compared to prospective association analysis, the retrospective association analysis demonstrates less susceptible to power loss, exhibits greater robustness to control type I errors, and maintains statistical power, especially when dealing with trait model misspecification [28]. For example, LT-Fam uses a retrospective association score statistic to address family-biased case–control ascertainment [29]. For individuals with diverse or admixed genetic ancestries, phenotypes could be affected by latent ancestry-specific environment factors or lifestyles, potentially leading to model misspecification. Consequently, a retrospective association analysis may offer a more effective solution.
In this study, we propose SPAmix, a scalable and accurate retrospective analysis framework for a large-scale genome-wide analysis while accounting for population stratification in admixed populations. Since no alternative model is fitted, SPAmix primarily focuses on association p values. As a score test approach, SPAmix only requires fitting a null model which is the same for all genetic variants across the genome-wide analysis. And thus, the computational burden is much lower than the approaches based on Wald test or likelihood ratio test, especially for a large-scale biobank data analysis. The statistical model can be freely chosen to fit the trait of interest and only residuals obtained from the null model fitting are required for SPAmix. In this paper, we used binary traits, quantitative traits, time-to-event traits, ordinal traits, and longitudinal traits as examples to demonstrate the performance of SPAmix. For each genetic variant, SPAmix estimates individual-specific allele frequencies using genotype data and SNP-derived PCs. And then, SPAmix treats genotype as a random variable and uses a hybrid strategy including both normal distribution approximation and SPA to calibrate p values. In addition, SPAmix can be extended to incorporate local ancestry if available and SPAmixCCT can serve as an optimal unified approach for various cross-ancestry genetic architectures, leveraging the advantages of the Cauchy combination (CCT) [27, 30].
We evaluated SPAmix using extensive simulation studies in which individuals from multiple ancestries and various trait types were included. SPAmix can always accurately control type I error rates and maintain power in various settings in terms of genotypes (common, low-frequency, or rare variants) and phenotypes (phenotypic distribution is balanced or unbalanced). However, in the case of phenotypic variance heterogeneity among ancestries, regular linear regression approaches are invalid. And similarly, regular Cox regression approaches cannot control type one error rates if the baseline hazard functions for distinct ancestries are different. If the local ancestry is incorporated, SPAmixCCT is more powerful than Tractor and can serve as an optimal unified approach regardless of effect size homogeneity and heterogeneity.
We applied SPAmix to analyze 10 time-to-event traits and 11 longitudinal traits using 369,314 unrelated individuals from all ancestries in UK Biobank data. Compared to the analysis limited to 308,328 (~ 83%) unrelated White British (WB) individuals, a total of 377 loci were additionally identified at a significance level of 5 × 10−8. The results highlight the importance of the ancestry diversities in GWAS.
Results
An overview of SPAmix
SPAmix is designed for conducting GWASs in a study cohort including multiple subpopulations and/or admixed populations. It is comprised of two steps (see Fig. 1). In step 1, SPAmix requires fitting a null model to calculate model residuals, in which confounding factors such as age, sex, SNP-derived principal components (PCs), and other confounders are incorporated as covariates. The null model specification and the corresponding model residuals can vary depending on the type of trait. In the online Methods section, we demonstrated regression models to fit binary, quantitative, time-to-event, ordinal, and longitudinal traits, and the corresponding model residuals.
Overview of SPAmix approach. SPAmix is comprised of two steps. In step 1, null model fittings are required to calculate model residuals. In step 2, SPAmix approach associates the traits of interest to single genetic variant by approximating the null distribution of score statistics. Based on a hybrid strategy using normal distribution approximation and saddlepoint approximation, SPAmix is scalable to analyze large-scale biobank data while being accurate to analyze rare genetic variants even if the phenotypic distribution is unbalanced. The estimation of individual-level allele frequency can well characterize the ancestry difference
In step 2, SPAmix approach associates the traits of interest to single genetic variant by approximating the null distribution of score statistics \(S=\sum_{i=1}^{n}{G}_{i}{R}_{i}\), where \(n\) is the number of individuals, and \({G}_{i}\) and \({R}_{i},i\le n\) are the genotypes and model residuals, respectively. To characterize the diversity of allele frequencies (AFs) of this genetic variant for different ancestries, we assume that individuals come from different populations with varying AF. Instead, linear and logistic regressions were used to estimate individual-specific AFs \(\widehat{{\varvec{q}}}={\left({\widehat{q}}_{1},\dots ,{\widehat{q}}_{n}\right)}^{T}\), in which raw genotypes are outcomes and SNP-derived principal components are model predictors. SPAmix uses a hybrid strategy including both normal distribution approximation and SPA to approximate the distribution of score statistics under the null hypothesis. The approximation relies on the assumption that genotype \({G}_{i}\) follows a binomial distribution Binom(2, \({\widehat{q}}_{i}\)). In addition, model residuals are categorized to outliers and non-outliers based on interquartile range, and a partial normal distribution is used to reduce computational burden. SPAmix supports simultaneously analyzing multiple traits whose types are the same or different, which can avoid repeated calculation of the AF estimation.
Association analysis in the UK Biobank data
We applied SPAmix, SPACox, and trajGWAS to analyze 10 time-to-event traits and 11 longitudinal traits in UK Biobank data. A total of 369,314 unrelated individuals from multiple ancestries were included in genome-wide analyses (SPAmix, SPACoxALL, and trajGWASALL). For SPACox and trajGWAS, we additionally analyzed 308,328 (~ 83%) unrelated White British (WB) individuals (SPACoxWB and trajGWASWB). We also evaluated SPAmixnoSPA approach in which only normal distribution approximation is used to calibrate p values. The exact sample size in analysis, phecode, and event rates in WB and non-WB for the 10 time-to-event traits can be found in Additional file 1: Table S1. In this paper, we defined WB based on Field ID of 22,006, which is highly consistent to the genetic ancestry based on the genetically derived principal components. We define non-WB group as all participants from diverse ancestries (except WB) or admixed populations in the UKB. Note that non-WB is not a monolith ancestry group. The 11 longitudinal traits were extracted from primary care data in a similar way as in previous studies [31]. The evaluation of SPACoxWB is to highlight the importance of ancestry diversities in GWAS. We did not consider more complicated study design, for example, using SPACox to analyze separate populations and then conducting meta-analysis. Although this design can increase power, it is still expected to be less powerful than SPAmix as subjects in transitional regions would be excluded.
SPAmix is scalable to analyze large-scale biobank data
We selected Alzheimer’s disease (event rate = 0.2%) and essential hypertension (event rate = 27.1%) as two time-to-event trait examples (sample size = 338,044) to evaluate computation time of SPAmix for low and high event rates, respectively. All analyses were conduct on a CPU model of Intel(R) Xeon(R) Gold 6342 CPU @ 2.80 GHz. Memory usage for each job was less than 4 GB. In addition to SPAmix and SPACox, we also evaluated an R package gwasurvivr in which Wald test was used to calibrate p values. As the package gwasurvivr does not support BGEN format, we converted the genotype data to plink format. It is expected that reading text-based formats (such as VCF format) is slower than reading binary format (such as plink and BGEN formats). We analyzed 10,000 genetic variants randomly selected in chromosome 1, recorded the computation time, and then projected it to all chromosomes including 18,583,853 genetic variants (Fig. 2, Additional file 1: Table S2). In Additional file 1: Supplementary notes, we demonstrated that the computational time scaled linearly (Additional file 1: Fig. S61).
Computation efficiency of SPAmix, SPACox, and Wald test (gwasurvivr). A total of 338,044 UK Biobank unrelated individuals were included in analysis. The CPU hours was recorded based on 10,000 genetic variants randomly selected in chromosome 1 and then projected to a genome-wide analysis including 18,583,853 genetic variants. SPAmix, SPACox, and gwasurvivr fit time-to-event traits using Cox regression models including age, sex, and the top 10 SNP-derived PCs as covariates. All analyses were conduct on a CPU model of Intel(R) Xeon(R) Gold 6342 CPU @ 2.80 GHz. Since SPACox and gwasurvivr does not support analyzing multiple traits simultanously, we sum up the CPU hours for AD and hypertension to get the CPU hours for “AD & Hypertension”
For GWAS of Alzheimer’s disease and essential hypertension, gwasurvivr took ~ 6354 and 6229 CPU hours, respectively. Meanwhile, SPAmix only took 501 and 553 CPU hours, which were 12.7 and 11.3 times faster than gwasurvivr. Furthermore, if these two traits were analyzed simultaneously, it only took 574 CPU hours, which was much lower than the sum of the computation time of gwasurvivr and slightly lower than that of SPACox. This indicates that reading genotype and calculating individual-specific allele frequency take a high proportion of the computation time, which can be reduced if multiple traits were analyzed simultaneously.
As a score test approach, SPAmix is much faster than tools developed using Wald test or likelihood ratio test, especially when analyzing multiple traits. The high computational efficiency ensures that SPAmix is scalable to a large-scale GWAS including hundreds of thousands of subjects.
SPAmix is accurate to analyze multi-way admixed populations
We chose four time-to-event traits including type 2 diabetes, essential hypertension, glaucoma, and asthma as examples for demonstration. Manhattan plots and quantile–quantile (QQ) plots are shown in Fig. 3, and QQ plots stratified by minor allele frequency in WB are shown in Fig. 4. For all these four traits, SPAmix and SPACoxWB can well control type I error rates, and SPAmix was more powerful than SPACoxWB. This is expected as SPAmix incorporated individuals from all ancestries and SPACoxWB removed ~ 17% non-white British individuals.
Manhattan plots and Quantile–quantile (QQ) plots for GWAS of four time-to-event traits. A total of 338,044 unrelated individuals from multiple ancestries were included in genome-wide analyses (SPAmix and SPACoxALL). For SPACoxWB, 281,299 (~ 83%) unrelated White British (WB) individuals were analyzed
Quantile–quantile (QQ) plots for GWAS of four time-to-event traits in which genetic variants were grouped based on minor allele frequency (MAF) in white British. A total of 338,044 unrelated individuals from multiple ancestries were included in genome-wide analyses (SPAmix and SPACoxALL). For SPACoxWB, 282,590 (~ 83%) unrelated White British (WB) individuals were analyzed
The accuracy of SPACoxALL highly depends on the diversity of event rates for different ancestries. For SPACoxALL, the high diversity of event rates resulted in a large number of false discoveries across the whole genome. An example is the type 2 diabetes for which the event rate in WB is ~ 6.55% and the event rates in non-WB is ~ 9.69%. QQ plots stratified by allele frequencies demonstrated that SPACoxALL identified spurious discoveries of low-frequency and rare variants (Fig. 4). For Alzheimer’s disease and multiple sclerosis, the diversity of event rates is moderate and SPACoxALL gave similar analysis results as SPAmix.
SPAmixnoSPA approach approximates the null distribution of the score statistics only using normal distribution approximation, and its performance depends on whether the phenotypic distribution is balanced or unbalanced. Of the four time-to-event traits, SPAmixnoSPA performed comparably as SPAmix only for type 2 diabetes (event rate is 7.1%). For the other 3 traits (event rate < 0.42%), SPAmixnoSPA questionably identified more associations than SPAmix and the findings mostly corresponded to low-frequency and rare variants (MAF < 0.05). The results were consistent with previous studies, which validated that SPA is necessary to control type I error rates if event rate is low, especially for low-frequency and rare variants [6, 15, 32]. The corresponding Manhattan plots, QQ plots, and QQ plots stratified by MAF for the remaining 6 time-to-event traits can be found in Additional file 1: Fig. S1-S3.
The analysis results for 11 longitudinal traits are similar as in time-to-event data analysis (Additional file 1: Fig. S4-S9). For example, in the analysis of BMI, spurious positive findings of low-frequency and rare variants were observed when using trajGWASALL, and trajGWASWB is less powerful than SPAmix (Additional file 1: Fig. S9k). The results indicate a unified advance of SPAmix over regular methods for different trait types of interest.
Generally speaking, the real data analysis in UK Biobank data suggests that SPAmix is more powerful than SPACoxWB and trajGWASWB, which is expected as SPAmix allows to include all individuals. Meanwhile, SPACoxALL and trajGWASALL are only valid when the trait distributions are close to each other for different ancestries. This is because the empirical SPA methods used in SPACox and trajGWAS heavily rely on the assumption that all model residuals follow an identical distribution [6]. In addition, SPA is also required to control type one error rates, especially for low-frequency and rare variants. The above conclusions are consistent with the simulation studies demonstrated in the next sections.
SPAmix identified 377 additional significant loci by including non-WB individuals
We clustered genetic variants within 200-kb region as one locus. At a genome-wide significance level \(\alpha\ =\) 5 \(\times\) 10−8, SPAmix totally identified 419 and 1500 loci for the 10 time-to-event traits and 11 longitudinal traits (testing for \({\beta }_{g}\)), of which, 113 and 264 loci were not identified by SPACoxWB and trajGWASWB, respectively. On the contrary, only 27 (27/113 = 23.9%) and 86 (86/264 = 32.6%) loci were identified by SPACoxWB and trajGWASWB but were not identified by SPAmix. If we set the significance level at \(\alpha =\) 5 \(\times\) 10−8/21, SPAmix identified 281 and 1050 loci for time-to-event traits and longitudinal traits, of which, 73 and 155 loci were not identified by SPACoxWB and trajGWASWB, respectively. On the contrary, only 11 (11/73 = 15.1%) and 48 (48/155 = 31.0%) loci were identified by SPACoxWB and trajGWASWB but were not identified by SPAmix. The complete list of the identified loci, including the top significant genetic variants (whose p value is the smallest in a locus) and the association results, are shown in Supplementary Tables S3. In this section, 3 loci were highlighted to exemplify the advantage of SPAmix over regular approaches as below.
For type 2 diabetes, SPAmix identified SNP rs71317817 (p value = 7.03 \(\times\) 10−9), which is an intronic variant in UBE2E2 that encodes ubiquitin conjugating enzyme E2. A GWAS in Japanese population has also identified the susceptibility loci for type 2 diabetes [33]. Within 10-Mb region centered on this SNP, this SNP exerts cis-eQTL effects on UBE2E2 in different tissues, especially in DM-related tissues such as the pancreas (p value = 2.03 \(\times\) 10−11) [34]. We then queried the CAUSALdb [35], a database which incorporates over 3000 public full GWAS summary data and identified credible sets of causality by uniformly processed fine-mapping. The SNP is the most possible causal variation in its linkage disequilibrium (LD) block with the highest posterior probability for diabetes mellitus. In addition, the SNP is located in a region overlapped with some epigenomics signals such as transcription factor binding and open chromatin [36].
For asthma, SPAmix identified SNP rs4833095, a missense variant in TLR1 which encodes Toll Like Receptor 1. Ferreira et al. identified this SNP as a risk variant associated with the asthma with hay fever phenotype [37]. In addition, Chang et al. found that SNP rs4833095 was associated with asthma with a p value 7 × 10−6 through a whole genome sequencing study [38]. The above two studies indicate that this SNP could be a variant associated with asthma, and larger sample sizes may be required to achieve the genome-wide significance. In addition, we found that SNP rs4833095 exerted cis-eQTL effects on TLR1 in different tissues, which showed the strongest effect in blood with a p value 2.63 \(\times\) 10−284. CAUSALdb also suggests that this SNP is speculated the most possible causal variant of asthma as well as allergic disease in its corresponding region [35].
For glaucoma, SPAmix identified SNP rs10262524 (p value = 4.86 \(\times\) 10−8), an intergenic variant between CAV2 gene and CAV1 gene. Gharahkhani et al. previously identified rs10262524 as open-angle glaucoma risk variant (p value = 3 \(\times\) 10−14) by a large-scale GWAS [39]. The SNP exerts cis-eQTL effects on CAV2 and CAV1 in many tissue types, and CAV2 shows the strongest eQTL effect size (p value = 3.77 \(\times\) 10−71). In addition, rs9665610 is the most possible causal variation in its linkage disequilibrium (LD) block with the highest posterior probability for glaucoma and related phenotypes. The location of rs10262524 overlaps with transcription factor binding sites and open chromatin regions, suggesting an underlying epigenetic regulatory mechanism. Other CAV1-CAV2 loci associated with glaucoma include rs4236601, a SNP in LD with the SNP rs10262524 (r2 = 0.96) [40,41,42,43,44,45].
Association analysis in the ALL of Us Data
To validate SPAmix in diverse populations, we performed genome-wide association studies (GWAS) for time-to-event hypertension in the All of Us cohort, analyzing both multi-ancestry (sample size = 188,941, event rate = 40.4%) and the European-ancestry subgroups (sample size = 112,080, event rate = 41.6%). After excluding variants with missing p values or MAF < 0.01, SPAmix identified 18 significant SNPs at a genome-wide significance level \(\alpha =\) 5 \(\times\) 10−8 in the multi-ancestry analysis, a 38% increase compared to the European-only subgroup (13 significant SNPs), as shown in Tables S4 and S5. Manhattan plots (Additional file 1: Fig. S10) demonstrated the power enhancement through incorporation of multi-ancestry individuals.
To demonstrate the validity of SPAmix in AllofUs data analyses, we highlight key loci showing consistent biological relevance. The KCNK3 locus (rs12476527) was significantly associated with hypertension in both multi-ancestry and European-ancestry analyses, supporting previous reports linking KCNK3 to hypertension [46]. Similarly, SPAmix replicated the known hypertension signal at rs1799945 in HFE across both multi-ancestry and European analyses, consistent with established GWAS evidence [47, 48].
Beyond replications, we also highlight biologically plausible associations uniquely identified by multi-ancestry analysis: (1) SNP rs167479 in RGL3 (p value = 4.12 \(\times\) 10−8),whose hypertension association has independent validation, [49, 50] and (2) rs10786736 (p value = 4.45 \(\times\) 10−8), a 3' UTR variant in the CNNM2 and NT5C2 genes on chromosome 10 with hypertension associations reported specifically in East Asian populations [51].
Simulation studies
To assess the performance of SPAmix in terms of type I error rates and powers in admixed population analyses, we simulated genotypes and time-to-event traits of \(n=\text{10,000}\) subjects, mimicking an admixed population of European (EUR) and East Asian (EAS). Other types of traits were simulated in the following sections.
For each genetic variant, we simulated genotypes using ancestry vectors and allele frequency \(\left({q}^{EUR},{q}^{EAS}\right)\) downloaded from the 1000 Genomes Project [52]. Depending on the difference of MAFs (i.e., DiffMAF = \({q}^{EUR}-{q}^{EAS}\)) and the minimal MAF value (i.e., minMAF = \(\text{min}\left({q}^{EUR},{q}^{EAS}\right)\)) in populations EUR and EAS, genetic variants were categorized into 15 groups (Additional file 1: Fig. S11). Two scenarios were used to simulate time-to-event traits. In scenario 1, the event rates in EUR and EAS were the same; and in scenario 2, the event rate in EUR was higher than that in EAS. In each scenario, we simulated traits with low event rates (ERlow), moderate event rates (ERmod), and high event rates (ERhigh). More details about the data simulation can be found in the Data simulation subsection of the online Methods section.
Type I error rates
The empirical type I error rates for the two scenarios based on 109 association tests at a genome-wide significance level 5 \(\times\) 10−8 are presented in Fig. 5a and S12, and Tables S6 and S7. The QQ plots for randomly selected 106 tests are shown in Additional file 1: Fig. S13 and S14.
Empirical type I error rates (a) and powers (b) of SPAmix, SPAmixnoSPA, and SPACox methods at a significance level 5 \(\times\) 10−8. Significance level is 5 \(\times\) 10−8 to evaluate type I error rates and powers of SPAmix, SPAmixnoSPA, and SPACox, and empirical significance levels are used for SPAmixnoSPA|emp and SPACox|emp. In type I error simulations, we conducted 109 tests. In power simulations, we conducted 10.4 tests, and the genetic effect is positively defined as \(\gamma =-2{\text{log}}_{10}\text{MAF}\). We simulated an admixed population of EUR and EAS with a total sample size n = 10,000. Time-to-event traits were simulated in scenario 2, that is, the event rates in EUR are higher than that in EAS. We considered three pairs of event rates (0.1, 0.01) (low event rate, ERlow), (0.3, 0.05) (moderate event rate, ERmod), and (0.5, 0.2) (high event rate, ERhigh) in EUR and EAS, respectively. From left to right, we considered five scenrios of MAF difference (DiffMAF) between EUR and EAS, from left to right, we considered three scenarios of minimal MAF in EUR and EAS (minMAF). More details about the grouping process can be seen in the legend of Additional file1: Fig. S11
If event rates in EUR and EAS were the same (i.e., in scenario 1), SPAmix and SPACox generally performed well and can control type I error rates under all settings of MAFs and event rates. In a limited number of settings, SPACox produced slightly deflated type I error rates (Additional file 1: Table S6). For example, if DiffMAF < < 0, 0.01 \(\le\) MinMAF \(<\) 0.05 (i.e., minMAFmod), and event rates in both EUR and EAS are 0.01 (i.e., ERlow), the empirical type I error rates corresponding to SPAmix and SPACox were 4.7 \(\times\) 10−8 (0.94 \(\alpha\)) and 2.1 \(\times\) 10−8 (0.42 \(\alpha\)), respectively. Meanwhile, for low event rates or low-frequency variants, SPAmixnoSPA cannot control type I error rates, which results in false positive discoveries (Additional file 1: Fig. S12). For example, if DiffMAF ~ 0, MinMAF \(<0.01\) (minMAFlow), and event rates in both EUR and EAS are 0.01 (ERlow), the empirical type I error rate of SPAmixnoSPA was 0.0003 (> 6000 \(\alpha\)). The result suggests that SPA approaches overperform normal distribution approximation in terms of type I error rates control, especially for low event rates and rare variants. This conclusion is consistent with previous studies and the real data analysis, which demonstrates the necessity to use SPA to approximate the null distribution of the score statistics [6, 20, 32, 53, 54].
If event rates in EUR and EAS were different (i.e., in scenario 2), SPAmix can still calibrate p values accurately. Meanwhile, despite incorporating ancestry PCs to fit the null model, SPACox cannot control type I error rates, which resulted in largely deflated or inflated type I error rates depending on DiffMAF and event rates. If DiffMAF < < 0 or DiffMAF < 0, type I error rates were deflated; and if DiffMAF > 0 or DiffMAF > > 0, type I error rates were inflated. The degree of inflation or deflation was greater if the minimal of \({q}^{EUR}\) and \({q}^{EAS}\) (MinMAF) was closer to 0 (Fig. 5a). For example, if DiffMAF > 0, \(\text{min}\left({q}^{EUR},{q}^{EAS}\right)<0.01\) (i.e., minMAFlow), and event rates in EUR and EAS are 0.1 and 0.01 (i.e., ERlow), respectively, the empirical type I error rate of SPACox was 4.1 \(\times\) 10−6 (82.4 \(\alpha\)). This demonstrated that SPACox cannot control type I error rates if the event rates are different across ancestries. In addition, similar to scenario 1, using only normal distribution produced inflated type I error rates. The QQ plots for randomly selected 106 tests (Additional file 1: Fig. S14) showed that only SPAmix produced well-calibrated p values, which further validated the accuracy of SPAmix over SPACox and normal distribution approximation in terms of type I error rates control in the presence of variation of event rates and MAFs across ancestries. In addition, SPAmix well controlled type I error rates when analyzing a sample consisting of 90% EUR individuals and 10% EAS individuals, as illustrated in Additional file 1: Fig. S15.
Empirical power
The empirical powers of SPAmix, SPAmixnoSPA, and SPACox were evaluated in two scenarios based on 104 tests at a significance level \(\alpha =5\times {10}^{-8}\) under the alternative model with positive genetic effects (Fig. 5b and S16) and negative genetic effects (Additional file 1: Fig. S17 and S18), respectively. The genetic effect sizes were the same for different ancestries in this section. More discussion about the effect size heterogeneity can be found in the next sections. The QQ plots were shown in Additional file 1: Fig. S19-S22. If event rates in EUR and EAS were the same (i.e., in scenario 1), SPAmix and SPACox were similarly powerful in all cases, whereas SPACoxnoSPA yield spurious higher powers due to the inflated type I error rates.
In scenario 2, that is, the event rates in EUR and EAS were different, if DiffMAF < < 0 and DiffMAF < 0, SPAmix were more powerful than SPACox. For instance, if genetic effects are positive, DiffMAF < < 0, \(0.01\le\) minMAF \(<0.05\) (i.e., minMAFmod), and event rates in EUR and EAS are 0.3 and 0.05 (i.e., ERmod), respectively, the empirical power of SPAmix (21.02%), which was nearly threefold higher than SPACox (7.25%). This is due to the deflation of SPACox as observed in the type I error rates simulation studies. Under settings of DiffMAF ~ 0, DiffMAF > 0, and DiffMAF > > 0, SPACox produced spurious higher powers as shown in Fig. 5b. To fairly evaluate powers, we also used test-specific empirical significance levels that yield type I error rate \(\alpha =5\times {10}^{-8}\), as shown in Fig. 5b and Additional file1: Fig. S16. Using the empirical significance levels, SPAmixnoSPA (denoted as SPAmixnoSPA|emp) and SPACox (denoted as SPACox|emp) were similarly powerful as SPAmix in all cases.
The empirical powers for negative genetic effects (Additional file 1: Fig. S17 and S18) and QQ plots (Additional file 1: Fig. S21 and S22) display a similar trend as those for positive genetic effects, demonstrating that the above discussion about the powers is also valid. Moreover, the empirical powers when the genetic effects sizes are heterogeneous across ancestry groups are shown in Additional file 1: Fig. S23-S28, which also demonstrated that SPAmix maintained power in such cases. Generally speaking, only if event rates in different populations are nearly identical, SPACox is still accurate. Otherwise, using SPACox can result in a large number of false positive discoveries or power loss. Meanwhile, SPAmix is an efficient and accurate approach in all scenarios. Although we did not evaluate other types of traits, the conclusions are expected to be similar.
Simulation of a three-way admixed population
To validate that SPAmix method is not limited to a two-way admixed scenario, we also simulated genotypes and time-to-event phenotypes of \(n=\text{10,000}\) subjects to mimic a three-way admixed population of European (EUR), African (AFR), and East Asian (EAS). More details about the simulation can be found in the online Method section. The results are consistent to the previous simulation studies. The QQ plots (Additional file 1: Fig. S29 and S30) demonstrate that SPAmix can still well control type I error rates while remaining powerful, and SPACox displayed inflation or deflation depending on the simulation scenarios. Since ancestry proportions were characterized based on individual-specific allele frequency estimation, SPAmix can be straightforwardly applied to analyze populations with more than three ancestries, which is also validated in the UK Biobank real data analysis.
SPAmix can control type I error rates in the case of variance heterogeneity and baseline hazard function heterogeneity
To address potential confounding arising from population stratification, SNP-derived principal components (PCs) were regularly incorporated as covariates within regression models, including linear regression to analyze quantitative traits and Cox PH regression to analyze time-to-event traits. However, this strategy still cannot adjust for population stratification if the population-related heterogeneity cannot be well characterized in the regression model. For example, a basic assumption of the linear regression is the phenotypic variance homogeneity. If the variances corresponding to distinct ancestries are different, the regular linear regression is invalid, even if the SNP-derived PCs were incorporated. Similarly, Cox PH regression implicitly assumes that baseline hazard functions for all individuals are the same, which could also be violated in an admixed population.
We conducted simulation studies (more details can be seen in the online Methods section) to compare regular linear regression and SPAmix-based methods. To mimic the variance heterogeneity in an admixed population, we simulated quantitative traits in which the phenotypic variances in EUR and EAS are different. Additional file 1: Fig. S31 and S32 demonstrated the empirical type I error rates and the QQ plots under the null hypothesis. SPAmix-based approaches can well control type I error rates in all scenarios. Meanwhile, although incorporating SNP-derived PCs as covariates, the regular linear regression methods including Wald test, likelihood ratio test, SAIGE [32], REGENIE [17], and fastGWA [55] were still invalid in the case of variance heterogeneity. Even in the case where the variance heterogeneity is limited (standard derivations in EUR and EAS are 1 and 0.8, respectively), the inflated or deflated type I error rates still occurred. The variance heterogeneity issue has been discussed in previous studies [25].
For time-to-event trait analysis, we also conducted simulation studies to evaluate SPAmix-based methods, SPACox, Wald test, and GATE in the case of heterogeneous baseline hazard functions corresponding to ancestries EUR and EAS (more details can be seen in the online Methods section). The empirical type I error rates and QQ plots are presented in Additional file 1: Fig. S33 and S34. SPACox, Wald test, and GATE had inflated or deflated type one error rates. Meanwhile, SPAmix always controlled type I error rates well in all scenarios, which demonstrates the accuracy and robustness of SPAmix over other regular methods in the case of baseline hazard function heterogeneity in an admixed population analysis.
The above failure of the regular linear regression and Cox regression is due to the model misspecification resulted from the diverse phenotypic distribution corresponding to the distinct ancestries, many of which cannot be well characterized via incorporating PCs. Surprisingly, as a retrospective approach, SPAmix can well address this issue, which suggests that it can perform as a more robust solution in the case of other latent heterogeneities when analyzing complex phenotypes.
In addition, the retrospective idea of SPAmix also contributes to a more robust performance if the phenotypic distribution is extremely unbalanced. To demonstrate it, we simulated an ordinal trait with four groups (more details can be seen in the online Methods section). The empirical type I error rates are presented in Additional file 1: Fig. S35. If the ratio of the sample sizes in these four groups is 100:1:1:1, the regular methods can still control type one error rates. However, if the ratio is 1000:1:1:1, the regular methods had inflated type I error rates. Meanwhile, SPAmix can control type I error rates well in all scenarios. Although the simulation scenario seems too extreme to be available in real data analysis, it still can demonstrate the advantage of SPAmix in terms of robustness and the accuracy.
SPAmix framework can account for ancestry-specific environmental factors
The empirical type I error rates of SPAmix in the cases where environmental factors are ancestry-specific are presented in Additional file 1: Fig. S36 and S37 (more details can be seen in the Additional file 1: Supplementary notes). The results showed that SPAmix can still control type I error rates and maintain powers regardless of incorporating the interactions between environmental factors and genetic-derived PCs or not, while SPACox had inflated or deflated type I error rates.
SPAmix and SAIGE show comparable type I error rates and powers
We conducted simulation studies to evaluate SPAmix and SAIGE in binary trait analysis in terms of type I error rates and powers. The empirical type I error rates based on 105 association tests at a significance level 5 \(\times\) 10−4 and QQ plots are presented in Additional file 1: Fig. S38 and S39. The results demonstrated that both SPAmix and SAIGE can well control type I error rates. The empirical powers based on 1000 tests at a significance level \(\alpha =5\times {10}^{-8}\) under the alternative model are presented in Additional file 1: Fig. S40, which showed that SPAmix and SAIGE were similarly powerful in all cases. Therefore, SPAmix and SAIGE showed comparable type I error rates and powers when analyzing admixed populations.
SPAmix framework can utilize local ancestry and Cauchy combination to maximize power for various cross-ancestry genetic architectures
Tractor proposed a framework to use local ancestry to analyze admixed population, which can enhance power in the case of effect size heterogeneity. SPAmix framework can also straightforwardly incorporate local ancestry (more details can be seen in the online Methods section). We refer to this as SPAmixlocal to distinguish it from the original SPAmix in which the effect sizes for various ancestries were assumed to be equal. In addition, we proposed to use Cauchy combination to calculate one p value (SPAmixCCT) using p values from SPAmix and SPAmixlocal (see online Methods). Extensive simulations were conducted to evaluate SPAmix, SPAmixlocal, SPAmixCCT, and Tractor for binary and quantitative phenotypes.
Under the null hypothesis that there are no genetic effects of two ancestries, SPAmix, SPAmixlocal, and SPAmixCCT can always control type I error rates. However, for binary phenotypes, Tractor had inflated type I error rates if the case–control ratio was unbalanced, i.e., low prevalence (see Fig. 6). This is because Tractor package follows a regular logistic regression to fit binary traits, which does not incorporate SPA and cannot control type I error rates in case of unbalanced phenotypic distribution [32]. For quantitative phenotypes, both SPAmix-based approaches and Tractor can well control type I error rates (see Additional file 1: Fig. S41). The result suggested that incorporating SPA increased the accuracy, especially for unbalanced phenotypic distribution.
Quantile–quantile (QQ) plots of − log10(p) for SPAmix, SPAmixlocal (ance1), SPAmixlocal (ance2), SPAmixCCT, Tractor (ance1), and Tractor (ance2) for binary trait analysis under the null hypothesis. SPAmixlocal (ance1) and Tractor (ance1) are to test for \({\gamma }^{\left(1\right)}=0\), and SPAmixlocal (ance2) and Tractor (ance2) are to test for \({\gamma }^{\left(2\right)}=0\). We simulated a two-way admixed population with a total sample size n = 10,000 and binary phenotypes. We consider two disease prevalences and two MAFs in ancestry 1 from up to bottom and four MAFs in ancestry 2 from left to right. In each case, we conducted 106 tests
The empirical powers were evaluated for a binary trait with a moderate disease prevalence of 0.2, under which both SPAmix-based approaches and Tractor can control type I error rates well. If the genetic effect sizes of different ancestries were equal, SPAmix was always more powerful than Tractor and SPAmixlocal (see Fig. 7a). To mimic the genetic effect size heterogeneity, we fixed the genetic effect size of ancestry 1, i.e., \({\gamma }_{ance1}\), and increased genetic effect size of ancestry 2, i.e., \({\gamma }_{ance2}\). In general, SPAmixlocal demonstrated similar power as Tractor, and its superiority over SPAmix depends on cross-ancestry genetic architectures (see Fig. 7b, S42, and S43). For example, if the MAF in ancestry 2 was high and the corresponding effect size was close to 0, the power of SPAmix was close to 0 but SPAmixlocal and Tractor can still identify the genetic variants. The empirical powers for quantitative traits were consistent and demonstrated in Additional file 1: Fig. S44-S47.
Empirical powers of SPAmix, SPAmixlocal (ance1), SPAmixlocal (ance2), SPAmixCCT, Tractor (ance1), and Tractor (ance2) at a significance level 5 \(\times\) 10−8 for binary trait analysis under the scenario of genetic effect size homogeneity (a) and genetic effect size heterogeneity (b). SPAmixlocal (ance1) and Tractor (ance1) are to test for \({\gamma }^{\left(1\right)}=0\), and SPAmixlocal (ance2) and Tractor (ance2) are to test for \({\gamma }^{\left(2\right)}=0\). We simulated a two-way admixed population with a total sample size n = 10,000. The disease prevalence of the simulated binary phenotypes is 0.2. We considered two MAFs in ancestry 1 from up to bottom and four MAFs in ancestry 2 from left to right. In panel a, the true genetic effect size of ancestries 1 and 2 are the same. In panel b, we fixed the true genetic effect size of ancestry 1 at 0.5 and increased that for ancestry 2. In each case, we conducted 103 tests
The above power comparisons between Tractor and regular approaches have been comprehensively discussed elsewhere [27]. In this paper, we proposed SPAmixCCT which uses Cauchy combination to maximize powers. The simulation results suggested that SPAmixCCT was always close to the most powerful methods for various cross-ancestry genetic architectures, which indicates that SPAmixCCT can be an optimal unified approach (see Fig. 7 and Additional file 1: S42-S47).
SPAmixlocal and SPAmixCCT can control type I errors in the case of variance heterogeneity
We conducted simulation studies to evaluate SPAmixlocal, SPAmixCCT, and Tractor when analyzing quantitative traits in the scenarios of variance heterogeneity. Under the null hypothesis, QQ plots demonstrated that SPAmixlocal and SPAmixCCT can well control type I error rates in all scenarios (see Additional file 1: Fig. S47). However, Tractor cannot control type I error rates in the scenario of variance heterogeneity. When testing for the genetic effect of the ancestry with larger variance, Tractor results in inflated type I error rates. And when testing for the genetic effect of the ancestry with smaller variance, Tractor results in deflated type I error rates. The degree of inflation or deflation is positively correlated with the degree of the variance heterogeneity. The results well demonstrated the superiority of SPAmix framework over Tractor in terms of accurate type I error control for analysis of quantitative traits with variance heterogeneity in an admixed population.
UK Biobank data analysis validated that SPAmix overperformed SPACox
We selected 10,000 individuals in UK Biobank data to mimic simulation studies. More details about the individual selection can be found in the online Methods section. The QQ plots (Additional file 1: Fig. S49) demonstrated that SPAmix successfully identified a large number of variants while well controlling for type I error rates. In contrast, SPACox resulted in anticonservative or conservative p values depending on the correlation between event rate and MAF across ancestries, leading to a large number of false positive discoveries or loss of power. The results clearly demonstrated that SPACox cannot control type I error rates when analyzing a population with both genetic and phenotypic heterogeneity. The trend of inflation or deflation of p-values is completely consistent to the simulation studies. In contrast, SPAmix produced well-calibrated p-values even in the presence of genetic and phenotypic heterogeneity in the population.
Discussion
Most of the current GWAS studies have been conducted on individuals of European descent, resulting in a lack of diversity and underpowered analyses for other populations. As a result, there is an urgent need for more diverse representation in GWAS studies to enable comprehensive and accurate analysis of genetic variants across different populations.
Admixed populations are groups of individuals whose genetic ancestries are from two or more than two ancestral populations. The population mixture provides unique opportunities to study genetic diversity, disease susceptibility, and complex traits [56,57,58,59,60]. However, the admixed population analysis is technically challenging as it requires addressing the complicated patterns of genetic and phenotypic diversities that arise from distinct genetic backgrounds and environmental exposures. While using SNP-derived PCs are commonly used to account for population stratifications, the simulation studies have shown that it could be inadequate which result in inflation when analyzing admixed samples in some cases. The variance heterogeneity of a quantitative trait is a classic example where regular methods may be ineffective. Several methods have been proposed for GWAS in admixed samples, but they are mostly designed for quantitative traits and binary traits. Scalable and accurate methods for complex traits with intricate structures (e.g., time-to-event and longitudinal traits) are still urgently needed.
In this work, we propose SPAmix, a scalable, accurate, and universal analysis framework to conduct GWASs in a biobank-scale dataset consisting of subjects with complicated genetic structure. SPAmix estimates individual-specific allele frequencies to characterize the individual-specific genetic ancestries using information from the SNP-derived PCs and raw genotype data in a model-free approach. Thus, it does not necessitate accurate specification of and the availability of appropriate reference population panels for the ancestries contributing to the individual, which might be unknown or not well defined [61]. SPAmix is a variant-level analysis and the core codes are written in C + +, which supports a rigorous control of memory usage. In addition, as a retrospective approach, SPAmix is not sensitive to model misspecification (e.g., missed or biased confounder-trait associations) and trait-based ascertainment [28]. In the case of model misspecification, SPAmix can control type I error rates well, but also could be less powerful, which could be interesting to explore in the future study.
In general, the novelty of the state-of-art methodology gives SPAmix the below advances in GWAS. First, in the SPAmix framework, only modeling residuals after fitting a null model is required and thus it can be easily used to analyze any type of complex trait. In the simulation studies and real data analysis, we demonstrated it via binary, quantitative, time-to-event, ordinal, and longitudinal traits. We note that REGENIE recently released a new version to support time-to-event trait. Preliminary analysis results demonstrated that the time-to-event trait analysis cannot control time one error rates, similar to SPACox, GATE, and Wald test (see Additional file 1: Supplementary notes). Second, SPAmix calibrates p values using a hybrid strategy to combine regular normal distribution approximation and SPA. Using SPA can more accurately approximate the null distribution of the test statistics, especially if the phenotypic distribution is unbalanced. Third, SPAmix only requires fitting one null model for a genome-wide analysis. In addition, SPAmix also uses some technical strategies, such as partial normal distribution approximation in SPA, to further reduce the computation time. UK Biobank real data analysis demonstrated that SPAmix is scalable to analyze a large-scale biobank with hundreds of thousands of individuals. Importantly, SPAmix framework does not need separating subjects into multiple ancestry-homozygous groups, and all subjects can be included into one genome-wide analysis. Finally, SPAmix has been incorporated into our well-developed R package GRAB (see URLs), in which PLINK and BGEN file formats are supported.
The original SPAmix assumes that the genetic effects of different ancestries are equal. Meanwhile, if the local ancestry information is available, SPAmix can also be straightforwardly extended to SPAmixlocal which generally performs similar to Tractor and be more accurate when analyzing a quantitative trait with phenotypic variance heterogeneity and a binary trait with low case–control ratio. In addition, we also proposed SPAmixCCT to combine the original SPAmix and SPAmixlocal to maximize statistical powers for various cross-ancestry genetic architectures. Its ability to account for ancestry-specific effect sizes is a valuable addition to regular approaches [17, 18, 32].
There are several limitations in SPAmix. First, SPAmix assumes that genotype in subpopulations follows a binomial distribution, which could be violated. Interestingly, in UK Biobank data analysis, although we used imputation dosage data and did not filter out variants based on HWE test, it seems no inflation/deflation were observed, which releases the concerns of the HWE assumption on SPAmix. More discussions about the binomial distribution assumption [62] can be found in Additional file 1: Supplementary notes. Second, SPAmix cannot adjust for family relatedness and thus only unrelated individuals can be analyzed. In the near future, we plan to extend SPAmix to account for sample relatedness via incorporating genetic relationship matrix (GRM) to adjust for the variance of the score statistics. Third, SPAmix cannot estimate genetic effect size via fitting an alternative model. SPAmix can serve as a screening process to prioritize variants. If genetic effect size is required for the subsequent analyses, users can select the top variants and then fit alternative models for these variants. Alternatively, we proposed an approach to estimate effect size, which adopted the strategies utilized in GATE [18], BOLT-LMM [63], and SAIGE [32] and performs comparably as the Wald test and GATE (see Additional file 1: Supplementary notes). Finally, the current version of SPAmix only supports analyzing diploid species such as human.
Currently, quantitative and binary traits remain the most commonly used types of traits in GWAS. Meanwhile, there is a growing trend toward utilizing traits with intricate structures, such as time-to-event, longitudinal, and ordinal traits, as they can convey essential (but often overlooked) information. Unfortunately, applying well-developed statistical models and testing methods in large-scale GWAS is usually challenging due to high computational burden and other practical challenges in genetics studies. To bridge the gap between statistics and GWAS, we previously introduced SPACox as a universal analysis framework to calibrate p values. And currently, we propose SPAmix to address the population stratification which is a significant practical challenge in GWAS, especially for cross-ancestry analyses. Through simulation studies and real data analyses, we demonstrated that SPAmix can efficiently analyze a dataset with over 400,000 individuals and adjust for type I error rates even if the phenotypic distribution is extremely unbalanced. Our method provides a scalable, accurate, and universal solution for GWAS in a large-scale biobank dataset including individuals from multiple genetic ancestries. UK Biobank may not be the best scenario as the majority of samples of the data are from a single homogeneous ancestry group. More ancestrally diverse and admixed dataset, for example, the AllofUs Program [64], the Trans-Omics for Precision Medicine (TOPMed) Program [65], the Mexican Biobank [66], and the Hispanic Community Health Study/Study of Latinos (HCHS/SOL) [67] could be better choices. The preliminary analysis using AllofUS dataset validates the robustness of SPAmix, and we anticipate the discovery of more valuable insights that may result from utilizing SPAmix.
Conclusions
We propose SPAmix as a new analysis framework which is scalable to analyze biobank-scale data, is accurate to analyze phenotypes with an unbalanced distribution and rare variants, and is universal for complex traits with intricate structures (e.g., time-to-event trait and longitudinal trait) and a multi-way admixed population.
URLs
GRAB (version 0.1.1), https://CRAN.R-project.org/package=GRAB. trajGWAS (version 0.4.6), https://openmendel.github.io/TrajGWAS.jl/stable/. gwasurvivr (version 1.18.0), https://bioconductor.org/packages/release/bioc/html/gwasurvivr.html. SAIGE (version 1.1.9), https://saigegit.github.io/SAIGE-doc/. REGENIE (version 3.3), https://rgcgithub.github.io/. GCTA (version 1.94.1) http://cnsgenomics.com/software/gcta/#fastGWA. Tractor (version 0.0.1), https://github.com/Atkinson-Lab/Tractor. UK Biobank analysis summary statistics, https://zenodo.org/record/8343684.
Methods
Model residuals of linear regression and logistic regression
Linear regression and logistic regression are regular models to fit quantitative traits and binary traits. For subject i, we define model residuals \({R}_{i}\) as the difference between response variable \({Y}_{i}\) and its estimated expectation \(\widehat{\text{E}\left({Y}_{i}\right)}\) under the null model. Suppose that the study cohort consists of \(n\) subjects, we let \({\varvec{G}}={\left({G}_{1},\dots ,{G}_{n}\right)}^{T}\) denote the genotype vector of a genetic variant. The score statistic to associate trait of interest and genetic variant is \(S=\sum_{i=1}^{n}{G}_{i}{R}_{i}={{\varvec{G}}}^{T}{\varvec{R}}\), where \({\varvec{R}}={\left({R}_{1},\dots ,{R}_{n}\right)}^{T}\).
Cox proportional hazard model and SPACox
For subject \(i\le n\), Cox proportional hazard model specifies the hazard function \(\uplambda \left(t;{X}_{i},{G}_{i}\right)\) for the failure (i.e., event occurs) time \({T}_{i}^{*}\) associated with \({G}_{i}\) and a \(p\times 1\) vector of covariates \({X}_{i}\) in the form of
where \({\uplambda }_{0}\left(t\right)\) is the baseline hazard function, \({X}_{i}\) is a \(k\times 1\) vector of covariates including age, sex, SNP-derived principal components (PCs), and the other confounding factors, \(\beta ={\left({\beta }_{1},\dots ,{\beta }_{k}\right)}^{T}\) is the coefficient vector for the covariates, and \(\gamma\) is the genetic effect \(.\) The observed time-to-event phenotype is \(\left({T}_{i},{\delta }_{i},{X}_{i},{C}_{i}\right)\), where \({C}_{i}\) denotes the censoring time, \({T}_{i}=\text{min}\left({T}_{i}^{*},{C}_{i}\right)\) denotes the observed time-to-event, \({\delta }_{i}=\text{I}\left({T}_{i}^{*}\le {C}_{i}\right)\) indicates that failure is observed, and \(\text{I}\left(.\right)\) is an indicator function.
To test for the null hypothesis \({\text{H}}_{0}: \gamma =0\), we fit the null Cox PH model as
and estimate the coefficients \(\widehat{\beta }\). Score statistics \(S=\sum_{i=1}^{n}{G}_{i}{R}_{i}={{\varvec{G}}}^{T}{\varvec{R}}\), where \({R}_{i}, i\le n\) is the martingale residual for subject \(i\) and \(\sum_{i=1}^{n}{R}_{i}=0\). For time-to-event data analysis, the approaches based on Cox PH model is more powerful than regular logistic regression-based methods [6, 12]. More details, including model fitting and theoretical derivations about the score statistics, can be found in previous work [6].
SPACox is a prospective approach that considers martingale residuals \({R}_{1},\dots ,{R}_{n}\) as independently and identically distributed (i.i.d.) random variables [6]. A hybrid strategy, including normal distribution approximation and saddlepoint approximation (SPA), is then used to calibrate p values. Since the null model is the same for all genetic variants, the model fitting and the empirical approximation of the distribution of the martingale residuals are only required once across the genome-wide analysis, which facilitates SPACox to be computationally efficient for time-to-event trait analysis in GWAS. In addition, SPACox is accurate as SPA uses an entire cumulant generating function (CGF) to approximate the null distribution of score statistics.
Longitudinal trajectories and trajGWAS
The empirical SPA strategy proposed by SPACox can be easily extended to other types of traits. For example, Ko et al. proposed trajGWAS in which the empirical SPA strategy was employed to analyze longitudinal trajectories [31]. Fitting a null linear mixed model to longitudinal measurements, score statistics \({S}_{{\beta }_{g}}={R}_{{\beta }_{g}}^{T}G\) and \({S}_{{\tau }_{g}}={R}_{{\tau }_{g}}^{T}G\) are then used to test the mean effect and within-subject (WS) variability for each genetic variant, in which \({R}_{{\beta }_{g}}\) and \({R}_{{\tau }_{g}}\) are the corresponding model residuals. As trajGWAS follows a similar empirical strategy as SPACox, its performance is expected to be similar as SPACox and thus we did not conduct simulation studies for longitudinal trait analysis. More details about the phenotype data clean can be found in the previous study [31]. The only exception is to retain all individuals, instead of only retaining White British individuals.
Ordinal traits and proportional odds logistic regression model
Ordinal traits are widely available in biobanks to measure human behaviors, satisfaction, and preferences. We let J denote the number of category levels. For subject \(i\le n\), let \({Y}_{i}\) = 1, 2, …, J denote the ordinal phenotype. To characterize ordinal traits, the proportional odds logistic regression models as below has been applied in GWAS [7].
where \({X}_{i}\) is a vector of covariates of confounding factors including age, sex, SNP-derived principal components (PCs), etc., \({G}_{i}\) is the genotype, \(\beta\) is the coefficient vector for the covariates, and \(\gamma\) is the genetic effect \(.\) The cumulative probability of \({Y}_{i}\le j\) conditional on \({X}_{i}\) and \({G}_{i}\) is denoted by \({\nu }_{ij}=\text{Pr}\left({Y}_{i}\le j|{X}_{i},{G}_{i}\right)\), and the cutpoints \({\varepsilon }_{j}, j\le J\) are to categorize the data. Following the above model, the score statistics to test parameter \(\gamma\) is \(S=\sum_{i=1}^{n}{G}_{i}{R}_{i}\). More details, including model fitting, theoretical derivations about the score statistics, and the model residuals \({R}_{i},i\le n\) can be found in previous work [7].
SPAmix uses individual-specific allele frequency to adjust for admixed population
The rationality of the empirical SPA strategy in SPACox heavily relies on the identical distribution assumption of the residuals. However, this assumption could be violated if the study cohort is from multiple genetic ancestries. Instead of characterizing the complicated ancestry-specific effect on traits, we used a retrospective approach.
In this section, we use Cox proportion hazard model and time-to-event phenotype as an example to demonstrate SPAmix. For each genetic variant, SPAmix uses both normal distribution and SPA to approximate the null distribution of score statistics \(S=\sum_{i=1}^{n}{G}_{i}{R}_{i}\) conditional on time-to-event phenotypes \({\varvec{T}}={\left({T}_{1},{T}_{2},\cdots ,{T}_{n}\right)}^{T}\), indicators \({\varvec{\delta}}={\left({\delta }_{1},{\delta }_{2},\cdots ,{\delta }_{n}\right)}^{T}\), and covariates \({\varvec{X}}\), or equivalently, conditional on martingale residuals \({R}_{i},i\le n\). In other words, SPAmix treats martingale residuals \({R}_{i},i\le n\) as fixed values and genotypes \({G}_{i},i\le n\) as random variables.
There have been previous studies considering genotype as a random variable following a binomial distribution but most of them assume the same allele frequency for different subjects [29, 68]. Here, we relax the assumption and use SNP-derived PCs and raw genotype to estimate the individual-specific allele frequency for each genetic variant. Intuitively, for individuals in the same ancestry, the estimated allele frequencies are always close to each other. Meanwhile, for individuals in different ancestries, the estimated allele frequencies could be different. In either case, the individual-specific allele frequency estimation is more accurate, since in the real world, well-delimited genetic populations are rarely found within species, and there is often heterogeneity within “ancestry groups” and a continuous spectrum of relatedness across them [69]. It is crucial to embrace a multidimensional and continuous understanding of ancestry [70]. The notion of homogeneous populations merely offers an approximate representation of human data and fails to capture the true distribution of genetic variants and the continuous movement of people which leads to varying degrees of mixture among global populations [69,70,71,72,73,74,75]. The individual-specific allele frequency estimation remains robust, regardless of ancestral diversity populations or admixture within populations.
Individual-specific allele frequency estimation
In this paper, we assume that individuals come from different populations with varying allele frequencies (AFs). We employ the term “individual-specific AFs” as defined by Conomos et al. [61] to describe the variation in genotype distribution among individuals from different populations. For a given genetic variant, let \({\varvec{q}}={\left({q}_{1},\dots ,{q}_{n}\right)}^{T}\) represent the individual-specific allele frequencies for \(n\) subjects, where \({q}_{i}\) denotes the allele frequency of the variant for subject \(i, i\le n\). Under the assumption of Hardy–Weinberg equilibrium (HWE), the genotype \({G}_{i}\) is modeled as following an independent binomial distribution Binom(2, \({q}_{i}\)) [61, 68].
If all \(n\) subjects belong to a homogeneous population, the allele frequencies are identical across individuals, i.e., \({q}_{1}={q}_{2}=\dots ={q}_{n}\). In this case, \({\varvec{q}}\) can be estimated as:
However, the above estimation becomes inaccurate if the subjects are from distinct subpopulations or an admixed population [53, 76]. Ancestry PCs are commonly used to characterize the population structure [76]. While PCs are typically included as covariates in model fitting, they also be leveraged to estimate allele frequencies [61]. Here, we propose a hybrid strategy to estimate \({\varvec{q}}\) using raw genotype data and PCs [61].
Based on a linear regression model, \({\varvec{q}}\) can be estimated as:
where \({\widetilde{{\varvec{X}}}}_{PC}=\left(1, {{\varvec{X}}}_{PC}\right)\) is a matrix consisting of a vector of ones and the top PCs [61]. However, \({\widehat{{\varvec{q}}}}_{{\varvec{l}}{\varvec{m}}}\) may not always be optimal, as some elements of the vector \({\widehat{{\varvec{q}}}}_{{\varvec{l}}{\varvec{m}}}\) can fall outside the interval [0,1], particularly for rare variants. To address this issue, we introduced a cutoff \({q}_{c}\) (default: \({q}_{c}=0\)) and update \({\widehat{{\varvec{q}}}}_{{\varvec{l}}{\varvec{m}}}\) as
Alternatively, logistic regression model can be employed to estimate \({\varvec{q}}\). Here, the response variable is the indicator \({\text{I}}_{\left[{G}_{i}\ge 0.5\right]}\), and the predictors are the top PCs. The probability \(\text{Pr}\left({G}_{i}\ge 0.5\right)\) is estimated as \(\sigma ({\widehat{\eta }}_{i})\), where \({\widehat{\eta }}_{i}\) is the estimated linear predictor, and \(\sigma \left(x\right)=1/(1+\text{exp}(-x))\). Under HWE, the allele frequency is estimated as:
Notably, all elements of \({\widehat{{\varvec{q}}}}_{{\varvec{g}}{\varvec{l}}{\varvec{m}}}\) lie within the interval [0,1], making it more robust than \({\widehat{{\varvec{q}}}}_{{\varvec{l}}{\varvec{m}}}\) especially for low-frequency and rare variants.
To balance accuracy and computational efficiency, we propose a hybrid strategy. First, we fit a linear regression model to calculate \({\widehat{{\varvec{q}}}}_{{\varvec{l}}{\varvec{m}}}\). If the proportion of the elements in \({\widehat{{\varvec{q}}}}_{{\varvec{l}}{\varvec{m}}}\) that fall within [0,1] exceeds a predefined threshold \(r\) (default: 0.9), we use \(\widehat{{\varvec{q}}}={\widehat{{\varvec{q}}}}_{{\varvec{l}}{\varvec{m}}{\varvec{c}}}\) as the final estimate. Otherwise, we extract PCs with p values less than 0.05 in the linear regression and compute \(\widehat{{\varvec{q}}}={\widehat{{\varvec{q}}}}_{{\varvec{g}}{\varvec{l}}{\varvec{m}}}\) as the final estimate.
Normal distribution approximation and saddlepoint approximation
Given individual-specific allele frequencies, the mean and variance of \(S\) conditional on \(\left({\varvec{T}},{\varvec{\delta}},{\varvec{X}}\right)\) under the null hypothesis H0 are
and
respectively. For normal distribution approximation, we replace \({\varvec{q}}\) with the estimated \(\widehat{{\varvec{q}}}\) and then use a normal distribution with a mean of \(\widehat{\mu }=\sum_{i=1}^{n}2{R}_{i}{\widehat{q}}_{i}\) and a variance of \({\widehat{\sigma }}^{2}=\sum_{i=1}^{n}{R}_{i}^{2}\cdot 2{\widehat{q}}_{i}\left(1-{\widehat{q}}_{i}\right)\) to approximate the null distribution of score statistics \(S\). Under the null hypothesis, the probability \(\text{Pr}\left(S<s|{\varvec{T}},{\varvec{\delta}},{\varvec{X}}\right)=\text{Pr}\left(S-\widehat{\mu }<s-\widehat{\mu }|{\varvec{T}},{\varvec{\delta}},{\varvec{X}}\right)\) can be estimated by \(\Phi \left\{\left(s-\widehat{\mu }\right)/\widehat{\sigma }\right\}\), where \(\Phi \left\{.\right\}\) is the cumulative distribution function (CDF) of the standard normal distribution.
The normal distribution approximation only relies on the first two cumulants (i.e., mean and variance) of the distribution. Thus, it cannot control type I error rates at a stringent genome-wide significance level if the event rate is low, particularly when testing low-frequency variants [15]. To address this issue, we propose a retrospective SPA method which utilizes knowledge of all cumulants expressed through values of the CGF. Compared to the normal distribution approximation, SPA is more accurate if the observed score statistic is at the tails of the distribution. More importantly, SPA is not restricted to a symmetric distribution [77, 78].
Suppose that genotype \({G}_{i},i\le n\) follow a binomial distribution Binom \((2,{q}_{i})\), the estimated moment generating function (MGF) of \({G}_{i}\) and its derivatives are
The corresponding cumulant generating function (CGF) is \({\widehat{K}}_{{G}_{i}}\left(t\right)=\text{ln}{\widehat{M}}_{{G}_{i}}(t)\), and its derivates are
Hence, under the null hypothesis, the estimated CGF of \(S\) conditional on \(\left({\varvec{T}},{\varvec{\delta}},{\varvec{X}}\right)\), or equivalently \({\varvec{R}}\), is
and its first and second derivatives are
For any observed score \(s\) and observed martingale residuals \({R}_{i}, i\le n\), we first calculate \(\widehat{\upzeta }\) such that \({\widehat{H}}^{\prime}\left(\widehat{\upzeta }\right)=s\), then we calculate \(\widehat{\omega }=\text{sgn}\left(\widehat{\upzeta }\right)\sqrt{2\left(\widehat{\upzeta }s-\widehat{H}\left(\widehat{\upzeta }\right)\right)}\) and \(\widehat{\nu }=\widehat{\upzeta }\sqrt{{\widehat{H}}^{{\prime}{\prime}}\left(\widehat{\upzeta }\right)}\). Under the null hypothesis, the conditional probability \(\text{Pr}\left(S<s|{\varvec{T}},{\varvec{\delta}},{\varvec{X}}\right)\) can be estimated by \(\Phi \left\{\widehat{\omega }+\frac{1}{\widehat{\omega }}\cdot \text{log}\left(\frac{\widehat{\nu }}{\widehat{\omega }}\right)\right\}\), where \(\Phi \left\{.\right\}\) is the CDF of the standard normal distribution. The proposed method is a straightforward extension of Barndorff-Nielsen’s formula [79].
To obtain p values efficiently while avoiding false positive discoveries, we adopt a hybrid strategy to combine normal distribution approximation and saddlepoint approximation. If the absolute value of the centered observed statistics \(\left|s-\widehat{\mu }\right|<r\widehat{\sigma }\), where \(r=2\) is a pre-specified value [4, 6, 15, 54], we use normal distribution approximation to obtain the p value. Otherwise, the saddlepoint approximation method is used to calibrate p values in tail areas. SPAmix outputs a two-sided p value of \({p}_{l}+{p}_{r}\), where the left-tailed and right-tailed p values are \({p}_{l}=\widehat{\text{Pr}}\left(S-\widehat{\mu }<-1\cdot \left|s-\widehat{\mu }\right||{\varvec{T}},{\varvec{\delta}},{\varvec{X}}\right)\) and \({p}_{r}=\widehat{\text{Pr}}\left(S-\widehat{\mu }>\left|s-\widehat{\mu }\right||{\varvec{T}},{\varvec{\delta}},{\varvec{X}}\right)\), respectively.
Partial normal distribution approximation in SPA
To further reduce the computation time, we used a partial normal distribution in SPA, similar as fastSPA [15]. We check the residual distribution and define outliers. Suppose that the 25th and 75th percentiles of the residuals are \({\alpha }_{0.25}\) and \({\alpha }_{0.75}\), respectively, the interquartile range (IQR) is \({\alpha }_{0.75}-{\alpha }_{0.25}\). If the residual > \({\alpha }_{0.75}+\text{IQR}\cdot \gamma\) or < \({\alpha }_{0.25}-\text{IQR}\cdot \gamma\), then the residual is an outlier. In this paper, we use \(\gamma =1.5\).
Given the definitions of outlier, we reformulate the score statistics
where the residuals of the first \({n}_{1}\) individuals are outliers and the others are not. We let
then, the MGF and CGF are calculated for \({S}_{o}\) as in the previous sections, and normal distribution approximation is applied to calculate the MGF and corresponding CGF for \({S}_{non}.\) The partially normal approximation can greatly reduce computational complexity. More details on the accuracy trade-offs of the partial normal distribution approximation and the use of IQR for categorizing residuals into outliers and non-outliers can be found in Additional file 1: Supplementary notes.
SPAmixlocal tests for ancestry-specific genetic effect using ancestry-specific score statistics
Suppose that the study cohort consists of \(n\) subjects from an admixed population composed of \(K\) ancestries, we let \({\varvec{G}}={\left({G}_{1},\dots ,{G}_{n}\right)}^{T}\) denote the genotype vector of a genetic variant and \({{\varvec{G}}}^{\left(k\right)}={\left({G}_{1}^{\left(k\right)},\dots ,{G}_{n}^{\left(k\right)}\right)}^{T},k\le K\), denote the vector of the number of copies coming from the kth ancestry, i.e., the genotypes from the kth ancestry. Instead of associating genotype \({\varvec{G}}\) to a phenotype of interest, Tractor and SPAmixlocal are designed to test for ancestry-specific genetic effect, i.e., to associate ancestry-specific genotypes \({{\varvec{G}}}^{\left(k\right)}\) and the phenotype of interest. The original SPAmix calibrates p values by approximating the null distribution of score statistics \(S=\sum_{i=1}^{n}{R}_{i}{G}_{i}\) in which \({R}_{i}\) is the null model residual for subject \(i\). And we extend it to SPAmixlocal in which ancestry-specific p values are calculated corresponding to the ancestry-specific score statistics \({S}^{\left(k\right)}=\sum_{i=1}^{n}{R}_{i}{G}_{i}^{\left(k\right)}\).
For subject \(i, i\le n\), let \({h}_{i}^{\left(k\right)}\) denote the number of haplotypes, i.e., local ancestry count, of the kth ancestry at one locus, and \({{{\varvec{h}}}^{\left({\varvec{k}}\right)}=\left({h}_{1}^{\left(k\right)},\dots ,{h}_{n}^{\left(k\right)}\right)}^{T}\) denotes the corresponding vector for all subjects. Similar to the original SPAmix, suppose that the ancestry-specific allele frequencies \({q}^{\left(1\right)},\dots ,{q}^{\left(K\right)}\) are available, the mean and the variance of \({S}^{\left(k\right)}\) under the null hypothesis are
and
respectively. For SPAmixlocal, the ancestry-specific allele frequency \({q}^{\left(k\right)}\) is estimated by using \({\widehat{q}}^{\left(k\right)}=\sum_{i=1}^{n}{G}_{i}^{\left(k\right)}/\sum_{i=1}^{n}{h}_{i}^{\left(k\right)}\).
Similar to SPAmix, SPAmixlocal also applies a hybrid strategy including both normal distribution approximation and SPA to calibrate p values. For normal distribution approximation, the null distribution of \({S}^{\left(k\right)}\) is approximated using a normal distribution with a mean of \({\widehat{\mu }}^{\left(k\right)}={E}_{{\text{H}}_{0}}\left({S}^{\left(k\right)}|{\varvec{R}},{\widehat{q}}^{\left(k\right)}\right)\) and a variance of \({\left({\widehat{\sigma }}^{\left(k\right)}\right)}^{2}=Va{r}_{{\text{H}}_{0}}\left({S}^{\left(k\right)}|{\varvec{R}},{\widehat{q}}^{\left(k\right)}\right)\). Under the null hypothesis, the probability \(\text{Pr}\left({S}^{\left(k\right)}<s|{\varvec{R}}\right)=\Phi \left\{\left(s-{\widehat{\mu }}^{\left(k\right)}\right)/{\widehat{\sigma }}^{\left(k\right)}\right\}\) where \(\Phi \left\{.\right\}\) is the CDF of a standard normal distribution. For SPA, we assume that ancestry-specific genotype \({G}_{i}^{\left(k\right)}, i\le n\) follow a binomial distribution Binom \(({h}_{i}^{\left(k\right)},{\widehat{q}}^{\left(k\right)})\) in which \({h}_{i}^{\left(k\right)}=0, 1,\) or \(2\). Then, the estimated MGF and CGF of \({G}_{i}^{\left(k\right)}\) are
and \({\widehat{K}}_{{G}_{i}^{\left(k\right)}}\left(t\right)=\text{ln}{\widehat{M}}_{{G}_{i}^{\left(k\right)}}(t)\). And conditional on \({\varvec{R}}\), the estimated CGF of \({S}^{\left(k\right)}\) under the null hypothesis is
For ancestry-specific observed score statistics \({s}^{\left(k\right)}\), we followed the similar process as in the original SPAmix to calculate the first and the second derivatives of \({\widehat{H}}^{\left(k\right)}\left(t\right)\) and to calibrate p values based on Barndorff-Nielsen’s formula. The hybrid strategy to combine normal distribution approximation and SPA is the same as the original SPAmix.
Optimization guidelines for SPAmixlocal
SPAmixlocal is a framework designed to test for ancestry-specific genetic effects and operates through a two-step process. In step 1, SPAmixlocal requires fitting a null model to calculate model residuals, in which confounding factors such as age, sex, SNP-derived principal components (PCs), and other confounders are incorporated as covariates.
In step 2, SPAmix associates the traits of interest with ancestry-specific genotypes of single genetic variant by approximating the null distribution of ancestry-specific score statistics \({S}^{\left(k\right)}=\sum_{i=1}^{n}{R}_{i}{G}_{i}^{\left(k\right)}\), where \(k\) denotes the ancestry index. SPAmixlocal efficiently estimates ancestry-specific allele frequencies. SPAmixlocal uses a hybrid strategy combining normal distribution approximation and SPA to approximate the distribution of ancestry-specific score statistics under the null hypothesis. The approximation relies on the assumption that ancestry-specific genotype \({G}_{i}^{\left(k\right)}\) follows a binomial distribution Binom(2, \({\widehat{q}}^{\left(k\right)}\)). If the absolute value of the centered observed ancestry-specific statistics \(\left|{s}^{(k)}-{\widehat{\mu }}^{\left(k\right)}\right|<r{\widehat{\sigma }}^{\left(k\right)}\), normal distribution approximation is used to obtain the p value. Otherwise, the saddlepoint approximation is used to calibrate p values.
A key distinction between SPAmixlocal and the original SPAmix is the requirement of local ancestry count \({{{\varvec{h}}}^{\left(k\right)}=\left({h}_{1}^{\left(k\right)},\dots ,{h}_{n}^{\left(k\right)}\right)}^{T}\) for ancestry-specific analysis in admixed populations. Importantly, SPAmixlocal estimates ancestry-specific allele frequencies efficiently using the formula \({\widehat{q}}^{\left(k\right)}=\sum_{i=1}^{n}{G}_{i}^{\left(k\right)}/\sum_{i=1}^{n}{h}_{i}^{\left(k\right)}\). As it eliminates regression-based allele frequency estimation, which is the most computationally intensive step in the original SPAmix framework. Consequently, the incorporation of local ancestry data does not significantly increase the computational burden while enhancing statistical power for detecting ancestry-specific genetic effects.
This streamlined process makes SPAmixlocal a powerful and efficient tool for testing ancestry-specific genetic effects in admixed populations.
SPAmixCCT combines p values of SPAmix and SPAmixlocal to maximize powers
Suppose that the admixed population is composed of \(K\) ancestries. SPAmixlocal outputs K ancestry-specific p values, and the original SPAmix outputs one p value assuming that the genetic effects are the same for all ancestries. We proposed SPAmixCCT in which Cauchy combination is used to combine the K + 1 p values. Benefit from the advantage of Cauchy combination, the p value of SPAmixCCT can still control type one error rate well and remain powerful for both homogeneous and heterogeneous ancestry-specific genetic effect sizes.
Data simulation
For subject \(i\le n\), we let \({{\varvec{a}}}_{{\varvec{i}}}={\left({a}_{i}^{EUR}, {a}_{i}^{EAS}\right)}^{T}\) denote an ancestry vector, where \(1\ge {a}_{i}^{EUR}\ge 0\) and \(1\ge {a}_{i}^{EAS}\ge 0\) are to represent ancestry proportions of EUR and EAS, respectively, and \({a}_{i}^{EUR}+{a}_{i}^{EAS}=1\). We assumed that the first \(n/2=5000\) subjects were from a EUR-dominant community with an ancestry vector \({{\varvec{a}}}_{{\varvec{i}}}\) following a Dirichlet(9, 1) distribution, and the remaining \(n/2=5000\) subjects were from an EAS-dominant community with an ancestry vector \({{\varvec{a}}}_{{\varvec{i}}}\) following a Dirichlet(1, 9) distribution [61, 80,81,82,83]. The distribution of \({{\varvec{a}}}_{{\varvec{i}}},i\le n\) can be found in Additional file 1: Fig. S50. The allele frequency diversity between EUR and EAS commonly exists in real data (Additional file 1: Fig. S11).
Based on the difference of MAFs corresponding to the two populations, i.e., DiffMAF = \({q}^{EUR}-{q}^{EAS}\), we categorized variants into five groups: DiffMAF < < 0, DiffMAF < 0, DiffMAF ~ 0, DiffMAF > 0, and DiffMAF > > 0. Based on the minimal MAF value, i.e., \(\text{min}\left({q}^{EUR},{q}^{EAS}\right)\), we categorized variants into three groups of minMAFlow, minMAFmod, and minMAFhigh. Thus, all variants were categorized into 15 (5 \(\times\) 3) groups. More details of the grouping process and the proportion of variants in each group can be found in Additional file 1: Fig. S11. In each group, we randomly sampled 1000 pairs of \(\left({q}^{EUR},{q}^{EAS}\right)\) and simulated 1000 genetic variants correspondingly. For each variant, the genotype of individual \(i\) follows a Binom(2, \({q}_{i}\)) distribution, where \({q}_{i}={a}_{i}^{EUR}{q}^{EUR}+{a}_{i}^{EAS}{q}^{EAS}\). In addition, a total of 100,000 common SNPs (\({q}^{EUR}+{q}^{EAS}>0.1\)) were simulated to calculate SNP-derived PCs (Additional file 1: Fig. S51).
We followed a two-step process to simulate time-to-event traits as phenotypes. [6] We first simulated an underlying failure time \({T}_{i}^{*}\) and a censoring time \({C}_{i}\). The censoring time \({C}_{i}\) was simulated following a Weibull distribution with a shape parameter of 1 and a scale parameter of 0.15. The underlying failure time \({T}_{i}^{*}\) was simulated based on a Cox PH model with a Weibull baseline hazard function as \({T}_{i}^{*}=\lambda \sqrt{-\text{log}{U}_{i}/\text{exp}\left({\eta }_{i}\right)}\), where \({U}_{i}\) was simulated following a uniform distribution U(0,1), and linear predictor \({\eta }_{i}={\beta }_{1}{X}_{i1}+0.5{X}_{i2}+0.5{X}_{i3}+\gamma {G}_{i}\) was to characterize the effect from three covariates \({X}_{i1}\), \({X}_{i2}\), \({X}_{i3}\), and genotype \({G}_{i}\). Covariate \({X}_{i1}={a}_{i}^{EAS}=1-{a}_{i}^{EUR}\) was the proportion of EAS ancestry, \({X}_{i2}\) was simulated following a Bernoulli(0.5) distribution, and \({X}_{i3}\) was simulated following a standard normal distribution. Time-to-event phenotype \({T}_{i}=\text{min}\left({T}_{i}^{*}, {C}_{i}\right)\). Indicator \({\delta }_{i}=\text{I}\left({T}_{i}^{*}\le {C}_{i}\right)\) is to indicate whether event occurs. The scale parameter \(\lambda\) and the coefficient \({\beta }_{1}\) were selected to obtain the desired event rates \({\rho }_{EUR}\) and \({\rho }_{EAS}\) in EUR and EAS populations. Here, \({\rho }_{EUR}\) and \({\rho }_{EAS}\) are the expected event rates for a pure EUR population (i.e., \({X}_{i1}=0, i\le n\)) and pure EAS population (i.e., \({X}_{i1}=1, i\le n\)), respectively.
To assess type I error rates, we simulated phenotypes under two scenarios, either of which followed the null hypothesis of no genetic effects (\(\text{i}.\text{e}., \gamma =0\)).
-
Scenario 1. The event rates in EUR and EAS were the same, that is, \({{\varvec{\rho}}}_{{\varvec{E}}{\varvec{U}}{\varvec{R}}}={{\varvec{\rho}}}_{{\varvec{E}}{\varvec{A}}{\varvec{S}}}\). We consider three events rates including 0.01 (low event rate, ERlow), 0.1 (moderate event rate, ERmod), and 0.3 (high event rate, ERhigh).
-
Scenario 2. The event rates in EUR were higher than those in EAS, that is, \({{\varvec{\rho}}}_{{\varvec{E}}{\varvec{U}}{\varvec{R}}}>{{\varvec{\rho}}}_{{\varvec{E}}{\varvec{A}}{\varvec{S}}}\). We considered three pairs of event rates \(\left({\rho }_{EUR},{\rho }_{EAS}\right)=\) (0.1, 0.01) (low event rate, ERlow), (0.3, 0.05) (moderate event rate, ERmod), and (0.5, 0.2) (high event rate, ERhigh).
We did not consider the scenario in which the event rates in EAS were higher than those in EUR since it is exactly the opposite direction of scenario 2 and the results can be expected. In either scenario, 1,000,000 datasets of phenotypes and covariates were simulated, and thus a total of \({10}^{9}\) tests were conducted for each pair of MAF group and event rate, and type I error rates were evaluated at a significance level of \(\alpha =5\times {10}^{-8}\).
To evaluate powers, we simulated time-to-event phenotypes under the alternative hypothesis using linear predicator
where 10 SNPs were randomly selected from each of the 15 groups to consist of 150 causal SNPs, \({G}_{ij}\) is the genotype value for the \({j}^{th}\) causal SNP. To evaluate powers under different directions of genetic effect, we consider positive genetic effects \({\gamma }_{j}=-2\cdot \text{log}10\left({\widehat{q}}^{j}\right)\) and negative genetic effects \({\gamma }_{j}=2\cdot \text{log}10\left({\widehat{q}}^{j}\right)\) where \({\widehat{q}}^{j}=\left(\sum_{i=1}^{n}{G}_{ij}\right)/2n<1\) is the MAF estimation in the overall population. Here we set the same genetic effect for all individuals, as a recent study has found that causal effects on complex traits are highly similar for common variants across segments of different continental ancestries within admixed individuals if environments are well controlled [84]. The events rates were set the same as in type I error rates simulations. For either scenario, we simulated 1000 datasets of phenotypes and covariates, and thus, \({10}^{4}\) tests were conducted to evaluate powers at a significance level of \(\alpha =5\times {10}^{-8}\).
For both type I error rates and powers simulations, null model fitting incorporates covariates \({X}_{2}, {X}_{3}\), and the top 4 PCs derived from genotype data. In addition to SPAmix and SPACox, we also evaluated SPAmixnoSPA, in which p values of all variants are calibrated using only normal distribution approximation without SPA.
Simulation studies including three genetic ancestries
For subject \(i\le n\), we let \({{\varvec{a}}}_{{\varvec{i}}}={\left({a}_{i}^{EUR}, {a}_{i}^{AFR}, {a}_{i}^{EAS}\right)}^{T}\) denote an ancestry vector to represent ancestry proportions of EUR, AFR, and EAS, respectively, and \({a}_{i}^{EUR}+{a}_{i}^{AFR}+{a}_{i}^{EAS}=1\). We assumed that the ancestry vector \({{\varvec{a}}}_{{\varvec{i}}}\) of the first \(4000\) subjects from an EUR-dominant community follows a Dirichlet(8, 1, 1) distribution, the ancestry vector \({{\varvec{a}}}_{{\varvec{i}}}\) of the next \(4000\) subjects from an AFR-dominant community follows a Dirichlet(1, 8, 1) distribution, and the ancestry vector \({{\varvec{a}}}_{{\varvec{i}}}\) of the remaining 2000 subjects from an EAS-dominant community follows a Dirichlet(1, 1, 8) distribution. The distribution of \({{\varvec{a}}}_{{\varvec{i}}},i\le n\) can be found in Additional file 1: Fig. S52.
Similar to the previous section, we simulated genotypes using ancestry vectors and real allele frequency data and then calculated SNP-derived PCs using 100,000 common SNPs. Additional file 1: Fig. S53 shows the top PCs derived from 100,000 simulated genetic variants and compared them to three components of the ancestry vector \({{\varvec{a}}}_{{\varvec{i}}}\), which demonstrated that the three ancestry proportions are strongly correlated with and can be well captured by the top PCs. For example, as shown in Additional file 1: Fig. S53(c), PC1 has an approximately linear correlation with the ancestry proportion of AFR.
To simulate time-to-event phenotype, we use a linear predictor \({\eta }_{i}={\beta }^{EUR}\cdot {a}_{i}^{EUR}+{\beta }^{AFR}\cdot {a}_{i}^{AFR}+ {\beta }^{EAS}\cdot {a}_{i}^{EAS}+0.5{X}_{i2}+0.5{X}_{i3}+\gamma {G}_{i}\) where notations are the same as in the previous section. In this section, we fixed parameters \({\beta }^{EUR}=0.2, {\beta }^{AFR}=2,\) and \({\beta }^{EAS}=5.5\), and scale parameter \(\lambda =3\). Given \(\gamma =0\), the event rates for pure EUR (i.e., \({a}_{i}^{EUR}=1\)), AFR (i.e., \({a}_{i}^{AFR}=1\)), and EAS (i.e., \({a}_{i}^{EAS}=1\)) populations are around 0.01, 0.04, and 0.4, respectively, and the event rate in \(n\)=10,000 admixed subjects is around 0.086.
To evaluate the performance of SPAmix and SPACox in terms of type I error rates and powers, we consider two settings of MAF to simulate genotype.
-
Setting 1. MAF in EUR = 0.01, MAF in AFR = 0.1, MAF in EAS = 0.45.
-
Setting 2. MAF in EUR = 0.45, MAF in AFR = 0.1, MAF in EAS = 0.01.
For each variant, the genotype of subject \(i\) follows a Binom(2, \(q_i\)) distribution, where
is the ancestry vector of this subject.
To evaluate type I error rates, we simulated phenotypes under the null hypothesis (\(\text{i}.\text{e}., \gamma =0\)) and simulated 105 genetic variants for each setting of MAF. To evaluate powers, we simulated 10,000 datasets (including phenotypes and covariates) under the alternative hypothesis use a linear predictor
for two settings of MAF, respectively.
Simulation studies of quantitative traits in the case of variance heterogeneity
To evaluate the performance of SPAmix for quantitative traits in the case of variance heterogeneity, we simulated \(n\) = 10,000 subjects from an admixed population of EUR and EAS. We assume that the ancestry proportion vectors \({{\varvec{a}}}_{{\varvec{i}}}, i\leq n\) and ancestry PCs are the same as in the previous section of time-to-event traits simulation, as shown in Additional file 1: Fig. S50 and S51. One thousand datasets of genotypes were simulated and categorized following the same procedures as in the previous section of data simulation.
To assess type I error rates, we simulated quantitative phenotypes under the null hypothesis of no genetic effects in the case of variance heterogeneity corresponding to different ancestries from a linear regression model as below
where covariates \({X}_{i2}\) and \({X}_{i3}\) were simulated following a Bernoulli(0.5) distribution and a standard normal distribution, \({\varepsilon }_{i}\) was a random term simulated following a normal distribution \(N\left(0,{\sigma }_{i}^{2}\right)\), and \({Y}_{i}\) was a quantitative trait. We generated \({\sigma }_{i}\), the individual-specific standard deviation of subject \(i\), through a weighted sum of standard deviations of ancestry EUR and EAS as below
where \({\sigma }_{EUR}\) and \({\sigma }_{EAS}\) denotes the standard deviations of random terms in ancestry EUR and EAS, respectively. We consider four pairs of standard deviations (sd) (sdEUR, sdEAS) = (1, 0.1), (sdEUR, sdEAS) = (1, 0.5), (sdEUR, sdEAS) = (1, 0.8), (sdEUR, sdEAS) = (1, 1). We did not consider the scenario in which the standard deviations in EAS were higher than those in EUR since it is exactly the opposite direction of the scenario above and the results can be expected. In each scenario, 1000 datasets of phenotypes and covariates were simulated, and thus a total of 106 association tests were conducted for each pair of MAF group and standard deviation. We evaluated type I error rates of SPAmix, SPAmixnoSPA, Wald test, likelihood ratio test (LRT), SAIGE, REGENIE, and fastGWA at a significance level of \(\alpha =5\times {10}^{-5}\).
Simulation studies of time-to-event trait in the case of baseline hazard function heterogeneity
To evaluate the performance of SPAmix for time-to-event traits in the case of baseline hazard function heterogeneity, we simulated \(n\) = 10,000 subjects from an admixed population of EUR and EAS. We assume that the ancestry proportion vectors \({{\varvec{a}}}_{{\varvec{i}}},i\le n\) and ancestry PCs are the same as in the previous section of time-to-event traits simulation, as shown in Additional file 1: Fig. S50 and S51. One thousand datasets of genotypes were simulated and categorized following the same procedures as in the previous section of data simulation.
To assess type I error rates, we simulated time-to-event phenotypes under the null hypothesis of no genetic effects in the case of baseline hazard function heterogeneity corresponding to different ancestries. We first simulated an underlying failure time \({T}_{i}^{*}\) and a censoring time \({C}_{i}\). The censoring time \({C}_{i}\) was simulated following a Weibull distribution with an individual-specific shape parameter of \({v}_{i}\) and an individual-specific scale parameter of \({w}_{i}\).. The underlying failure time \({T}_{i}^{*}\) was simulated based on a Cox PH model with a Weibull baseline hazard function as \({T}_{i}^{*}=\lambda \sqrt{-\text{log}{U}_{i}/\text{exp}\left({\eta }_{i}\right)}\), where \({U}_{i}\) was simulated following a beta distribution Beta(1,\({b}_{i}\)). For each subject \(i\), we generated parameters \({v}_{i}\), \({w}_{i}\), and \({b}_{i}\) through weighted sums of the original parameters for EUR and EAS ancestry as below
where (1) \({v}_{EUR}\) and \({v}_{EAS}\), (2)\({w}_{EUR}\) and \({w}_{EAS}\), and (3) \({b}_{EUR}\) and \({b}_{EAS}\) denote (1) the shape parameter of Weibull distribution to simulate \({C}_{i}\), (2) the scale parameter of Weibull distribution to simulate \({C}_{i}\), and (3) the second shape parameter beta in EUR and EAS ancestry, respectively. We let \({v}_{EUR}=1\) and \({v}_{EAS}=2\), \({w}_{EUR}=0.15\) and \({w}_{EAS}=0.3\), and \({b}_{EUR}=4\) and \({b}_{EAS}=1\) in simulation studies.
The linear predictor \({\eta }_{i}=0.5{X}_{i2}+0.5{X}_{i3}\) was to characterize the effect from two covariates \({X}_{i2}\) and \({X}_{i3}\). Covariate \({X}_{i2}\) was simulated following a Bernoulli(0.5) distribution, and \({X}_{i3}\) was simulated following a standard normal distribution. Time-to-event phenotype \({T}_{i}=\text{min}\left({T}_{i}^{*}, {C}_{i}\right)\). Indicator \({\delta }_{i}=\text{I}\left({T}_{i}^{*}\le {C}_{i}\right)\) is to indicate whether event occurs. In each scenario, 1000 datasets of phenotypes and covariates were simulated, and thus a total of 106 association tests were conducted for each pair of MAF group. We evaluated type I error rates of SPAmix, SPAmixnoSPA, SPACox, Wald test, GATE, and REGENIE at a significance level of \(\alpha =5\times {10}^{-5}\).
Simulation studies of ordinal traits
To evaluate the performance of SPAmix for ordinal traits, we simulated \(n\) = 10,000 subjects from an admixed population of EUR and EAS. We assume that the ancestry proportion vectors \({{\varvec{a}}}_{{\varvec{i}}},i\le n\) and ancestry PCs are the same as in the previous section of time-to-event traits simulation, as shown in Additional file 1: Fig. S50 and S51. One thousand datasets of genotypes were simulated and categorized following the same procedures as in the previous section of data simulation.
Let J = 4 denote the number of category levels. For subject \(i\le n\), let \({Y}_{i}\) = 1, 2, …, J denote the ordinal phenotype. To assess type I error rates, we simulated ordinal phenotypes under the null hypothesis of no genetic effects from a proportional odds logistic regression model as below
where covariates \({X}_{i2}\) and \({X}_{i3}\) were simulated following a Bernoulli(0.5) distribution and a standard normal distribution. The individual-specific cutpoints \(\left\{{\varepsilon }_{ij}, i\le n,j\le J\right\}\) (\({\varepsilon }_{i1}<\cdots <{\varepsilon }_{iJ}=+\infty\)) are used to categorize the data. For subject \(i\), we generated \(\left\{{\varepsilon }_{ij}, i\le n,j\le J\right\}\) through a weighted sum of cutpoints of ancestry EUR and EAS as below
where \(\left\{{\varepsilon }_{j}^{EUR}, j\le J\right\}\) and \(\left\{{\varepsilon }_{j}^{EAS}, j\le J\right\}\) denotes the cutpoints selected to obtain the desired phenotypic distribution (ratio) in EUR and EAS populations, respectively. We consider five pairs of phenotypic distribution (Ratios) in EUR and EAS: (RatioEUR = 10:1:1:1, RatioEAS = 10:1:1:1), (RatioEUR = 100:1:1:1, RatioEAS = 100:1:1:1), (RatioEUR = 1000:1:1:1, RatioEAS = 1000:1:1:1), (RatioEUR = 10:1:1:1, RatioEAS = 1000:1:1:1), and (RatioEUR = 100:1:1:1, RatioEAS = 1000:1:1:1). In each scenario, 1000 datasets of phenotypes and covariates were simulated, and thus a total of 106 association tests were conducted for each pair of MAF group and phenotypic distribution (ratio). We evaluated type I error rates of SPAmix, SPAmixnoSPA, and POLMM at a significance level of \(\alpha =5\times {10}^{-5}\).
Simulation studies of binary and quantitative traits given ancestry-specific genetic effect sizes
To evaluate the performance of SPAmixlocal and SPAmixCCT, we simulated a two-way admixed population with sample size n = 10,000, including ancestry-specific genotypes, local ancestry counts, genotype-derived PCs, and phenotypes. We considered extensive scenarios of ancestry-specific genetic effect and minor allele frequencies to compare the SPAmix-based approaches and Tractor.
We followed procedure as in Mester et al. [27] to simulate genotypes. First, we generated an individual-level global ancestry proportion of ancestry 2 (denoted as \({d}_{i}, i\le n\)) from a normal distribution \(N\left(\theta ,{\sigma }^{2}\right)\) for each individual, in which \(\theta\) is the expected global ancestry proportion and \({\sigma }^{2}\) is the corresponding variance. We let \(\sigma =0.125\) and coerced \({d}_{i}\) between [0,1]. For individual \(i, i\le n\), we simulated local ancestry count \({h}_{i}^{\left(1\right)}\) and \({h}_{i}^{\left(2\right)}\), in which \({h}_{i}^{\left(2\right)}\) follows a binomial distribution \(\text{Binom}\left({d}_{i},2\right)\) and \({h}_{i}^{\left(1\right)}=2-{h}_{i}^{\left(2\right)}\). Then, we simulated ancestry-specific genotype \({G}_{i}^{\left(k\right)}\) following a binomial distribution \(\text{Binom}\left({h}_{i}^{\left(k\right)},{q}^{\left(k\right)}\right)\), where \({q}^{\left(k\right)}\) is the allele frequency corresponding to the ancestry \(k.\) Genotype \({G}_{i}={G}_{i}^{\left(1\right)}+{G}_{i}^{\left(2\right)}\). In simulation studies, we considered two fixed MAFs of 0.01 and 0.1 in ancestry 1 and four fixed MAFs including 0.01, 0.05, 0.1, and 0.3 in ancestry 2. We calculated SNP-derived PCs using 100,000 common SNPs. Additional file 1: Fig. S54 shows the top PCs and the global ancestry distribution for the 10,000 two-way admixed individuals.
Type I error simulations
We simulated binary and quantitative traits under the null hypothesis to evaluate type I error rates. We simulated binary and quantitative traits from a logistic regression model and a linear regression model as below
where covariates \({X}_{i2}\) and \({X}_{i3}\) were simulated following a Bernoulli(0.5) distribution and a standard normal distribution, \({\mu }_{i}\) is the probability of being a case for a binary trait, and \({Y}_{i}\) is a quantitative trait. For a binary trait, the intercept \({\beta }_{0}\) was determined to correspond to a certain disease prevalence. We considered two disease prevalences including 0.01 and 0.2. For a quantitative trait, random term \({\varepsilon }_{i}\) was simulated following a standard normal distribution. We simulated 100 datasets of phenotypes and covariates for each phenotypic distribution and 10,000 SNPs for each setting of MAF, and thus a total of 106 tests were conducted in each scenario.
Power simulations
We simulated binary and quantitative traits under the alternative hypothesis to evaluate powers. For both binary and quantitative traits, we added an overall genetic effect \(\sum_{j=1}^{10}\left[{\gamma }^{\left(1\right)}\cdot {G}_{i,j}^{\left(1\right)}+{\gamma }^{\left(2\right)}\cdot {G}_{i,j}^{\left(2\right)}\right]\) to the linear predicator \({\eta }_{i}\), in which \({G}_{i,j}^{\left(1\right)}\) and \({G}_{i,j}^{\left(2\right)}\) were the ancestry-specific genotype of subject i in SNP j from ancestries 1 and 2, respectively, and \({\gamma }^{\left(1\right)}\) and \({\gamma }^{\left(2\right)}\) were corresponding ancestry-specific genetic effect sizes. For binary traits, we set disease prevalence at 0.2.
We considered two scenarios including homogeneity and heterogeneity of genetic effect sizes for ancestries 1 and 2. For heterogeneous genetic effect sizes, we fixed \({\gamma }^{\left(1\right)}\) and increased \({\gamma }^{\left(2\right)}\) from 0. We simulated 100 datasets of phenotypes and covariates for each scenario, and thus a total of 1000 tests were conducted to evaluate powers. We calculated empirical powers at a genome-wide significance level 5 \(\times\) 10−8.
Association analysis of Tractor and SPAmixlocal in simulation studies
Tractor and SPAmixlocal calculated ancestry-specific p values based on likelihood ratio test and score test, respectively. In simulation studies, Tractor fitted a full model with covariates of \({X}_{i2}\), \({X}_{i3}\), top 4 genetically derived PCs, local ancestry count of \({h}_{i}^{\left(1\right)}\), and ancestry-specific genotypes \({G}_{i}^{\left(1\right)},{G}_{i}^{\left(2\right)}\). Meanwhile, SPAmixlocal fitted a null model with covariates of \({X}_{i2}\), \({X}_{i3}\), and top 4 genetically derived PCs. Regular logistic model and linear model were to fit binary and quantitative traits, respectively. Tractor and SPAmixlocal returned two p values corresponding to ancestry-specific effect \({\gamma }^{\left(1\right)}\) and \({\gamma }^{\left(2\right)}\). SPAmixCCT combined the two p values outputted by SPAmixlocal and one p value outputted by SPAmix to calculate one p value.
Simulation studies of quantitative traits with the utilization of local ancestry information in the case of variance heterogeneity
To evaluate the performance of SPAmixlocal and SPAmixCCT for quantitative traits in the case of variance heterogeneity with the utilization of local ancestry information, we simulated \(n\) = 10,000 subjects from an admixed population of EUR and EAS and assume that the global ancestry proportion vectors \({{\varvec{a}}}_{{\varvec{i}}},\boldsymbol{ }i\le n\) and ancestry PCs are the same as in the previous section of time-to-event traits simulation, as shown in Additional file 1: Fig. S50 and S51. For individual\(i, i\le n\), we simulated local ancestry count \({h}_{i}^{EUR}\) and \({h}_{i}^{EAS}\), in which \({h}_{i}^{EAS}\) follows a binomial distribution \(\text{Binom}\left({a}_{i}^{EAS},2\right)\) and \({h}_{i}^{EUR}=2-{h}_{i}^{EAS}\). Then, we simulated ancestry-specific genotype \({G}_{i}^{EUR}\) and \({G}_{i}^{EAS}\) following binomial distribution \(\text{Binom}\left({h}_{i}^{EUR},{q}^{EUR}\right)\) and \(\text{Binom}\left({h}_{i}^{EAS},{q}^{EAS}\right)\), respectively, where \({q}^{EUR}\) and \({q}^{EAS}\) are the allele frequency corresponding to the ancestry EUR and EAS, respectively \(.\) Thus, we obtained genotype \({G}_{i}={G}_{i}^{EUR}+{G}_{i}^{EAS}\). We considered three fixed MAFs of 0.01, 0.1, and 0.3 in EUR and EAS, and thus, 9 scenarios of MAFs were considered in total.
To assess type I error rates, we simulated quantitative phenotypes following the same procedure as in the previous section of type I error simulation studies of quantitative traits in the case of variance heterogeneity. We consider four pairs of standard deviations (sd) (sdEUR, sdEAS) = (1, 0.1), (sdEUR, sdEAS) = (1, 0.5), (sdEUR, sdEAS) = (1, 0.8), (sdEUR, sdEAS) = (1, 1). In each scenario, 1000 datasets of phenotypes and covariates were simulated, and thus a total of 106 association tests were conducted for each pair of MAF and standard deviation. We evaluated type I error rates of SPAmixlocal, SPAmixCCT, and Tractor at a significance level of \(\alpha =5\times {10}^{-5}\).
Application to the UK Biobank Data
We used SPAmix to conduct genome-wide analyses of 10 time-to-event traits in UK Biobank dataset, including Alzheimer’s disease, type 2 diabetes, essential hypertension, Parkinson’s disease, schizophrenia, multiple sclerosis, sickle cell anemia, osteoporosis, glaucoma, and asthma [5, 68]. Additional file 1: Table S1 provides the detailed summary information of the 10 time-to-event traits. UK Biobank contains 338,044 unrelated subjects with in-patient diagnosis data, of which 282,590 (83.6%) are white British participants and the remaining participants (16.4%) are from the other ancestries including African, Asian, admixed populations, and other ancestry groups. The ancestry information is from Field ID of 22,006, which is highly consistent to the genetic ancestry based on the genetically derived principal components. In order to identify affected subjects and create time-to-event phenotypes, we utilized the PheWAS code which is based on the International Statistical Classification of Diseases (ICD) codes version 9 and 10 [85, 86]. To analyze each disease, we assigned an event indicator \(\delta_i = 1\) and time-to-event \(T_i\) as the age at the first in-patient diagnosis date if at least one in-patient diagnosis was observed. For patients without an in-patient diagnosis, we set \(\delta_i = 0\) and time-to-event \(T_i\) as the age at the right-censoring date or lost to follow-up date. Additionally, the observed survival time was left truncated at the in-patient data collection date. For each phenotype, the top ten ancestry principal components (PCs), sex, and birthyear were included as covariates, and markers imputed by the Haplotype Reference Consortium (HRC) [87] panel with a MAF over 0.0001 and imputation INFO score > 0.6 were used in our analysis. In addition, we applied SPAmix to conduct genome-wide analyses of 11 longitudinal traits in UK Biobank dataset.
The majority of UK Biobank’s participants were of white British ancestry, and we selected 10,000 subjects based on genetic principle components (PCs) to mimic a study cohort with greater heterogeneity. Since the first two PCs (Field ID: 22,009, Additional file 1: Fig. S55a) of white British participants are close to zero, we define \(\text{d}=\sqrt{{\text{PC}}_{1}^{2}+{\text{PC}}_{2}^{2}}\) as an ancestry distance for each subject. A total of 5000 White British subjects with the smallest ancestry distance and 5000 non-white British subjects with the largest ancestry distance were selected (Additional file 1: Fig. S55b). Using the study cohort of the 10,000 subjects, we conducted a genome-wide analysis of the same 10 time-to-event traits. We increased the MAF cutoff to 0.001 as the sample size is smaller. Consistent to simulation studies, genetic variants were categorized into 15 groups, and SPAmix and SPACox were evaluated for each group.
Application to the ALL of US Data
To evaluate the performance of SPAmix in more diverse genomic contexts, we applied SPAmix to conduct genome-wide association analysis of essential hypertension (modeled as a time-to-event phenotype) using the All of Us dataset, a cohort with substantially greater genetic diversity. This analysis leveraged the cohort’s unique population diversity by conducting parallel GWAS in two subsets: (1) a multi-ancestry sample (sample size = 188,941), and (2) a European-ancestry subset (sample size = 112,080) for population-specific comparison.
Time-to-event phenotypes for essential hypertension (PheCode: 401.1) were constructed using electronic health records from the All of Us Research Program. Individuals with relevant diagnosis codes (mapped from ICD-9, ICD-10, and SNOMED) were classified as cases, while those without such diagnoses were treated as censored. The event time was defined as the age at first diagnosis of hypertension for cases, and as the age at the last recorded clinical observation for censored individuals. Age was calculated using each participant’s date of birth and either the diagnosis date or censoring date. To avoid confounding due to relatedness, we excluded all individuals identified as genetically related (kinship score > 0.1) based on a precomputed kinship file from All of Us dataset, ensuring a fully unrelated cohort for analysis.
Data availability
All data generated or analyzed during this study are included in this published article. The packages and codes used in the analysis are publicly available (see URLs section). SPAmix is implemented in R and is accessible at [88]. The scripts for generating simulation results and real data analyses are available at [89]. All aforementioned codes are freely accessible under a GNU General Public License v3.0 license. The source code used in the manuscript is available on Zenodo [90]. UK Biobank analysis summary statistics from the genome-wide association study analyses presented in this paper are available on Zenodo [91].
References
Abdellaoui A, Yengo L, Verweij KJH, Visscher PM. 15 years of GWAS discovery: realizing the promise. Am J Hum Genet. 2023. https://doi.org/10.1016/j.ajhg.2022.12.011.
Backman JD, Li AH, Marcketta A, Sun D, Mbatchou J, Kessler MD, et al. Exome sequencing and analysis of 454,787 UK biobank participants. Nature. 2021;599:628–34.
Beesley LJ, Salvatore M, Fritsche LG, Pandit A, Rao A, Brummett C, et al. The emerging landscape of health research based on biobanks linked to electronic health records: existing resources, statistical challenges, and potential opportunities. Stat Med. 2020;39:773–800.
Bi W, Lee S. Scalable and robust regression methods for phenome-wide association analysis on large-scale biobank data. Front Genet. 2021;12:682638.
Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–9.
Bi W, Fritsche LG, Mukherjee B, Kim S, Lee S. A fast and accurate method for genome-wide time-to-event data analysis and its application to UK biobank. Am J Hum Genet. 2020;107:222–33.
Bi W, Zhou W, Dey R, Mukherjee B, Sampson JN, Lee S. Efficient mixed model approach for large-scale genome-wide association studies of ordinal categorical phenotypes. Am J Hum Genet. 2021;108:825–39.
Lu Q, Du Y, Zhang Y, Chen Y, Li H, He W, et al. A genome-wide association study for susceptibility to axial length in highly myopic eyes. Phenomics. 2023;3:255–67.
Lapointe-Shaw L, Bouck Z, Howell NA, Lange T, Orchanian-Cheff A, Austin PC, et al. Mediation analysis with a time-to-event outcome: a review of use and reporting in healthcare research. BMC Med Res Methodol. 2018;18:1–12.
Thompson S, Kaptoge S, White I, Wood A, Perry P, Danesh J. Statistical methods for the time-to-event analysis of individual participant data from multiple epidemiological studies. Int J Epidemiol. 2010;39:1345–59.
Zhang M, Davidian M. “Smooth” semiparametric regression analysis for arbitrarily censored time-to-event data. Biometrics. 2008;64:567–76.
Pedersen EM, Agerbo E, Plana-Ripoll O, Steinbach J, Krebs MD, Hougaard DM, et al. ADuLT: an efficient and robust time-to-event GWAS. Nat Commun. 2023;14:5553. https://doi.org/10.1038/s41467-023-41210-z.
Daniels HE. Saddlepoint approximations in statistics. Ann Math Stat. 1954. https://doi.org/10.1214/aoms/1177728652.
Jensen JL. Saddlepoint approximations. Oxford Statistical Science Series. Vol. 16. The Clarendon Press Oxford University Press, New York: Oxford Science Publications; 1995.
Dey R, Schmidt EM, Abecasis GR, Lee S. A fast and accurate algorithm to test for binary phenotypes and its application to PheWAS. Am J Hum Genet. 2017;101:37–49.
Jiang L, Zheng Z, Fang H, Yang J. A generalized linear mixed model association tool for biobank-scale data. Nat Genet. 2021;53:1616–21.
Mbatchou J, Barnard L, Backman J, Marcketta A, Kosmicki JA, Ziyatdinov A, et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat Genet. 2021;53:1097–103.
Dey R, Zhou W, Kiiskinen T, Havulinna A, Elliott A, Karjalainen J, et al. FinnGen, Lee S: efficient and accurate frailty model approach for genome-wide survival association analysis in large-scale biobanks. Nat Commun. 2022;13:5437.
Loh P-R, Kichaev G, Gazal S, Schoech AP, Price AL. Mixed-model association for biobank-scale datasets. Nat Genet. 2018;50:906–8.
Ko S, German CA, Jensen A, Shen J, Wang A, Mehrotra DV, et al. Gwas of longitudinal trajectories at biobank scale. Am J Hum Genet. 2022;109:433–45.
Peterson RE, Kuchenbaecker K, Walters RK, Chen C-Y, Popejoy AB, Periyasamy S, et al. Genome-wide association studies in ancestrally diverse populations: opportunities, methods, pitfalls, and recommendations. Cell. 2019;179:589–603.
Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–9.
Martin AR, Gignoux CR, Walters RK, Wojcik GL, Neale BM, Gravel S, et al. Human demographic history impacts genetic risk prediction across diverse populations. Am J Hum Genet. 2017;100:635–49.
Cavalli-Sforza LL. Genes, peoples, and languages. Proc Natl Acad Sci U S A. 1997;94:7719–24.
Conomos MP, Laurie CA, Stilp AM, Gogarten SM, McHugh CP, Nelson SC, et al. Genetic diversity and association studies in US Hispanic/Latino populations: applications in the Hispanic community health study/study of latinos. Am J Hum Genet. 2016;98:165–84.
Atkinson EG, Maihofer AX, Kanai M, Martin AR, Karczewski KJ, Santoro ML, et al. Tractor uses local ancestry to enable the inclusion of admixed individuals in GWAS and to boost power. Nat Genet. 2021;53:195–204.
Mester R, Hou K, Ding Y, Meeks G, Burch KS, Bhattacharya A, et al. Impact of cross-ancestry genetic architecture on GWASs in admixed populations. Am J Hum Genet. 2023;110:927–39.
Jiang D, Mbatchou J, McPeek MS. Retrospective association analysis of binary traits: overcoming some limitations of the additive polygenic model. Hum Hered. 2015;80:187–95.
Hayeck TJ, Loh PR, Pollack S, Gusev A, Patterson N, Zaitlen NA, et al. Mixed model association with family-biased case-control ascertainment. Am J Hum Genet. 2017;100:31–9.
Liu Y, Xie J. Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures. J Am Stat Assoc. 2020;115:393–402.
Ko S, German CA, Jensen A, Shen J, Wang A, Mehrotra DV, et al. Gwas of longitudinal trajectories at biobank scale. Am J Hum Genet. 2022. https://doi.org/10.1016/j.ajhg.2022.01.018.
Zhou W, Nielsen JB, Fritsche LG, Dey R, Gabrielsen ME, Wolford BN, et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet. 2018;50:1335–41.
Yamauchi T, Hara K, Maeda S, Yasuda K, Takahashi A, Horikoshi M, et al. A genome-wide association study in the Japanese population identifies susceptibility loci for type 2 diabetes at UBE2E2 and C2CD4A-C2CD4B. Nat Genet. 2010;42:864–8.
Zheng Z, Huang D, Wang J, Zhao K, Zhou Y, Guo Z, et al. QTLbase: an integrative resource for quantitative trait loci across multiple human molecular phenotypes. Nucleic Acids Res. 2020;48(D1):D983-d991.
Wang J, Huang D, Zhou Y, Yao H, Liu H, Zhai S, et al. CAUSALdb: a database for disease/trait causal variants identified using summary statistics of genome-wide association studies. Nucleic Acids Res. 2020;48(D1):D807-d816.
Dong S, Zhao N, Spragins E, Kagda MS, Li M, Assis P, et al. Annotating and prioritizing human non-coding variants with RegulomeDB vol 2. Nat Genet. 2023;55:724–6.
Ferreira MA, Matheson MC, Tang CS, Granell R, Ang W, Hui J, et al. Genome-wide association analysis identifies 11 risk variants associated with the asthma with hay fever phenotype. J Allergy Clin Immunol. 2014;133:1564–71.
Chang D, Hunkapiller J, Bhangale T, Reeder J, Mukhyala K, Tom J, et al. A whole genome sequencing study of moderate to severe asthma identifies a lung function locus associated with asthma risk. Sci Rep. 2022;12:5574.
Gharahkhani P, Jorgenson E, Hysi P, Khawaja AP, Pendergrass S, Han X, et al. Genome-wide meta-analysis identifies 127 open-angle glaucoma loci with consistent effect across ancestries. Nat Commun. 2021;12(1):1258.
Lu SY, Rong SS, Wu Z, Huang C, Matsushita K, Ng TK, et al. Association of the CAV1-CAV2 locus with normal-tension glaucoma in Chinese and Japanese. Clin Experiment Ophthalmol. 2020;48:658–65.
Thorleifsson G, Walters GB, Hewitt AW, Masson G, Helgason A, DeWan A, et al. Common variants near CAV1 and CAV2 are associated with primary open-angle glaucoma. Nat Genet. 2010;42:906–9.
Wiggs JL, Kang JH, Yaspan BL, Mirel DB, Laurie C, Crenshaw A, et al. Common variants near CAV1 and CAV2 are associated with primary open-angle glaucoma in Caucasians from the USA. Hum Mol Genet. 2011;20:4707–13.
Chen F, Klein AP, Klein BE, Lee KE, Truitt B, Klein R, et al. Exome array analysis identifies CAV1/CAV2 as a susceptibility locus for intraocular pressure. Invest Ophthalmol Vis Sci. 2014;56:544–51.
Loomis SJ, Kang JH, Weinreb RN, Yaspan BL, Cooke Bailey JN, Gaasterland D, et al. Association of CAV1/CAV2 genomic variants with primary open-angle glaucoma overall and by gender and pattern of visual field loss. Ophthalmology. 2014;121:508–16.
Kim S, Kim K, Heo DW, Kim JS, Park CK, Kim CS, et al. Expression-associated polymorphisms of CAV1-CAV2 affect intraocular pressure and high-tension glaucoma risk. Mol Vis. 2015;21:548–54.
Verma A, Huffman JE, Rodriguez A, Conery M, Liu M, Ho Y-L, et al. Diversity and scale: genetic architecture of 2068 traits in the VA million veteran program. Science. 2024;385:eadj1182.
Ivanova T, Churnosova M, Abramova M, Ponomarenko I, Reshetnikov E, Aristova I, et al. Risk effects of rs1799945 polymorphism of the HFE gene and intergenic interactions of GWAS-significant loci for arterial hypertension in the Caucasian population of Central Russia. Int J Mol Sci. 2023;24:8309.
Selvaraj S, Seidelmann S, Silvestre OM, Claggett B, Ndumele CE, Cheng S, et al. HFE H63D polymorphism and the risk for systemic hypertension, myocardial remodeling, and adverse cardiovascular events in the ARIC study. Hypertension. 2019;73:68–74.
Tyrmi JS, Kaartokallio T, Lokki AI, Jääskeläinen T, Kortelainen E, Ruotsalainen S, et al. Genetic risk factors associated with preeclampsia and hypertensive disorders of pregnancy. JAMA Cardiol. 2023;8:674–83.
Changalidis AI, Maksiutenko EM, Barbitoff YA, Tkachenko AA, Vashukova ES, Pachuliia OV, et al. Aggregation of genome-wide association data from FinnGen and UK biobank replicates multiple risk loci for pregnancy complications. Genes. 2022;13:2255.
Takeuchi F, Akiyama M, Matoba N, Katsuya T, Nakatochi M, Tabara Y, et al. Interethnic analyses of blood pressure loci in populations of East Asian and European descent. Nat Commun. 2018;9:5052.
Siva N. 1000 genomes project. Nat Biotechnol. 2008;26:256–7.
Meirmans PG, Hedrick PW. Assessing population structure: FST and related measures. Mol Ecol Resour. 2011;11:5–18.
Bi W, Zhao Z, Dey R, Fritsche LG, Mukherjee B, Lee S. A fast and accurate method for genome-wide scale phenome-wide G× E analysis and its application to UK Biobank. Am J Hum Genet. 2019;105:1182–92.
Jiang L, Zheng Z, Qi T, Kemper KE, Wray NR, Visscher PM, et al. A resource-efficient tool for mixed model association analysis of large-scale data. Nat Genet. 2019;51:1749–55.
Bryc K, Durand EY, Macpherson JM, Reich D, Mountain JL. The genetic ancestry of African Americans, Latinos, and European Americans across the United States. Am J Hum Genet. 2015;96:37–53.
Moreno-Estrada A, Gravel S, Zakharia F, McCauley JL, Byrnes JK, Gignoux CR, et al. Reconstructing the population genetic history of the Caribbean. PLoS Genet. 2013;9:e1003925.
Schlebusch CM, Skoglund P, Sjödin P, Gattepaille LM, Hernandez D, Jay F, et al. Genomic variation in seven Khoe-San groups reveals adaptation and complex African history. Science. 2012;338:374–9.
Hou K, Bhattacharya A, Mester R, Burch KS, Pasaniuc B. On powerful GWAS in admixed populations. Nat Genet. 2021;53:1631–3.
Caliebe A, Tekola-Ayele F, Darst BF, Wang X, Song YE, Gui J, et al. Including diverse and admixed populations in genetic epidemiology research. Genetic epidemiology. 2022;46:347–1.
Conomos MP, Reiner AP, Weir BS, Thornton TA. Model-free estimation of recent genetic relatedness. Am J Hum Genet. 2016;98:127–48.
Deng H-W, Chen W-M, Recker RR. Qtl fine mapping by measuring and testing for Hardy-Weinberg and linkage disequilibrium at a series of linked marker loci in extreme samples of populations. Am J Hum Genet. 2000;66:1027–45.
Loh P-R, Tucker G, Bulik-Sullivan BK, Vilhjálmsson BJ, Finucane HK, Salem RM, et al. Efficient bayesian mixed-model analysis increases association power in large cohorts. Nat Genet. 2015;47:284–90.
All of Us Research Program Investigators. The “All of Us” research program. N Engl J Med. 2019;381:668–76.
Taliun D, Harris DN, Kessler MD, Carlson J, Szpiech ZA, Torres R, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed program. Nature. 2021;590:290–9.
Sohail M, Palma-Martínez MJ, Chong AY, Quinto-Cortés CD, Barberena-Jonas C, Medina-Muñoz SG, et al. Mexican biobank advances population and medical genomics of diverse ancestries. Nature. 2023;622:775–83.
Schneiderman N, Llabre M, Cowie CC, Barnhart J, Carnethon M, Gallo LC, Giachello AL, Heiss G, Kaplan RC. LaVange LMJDc: Prevalence of diabetes among Hispanics/Latinos from diverse backgrounds: the Hispanic community health study/study of Latinos (HCHS/SOL). Diabetes care. 2014;37:2233–9.
Hayeck TJ, Zaitlen NA, Loh P-R, Vilhjalmsson B, Pollack S, Gusev A, et al. Mixed model with correction for case-control ascertainment increases association power. Am J Hum Genet. 2015;96:720–30.
Coop G. Genetic similarity versus genetic ancestry groups as sample descriptors in human genetics. arXiv preprint arXiv:2207. 11595. 2022.
Lewis AC, Molina SJ, Appelbaum PS, Dauda B, Di Rienzo A, Fuentes A, et al. Getting genetic ancestry right for science and society. Science. 2022;376:250–2.
Mathieson I, Scally A. What is ancestry? PLoS Genet. 2020;16:e1008624.
Krainc T, Fuentes A. Genetic ancestry in precision medicine is reshaping the race debate. Proc Natl Acad Sci U S A. 2022;119:e2203033119.
Belbin GM, Cullina S, Wenric S, Soper ER, Glicksberg BS, Torre D, et al. Toward a fine-scale population health monitoring system. Cell. 2021;184(2068–2083):e2011.
Wojcik GL, Graff M, Nishimura KK, Tao R, Haessler J, Gignoux CR, et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature. 2019;570:514–8.
Ding Y, Hou K, Xu Z, Pimplaskar A, Petter E, Boulier K, et al. Polygenic scoring accuracy varies across the genetic ancestry continuum. Nature. 2023;618:774–81.
Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2:e190.
Butler RW: Saddlepoint approximations with applications. Vol.22. Cambridge University Press; 2007.
Goutis C, Casella G. Explaining the saddlepoint approximation. Am Stat. 1999;53:216–24.
Barndorff-Nielsen OE. Approximate interval probabilities. J R Stat Soc Series B Stat Methodol. 1990;52:485–96.
Balding DJ, Nichols RA. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica. 1995;96:3–12.
Foreman LA, Smith AF, Evett IW. Bayesian analysis of DNA profiling data in forensic identification applications. J R Stat Soc Ser A Stat Soc. 1997;160:429–59.
Rannala B, Mountain JL. Detecting immigration by using multilocus genotypes. Proc Natl Acad Sci U S A. 1997;94:9197–201.
Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945–59.
Hou K, Ding Y, Xu Z, Wu Y, Bhattacharya A, Mester R, et al. Causal effects on complex traits are similar for common variants across segments of different continental ancestries within admixed individuals. Nat Genet. 2023;55:549–58.
Wu P, Gifford A, Meng X, Li X, Campbell H, Varley T, et al. Mapping ICD-10 and ICD-10-CM codes to phecodes: workflow development and initial evaluation. JMIR Med Inform. 2019;7:e14325.
Denny JC, Ritchie MD, Basford MA, Pulley JM, Bastarache L, Brown-Gentry K, et al. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations. Bioinformatics. 2010;26:1205–10.
Loh P-R, Danecek P, Palamara PF, Fuchsberger C, A Reshef Y, K Finucane H, et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat Genet. 2016;48:1443–8.
Bi W, GRAB. CRAN.R-project. https://CRAN.R-project.org/package=GRAB. 2025.
Ma Y, Bi W: SPAmix: A scalable, accurate, and universal analysis framework for large-scale genetic association studies in admixed populations. Github. https://github.com/YuzhuoMa97/SPAmix. 2025.
Ma Y, Bi W. YuzhuoMa97/SPAmix: Code used in SPAmix project. 2025. Zenodo. https://doi.org/10.5281/zenodo.17139129.
Ma Y, Bi W: A scalable, accurate, and universal analysis framework using individual-level allele frequency for large-scale genetic association studies in an admixed population. Zenodo. https://zenodo.org/record/8343684. 2025.
Acknowledgements
UK Biobank data was accessed under the accession number 78795. We thank Dr. Seokho Jeong for providing the maximal independent set algorithm used in AllofUs data. This research is supported by high-performance computing platform of Peking University. This work was supported under the framework of international cooperation program managed by National Research Foundation of Korea (2024K2A9A2A0601368311).
Peer review information
Tim Sands was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Review history
The review history is available as Additional file 2.
Funding
This research was supported by National Natural Science Foundation of China (82441004, 62273010, 62411540239, W.B.; 32471202, P.Z.), Clinical Medicine Plus X—Young Scholars Project, Peking University (PKU2023LCXQ043, W.B.), National Research Foundation of Korea (2024K2A9A2A0601368311, S.L.), and Innovation Fund for Outstanding Doctoral Candidates of Peking University Health Science Center (BMU2025BSS001, Y.M.).
Author information
Authors and Affiliations
Contributions
Y.M., S.L., J.Z., P.Z., and W.B. designed the experiments. Y.M. and W.B. performed the experiments. H.X. constructed longitudinal traits from UK Biobank and conducted analysis using trajGWAS. Y.M., Y.L., and W.B. constructed the time-to-event traits from UK Biobank. H.K. constructed the time-to-event traits and performed GWAS in AllofUs. L.M. prepared a Docker image for SPAmix, ensuring that the software can be deployed reproducibly. Y.M., L.X., and W.B. wrote the manuscript with the assistance of P.X., F.M., X.Z, S.L., J.Z, and P.Z. All authors have read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
13059_2025_3827_MOESM1_ESM.pdf
Additional file 1. Supplementary information includes supplementary notes, supplementary figures S1-S72, and supplementary tables S1-S7.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Ma, Y., Xu, H., Li, Y. et al. SPAmix: a scalable, accurate, and universal analysis framework for large-scale genetic association studies in admixed populations. Genome Biol 26, 356 (2025). https://doi.org/10.1186/s13059-025-03827-9
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1186/s13059-025-03827-9