SPAmix: a scalable, accurate, and universal analysis framework for large-scale genetic association studies in admixed populations

Ma, Yuzhuo; Xu, He; Li, Ying; Kim, Hyesung; Xu, Lin-lin; Miao, Lin; Xu, Peng; Mao, Fengbiao; Zhou, Xu-jie; Zhou, Wei; Lee, Seunggeun; Zhang, Ji-Feng; Zhang, Peipei; Bi, Wenjian

doi:10.1186/s13059-025-03827-9

Research
Open access
Published: 16 October 2025

SPAmix: a scalable, accurate, and universal analysis framework for large-scale genetic association studies in admixed populations

Yuzhuo Ma^1,2,
He Xu¹,
Ying Li¹,
Hyesung Kim³,
Lin-lin Xu⁴,
Lin Miao¹,
Peng Xu⁵,
Fengbiao Mao⁶,
Xu-jie Zhou⁴,
Wei Zhou^7,8,
Seunggeun Lee³,
Ji-Feng Zhang^9,10,
Peipei Zhang^11,12 &
…
Wenjian Bi ORCID: orcid.org/0000-0002-5108-9311^1,13,14

Genome Biology volume 26, Article number: 356 (2025) Cite this article

902 Accesses
Metrics details

Abstract

Background

Inclusion of individuals with diverse or admixed genetic ancestries is crucial to discover novel findings that may be missed by genomics analyses rooted solely in European population.

Results

Here, we present an analysis framework, SPAmix, which is scalable to a large-scale biobank data analysis including hundreds of thousands of admixed individuals and is universally applicable to various types of complex traits including quantitative traits, time-to-event traits, ordinal traits, and longitudinal traits. Since no alternative model is fitted, SPAmix primarily focuses on association p values. For each genetic variant, SPAmix uses genotype data and genetic principal components to estimate individual-specific allele frequency, which is subsequently used to calibrate p values via a retrospective analysis. A hybrid strategy including saddlepoint approximation (SPA) can greatly increase the accuracy to analyze rare genetic variants, especially if the phenotypic distribution is unbalanced or extremely unbalanced. We also propose SPAmix_local to incorporate local ancestry to calculate ancestry-specific p values. To maximize the statistical powers, SPAmix_CCT is proposed to combine the p values of SPAmix and SPAmix_local via Cauchy combination.

Conclusions

The SPAmix-based approaches are more accurate than Tractor to address phenotypic variance heterogeneity among ancestries when analyzing quantitative traits and to address an unbalanced case–control ratio when analyzing binary traits. SPAmix_CCT is an optimal unified approach for various cross-ancestry genetic architectures. Extensive simulation studies and real data analyses of 369,314 UK Biobank individuals from multiple ancestries demonstrated that SPAmix is scalable and can discover novel hits while controlling type I error rates well.

Background

Large-scale biobank data integrate in-depth genetic information and electronic health records (EHR) with extensive health information, enabling the response variable in genome-wide association studies (GWASs) to extend beyond quantitative and binary traits [1]. As a result, complex traits with intricate structures are increasingly attracting attention in large-scale GWAS researches [2,3,4,5,6,7,8]. Statistical models in GWAS have evolved to include more sophisticated models such as the Cox proportional hazards regression model to fit a time-to-event trait [6, 9,10,11,12].

Recently, SPACox was proposed as a general analysis framework to accurately and efficiently conduct GWAS for complex traits with intricate structures [6]. Using an empirical saddlepoint approximation (SPA) to approximate the null distribution of score statistics [13,14,15,16,17,18,19], SPACox is computationally efficient and remains powerful while controlling type I error rates for common variants (minor allele frequency, MAF > 5%), low-frequency variants (5% > MAF > 1%), and rare variants (MAF < 1%) [6]. Although demonstrated using time-to-event traits, SPACox is of high potential to be extended. For example, trajGWAS also employs the empirical SPA used in SPACox to analyze longitudinal trait [20]. However, SPACox was designed to analyze a homogeneous population and its performance in multiple subpopulations or admixed populations remains unclear.

Population stratification is a significant confounder in GWAS, and large biobanks may include a considerable number of individuals from heterogeneous ancestries, such as those in admixed populations [21,22,23]. The admixture can occur due to various reasons, such as migration, colonization, and intermarriage between individuals from different populations [24]. Incorporating SNP-derived principal components (PCs) as covariates in regular regression models can help control for confounding but still could result in inflation or deflation, for example, in the case of phenotypic heterogeneity (such as phenotypic variance heterogeneity for quantitative traits and baseline hazard function heterogeneity for time-to-event traits) among genetic ancestries [25]. Due to the concerns over population structure, these individuals are routinely excluded from analysis.

Recently, Tractor was proposed to enable the inclusion of unrelated admixed individuals in GWAS [26]. In addition, Tractor can incorporate local ancestry to conduct population-specific association studies, which can enhance power compared to regular approaches in the case of the effect size heterogeneity [27]. Although powerful, Tractor is computationally expensive because it requires fitting a full model for each genetic variant, which limited its application to a large-scale biobank dataset with hundreds of thousands of individuals. In addition, Tractor package does not support complex traits with intricate structures (e.g., time-to-event and longitudinal traits). If the effect size remains the same for different ancestries, Tractor is less powerful than regular approaches [27]. An alternative strategy is to divide all individuals into multiple ancestry groups and then conduct a meta-analysis. However, the division criteria are arbitrary and may result in the removal of individuals in transitional regions.

Retrospective association analysis treats the genotype of a genetic variant as a random variable and conducts association analysis conditional on the phenotype and covariates. Compared to prospective association analysis, the retrospective association analysis demonstrates less susceptible to power loss, exhibits greater robustness to control type I errors, and maintains statistical power, especially when dealing with trait model misspecification [28]. For example, LT-Fam uses a retrospective association score statistic to address family-biased case–control ascertainment [29]. For individuals with diverse or admixed genetic ancestries, phenotypes could be affected by latent ancestry-specific environment factors or lifestyles, potentially leading to model misspecification. Consequently, a retrospective association analysis may offer a more effective solution.

In this study, we propose SPAmix, a scalable and accurate retrospective analysis framework for a large-scale genome-wide analysis while accounting for population stratification in admixed populations. Since no alternative model is fitted, SPAmix primarily focuses on association p values. As a score test approach, SPAmix only requires fitting a null model which is the same for all genetic variants across the genome-wide analysis. And thus, the computational burden is much lower than the approaches based on Wald test or likelihood ratio test, especially for a large-scale biobank data analysis. The statistical model can be freely chosen to fit the trait of interest and only residuals obtained from the null model fitting are required for SPAmix. In this paper, we used binary traits, quantitative traits, time-to-event traits, ordinal traits, and longitudinal traits as examples to demonstrate the performance of SPAmix. For each genetic variant, SPAmix estimates individual-specific allele frequencies using genotype data and SNP-derived PCs. And then, SPAmix treats genotype as a random variable and uses a hybrid strategy including both normal distribution approximation and SPA to calibrate p values. In addition, SPAmix can be extended to incorporate local ancestry if available and SPAmix_CCT can serve as an optimal unified approach for various cross-ancestry genetic architectures, leveraging the advantages of the Cauchy combination (CCT) [27, 30].

We evaluated SPAmix using extensive simulation studies in which individuals from multiple ancestries and various trait types were included. SPAmix can always accurately control type I error rates and maintain power in various settings in terms of genotypes (common, low-frequency, or rare variants) and phenotypes (phenotypic distribution is balanced or unbalanced). However, in the case of phenotypic variance heterogeneity among ancestries, regular linear regression approaches are invalid. And similarly, regular Cox regression approaches cannot control type one error rates if the baseline hazard functions for distinct ancestries are different. If the local ancestry is incorporated, SPAmix_CCT is more powerful than Tractor and can serve as an optimal unified approach regardless of effect size homogeneity and heterogeneity.

We applied SPAmix to analyze 10 time-to-event traits and 11 longitudinal traits using 369,314 unrelated individuals from all ancestries in UK Biobank data. Compared to the analysis limited to 308,328 (~ 83%) unrelated White British (WB) individuals, a total of 377 loci were additionally identified at a significance level of 5 × 10⁻⁸. The results highlight the importance of the ancestry diversities in GWAS.

Results

An overview of SPAmix

SPAmix is designed for conducting GWASs in a study cohort including multiple subpopulations and/or admixed populations. It is comprised of two steps (see Fig. 1). In step 1, SPAmix requires fitting a null model to calculate model residuals, in which confounding factors such as age, sex, SNP-derived principal components (PCs), and other confounders are incorporated as covariates. The null model specification and the corresponding model residuals can vary depending on the type of trait. In the online Methods section, we demonstrated regression models to fit binary, quantitative, time-to-event, ordinal, and longitudinal traits, and the corresponding model residuals.

In step 2, SPAmix approach associates the traits of interest to single genetic variant by approximating the null distribution of score statistics $S=\sum_{i=1}^{n}{G}_{i}{R}_{i}$, where $n$ is the number of individuals, and ${G}_{i}$ and ${R}_{i},i\le n$ are the genotypes and model residuals, respectively. To characterize the diversity of allele frequencies (AFs) of this genetic variant for different ancestries, we assume that individuals come from different populations with varying AF. Instead, linear and logistic regressions were used to estimate individual-specific AFs $\widehat{{\varvec{q}}}={\left({\widehat{q}}_{1},\dots ,{\widehat{q}}_{n}\right)}^{T}$, in which raw genotypes are outcomes and SNP-derived principal components are model predictors. SPAmix uses a hybrid strategy including both normal distribution approximation and SPA to approximate the distribution of score statistics under the null hypothesis. The approximation relies on the assumption that genotype ${G}_{i}$ follows a binomial distribution Binom(2, ${\widehat{q}}_{i}$). In addition, model residuals are categorized to outliers and non-outliers based on interquartile range, and a partial normal distribution is used to reduce computational burden. SPAmix supports simultaneously analyzing multiple traits whose types are the same or different, which can avoid repeated calculation of the AF estimation.

Association analysis in the UK Biobank data

We applied SPAmix, SPACox, and trajGWAS to analyze 10 time-to-event traits and 11 longitudinal traits in UK Biobank data. A total of 369,314 unrelated individuals from multiple ancestries were included in genome-wide analyses (SPAmix, SPACox_ALL, and trajGWAS_ALL). For SPACox and trajGWAS, we additionally analyzed 308,328 (~ 83%) unrelated White British (WB) individuals (SPACox_WB and trajGWAS_WB). We also evaluated SPAmix_noSPA approach in which only normal distribution approximation is used to calibrate p values. The exact sample size in analysis, phecode, and event rates in WB and non-WB for the 10 time-to-event traits can be found in Additional file 1: Table S1. In this paper, we defined WB based on Field ID of 22,006, which is highly consistent to the genetic ancestry based on the genetically derived principal components. We define non-WB group as all participants from diverse ancestries (except WB) or admixed populations in the UKB. Note that non-WB is not a monolith ancestry group. The 11 longitudinal traits were extracted from primary care data in a similar way as in previous studies [31]. The evaluation of SPACox_WB is to highlight the importance of ancestry diversities in GWAS. We did not consider more complicated study design, for example, using SPACox to analyze separate populations and then conducting meta-analysis. Although this design can increase power, it is still expected to be less powerful than SPAmix as subjects in transitional regions would be excluded.

SPAmix is scalable to analyze large-scale biobank data

We selected Alzheimer’s disease (event rate = 0.2%) and essential hypertension (event rate = 27.1%) as two time-to-event trait examples (sample size = 338,044) to evaluate computation time of SPAmix for low and high event rates, respectively. All analyses were conduct on a CPU model of Intel(R) Xeon(R) Gold 6342 CPU @ 2.80 GHz. Memory usage for each job was less than 4 GB. In addition to SPAmix and SPACox, we also evaluated an R package gwasurvivr in which Wald test was used to calibrate p values. As the package gwasurvivr does not support BGEN format, we converted the genotype data to plink format. It is expected that reading text-based formats (such as VCF format) is slower than reading binary format (such as plink and BGEN formats). We analyzed 10,000 genetic variants randomly selected in chromosome 1, recorded the computation time, and then projected it to all chromosomes including 18,583,853 genetic variants (Fig. 2, Additional file 1: Table S2). In Additional file 1: Supplementary notes, we demonstrated that the computational time scaled linearly (Additional file 1: Fig. S61).

For GWAS of Alzheimer’s disease and essential hypertension, gwasurvivr took ~ 6354 and 6229 CPU hours, respectively. Meanwhile, SPAmix only took 501 and 553 CPU hours, which were 12.7 and 11.3 times faster than gwasurvivr. Furthermore, if these two traits were analyzed simultaneously, it only took 574 CPU hours, which was much lower than the sum of the computation time of gwasurvivr and slightly lower than that of SPACox. This indicates that reading genotype and calculating individual-specific allele frequency take a high proportion of the computation time, which can be reduced if multiple traits were analyzed simultaneously.

As a score test approach, SPAmix is much faster than tools developed using Wald test or likelihood ratio test, especially when analyzing multiple traits. The high computational efficiency ensures that SPAmix is scalable to a large-scale GWAS including hundreds of thousands of subjects.

SPAmix is accurate to analyze multi-way admixed populations

We chose four time-to-event traits including type 2 diabetes, essential hypertension, glaucoma, and asthma as examples for demonstration. Manhattan plots and quantile–quantile (QQ) plots are shown in Fig. 3, and QQ plots stratified by minor allele frequency in WB are shown in Fig. 4. For all these four traits, SPAmix and SPACox_WB can well control type I error rates, and SPAmix was more powerful than SPACox_WB. This is expected as SPAmix incorporated individuals from all ancestries and SPACox_WB removed ~ 17% non-white British individuals.

The accuracy of SPACox_ALL highly depends on the diversity of event rates for different ancestries. For SPACox_ALL, the high diversity of event rates resulted in a large number of false discoveries across the whole genome. An example is the type 2 diabetes for which the event rate in WB is ~ 6.55% and the event rates in non-WB is ~ 9.69%. QQ plots stratified by allele frequencies demonstrated that SPACox_ALL identified spurious discoveries of low-frequency and rare variants (Fig. 4). For Alzheimer’s disease and multiple sclerosis, the diversity of event rates is moderate and SPACox_ALL gave similar analysis results as SPAmix.

SPAmix_noSPA approach approximates the null distribution of the score statistics only using normal distribution approximation, and its performance depends on whether the phenotypic distribution is balanced or unbalanced. Of the four time-to-event traits, SPAmix_noSPA performed comparably as SPAmix only for type 2 diabetes (event rate is 7.1%). For the other 3 traits (event rate < 0.42%), SPAmix_noSPA questionably identified more associations than SPAmix and the findings mostly corresponded to low-frequency and rare variants (MAF < 0.05). The results were consistent with previous studies, which validated that SPA is necessary to control type I error rates if event rate is low, especially for low-frequency and rare variants [6, 15, 32]. The corresponding Manhattan plots, QQ plots, and QQ plots stratified by MAF for the remaining 6 time-to-event traits can be found in Additional file 1: Fig. S1-S3.

The analysis results for 11 longitudinal traits are similar as in time-to-event data analysis (Additional file 1: Fig. S4-S9). For example, in the analysis of BMI, spurious positive findings of low-frequency and rare variants were observed when using trajGWAS_ALL, and trajGWAS_WB is less powerful than SPAmix (Additional file 1: Fig. S9k). The results indicate a unified advance of SPAmix over regular methods for different trait types of interest.

Generally speaking, the real data analysis in UK Biobank data suggests that SPAmix is more powerful than SPACox_WB and trajGWAS_WB, which is expected as SPAmix allows to include all individuals. Meanwhile, SPACox_ALL and trajGWAS_ALL are only valid when the trait distributions are close to each other for different ancestries. This is because the empirical SPA methods used in SPACox and trajGWAS heavily rely on the assumption that all model residuals follow an identical distribution [6]. In addition, SPA is also required to control type one error rates, especially for low-frequency and rare variants. The above conclusions are consistent with the simulation studies demonstrated in the next sections.

SPAmix identified 377 additional significant loci by including non-WB individuals

We clustered genetic variants within 200-kb region as one locus. At a genome-wide significance level $\alpha\ =$ 5 $\times$ 10⁻⁸, SPAmix totally identified 419 and 1500 loci for the 10 time-to-event traits and 11 longitudinal traits (testing for ${\beta }_{g}$), of which, 113 and 264 loci were not identified by SPACox_WB and trajGWAS_WB, respectively. On the contrary, only 27 (27/113 = 23.9%) and 86 (86/264 = 32.6%) loci were identified by SPACox_WB and trajGWAS_WB but were not identified by SPAmix. If we set the significance level at $\alpha =$ 5 $\times$ 10⁻⁸/21, SPAmix identified 281 and 1050 loci for time-to-event traits and longitudinal traits, of which, 73 and 155 loci were not identified by SPACox_WB and trajGWAS_WB, respectively. On the contrary, only 11 (11/73 = 15.1%) and 48 (48/155 = 31.0%) loci were identified by SPACox_WB and trajGWAS_WB but were not identified by SPAmix. The complete list of the identified loci, including the top significant genetic variants (whose p value is the smallest in a locus) and the association results, are shown in Supplementary Tables S3. In this section, 3 loci were highlighted to exemplify the advantage of SPAmix over regular approaches as below.

For type 2 diabetes, SPAmix identified SNP rs71317817 (p value = 7.03 $\times$ 10⁻⁹), which is an intronic variant in UBE2E2 that encodes ubiquitin conjugating enzyme E2. A GWAS in Japanese population has also identified the susceptibility loci for type 2 diabetes [33]. Within 10-Mb region centered on this SNP, this SNP exerts cis-eQTL effects on UBE2E2 in different tissues, especially in DM-related tissues such as the pancreas (p value = 2.03 $\times$ 10⁻¹¹) [34]. We then queried the CAUSALdb [35], a database which incorporates over 3000 public full GWAS summary data and identified credible sets of causality by uniformly processed fine-mapping. The SNP is the most possible causal variation in its linkage disequilibrium (LD) block with the highest posterior probability for diabetes mellitus. In addition, the SNP is located in a region overlapped with some epigenomics signals such as transcription factor binding and open chromatin [36].

For asthma, SPAmix identified SNP rs4833095, a missense variant in TLR1 which encodes Toll Like Receptor 1. Ferreira et al. identified this SNP as a risk variant associated with the asthma with hay fever phenotype [37]. In addition, Chang et al. found that SNP rs4833095 was associated with asthma with a p value 7 × 10⁻⁶ through a whole genome sequencing study [38]. The above two studies indicate that this SNP could be a variant associated with asthma, and larger sample sizes may be required to achieve the genome-wide significance. In addition, we found that SNP rs4833095 exerted cis-eQTL effects on TLR1 in different tissues, which showed the strongest effect in blood with a p value 2.63 $\times$ 10⁻²⁸⁴. CAUSALdb also suggests that this SNP is speculated the most possible causal variant of asthma as well as allergic disease in its corresponding region [35].

For glaucoma, SPAmix identified SNP rs10262524 (p value = 4.86 $\times$ 10⁻⁸), an intergenic variant between CAV2 gene and CAV1 gene. Gharahkhani et al. previously identified rs10262524 as open-angle glaucoma risk variant (p value = 3 $\times$ 10⁻¹⁴) by a large-scale GWAS [39]. The SNP exerts cis-eQTL effects on CAV2 and CAV1 in many tissue types, and CAV2 shows the strongest eQTL effect size (p value = 3.77 $\times$ 10⁻⁷¹). In addition, rs9665610 is the most possible causal variation in its linkage disequilibrium (LD) block with the highest posterior probability for glaucoma and related phenotypes. The location of rs10262524 overlaps with transcription factor binding sites and open chromatin regions, suggesting an underlying epigenetic regulatory mechanism. Other CAV1-CAV2 loci associated with glaucoma include rs4236601, a SNP in LD with the SNP rs10262524 (r² = 0.96) [40,41,42,43,44,45].

Association analysis in the ALL of Us Data

To validate SPAmix in diverse populations, we performed genome-wide association studies (GWAS) for time-to-event hypertension in the All of Us cohort, analyzing both multi-ancestry (sample size = 188,941, event rate = 40.4%) and the European-ancestry subgroups (sample size = 112,080, event rate = 41.6%). After excluding variants with missing p values or MAF < 0.01, SPAmix identified 18 significant SNPs at a genome-wide significance level $\alpha =$ 5 $\times$ 10⁻⁸ in the multi-ancestry analysis, a 38% increase compared to the European-only subgroup (13 significant SNPs), as shown in Tables S4 and S5. Manhattan plots (Additional file 1: Fig. S10) demonstrated the power enhancement through incorporation of multi-ancestry individuals.

To demonstrate the validity of SPAmix in AllofUs data analyses, we highlight key loci showing consistent biological relevance. The KCNK3 locus (rs12476527) was significantly associated with hypertension in both multi-ancestry and European-ancestry analyses, supporting previous reports linking KCNK3 to hypertension [46]. Similarly, SPAmix replicated the known hypertension signal at rs1799945 in HFE across both multi-ancestry and European analyses, consistent with established GWAS evidence [47, 48].

Beyond replications, we also highlight biologically plausible associations uniquely identified by multi-ancestry analysis: (1) SNP rs167479 in RGL3 (p value = 4.12 $\times$ 10⁻⁸),whose hypertension association has independent validation, [49, 50] and (2) rs10786736 (p value = 4.45 $\times$ 10⁻⁸), a 3' UTR variant in the CNNM2 and NT5C2 genes on chromosome 10 with hypertension associations reported specifically in East Asian populations [51].

Simulation studies

To assess the performance of SPAmix in terms of type I error rates and powers in admixed population analyses, we simulated genotypes and time-to-event traits of $n=\text{10,000}$ subjects, mimicking an admixed population of European (EUR) and East Asian (EAS). Other types of traits were simulated in the following sections.

For each genetic variant, we simulated genotypes using ancestry vectors and allele frequency $\left({q}^{EUR},{q}^{EAS}\right)$ downloaded from the 1000 Genomes Project [52]. Depending on the difference of MAFs (i.e., Diff_MAF = ${q}^{EUR}-{q}^{EAS}$) and the minimal MAF value (i.e., minMAF = $\text{min}\left({q}^{EUR},{q}^{EAS}\right)$) in populations EUR and EAS, genetic variants were categorized into 15 groups (Additional file 1: Fig. S11). Two scenarios were used to simulate time-to-event traits. In scenario 1, the event rates in EUR and EAS were the same; and in scenario 2, the event rate in EUR was higher than that in EAS. In each scenario, we simulated traits with low event rates (ER_low), moderate event rates (ER_mod), and high event rates (ER_high). More details about the data simulation can be found in the Data simulation subsection of the online Methods section.

Type I error rates

The empirical type I error rates for the two scenarios based on 10⁹ association tests at a genome-wide significance level 5 $\times$ 10⁻⁸ are presented in Fig. 5a and S12, and Tables S6 and S7. The QQ plots for randomly selected 10⁶ tests are shown in Additional file 1: Fig. S13 and S14.

If event rates in EUR and EAS were the same (i.e., in scenario 1), SPAmix and SPACox generally performed well and can control type I error rates under all settings of MAFs and event rates. In a limited number of settings, SPACox produced slightly deflated type I error rates (Additional file 1: Table S6). For example, if Diff_MAF < < 0, 0.01 $\le$ Min_MAF $<$ 0.05 (i.e., minMAF_mod), and event rates in both EUR and EAS are 0.01 (i.e., ER_low), the empirical type I error rates corresponding to SPAmix and SPACox were 4.7 $\times$ 10⁻⁸ (0.94 $\alpha$) and 2.1 $\times$ 10⁻⁸ (0.42 $\alpha$), respectively. Meanwhile, for low event rates or low-frequency variants, SPAmix_noSPA cannot control type I error rates, which results in false positive discoveries (Additional file 1: Fig. S12). For example, if Diff_MAF ~ 0, Min_MAF $<0.01$ (minMAF_low), and event rates in both EUR and EAS are 0.01 (ER_low), the empirical type I error rate of SPAmix_noSPA was 0.0003 (> 6000 $\alpha$). The result suggests that SPA approaches overperform normal distribution approximation in terms of type I error rates control, especially for low event rates and rare variants. This conclusion is consistent with previous studies and the real data analysis, which demonstrates the necessity to use SPA to approximate the null distribution of the score statistics [6, 20, 32, 53, 54].

If event rates in EUR and EAS were different (i.e., in scenario 2), SPAmix can still calibrate p values accurately. Meanwhile, despite incorporating ancestry PCs to fit the null model, SPACox cannot control type I error rates, which resulted in largely deflated or inflated type I error rates depending on Diff_MAF and event rates. If Diff_MAF < < 0 or Diff_MAF < 0, type I error rates were deflated; and if Diff_MAF > 0 or Diff_MAF > > 0, type I error rates were inflated. The degree of inflation or deflation was greater if the minimal of ${q}^{EUR}$ and ${q}^{EAS}$ (Min_MAF) was closer to 0 (Fig. 5a). For example, if Diff_MAF > 0, $\text{min}\left({q}^{EUR},{q}^{EAS}\right)<0.01$ (i.e., minMAF_low), and event rates in EUR and EAS are 0.1 and 0.01 (i.e., ER_low), respectively, the empirical type I error rate of SPACox was 4.1 $\times$ 10⁻⁶ (82.4 $\alpha$). This demonstrated that SPACox cannot control type I error rates if the event rates are different across ancestries. In addition, similar to scenario 1, using only normal distribution produced inflated type I error rates. The QQ plots for randomly selected 10⁶ tests (Additional file 1: Fig. S14) showed that only SPAmix produced well-calibrated p values, which further validated the accuracy of SPAmix over SPACox and normal distribution approximation in terms of type I error rates control in the presence of variation of event rates and MAFs across ancestries. In addition, SPAmix well controlled type I error rates when analyzing a sample consisting of 90% EUR individuals and 10% EAS individuals, as illustrated in Additional file 1: Fig. S15.

Empirical power

The empirical powers of SPAmix, SPAmix_noSPA, and SPACox were evaluated in two scenarios based on 10⁴ tests at a significance level $\alpha =5\times {10}^{-8}$ under the alternative model with positive genetic effects (Fig. 5b and S16) and negative genetic effects (Additional file 1: Fig. S17 and S18), respectively. The genetic effect sizes were the same for different ancestries in this section. More discussion about the effect size heterogeneity can be found in the next sections. The QQ plots were shown in Additional file 1: Fig. S19-S22. If event rates in EUR and EAS were the same (i.e., in scenario 1), SPAmix and SPACox were similarly powerful in all cases, whereas SPACox_noSPA yield spurious higher powers due to the inflated type I error rates.

In scenario 2, that is, the event rates in EUR and EAS were different, if Diff_MAF < < 0 and Diff_MAF < 0, SPAmix were more powerful than SPACox. For instance, if genetic effects are positive, Diff_MAF < < 0, $0.01\le$ minMAF $<0.05$ (i.e., minMAF_mod), and event rates in EUR and EAS are 0.3 and 0.05 (i.e., ER_mod), respectively, the empirical power of SPAmix (21.02%), which was nearly threefold higher than SPACox (7.25%). This is due to the deflation of SPACox as observed in the type I error rates simulation studies. Under settings of Diff_MAF ~ 0, Diff_MAF > 0, and Diff_MAF > > 0, SPACox produced spurious higher powers as shown in Fig. 5b. To fairly evaluate powers, we also used test-specific empirical significance levels that yield type I error rate $\alpha =5\times {10}^{-8}$, as shown in Fig. 5b and Additional file1: Fig. S16. Using the empirical significance levels, SPAmix_noSPA (denoted as SPAmix_noSPA|emp) and SPACox (denoted as SPACox|emp) were similarly powerful as SPAmix in all cases.

The empirical powers for negative genetic effects (Additional file 1: Fig. S17 and S18) and QQ plots (Additional file 1: Fig. S21 and S22) display a similar trend as those for positive genetic effects, demonstrating that the above discussion about the powers is also valid. Moreover, the empirical powers when the genetic effects sizes are heterogeneous across ancestry groups are shown in Additional file 1: Fig. S23-S28, which also demonstrated that SPAmix maintained power in such cases. Generally speaking, only if event rates in different populations are nearly identical, SPACox is still accurate. Otherwise, using SPACox can result in a large number of false positive discoveries or power loss. Meanwhile, SPAmix is an efficient and accurate approach in all scenarios. Although we did not evaluate other types of traits, the conclusions are expected to be similar.

Simulation of a three-way admixed population

To validate that SPAmix method is not limited to a two-way admixed scenario, we also simulated genotypes and time-to-event phenotypes of $n=\text{10,000}$ subjects to mimic a three-way admixed population of European (EUR), African (AFR), and East Asian (EAS). More details about the simulation can be found in the online Method section. The results are consistent to the previous simulation studies. The QQ plots (Additional file 1: Fig. S29 and S30) demonstrate that SPAmix can still well control type I error rates while remaining powerful, and SPACox displayed inflation or deflation depending on the simulation scenarios. Since ancestry proportions were characterized based on individual-specific allele frequency estimation, SPAmix can be straightforwardly applied to analyze populations with more than three ancestries, which is also validated in the UK Biobank real data analysis.

SPAmix can control type I error rates in the case of variance heterogeneity and baseline hazard function heterogeneity

To address potential confounding arising from population stratification, SNP-derived principal components (PCs) were regularly incorporated as covariates within regression models, including linear regression to analyze quantitative traits and Cox PH regression to analyze time-to-event traits. However, this strategy still cannot adjust for population stratification if the population-related heterogeneity cannot be well characterized in the regression model. For example, a basic assumption of the linear regression is the phenotypic variance homogeneity. If the variances corresponding to distinct ancestries are different, the regular linear regression is invalid, even if the SNP-derived PCs were incorporated. Similarly, Cox PH regression implicitly assumes that baseline hazard functions for all individuals are the same, which could also be violated in an admixed population.

We conducted simulation studies (more details can be seen in the online Methods section) to compare regular linear regression and SPAmix-based methods. To mimic the variance heterogeneity in an admixed population, we simulated quantitative traits in which the phenotypic variances in EUR and EAS are different. Additional file 1: Fig. S31 and S32 demonstrated the empirical type I error rates and the QQ plots under the null hypothesis. SPAmix-based approaches can well control type I error rates in all scenarios. Meanwhile, although incorporating SNP-derived PCs as covariates, the regular linear regression methods including Wald test, likelihood ratio test, SAIGE [32], REGENIE [17], and fastGWA [55] were still invalid in the case of variance heterogeneity. Even in the case where the variance heterogeneity is limited (standard derivations in EUR and EAS are 1 and 0.8, respectively), the inflated or deflated type I error rates still occurred. The variance heterogeneity issue has been discussed in previous studies [25].

For time-to-event trait analysis, we also conducted simulation studies to evaluate SPAmix-based methods, SPACox, Wald test, and GATE in the case of heterogeneous baseline hazard functions corresponding to ancestries EUR and EAS (more details can be seen in the online Methods section). The empirical type I error rates and QQ plots are presented in Additional file 1: Fig. S33 and S34. SPACox, Wald test, and GATE had inflated or deflated type one error rates. Meanwhile, SPAmix always controlled type I error rates well in all scenarios, which demonstrates the accuracy and robustness of SPAmix over other regular methods in the case of baseline hazard function heterogeneity in an admixed population analysis.

The above failure of the regular linear regression and Cox regression is due to the model misspecification resulted from the diverse phenotypic distribution corresponding to the distinct ancestries, many of which cannot be well characterized via incorporating PCs. Surprisingly, as a retrospective approach, SPAmix can well address this issue, which suggests that it can perform as a more robust solution in the case of other latent heterogeneities when analyzing complex phenotypes.

In addition, the retrospective idea of SPAmix also contributes to a more robust performance if the phenotypic distribution is extremely unbalanced. To demonstrate it, we simulated an ordinal trait with four groups (more details can be seen in the online Methods section). The empirical type I error rates are presented in Additional file 1: Fig. S35. If the ratio of the sample sizes in these four groups is 100:1:1:1, the regular methods can still control type one error rates. However, if the ratio is 1000:1:1:1, the regular methods had inflated type I error rates. Meanwhile, SPAmix can control type I error rates well in all scenarios. Although the simulation scenario seems too extreme to be available in real data analysis, it still can demonstrate the advantage of SPAmix in terms of robustness and the accuracy.

SPAmix framework can account for ancestry-specific environmental factors

The empirical type I error rates of SPAmix in the cases where environmental factors are ancestry-specific are presented in Additional file 1: Fig. S36 and S37 (more details can be seen in the Additional file 1: Supplementary notes). The results showed that SPAmix can still control type I error rates and maintain powers regardless of incorporating the interactions between environmental factors and genetic-derived PCs or not, while SPACox had inflated or deflated type I error rates.

SPAmix and SAIGE show comparable type I error rates and powers

We conducted simulation studies to evaluate SPAmix and SAIGE in binary trait analysis in terms of type I error rates and powers. The empirical type I error rates based on 10⁵ association tests at a significance level 5 $\times$ 10⁻⁴ and QQ plots are presented in Additional file 1: Fig. S38 and S39. The results demonstrated that both SPAmix and SAIGE can well control type I error rates. The empirical powers based on 1000 tests at a significance level $\alpha =5\times {10}^{-8}$ under the alternative model are presented in Additional file 1: Fig. S40, which showed that SPAmix and SAIGE were similarly powerful in all cases. Therefore, SPAmix and SAIGE showed comparable type I error rates and powers when analyzing admixed populations.

SPAmix framework can utilize local ancestry and Cauchy combination to maximize power for various cross-ancestry genetic architectures

Tractor proposed a framework to use local ancestry to analyze admixed population, which can enhance power in the case of effect size heterogeneity. SPAmix framework can also straightforwardly incorporate local ancestry (more details can be seen in the online Methods section). We refer to this as SPAmix_local to distinguish it from the original SPAmix in which the effect sizes for various ancestries were assumed to be equal. In addition, we proposed to use Cauchy combination to calculate one p value (SPAmix_CCT) using p values from SPAmix and SPAmix_local (see online Methods). Extensive simulations were conducted to evaluate SPAmix, SPAmix_local, SPAmix_CCT, and Tractor for binary and quantitative phenotypes.

Under the null hypothesis that there are no genetic effects of two ancestries, SPAmix, SPAmix_local, and SPAmix_CCT can always control type I error rates. However, for binary phenotypes, Tractor had inflated type I error rates if the case–control ratio was unbalanced, i.e., low prevalence (see Fig. 6). This is because Tractor package follows a regular logistic regression to fit binary traits, which does not incorporate SPA and cannot control type I error rates in case of unbalanced phenotypic distribution [32]. For quantitative phenotypes, both SPAmix-based approaches and Tractor can well control type I error rates (see Additional file 1: Fig. S41). The result suggested that incorporating SPA increased the accuracy, especially for unbalanced phenotypic distribution.

The empirical powers were evaluated for a binary trait with a moderate disease prevalence of 0.2, under which both SPAmix-based approaches and Tractor can control type I error rates well. If the genetic effect sizes of different ancestries were equal, SPAmix was always more powerful than Tractor and SPAmix_local (see Fig. 7a). To mimic the genetic effect size heterogeneity, we fixed the genetic effect size of ancestry 1, i.e., ${\gamma }_{ance1}$, and increased genetic effect size of ancestry 2, i.e., ${\gamma }_{ance2}$. In general, SPAmix_local demonstrated similar power as Tractor, and its superiority over SPAmix depends on cross-ancestry genetic architectures (see Fig. 7b, S42, and S43). For example, if the MAF in ancestry 2 was high and the corresponding effect size was close to 0, the power of SPAmix was close to 0 but SPAmix_local and Tractor can still identify the genetic variants. The empirical powers for quantitative traits were consistent and demonstrated in Additional file 1: Fig. S44-S47.

The above power comparisons between Tractor and regular approaches have been comprehensively discussed elsewhere [27]. In this paper, we proposed SPAmix_CCT which uses Cauchy combination to maximize powers. The simulation results suggested that SPAmix_CCT was always close to the most powerful methods for various cross-ancestry genetic architectures, which indicates that SPAmix_CCT can be an optimal unified approach (see Fig. 7 and Additional file 1: S42-S47).

SPAmix_local and SPAmix_CCT can control type I errors in the case of variance heterogeneity

We conducted simulation studies to evaluate SPAmix_local, SPAmix_CCT, and Tractor when analyzing quantitative traits in the scenarios of variance heterogeneity. Under the null hypothesis, QQ plots demonstrated that SPAmix_local and SPAmix_CCT can well control type I error rates in all scenarios (see Additional file 1: Fig. S47). However, Tractor cannot control type I error rates in the scenario of variance heterogeneity. When testing for the genetic effect of the ancestry with larger variance, Tractor results in inflated type I error rates. And when testing for the genetic effect of the ancestry with smaller variance, Tractor results in deflated type I error rates. The degree of inflation or deflation is positively correlated with the degree of the variance heterogeneity. The results well demonstrated the superiority of SPAmix framework over Tractor in terms of accurate type I error control for analysis of quantitative traits with variance heterogeneity in an admixed population.

UK Biobank data analysis validated that SPAmix overperformed SPACox

We selected 10,000 individuals in UK Biobank data to mimic simulation studies. More details about the individual selection can be found in the online Methods section. The QQ plots (Additional file 1: Fig. S49) demonstrated that SPAmix successfully identified a large number of variants while well controlling for type I error rates. In contrast, SPACox resulted in anticonservative or conservative p values depending on the correlation between event rate and MAF across ancestries, leading to a large number of false positive discoveries or loss of power. The results clearly demonstrated that SPACox cannot control type I error rates when analyzing a population with both genetic and phenotypic heterogeneity. The trend of inflation or deflation of p-values is completely consistent to the simulation studies. In contrast, SPAmix produced well-calibrated p-values even in the presence of genetic and phenotypic heterogeneity in the population.

Discussion

Most of the current GWAS studies have been conducted on individuals of European descent, resulting in a lack of diversity and underpowered analyses for other populations. As a result, there is an urgent need for more diverse representation in GWAS studies to enable comprehensive and accurate analysis of genetic variants across different populations.

Admixed populations are groups of individuals whose genetic ancestries are from two or more than two ancestral populations. The population mixture provides unique opportunities to study genetic diversity, disease susceptibility, and complex traits [56,57,58,59,60]. However, the admixed population analysis is technically challenging as it requires addressing the complicated patterns of genetic and phenotypic diversities that arise from distinct genetic backgrounds and environmental exposures. While using SNP-derived PCs are commonly used to account for population stratifications, the simulation studies have shown that it could be inadequate which result in inflation when analyzing admixed samples in some cases. The variance heterogeneity of a quantitative trait is a classic example where regular methods may be ineffective. Several methods have been proposed for GWAS in admixed samples, but they are mostly designed for quantitative traits and binary traits. Scalable and accurate methods for complex traits with intricate structures (e.g., time-to-event and longitudinal traits) are still urgently needed.

In this work, we propose SPAmix, a scalable, accurate, and universal analysis framework to conduct GWASs in a biobank-scale dataset consisting of subjects with complicated genetic structure. SPAmix estimates individual-specific allele frequencies to characterize the individual-specific genetic ancestries using information from the SNP-derived PCs and raw genotype data in a model-free approach. Thus, it does not necessitate accurate specification of and the availability of appropriate reference population panels for the ancestries contributing to the individual, which might be unknown or not well defined [61]. SPAmix is a variant-level analysis and the core codes are written in C + +, which supports a rigorous control of memory usage. In addition, as a retrospective approach, SPAmix is not sensitive to model misspecification (e.g., missed or biased confounder-trait associations) and trait-based ascertainment [28]. In the case of model misspecification, SPAmix can control type I error rates well, but also could be less powerful, which could be interesting to explore in the future study.

In general, the novelty of the state-of-art methodology gives SPAmix the below advances in GWAS. First, in the SPAmix framework, only modeling residuals after fitting a null model is required and thus it can be easily used to analyze any type of complex trait. In the simulation studies and real data analysis, we demonstrated it via binary, quantitative, time-to-event, ordinal, and longitudinal traits. We note that REGENIE recently released a new version to support time-to-event trait. Preliminary analysis results demonstrated that the time-to-event trait analysis cannot control time one error rates, similar to SPACox, GATE, and Wald test (see Additional file 1: Supplementary notes). Second, SPAmix calibrates p values using a hybrid strategy to combine regular normal distribution approximation and SPA. Using SPA can more accurately approximate the null distribution of the test statistics, especially if the phenotypic distribution is unbalanced. Third, SPAmix only requires fitting one null model for a genome-wide analysis. In addition, SPAmix also uses some technical strategies, such as partial normal distribution approximation in SPA, to further reduce the computation time. UK Biobank real data analysis demonstrated that SPAmix is scalable to analyze a large-scale biobank with hundreds of thousands of individuals. Importantly, SPAmix framework does not need separating subjects into multiple ancestry-homozygous groups, and all subjects can be included into one genome-wide analysis. Finally, SPAmix has been incorporated into our well-developed R package GRAB (see URLs), in which PLINK and BGEN file formats are supported.

The original SPAmix assumes that the genetic effects of different ancestries are equal. Meanwhile, if the local ancestry information is available, SPAmix can also be straightforwardly extended to SPAmix_local which generally performs similar to Tractor and be more accurate when analyzing a quantitative trait with phenotypic variance heterogeneity and a binary trait with low case–control ratio. In addition, we also proposed SPAmix_CCT to combine the original SPAmix and SPAmix_local to maximize statistical powers for various cross-ancestry genetic architectures. Its ability to account for ancestry-specific effect sizes is a valuable addition to regular approaches [17, 18, 32].

There are several limitations in SPAmix. First, SPAmix assumes that genotype in subpopulations follows a binomial distribution, which could be violated. Interestingly, in UK Biobank data analysis, although we used imputation dosage data and did not filter out variants based on HWE test, it seems no inflation/deflation were observed, which releases the concerns of the HWE assumption on SPAmix. More discussions about the binomial distribution assumption [62] can be found in Additional file 1: Supplementary notes. Second, SPAmix cannot adjust for family relatedness and thus only unrelated individuals can be analyzed. In the near future, we plan to extend SPAmix to account for sample relatedness via incorporating genetic relationship matrix (GRM) to adjust for the variance of the score statistics. Third, SPAmix cannot estimate genetic effect size via fitting an alternative model. SPAmix can serve as a screening process to prioritize variants. If genetic effect size is required for the subsequent analyses, users can select the top variants and then fit alternative models for these variants. Alternatively, we proposed an approach to estimate effect size, which adopted the strategies utilized in GATE [18], BOLT-LMM [63], and SAIGE [32] and performs comparably as the Wald test and GATE (see Additional file 1: Supplementary notes). Finally, the current version of SPAmix only supports analyzing diploid species such as human.

Currently, quantitative and binary traits remain the most commonly used types of traits in GWAS. Meanwhile, there is a growing trend toward utilizing traits with intricate structures, such as time-to-event, longitudinal, and ordinal traits, as they can convey essential (but often overlooked) information. Unfortunately, applying well-developed statistical models and testing methods in large-scale GWAS is usually challenging due to high computational burden and other practical challenges in genetics studies. To bridge the gap between statistics and GWAS, we previously introduced SPACox as a universal analysis framework to calibrate p values. And currently, we propose SPAmix to address the population stratification which is a significant practical challenge in GWAS, especially for cross-ancestry analyses. Through simulation studies and real data analyses, we demonstrated that SPAmix can efficiently analyze a dataset with over 400,000 individuals and adjust for type I error rates even if the phenotypic distribution is extremely unbalanced. Our method provides a scalable, accurate, and universal solution for GWAS in a large-scale biobank dataset including individuals from multiple genetic ancestries. UK Biobank may not be the best scenario as the majority of samples of the data are from a single homogeneous ancestry group. More ancestrally diverse and admixed dataset, for example, the AllofUs Program [64], the Trans-Omics for Precision Medicine (TOPMed) Program [65], the Mexican Biobank [66], and the Hispanic Community Health Study/Study of Latinos (HCHS/SOL) [67] could be better choices. The preliminary analysis using AllofUS dataset validates the robustness of SPAmix, and we anticipate the discovery of more valuable insights that may result from utilizing SPAmix.

Conclusions

We propose SPAmix as a new analysis framework which is scalable to analyze biobank-scale data, is accurate to analyze phenotypes with an unbalanced distribution and rare variants, and is universal for complex traits with intricate structures (e.g., time-to-event trait and longitudinal trait) and a multi-way admixed population.

URLs

GRAB (version 0.1.1), https://CRAN.R-project.org/package=GRAB. trajGWAS (version 0.4.6), https://openmendel.github.io/TrajGWAS.jl/stable/. gwasurvivr (version 1.18.0), https://bioconductor.org/packages/release/bioc/html/gwasurvivr.html. SAIGE (version 1.1.9), https://saigegit.github.io/SAIGE-doc/. REGENIE (version 3.3), https://rgcgithub.github.io/. GCTA (version 1.94.1) http://cnsgenomics.com/software/gcta/#fastGWA. Tractor (version 0.0.1), https://github.com/Atkinson-Lab/Tractor. UK Biobank analysis summary statistics, https://zenodo.org/record/8343684.

Methods

Model residuals of linear regression and logistic regression

Linear regression and logistic regression are regular models to fit quantitative traits and binary traits. For subject i, we define model residuals ${R}_{i}$ as the difference between response variable ${Y}_{i}$ and its estimated expectation $\widehat{\text{E}\left({Y}_{i}\right)}$ under the null model. Suppose that the study cohort consists of $n$ subjects, we let ${\varvec{G}}={\left({G}_{1},\dots ,{G}_{n}\right)}^{T}$ denote the genotype vector of a genetic variant. The score statistic to associate trait of interest and genetic variant is $S=\sum_{i=1}^{n}{G}_{i}{R}_{i}={{\varvec{G}}}^{T}{\varvec{R}}$, where ${\varvec{R}}={\left({R}_{1},\dots ,{R}_{n}\right)}^{T}$.

Cox proportional hazard model and SPACox

For subject $i\le n$, Cox proportional hazard model specifies the hazard function $\uplambda \left(t;{X}_{i},{G}_{i}\right)$ for the failure (i.e., event occurs) time ${T}_{i}^{*}$ associated with ${G}_{i}$ and a $p\times 1$ vector of covariates ${X}_{i}$ in the form of

$$\uplambda \left(t;{X}_{i},{G}_{i}\right)={\uplambda }_{0}\left(t\right)\text{exp}\left({X}_{i}^{T}\beta +{G}_{i}\gamma \right),$$

where ${\uplambda }_{0}\left(t\right)$ is the baseline hazard function, ${X}_{i}$ is a $k\times 1$ vector of covariates including age, sex, SNP-derived principal components (PCs), and the other confounding factors, $\beta ={\left({\beta }_{1},\dots ,{\beta }_{k}\right)}^{T}$ is the coefficient vector for the covariates, and $\gamma$ is the genetic effect $.$ The observed time-to-event phenotype is $\left({T}_{i},{\delta }_{i},{X}_{i},{C}_{i}\right)$, where ${C}_{i}$ denotes the censoring time, ${T}_{i}=\text{min}\left({T}_{i}^{*},{C}_{i}\right)$ denotes the observed time-to-event, ${\delta }_{i}=\text{I}\left({T}_{i}^{*}\le {C}_{i}\right)$ indicates that failure is observed, and $\text{I}\left(.\right)$ is an indicator function.

To test for the null hypothesis ${\text{H}}_{0}: \gamma =0$, we fit the null Cox PH model as

$$\uplambda \left(t;{X}_{i}\right)={\uplambda }_{0}\left(t\right)\text{exp}\left({X}_{i}^{T}\beta \right),$$

and estimate the coefficients $\widehat{\beta }$. Score statistics $S=\sum_{i=1}^{n}{G}_{i}{R}_{i}={{\varvec{G}}}^{T}{\varvec{R}}$, where ${R}_{i}, i\le n$ is the martingale residual for subject $i$ and $\sum_{i=1}^{n}{R}_{i}=0$. For time-to-event data analysis, the approaches based on Cox PH model is more powerful than regular logistic regression-based methods [6, 12]. More details, including model fitting and theoretical derivations about the score statistics, can be found in previous work [6].

SPACox is a prospective approach that considers martingale residuals ${R}_{1},\dots ,{R}_{n}$ as independently and identically distributed (i.i.d.) random variables [6]. A hybrid strategy, including normal distribution approximation and saddlepoint approximation (SPA), is then used to calibrate p values. Since the null model is the same for all genetic variants, the model fitting and the empirical approximation of the distribution of the martingale residuals are only required once across the genome-wide analysis, which facilitates SPACox to be computationally efficient for time-to-event trait analysis in GWAS. In addition, SPACox is accurate as SPA uses an entire cumulant generating function (CGF) to approximate the null distribution of score statistics.

Longitudinal trajectories and trajGWAS

The empirical SPA strategy proposed by SPACox can be easily extended to other types of traits. For example, Ko et al. proposed trajGWAS in which the empirical SPA strategy was employed to analyze longitudinal trajectories [31]. Fitting a null linear mixed model to longitudinal measurements, score statistics ${S}_{{\beta }_{g}}={R}_{{\beta }_{g}}^{T}G$ and ${S}_{{\tau }_{g}}={R}_{{\tau }_{g}}^{T}G$ are then used to test the mean effect and within-subject (WS) variability for each genetic variant, in which ${R}_{{\beta }_{g}}$ and ${R}_{{\tau }_{g}}$ are the corresponding model residuals. As trajGWAS follows a similar empirical strategy as SPACox, its performance is expected to be similar as SPACox and thus we did not conduct simulation studies for longitudinal trait analysis. More details about the phenotype data clean can be found in the previous study [31]. The only exception is to retain all individuals, instead of only retaining White British individuals.

Ordinal traits and proportional odds logistic regression model

Ordinal traits are widely available in biobanks to measure human behaviors, satisfaction, and preferences. We let J denote the number of category levels. For subject $i\le n$, let ${Y}_{i}$ = 1, 2, …, J denote the ordinal phenotype. To characterize ordinal traits, the proportional odds logistic regression models as below has been applied in GWAS [7].

$$\text{logit}\left({\nu }_{ij}\right)={\varepsilon }_{j}-{\eta }_{i}={\varepsilon }_{j}-{X}_{i}^{T}\beta -{G}_{i}\gamma , i\le n, j\le J,$$

where ${X}_{i}$ is a vector of covariates of confounding factors including age, sex, SNP-derived principal components (PCs), etc., ${G}_{i}$ is the genotype, $\beta$ is the coefficient vector for the covariates, and $\gamma$ is the genetic effect $.$ The cumulative probability of ${Y}_{i}\le j$ conditional on ${X}_{i}$ and ${G}_{i}$ is denoted by ${\nu }_{ij}=\text{Pr}\left({Y}_{i}\le j|{X}_{i},{G}_{i}\right)$, and the cutpoints ${\varepsilon }_{j}, j\le J$ are to categorize the data. Following the above model, the score statistics to test parameter $\gamma$ is $S=\sum_{i=1}^{n}{G}_{i}{R}_{i}$. More details, including model fitting, theoretical derivations about the score statistics, and the model residuals ${R}_{i},i\le n$ can be found in previous work [7].

SPAmix uses individual-specific allele frequency to adjust for admixed population

The rationality of the empirical SPA strategy in SPACox heavily relies on the identical distribution assumption of the residuals. However, this assumption could be violated if the study cohort is from multiple genetic ancestries. Instead of characterizing the complicated ancestry-specific effect on traits, we used a retrospective approach.

In this section, we use Cox proportion hazard model and time-to-event phenotype as an example to demonstrate SPAmix. For each genetic variant, SPAmix uses both normal distribution and SPA to approximate the null distribution of score statistics $S=\sum_{i=1}^{n}{G}_{i}{R}_{i}$ conditional on time-to-event phenotypes ${\varvec{T}}={\left({T}_{1},{T}_{2},\cdots ,{T}_{n}\right)}^{T}$, indicators ${\varvec{\delta}}={\left({\delta }_{1},{\delta }_{2},\cdots ,{\delta }_{n}\right)}^{T}$, and covariates ${\varvec{X}}$, or equivalently, conditional on martingale residuals ${R}_{i},i\le n$. In other words, SPAmix treats martingale residuals ${R}_{i},i\le n$ as fixed values and genotypes ${G}_{i},i\le n$ as random variables.

There have been previous studies considering genotype as a random variable following a binomial distribution but most of them assume the same allele frequency for different subjects [29, 68]. Here, we relax the assumption and use SNP-derived PCs and raw genotype to estimate the individual-specific allele frequency for each genetic variant. Intuitively, for individuals in the same ancestry, the estimated allele frequencies are always close to each other. Meanwhile, for individuals in different ancestries, the estimated allele frequencies could be different. In either case, the individual-specific allele frequency estimation is more accurate, since in the real world, well-delimited genetic populations are rarely found within species, and there is often heterogeneity within “ancestry groups” and a continuous spectrum of relatedness across them [69]. It is crucial to embrace a multidimensional and continuous understanding of ancestry [70]. The notion of homogeneous populations merely offers an approximate representation of human data and fails to capture the true distribution of genetic variants and the continuous movement of people which leads to varying degrees of mixture among global populations [69,70,71,72,73,74,75]. The individual-specific allele frequency estimation remains robust, regardless of ancestral diversity populations or admixture within populations.

Individual-specific allele frequency estimation

In this paper, we assume that individuals come from different populations with varying allele frequencies (AFs). We employ the term “individual-specific AFs” as defined by Conomos et al. [61] to describe the variation in genotype distribution among individuals from different populations. For a given genetic variant, let ${\varvec{q}}={\left({q}_{1},\dots ,{q}_{n}\right)}^{T}$ represent the individual-specific allele frequencies for $n$ subjects, where ${q}_{i}$ denotes the allele frequency of the variant for subject $i, i\le n$. Under the assumption of Hardy–Weinberg equilibrium (HWE), the genotype ${G}_{i}$ is modeled as following an independent binomial distribution Binom(2, ${q}_{i}$) [61, 68].

If all $n$ subjects belong to a homogeneous population, the allele frequencies are identical across individuals, i.e., ${q}_{1}={q}_{2}=\dots ={q}_{n}$. In this case, ${\varvec{q}}$ can be estimated as:

$${\widehat{{\varvec{q}}}}^{0}={\left({\widehat{q}}_{1}^{0},\dots ,{\widehat{q}}_{n}^{0}\right)}^{T},\text{ where }{\widehat{q}}_{1}^{0}={\widehat{q}}_{2}^{0}=\dots ={\widehat{q}}_{n}^{0}=\left(\sum_{i=1}^{n}{G}_{i}\right)/2n.$$

However, the above estimation becomes inaccurate if the subjects are from distinct subpopulations or an admixed population [53, 76]. Ancestry PCs are commonly used to characterize the population structure [76]. While PCs are typically included as covariates in model fitting, they also be leveraged to estimate allele frequencies [61]. Here, we propose a hybrid strategy to estimate ${\varvec{q}}$ using raw genotype data and PCs [61].

Based on a linear regression model, ${\varvec{q}}$ can be estimated as:

$${\widehat{{\varvec{q}}}}_{{\varvec{l}}{\varvec{m}}}=\left(1/2\right)\cdot {\widetilde{{\varvec{X}}}}_{PC}{\left({\widetilde{{\varvec{X}}}}_{PC}^{T}{\widetilde{{\varvec{X}}}}_{PC}\right)}^{-1}{\widetilde{{\varvec{X}}}}_{PC}^{T}{\varvec{G}},$$

where ${\widetilde{{\varvec{X}}}}_{PC}=\left(1, {{\varvec{X}}}_{PC}\right)$ is a matrix consisting of a vector of ones and the top PCs [61]. However, ${\widehat{{\varvec{q}}}}_{{\varvec{l}}{\varvec{m}}}$ may not always be optimal, as some elements of the vector ${\widehat{{\varvec{q}}}}_{{\varvec{l}}{\varvec{m}}}$ can fall outside the interval [0,1], particularly for rare variants. To address this issue, we introduced a cutoff ${q}_{c}$ (default: ${q}_{c}=0$) and update ${\widehat{{\varvec{q}}}}_{{\varvec{l}}{\varvec{m}}}$ as

$${\widehat{q}}_{lmc}={\left({\widehat{q}}_{lmc}^{1},\dots ,{\widehat{q}}_{lmc}^{n}\right)}^{T}, \text{where}\ {\widehat{q}}_{lmc}^{i}={\widehat{q}}_{lm}^{i}{\text{I}}_{\left[{q}_{c}\le {\widehat{q}}_{lm}^{i}\le 1-{q}_{c}\right]}+{q}_{c}{\text{I}}_{\left[{\widehat{q}}_{lm}^{i}<{q}_{c}\right]}+\left(1-{q}_{c}\right){\text{I}}_{\left[{\widehat{q}}_{lm}^{i}>1-{q}_{c}\right]}.$$

Alternatively, logistic regression model can be employed to estimate ${\varvec{q}}$. Here, the response variable is the indicator ${\text{I}}_{\left[{G}_{i}\ge 0.5\right]}$, and the predictors are the top PCs. The probability $\text{Pr}\left({G}_{i}\ge 0.5\right)$ is estimated as $\sigma ({\widehat{\eta }}_{i})$, where ${\widehat{\eta }}_{i}$ is the estimated linear predictor, and $\sigma \left(x\right)=1/(1+\text{exp}(-x))$. Under HWE, the allele frequency is estimated as:

$${\widehat{{\varvec{q}}}}_{{\varvec{g}}{\varvec{l}}{\varvec{m}}}={\left({\widehat{q}}_{glm}^{1},\dots ,{\widehat{q}}_{glm}^{n}\right)}^{T},\text{ where }{\widehat{q}}_{glm}^{i}=1-\sqrt{\widehat{\text{Pr}}\left({G}_{i}=0\right)}=1-\sqrt{1-\sigma \left({\widehat{\eta }}_{i}\right)}.$$

Notably, all elements of ${\widehat{{\varvec{q}}}}_{{\varvec{g}}{\varvec{l}}{\varvec{m}}}$ lie within the interval [0,1], making it more robust than ${\widehat{{\varvec{q}}}}_{{\varvec{l}}{\varvec{m}}}$ especially for low-frequency and rare variants.

To balance accuracy and computational efficiency, we propose a hybrid strategy. First, we fit a linear regression model to calculate ${\widehat{{\varvec{q}}}}_{{\varvec{l}}{\varvec{m}}}$. If the proportion of the elements in ${\widehat{{\varvec{q}}}}_{{\varvec{l}}{\varvec{m}}}$ that fall within [0,1] exceeds a predefined threshold $r$ (default: 0.9), we use $\widehat{{\varvec{q}}}={\widehat{{\varvec{q}}}}_{{\varvec{l}}{\varvec{m}}{\varvec{c}}}$ as the final estimate. Otherwise, we extract PCs with p values less than 0.05 in the linear regression and compute $\widehat{{\varvec{q}}}={\widehat{{\varvec{q}}}}_{{\varvec{g}}{\varvec{l}}{\varvec{m}}}$ as the final estimate.

Normal distribution approximation and saddlepoint approximation

Given individual-specific allele frequencies, the mean and variance of $S$ conditional on $\left({\varvec{T}},{\varvec{\delta}},{\varvec{X}}\right)$ under the null hypothesis H₀ are

$${E}_{{\text{H}}_{0}}\left(S|{\varvec{T}},{\varvec{\delta}},{\varvec{X}}\right)={E}_{{\text{H}}_{0}}\left({{\varvec{R}}}^{T}{\varvec{G}}|{\varvec{T}},{\varvec{\delta}},{\varvec{X}}\right)=\sum_{i=1}^{n}{R}_{i}\cdot 2{q}_{i}$$

and

$$Va{r}_{{\text{H}}_{0}}\left(S|{\varvec{T}},{\varvec{\delta}},{\varvec{X}}\right)=Va{r}_{{\text{H}}_{0}}\left({{\varvec{R}}}^{T}{\varvec{G}}|{\varvec{T}},{\varvec{\delta}},{\varvec{X}}\right)=\sum_{i=1}^{n}{R}_{i}^{2}\cdot 2{q}_{i}\left(1-{q}_{i}\right),$$

respectively. For normal distribution approximation, we replace ${\varvec{q}}$ with the estimated $\widehat{{\varvec{q}}}$ and then use a normal distribution with a mean of $\widehat{\mu }=\sum_{i=1}^{n}2{R}_{i}{\widehat{q}}_{i}$ and a variance of ${\widehat{\sigma }}^{2}=\sum_{i=1}^{n}{R}_{i}^{2}\cdot 2{\widehat{q}}_{i}\left(1-{\widehat{q}}_{i}\right)$ to approximate the null distribution of score statistics $S$. Under the null hypothesis, the probability $\text{Pr}\left(S<s|{\varvec{T}},{\varvec{\delta}},{\varvec{X}}\right)=\text{Pr}\left(S-\widehat{\mu }<s-\widehat{\mu }|{\varvec{T}},{\varvec{\delta}},{\varvec{X}}\right)$ can be estimated by $\Phi \left\{\left(s-\widehat{\mu }\right)/\widehat{\sigma }\right\}$, where $\Phi \left\{.\right\}$ is the cumulative distribution function (CDF) of the standard normal distribution.

The normal distribution approximation only relies on the first two cumulants (i.e., mean and variance) of the distribution. Thus, it cannot control type I error rates at a stringent genome-wide significance level if the event rate is low, particularly when testing low-frequency variants [15]. To address this issue, we propose a retrospective SPA method which utilizes knowledge of all cumulants expressed through values of the CGF. Compared to the normal distribution approximation, SPA is more accurate if the observed score statistic is at the tails of the distribution. More importantly, SPA is not restricted to a symmetric distribution [77, 78].

Suppose that genotype ${G}_{i},i\le n$ follow a binomial distribution Binom $(2,{q}_{i})$, the estimated moment generating function (MGF) of ${G}_{i}$ and its derivatives are

$${\widehat{M}}_{{G}_{i}}\left(t\right)={\left(1-{\widehat{q}}_{i}+{\widehat{q}}_{i}{e}^{t}\right)}^{2}, {\widehat{M}}_{{G}_{i}}^{\prime}\left(t\right)=2{\widehat{q}}_{i}{e}^{t}\cdot \left(1-{\widehat{q}}_{i}+{\widehat{q}}_{i}{e}^{t}\right),$$

$${\widehat{M}}_{{G}_{i}}^{{\prime}{\prime}}\left(t\right)=2{\left({\widehat{q}}_{i}{e}^{t}\right)}^{2}+2{\widehat{q}}_{i}{e}^{t}\cdot \left(1-{\widehat{q}}_{i}+{\widehat{q}}_{i}{e}^{t}\right).$$

The corresponding cumulant generating function (CGF) is ${\widehat{K}}_{{G}_{i}}\left(t\right)=\text{ln}{\widehat{M}}_{{G}_{i}}(t)$, and its derivates are

$${\widehat{K}^{\prime}}_{{G}_{i}}\left(t\right)=\frac{{\widehat{M}}_{{G}_{i}}^{\prime}\left(t\right)}{{\widehat{M}}_{{G}_{i}}\left(t\right)},\ {\widehat{K}}_{{G}_{i}}^{{\prime}{\prime}}\left(t\right)=\frac{{\widehat{M}}_{{G}_{i}}^{{\prime}{\prime}}\left(t\right){\widehat{M}}_{{G}_{i}}\left(t\right)-{\left[{\widehat{M}}_{{G}_{i}}^{\prime}\left(t\right)\right]}^{2}}{{\left[{\widehat{M}}_{{G}_{i}}\left(t\right)\right]}^{2}}.$$

Hence, under the null hypothesis, the estimated CGF of $S$ conditional on $\left({\varvec{T}},{\varvec{\delta}},{\varvec{X}}\right)$, or equivalently ${\varvec{R}}$, is

$$\widehat{H}\left(t\right)={\sum }_{i=1}^{n}{\widehat{K}}_{{G}_{i}}({R}_{i}t)={\sum }_{i=1}^{n}\text{ln}{\widehat{M}}_{{G}_{i}}\left({R}_{i}t\right).$$

and its first and second derivatives are

$${\widehat{H}}^{\prime}\left(t\right)={\sum }_{i=1}^{n}{R}_{i}{\widehat{K}}_{{G}_{i}}^{\prime}({R}_{i}t), {\widehat{H}}^{{\prime}{\prime}}\left(t\right)={\sum }_{i=1}^{n}{\left({R}_{i}\right)}^{2}{\widehat{K}}_{{G}_{i}}^{{\prime}{\prime}}\left({R}_{i}t\right).$$

For any observed score $s$ and observed martingale residuals ${R}_{i}, i\le n$, we first calculate $\widehat{\upzeta }$ such that ${\widehat{H}}^{\prime}\left(\widehat{\upzeta }\right)=s$, then we calculate $\widehat{\omega }=\text{sgn}\left(\widehat{\upzeta }\right)\sqrt{2\left(\widehat{\upzeta }s-\widehat{H}\left(\widehat{\upzeta }\right)\right)}$ and $\widehat{\nu }=\widehat{\upzeta }\sqrt{{\widehat{H}}^{{\prime}{\prime}}\left(\widehat{\upzeta }\right)}$. Under the null hypothesis, the conditional probability $\text{Pr}\left(S<s|{\varvec{T}},{\varvec{\delta}},{\varvec{X}}\right)$ can be estimated by $\Phi \left\{\widehat{\omega }+\frac{1}{\widehat{\omega }}\cdot \text{log}\left(\frac{\widehat{\nu }}{\widehat{\omega }}\right)\right\}$, where $\Phi \left\{.\right\}$ is the CDF of the standard normal distribution. The proposed method is a straightforward extension of Barndorff-Nielsen’s formula [79].

To obtain p values efficiently while avoiding false positive discoveries, we adopt a hybrid strategy to combine normal distribution approximation and saddlepoint approximation. If the absolute value of the centered observed statistics $\left|s-\widehat{\mu }\right|<r\widehat{\sigma }$, where $r=2$ is a pre-specified value [4, 6, 15, 54], we use normal distribution approximation to obtain the p value. Otherwise, the saddlepoint approximation method is used to calibrate p values in tail areas. SPAmix outputs a two-sided p value of ${p}_{l}+{p}_{r}$, where the left-tailed and right-tailed p values are ${p}_{l}=\widehat{\text{Pr}}\left(S-\widehat{\mu }<-1\cdot \left|s-\widehat{\mu }\right||{\varvec{T}},{\varvec{\delta}},{\varvec{X}}\right)$ and ${p}_{r}=\widehat{\text{Pr}}\left(S-\widehat{\mu }>\left|s-\widehat{\mu }\right||{\varvec{T}},{\varvec{\delta}},{\varvec{X}}\right)$, respectively.

Partial normal distribution approximation in SPA

To further reduce the computation time, we used a partial normal distribution in SPA, similar as fastSPA [15]. We check the residual distribution and define outliers. Suppose that the 25th and 75th percentiles of the residuals are ${\alpha }_{0.25}$ and ${\alpha }_{0.75}$, respectively, the interquartile range (IQR) is ${\alpha }_{0.75}-{\alpha }_{0.25}$. If the residual > ${\alpha }_{0.75}+\text{IQR}\cdot \gamma$ or < ${\alpha }_{0.25}-\text{IQR}\cdot \gamma$, then the residual is an outlier. In this paper, we use $\gamma =1.5$.

Given the definitions of outlier, we reformulate the score statistics

$$S={\sum }_{i=1}^{n}{G}_{i}{R}_{i}={\sum }_{i=1}^{{n}_{1}}{G}_{i}{R}_{i}+{\sum }_{i={n}_{1}+1}^{n}{G}_{i}{R}_{i},$$

where the residuals of the first ${n}_{1}$ individuals are outliers and the others are not. We let

$${S}_{o}={\sum }_{i=1}^{{n}_{1}}{G}_{i}{R}_{i}, {S}_{non}={\sum }_{i={n}_{1}+1}^{n}{G}_{i}{R}_{i},$$

then, the MGF and CGF are calculated for ${S}_{o}$ as in the previous sections, and normal distribution approximation is applied to calculate the MGF and corresponding CGF for ${S}_{non}.$ The partially normal approximation can greatly reduce computational complexity. More details on the accuracy trade-offs of the partial normal distribution approximation and the use of IQR for categorizing residuals into outliers and non-outliers can be found in Additional file 1: Supplementary notes.

SPAmix_local tests for ancestry-specific genetic effect using ancestry-specific score statistics

Suppose that the study cohort consists of $n$ subjects from an admixed population composed of $K$ ancestries, we let ${\varvec{G}}={\left({G}_{1},\dots ,{G}_{n}\right)}^{T}$ denote the genotype vector of a genetic variant and ${{\varvec{G}}}^{\left(k\right)}={\left({G}_{1}^{\left(k\right)},\dots ,{G}_{n}^{\left(k\right)}\right)}^{T},k\le K$, denote the vector of the number of copies coming from the kth ancestry, i.e., the genotypes from the kth ancestry. Instead of associating genotype ${\varvec{G}}$ to a phenotype of interest, Tractor and SPAmix_local are designed to test for ancestry-specific genetic effect, i.e., to associate ancestry-specific genotypes ${{\varvec{G}}}^{\left(k\right)}$ and the phenotype of interest. The original SPAmix calibrates p values by approximating the null distribution of score statistics $S=\sum_{i=1}^{n}{R}_{i}{G}_{i}$ in which ${R}_{i}$ is the null model residual for subject $i$. And we extend it to SPAmix_local in which ancestry-specific p values are calculated corresponding to the ancestry-specific score statistics ${S}^{\left(k\right)}=\sum_{i=1}^{n}{R}_{i}{G}_{i}^{\left(k\right)}$.

For subject $i, i\le n$, let ${h}_{i}^{\left(k\right)}$ denote the number of haplotypes, i.e., local ancestry count, of the kth ancestry at one locus, and ${{{\varvec{h}}}^{\left({\varvec{k}}\right)}=\left({h}_{1}^{\left(k\right)},\dots ,{h}_{n}^{\left(k\right)}\right)}^{T}$ denotes the corresponding vector for all subjects. Similar to the original SPAmix, suppose that the ancestry-specific allele frequencies ${q}^{\left(1\right)},\dots ,{q}^{\left(K\right)}$ are available, the mean and the variance of ${S}^{\left(k\right)}$ under the null hypothesis are

$${E}_{{\text{H}}_{0}}\left({S}^{\left(k\right)}|{\varvec{R}}\right)={E}_{{\text{H}}_{0}}\left({{\varvec{R}}}^{T}{{\varvec{G}}}^{\left({\varvec{k}}\right)}|{\varvec{R}}\right)={q}^{\left(k\right)}\cdot \sum_{i=1}^{n}{R}_{i}\cdot {h}_{i}^{\left(k\right)}$$

and

$$Va{r}_{{\text{H}}_{0}}\left({S}^{\left(k\right)}|{\varvec{R}}\right)=Va{r}_{{\text{H}}_{0}}\left({{\varvec{R}}}^{T}{{\varvec{G}}}^{\left({\varvec{k}}\right)}|{\varvec{R}}\right)=\sum_{i=1}^{n}{R}_{i}^{2}\cdot {h}_{i}^{\left(k\right)}\cdot {q}^{\left(k\right)}\cdot \left(1-{q}^{\left(k\right)}\right),$$

respectively. For SPAmix_local, the ancestry-specific allele frequency ${q}^{\left(k\right)}$ is estimated by using ${\widehat{q}}^{\left(k\right)}=\sum_{i=1}^{n}{G}_{i}^{\left(k\right)}/\sum_{i=1}^{n}{h}_{i}^{\left(k\right)}$.

Similar to SPAmix, SPAmix_local also applies a hybrid strategy including both normal distribution approximation and SPA to calibrate p values. For normal distribution approximation, the null distribution of ${S}^{\left(k\right)}$ is approximated using a normal distribution with a mean of ${\widehat{\mu }}^{\left(k\right)}={E}_{{\text{H}}_{0}}\left({S}^{\left(k\right)}|{\varvec{R}},{\widehat{q}}^{\left(k\right)}\right)$ and a variance of ${\left({\widehat{\sigma }}^{\left(k\right)}\right)}^{2}=Va{r}_{{\text{H}}_{0}}\left({S}^{\left(k\right)}|{\varvec{R}},{\widehat{q}}^{\left(k\right)}\right)$. Under the null hypothesis, the probability $\text{Pr}\left({S}^{\left(k\right)}<s|{\varvec{R}}\right)=\Phi \left\{\left(s-{\widehat{\mu }}^{\left(k\right)}\right)/{\widehat{\sigma }}^{\left(k\right)}\right\}$ where $\Phi \left\{.\right\}$ is the CDF of a standard normal distribution. For SPA, we assume that ancestry-specific genotype ${G}_{i}^{\left(k\right)}, i\le n$ follow a binomial distribution Binom $({h}_{i}^{\left(k\right)},{\widehat{q}}^{\left(k\right)})$ in which ${h}_{i}^{\left(k\right)}=0, 1,$ or $2$. Then, the estimated MGF and CGF of ${G}_{i}^{\left(k\right)}$ are

$${\widehat{M}}_{{G}_{i}^{\left(k\right)}}\left(t\right)={\left[1-{\widehat{q}}^{\left(k\right)}+{\widehat{q}}^{\left(k\right)}\cdot {e}^{t}\right]}^{{h}_{i}^{\left(k\right)}}$$

and ${\widehat{K}}_{{G}_{i}^{\left(k\right)}}\left(t\right)=\text{ln}{\widehat{M}}_{{G}_{i}^{\left(k\right)}}(t)$. And conditional on ${\varvec{R}}$, the estimated CGF of ${S}^{\left(k\right)}$ under the null hypothesis is

$${\widehat{H}}^{\left(k\right)}\left(t\right)={\sum }_{i=1}^{n}{\widehat{K}}_{{G}_{i}^{\left(k\right)}}\left({R}_{i}\cdot t\right)={\sum }_{i=1}^{n}\text{ln}{\widehat{M}}_{{G}_{i}^{\left(k\right)}}\left({R}_{i}\cdot t\right).$$

For ancestry-specific observed score statistics ${s}^{\left(k\right)}$, we followed the similar process as in the original SPAmix to calculate the first and the second derivatives of ${\widehat{H}}^{\left(k\right)}\left(t\right)$ and to calibrate p values based on Barndorff-Nielsen’s formula. The hybrid strategy to combine normal distribution approximation and SPA is the same as the original SPAmix.

Optimization guidelines for SPAmix_local

SPAmix_local is a framework designed to test for ancestry-specific genetic effects and operates through a two-step process. In step 1, SPAmix_local requires fitting a null model to calculate model residuals, in which confounding factors such as age, sex, SNP-derived principal components (PCs), and other confounders are incorporated as covariates.

In step 2, SPAmix associates the traits of interest with ancestry-specific genotypes of single genetic variant by approximating the null distribution of ancestry-specific score statistics ${S}^{\left(k\right)}=\sum_{i=1}^{n}{R}_{i}{G}_{i}^{\left(k\right)}$, where $k$ denotes the ancestry index. SPAmix_local efficiently estimates ancestry-specific allele frequencies. SPAmix_local uses a hybrid strategy combining normal distribution approximation and SPA to approximate the distribution of ancestry-specific score statistics under the null hypothesis. The approximation relies on the assumption that ancestry-specific genotype ${G}_{i}^{\left(k\right)}$ follows a binomial distribution Binom(2, ${\widehat{q}}^{\left(k\right)}$). If the absolute value of the centered observed ancestry-specific statistics $\left|{s}^{(k)}-{\widehat{\mu }}^{\left(k\right)}\right|<r{\widehat{\sigma }}^{\left(k\right)}$, normal distribution approximation is used to obtain the p value. Otherwise, the saddlepoint approximation is used to calibrate p values.

A key distinction between SPAmix_local and the original SPAmix is the requirement of local ancestry count ${{{\varvec{h}}}^{\left(k\right)}=\left({h}_{1}^{\left(k\right)},\dots ,{h}_{n}^{\left(k\right)}\right)}^{T}$ for ancestry-specific analysis in admixed populations. Importantly, SPAmix_local estimates ancestry-specific allele frequencies efficiently using the formula ${\widehat{q}}^{\left(k\right)}=\sum_{i=1}^{n}{G}_{i}^{\left(k\right)}/\sum_{i=1}^{n}{h}_{i}^{\left(k\right)}$. As it eliminates regression-based allele frequency estimation, which is the most computationally intensive step in the original SPAmix framework. Consequently, the incorporation of local ancestry data does not significantly increase the computational burden while enhancing statistical power for detecting ancestry-specific genetic effects.

This streamlined process makes SPAmix_local a powerful and efficient tool for testing ancestry-specific genetic effects in admixed populations.

SPAmix_CCT combines p values of SPAmix and SPAmix_local to maximize powers

Suppose that the admixed population is composed of $K$ ancestries. SPAmix_local outputs K ancestry-specific p values, and the original SPAmix outputs one p value assuming that the genetic effects are the same for all ancestries. We proposed SPAmix_CCT in which Cauchy combination is used to combine the K + 1 p values. Benefit from the advantage of Cauchy combination, the p value of SPAmix_CCT can still control type one error rate well and remain powerful for both homogeneous and heterogeneous ancestry-specific genetic effect sizes.

Data simulation

For subject $i\le n$, we let ${{\varvec{a}}}_{{\varvec{i}}}={\left({a}_{i}^{EUR}, {a}_{i}^{EAS}\right)}^{T}$ denote an ancestry vector, where $1\ge {a}_{i}^{EUR}\ge 0$ and $1\ge {a}_{i}^{EAS}\ge 0$ are to represent ancestry proportions of EUR and EAS, respectively, and ${a}_{i}^{EUR}+{a}_{i}^{EAS}=1$. We assumed that the first $n/2=5000$ subjects were from a EUR-dominant community with an ancestry vector ${{\varvec{a}}}_{{\varvec{i}}}$ following a Dirichlet(9, 1) distribution, and the remaining $n/2=5000$ subjects were from an EAS-dominant community with an ancestry vector ${{\varvec{a}}}_{{\varvec{i}}}$ following a Dirichlet(1, 9) distribution [61, 80,81,82,83]. The distribution of ${{\varvec{a}}}_{{\varvec{i}}},i\le n$ can be found in Additional file 1: Fig. S50. The allele frequency diversity between EUR and EAS commonly exists in real data (Additional file 1: Fig. S11).

Based on the difference of MAFs corresponding to the two populations, i.e., Diff_MAF = ${q}^{EUR}-{q}^{EAS}$, we categorized variants into five groups: Diff_MAF < < 0, Diff_MAF < 0, Diff_MAF ~ 0, Diff_MAF > 0, and Diff_MAF > > 0. Based on the minimal MAF value, i.e., $\text{min}\left({q}^{EUR},{q}^{EAS}\right)$, we categorized variants into three groups of minMAF_low, minMAF_mod, and minMAF_high. Thus, all variants were categorized into 15 (5 $\times$ 3) groups. More details of the grouping process and the proportion of variants in each group can be found in Additional file 1: Fig. S11. In each group, we randomly sampled 1000 pairs of $\left({q}^{EUR},{q}^{EAS}\right)$ and simulated 1000 genetic variants correspondingly. For each variant, the genotype of individual $i$ follows a Binom(2, ${q}_{i}$) distribution, where ${q}_{i}={a}_{i}^{EUR}{q}^{EUR}+{a}_{i}^{EAS}{q}^{EAS}$. In addition, a total of 100,000 common SNPs (${q}^{EUR}+{q}^{EAS}>0.1$) were simulated to calculate SNP-derived PCs (Additional file 1: Fig. S51).

We followed a two-step process to simulate time-to-event traits as phenotypes. [6] We first simulated an underlying failure time ${T}_{i}^{*}$ and a censoring time ${C}_{i}$. The censoring time ${C}_{i}$ was simulated following a Weibull distribution with a shape parameter of 1 and a scale parameter of 0.15. The underlying failure time ${T}_{i}^{*}$ was simulated based on a Cox PH model with a Weibull baseline hazard function as ${T}_{i}^{*}=\lambda \sqrt{-\text{log}{U}_{i}/\text{exp}\left({\eta }_{i}\right)}$, where ${U}_{i}$ was simulated following a uniform distribution U(0,1), and linear predictor ${\eta }_{i}={\beta }_{1}{X}_{i1}+0.5{X}_{i2}+0.5{X}_{i3}+\gamma {G}_{i}$ was to characterize the effect from three covariates ${X}_{i1}$, ${X}_{i2}$, ${X}_{i3}$, and genotype ${G}_{i}$. Covariate ${X}_{i1}={a}_{i}^{EAS}=1-{a}_{i}^{EUR}$ was the proportion of EAS ancestry, ${X}_{i2}$ was simulated following a Bernoulli(0.5) distribution, and ${X}_{i3}$ was simulated following a standard normal distribution. Time-to-event phenotype ${T}_{i}=\text{min}\left({T}_{i}^{*}, {C}_{i}\right)$. Indicator ${\delta }_{i}=\text{I}\left({T}_{i}^{*}\le {C}_{i}\right)$ is to indicate whether event occurs. The scale parameter $\lambda$ and the coefficient ${\beta }_{1}$ were selected to obtain the desired event rates ${\rho }_{EUR}$ and ${\rho }_{EAS}$ in EUR and EAS populations. Here, ${\rho }_{EUR}$ and ${\rho }_{EAS}$ are the expected event rates for a pure EUR population (i.e., ${X}_{i1}=0, i\le n$) and pure EAS population (i.e., ${X}_{i1}=1, i\le n$), respectively.

To assess type I error rates, we simulated phenotypes under two scenarios, either of which followed the null hypothesis of no genetic effects ($\text{i}.\text{e}., \gamma =0$).

Scenario 1. The event rates in EUR and EAS were the same, that is, ${{\varvec{\rho}}}_{{\varvec{E}}{\varvec{U}}{\varvec{R}}}={{\varvec{\rho}}}_{{\varvec{E}}{\varvec{A}}{\varvec{S}}}$. We consider three events rates including 0.01 (low event rate, ER_low), 0.1 (moderate event rate, ER_mod), and 0.3 (high event rate, ER_high).
Scenario 2. The event rates in EUR were higher than those in EAS, that is, ${{\varvec{\rho}}}_{{\varvec{E}}{\varvec{U}}{\varvec{R}}}>{{\varvec{\rho}}}_{{\varvec{E}}{\varvec{A}}{\varvec{S}}}$. We considered three pairs of event rates $\left({\rho }_{EUR},{\rho }_{EAS}\right)=$ (0.1, 0.01) (low event rate, ER_low), (0.3, 0.05) (moderate event rate, ER_mod), and (0.5, 0.2) (high event rate, ER_high).

We did not consider the scenario in which the event rates in EAS were higher than those in EUR since it is exactly the opposite direction of scenario 2 and the results can be expected. In either scenario, 1,000,000 datasets of phenotypes and covariates were simulated, and thus a total of ${10}^{9}$ tests were conducted for each pair of MAF group and event rate, and type I error rates were evaluated at a significance level of $\alpha =5\times {10}^{-8}$.

To evaluate powers, we simulated time-to-event phenotypes under the alternative hypothesis using linear predicator

$${\eta }_{i}={\beta }_{1}{X}_{i1}+0.5{X}_{i2}+0.5{X}_{i3}+{\gamma }_{j}\cdot \sum_{j=1}^{150}{G}_{ij}, 1\le i\le n,$$

where 10 SNPs were randomly selected from each of the 15 groups to consist of 150 causal SNPs, ${G}_{ij}$ is the genotype value for the ${j}^{th}$ causal SNP. To evaluate powers under different directions of genetic effect, we consider positive genetic effects ${\gamma }_{j}=-2\cdot \text{log}10\left({\widehat{q}}^{j}\right)$ and negative genetic effects ${\gamma }_{j}=2\cdot \text{log}10\left({\widehat{q}}^{j}\right)$ where ${\widehat{q}}^{j}=\left(\sum_{i=1}^{n}{G}_{ij}\right)/2n<1$ is the MAF estimation in the overall population. Here we set the same genetic effect for all individuals, as a recent study has found that causal effects on complex traits are highly similar for common variants across segments of different continental ancestries within admixed individuals if environments are well controlled [84]. The events rates were set the same as in type I error rates simulations. For either scenario, we simulated 1000 datasets of phenotypes and covariates, and thus, ${10}^{4}$ tests were conducted to evaluate powers at a significance level of $\alpha =5\times {10}^{-8}$.

For both type I error rates and powers simulations, null model fitting incorporates covariates ${X}_{2}, {X}_{3}$, and the top 4 PCs derived from genotype data. In addition to SPAmix and SPACox, we also evaluated SPAmix_noSPA, in which p values of all variants are calibrated using only normal distribution approximation without SPA.

Simulation studies including three genetic ancestries

For subject $i\le n$, we let ${{\varvec{a}}}_{{\varvec{i}}}={\left({a}_{i}^{EUR}, {a}_{i}^{AFR}, {a}_{i}^{EAS}\right)}^{T}$ denote an ancestry vector to represent ancestry proportions of EUR, AFR, and EAS, respectively, and ${a}_{i}^{EUR}+{a}_{i}^{AFR}+{a}_{i}^{EAS}=1$. We assumed that the ancestry vector ${{\varvec{a}}}_{{\varvec{i}}}$ of the first $4000$ subjects from an EUR-dominant community follows a Dirichlet(8, 1, 1) distribution, the ancestry vector ${{\varvec{a}}}_{{\varvec{i}}}$ of the next $4000$ subjects from an AFR-dominant community follows a Dirichlet(1, 8, 1) distribution, and the ancestry vector ${{\varvec{a}}}_{{\varvec{i}}}$ of the remaining 2000 subjects from an EAS-dominant community follows a Dirichlet(1, 1, 8) distribution. The distribution of ${{\varvec{a}}}_{{\varvec{i}}},i\le n$ can be found in Additional file 1: Fig. S52.

Similar to the previous section, we simulated genotypes using ancestry vectors and real allele frequency data and then calculated SNP-derived PCs using 100,000 common SNPs. Additional file 1: Fig. S53 shows the top PCs derived from 100,000 simulated genetic variants and compared them to three components of the ancestry vector ${{\varvec{a}}}_{{\varvec{i}}}$, which demonstrated that the three ancestry proportions are strongly correlated with and can be well captured by the top PCs. For example, as shown in Additional file 1: Fig. S53(c), PC1 has an approximately linear correlation with the ancestry proportion of AFR.

To simulate time-to-event phenotype, we use a linear predictor ${\eta }_{i}={\beta }^{EUR}\cdot {a}_{i}^{EUR}+{\beta }^{AFR}\cdot {a}_{i}^{AFR}+ {\beta }^{EAS}\cdot {a}_{i}^{EAS}+0.5{X}_{i2}+0.5{X}_{i3}+\gamma {G}_{i}$ where notations are the same as in the previous section. In this section, we fixed parameters ${\beta }^{EUR}=0.2, {\beta }^{AFR}=2,$ and ${\beta }^{EAS}=5.5$, and scale parameter $\lambda =3$. Given $\gamma =0$, the event rates for pure EUR (i.e., ${a}_{i}^{EUR}=1$), AFR (i.e., ${a}_{i}^{AFR}=1$), and EAS (i.e., ${a}_{i}^{EAS}=1$) populations are around 0.01, 0.04, and 0.4, respectively, and the event rate in $n$=10,000 admixed subjects is around 0.086.

To evaluate the performance of SPAmix and SPACox in terms of type I error rates and powers, we consider two settings of MAF to simulate genotype.

Setting 1. MAF in EUR = 0.01, MAF in AFR = 0.1, MAF in EAS = 0.45.
Setting 2. MAF in EUR = 0.45, MAF in AFR = 0.1, MAF in EAS = 0.01.

For each variant, the genotype of subject $i$ follows a Binom(2, $q_i$) distribution, where

$${q}_{i}={a}_{i}^{EUR}{q}^{EUR}+{a}_{i}^{AFR}{q}^{AFR}+{a}_{i}^{EAS}{q}^{EAS}\ \mathrm{and} \ \ {{\varvec{a}}}_{{\varvec{i}}}={\left({a}_{i}^{EUR}, {a}_{i}^{AFR}, {a}_{i}^{EAS}\right)}^{T}$$

is the ancestry vector of this subject.

To evaluate type I error rates, we simulated phenotypes under the null hypothesis ($\text{i}.\text{e}., \gamma =0$) and simulated 10⁵ genetic variants for each setting of MAF. To evaluate powers, we simulated 10,000 datasets (including phenotypes and covariates) under the alternative hypothesis use a linear predictor

$${\eta }_{i}={\beta }^{EUR}\cdot {a}_{i}^{EUR}+{\beta }^{AFR}\cdot {a}_{i}^{AFR}+ {\beta }^{EAS}\cdot {a}_{i}^{EAS}+0.5{X}_{i2}+0.5{X}_{i3}+0.5{G}_{i}$$

for two settings of MAF, respectively.

Simulation studies of quantitative traits in the case of variance heterogeneity

To evaluate the performance of SPAmix for quantitative traits in the case of variance heterogeneity, we simulated $n$ = 10,000 subjects from an admixed population of EUR and EAS. We assume that the ancestry proportion vectors ${{\varvec{a}}}_{{\varvec{i}}}, i\leq n$ and ancestry PCs are the same as in the previous section of time-to-event traits simulation, as shown in Additional file 1: Fig. S50 and S51. One thousand datasets of genotypes were simulated and categorized following the same procedures as in the previous section of data simulation.

To assess type I error rates, we simulated quantitative phenotypes under the null hypothesis of no genetic effects in the case of variance heterogeneity corresponding to different ancestries from a linear regression model as below

$${Y}_{i}={\eta }_{i}+{\varepsilon }_{i}=0.5{X}_{i2}+0.5{X}_{i3}+{\varepsilon }_{i}, i\le n$$

where covariates ${X}_{i2}$ and ${X}_{i3}$ were simulated following a Bernoulli(0.5) distribution and a standard normal distribution, ${\varepsilon }_{i}$ was a random term simulated following a normal distribution $N\left(0,{\sigma }_{i}^{2}\right)$, and ${Y}_{i}$ was a quantitative trait. We generated ${\sigma }_{i}$, the individual-specific standard deviation of subject $i$, through a weighted sum of standard deviations of ancestry EUR and EAS as below

$${\sigma }_{i}={a}_{i}^{EUR}{\sigma }_{EUR}+{a}_{i}^{EAS}{\sigma }_{EAS},$$

where ${\sigma }_{EUR}$ and ${\sigma }_{EAS}$ denotes the standard deviations of random terms in ancestry EUR and EAS, respectively. We consider four pairs of standard deviations (sd) (sd_EUR, sd_EAS) = (1, 0.1), (sd_EUR, sd_EAS) = (1, 0.5), (sd_EUR, sd_EAS) = (1, 0.8), (sd_EUR, sd_EAS) = (1, 1). We did not consider the scenario in which the standard deviations in EAS were higher than those in EUR since it is exactly the opposite direction of the scenario above and the results can be expected. In each scenario, 1000 datasets of phenotypes and covariates were simulated, and thus a total of 10⁶ association tests were conducted for each pair of MAF group and standard deviation. We evaluated type I error rates of SPAmix, SPAmix_noSPA, Wald test, likelihood ratio test (LRT), SAIGE, REGENIE, and fastGWA at a significance level of $\alpha =5\times {10}^{-5}$.

Simulation studies of time-to-event trait in the case of baseline hazard function heterogeneity

To evaluate the performance of SPAmix for time-to-event traits in the case of baseline hazard function heterogeneity, we simulated $n$ = 10,000 subjects from an admixed population of EUR and EAS. We assume that the ancestry proportion vectors ${{\varvec{a}}}_{{\varvec{i}}},i\le n$ and ancestry PCs are the same as in the previous section of time-to-event traits simulation, as shown in Additional file 1: Fig. S50 and S51. One thousand datasets of genotypes were simulated and categorized following the same procedures as in the previous section of data simulation.

To assess type I error rates, we simulated time-to-event phenotypes under the null hypothesis of no genetic effects in the case of baseline hazard function heterogeneity corresponding to different ancestries. We first simulated an underlying failure time ${T}_{i}^{*}$ and a censoring time ${C}_{i}$. The censoring time ${C}_{i}$ was simulated following a Weibull distribution with an individual-specific shape parameter of ${v}_{i}$ and an individual-specific scale parameter of ${w}_{i}$.. The underlying failure time ${T}_{i}^{*}$ was simulated based on a Cox PH model with a Weibull baseline hazard function as ${T}_{i}^{*}=\lambda \sqrt{-\text{log}{U}_{i}/\text{exp}\left({\eta }_{i}\right)}$, where ${U}_{i}$ was simulated following a beta distribution Beta(1,${b}_{i}$). For each subject $i$, we generated parameters ${v}_{i}$, ${w}_{i}$, and ${b}_{i}$ through weighted sums of the original parameters for EUR and EAS ancestry as below

$${v}_{i}={a}_{i}^{EUR}{v}_{EUR}+{a}_{i}^{EAS}{v}_{EAS}$$

$${w}_{i}={a}_{i}^{EUR}{w}_{EUR}+{a}_{i}^{EAS}{w}_{EAS}$$

$$b_i=a_i^{EUR}b_{EUR}+a_i^{EAS}b_{EAS},$$

where (1) ${v}_{EUR}$ and ${v}_{EAS}$, (2)${w}_{EUR}$ and ${w}_{EAS}$, and (3) ${b}_{EUR}$ and ${b}_{EAS}$ denote (1) the shape parameter of Weibull distribution to simulate ${C}_{i}$, (2) the scale parameter of Weibull distribution to simulate ${C}_{i}$, and (3) the second shape parameter beta in EUR and EAS ancestry, respectively. We let ${v}_{EUR}=1$ and ${v}_{EAS}=2$, ${w}_{EUR}=0.15$ and ${w}_{EAS}=0.3$, and ${b}_{EUR}=4$ and ${b}_{EAS}=1$ in simulation studies.

The linear predictor ${\eta }_{i}=0.5{X}_{i2}+0.5{X}_{i3}$ was to characterize the effect from two covariates ${X}_{i2}$ and ${X}_{i3}$. Covariate ${X}_{i2}$ was simulated following a Bernoulli(0.5) distribution, and ${X}_{i3}$ was simulated following a standard normal distribution. Time-to-event phenotype ${T}_{i}=\text{min}\left({T}_{i}^{*}, {C}_{i}\right)$. Indicator ${\delta }_{i}=\text{I}\left({T}_{i}^{*}\le {C}_{i}\right)$ is to indicate whether event occurs. In each scenario, 1000 datasets of phenotypes and covariates were simulated, and thus a total of 10⁶ association tests were conducted for each pair of MAF group. We evaluated type I error rates of SPAmix, SPAmix_noSPA, SPACox, Wald test, GATE, and REGENIE at a significance level of $\alpha =5\times {10}^{-5}$.

Simulation studies of ordinal traits

To evaluate the performance of SPAmix for ordinal traits, we simulated $n$ = 10,000 subjects from an admixed population of EUR and EAS. We assume that the ancestry proportion vectors ${{\varvec{a}}}_{{\varvec{i}}},i\le n$ and ancestry PCs are the same as in the previous section of time-to-event traits simulation, as shown in Additional file 1: Fig. S50 and S51. One thousand datasets of genotypes were simulated and categorized following the same procedures as in the previous section of data simulation.

Let J = 4 denote the number of category levels. For subject $i\le n$, let ${Y}_{i}$ = 1, 2, …, J denote the ordinal phenotype. To assess type I error rates, we simulated ordinal phenotypes under the null hypothesis of no genetic effects from a proportional odds logistic regression model as below

$$\text{logit}\left({\nu }_{ij}\right)={\varepsilon }_{ij}-{\eta }_{i}={\varepsilon }_{ij}-{X}_{i}^{T}\beta -{G}_{i}\gamma , i\le n, j\le J,$$

where covariates ${X}_{i2}$ and ${X}_{i3}$ were simulated following a Bernoulli(0.5) distribution and a standard normal distribution. The individual-specific cutpoints $\left\{{\varepsilon }_{ij}, i\le n,j\le J\right\}$ (${\varepsilon }_{i1}<\cdots <{\varepsilon }_{iJ}=+\infty$) are used to categorize the data. For subject $i$, we generated $\left\{{\varepsilon }_{ij}, i\le n,j\le J\right\}$ through a weighted sum of cutpoints of ancestry EUR and EAS as below

$${\varepsilon }_{ij}={a}_{i}^{EUR}{\varepsilon }_{j}^{EUR}+{a}_{i}^{EAS}{\varepsilon }_{j}^{EAS}, i\le n,j\le J,$$

where $\left\{{\varepsilon }_{j}^{EUR}, j\le J\right\}$ and $\left\{{\varepsilon }_{j}^{EAS}, j\le J\right\}$ denotes the cutpoints selected to obtain the desired phenotypic distribution (ratio) in EUR and EAS populations, respectively. We consider five pairs of phenotypic distribution (Ratios) in EUR and EAS: (Ratio_EUR = 10:1:1:1, Ratio_EAS = 10:1:1:1), (Ratio_EUR = 100:1:1:1, Ratio_EAS = 100:1:1:1), (Ratio_EUR = 1000:1:1:1, Ratio_EAS = 1000:1:1:1), (Ratio_EUR = 10:1:1:1, Ratio_EAS = 1000:1:1:1), and (Ratio_EUR = 100:1:1:1, Ratio_EAS = 1000:1:1:1). In each scenario, 1000 datasets of phenotypes and covariates were simulated, and thus a total of 10⁶ association tests were conducted for each pair of MAF group and phenotypic distribution (ratio). We evaluated type I error rates of SPAmix, SPAmix_noSPA, and POLMM at a significance level of $\alpha =5\times {10}^{-5}$.

Simulation studies of binary and quantitative traits given ancestry-specific genetic effect sizes

To evaluate the performance of SPAmix_local and SPAmix_CCT, we simulated a two-way admixed population with sample size n = 10,000, including ancestry-specific genotypes, local ancestry counts, genotype-derived PCs, and phenotypes. We considered extensive scenarios of ancestry-specific genetic effect and minor allele frequencies to compare the SPAmix-based approaches and Tractor.

We followed procedure as in Mester et al. [27] to simulate genotypes. First, we generated an individual-level global ancestry proportion of ancestry 2 (denoted as ${d}_{i}, i\le n$) from a normal distribution $N\left(\theta ,{\sigma }^{2}\right)$ for each individual, in which $\theta$ is the expected global ancestry proportion and ${\sigma }^{2}$ is the corresponding variance. We let $\sigma =0.125$ and coerced ${d}_{i}$ between [0,1]. For individual $i, i\le n$, we simulated local ancestry count ${h}_{i}^{\left(1\right)}$ and ${h}_{i}^{\left(2\right)}$, in which ${h}_{i}^{\left(2\right)}$ follows a binomial distribution $\text{Binom}\left({d}_{i},2\right)$ and ${h}_{i}^{\left(1\right)}=2-{h}_{i}^{\left(2\right)}$. Then, we simulated ancestry-specific genotype ${G}_{i}^{\left(k\right)}$ following a binomial distribution $\text{Binom}\left({h}_{i}^{\left(k\right)},{q}^{\left(k\right)}\right)$, where ${q}^{\left(k\right)}$ is the allele frequency corresponding to the ancestry $k.$ Genotype ${G}_{i}={G}_{i}^{\left(1\right)}+{G}_{i}^{\left(2\right)}$. In simulation studies, we considered two fixed MAFs of 0.01 and 0.1 in ancestry 1 and four fixed MAFs including 0.01, 0.05, 0.1, and 0.3 in ancestry 2. We calculated SNP-derived PCs using 100,000 common SNPs. Additional file 1: Fig. S54 shows the top PCs and the global ancestry distribution for the 10,000 two-way admixed individuals.

Type I error simulations

We simulated binary and quantitative traits under the null hypothesis to evaluate type I error rates. We simulated binary and quantitative traits from a logistic regression model and a linear regression model as below

$$\text{logit}\left({\mu }_{i}\right)={\eta }_{i}={\beta }_{0}+0.5{X}_{i2}+0.5{X}_{i3}, i\le n,$$

$${Y}_{i}={\eta }_{i}+{\varepsilon }_{i}=0.5{X}_{i2}+0.5{X}_{i3}+{\varepsilon }_{i}, i\le n,$$

where covariates ${X}_{i2}$ and ${X}_{i3}$ were simulated following a Bernoulli(0.5) distribution and a standard normal distribution, ${\mu }_{i}$ is the probability of being a case for a binary trait, and ${Y}_{i}$ is a quantitative trait. For a binary trait, the intercept ${\beta }_{0}$ was determined to correspond to a certain disease prevalence. We considered two disease prevalences including 0.01 and 0.2. For a quantitative trait, random term ${\varepsilon }_{i}$ was simulated following a standard normal distribution. We simulated 100 datasets of phenotypes and covariates for each phenotypic distribution and 10,000 SNPs for each setting of MAF, and thus a total of 10⁶ tests were conducted in each scenario.

Power simulations

We simulated binary and quantitative traits under the alternative hypothesis to evaluate powers. For both binary and quantitative traits, we added an overall genetic effect $\sum_{j=1}^{10}\left[{\gamma }^{\left(1\right)}\cdot {G}_{i,j}^{\left(1\right)}+{\gamma }^{\left(2\right)}\cdot {G}_{i,j}^{\left(2\right)}\right]$ to the linear predicator ${\eta }_{i}$, in which ${G}_{i,j}^{\left(1\right)}$ and ${G}_{i,j}^{\left(2\right)}$ were the ancestry-specific genotype of subject i in SNP j from ancestries 1 and 2, respectively, and ${\gamma }^{\left(1\right)}$ and ${\gamma }^{\left(2\right)}$ were corresponding ancestry-specific genetic effect sizes. For binary traits, we set disease prevalence at 0.2.

We considered two scenarios including homogeneity and heterogeneity of genetic effect sizes for ancestries 1 and 2. For heterogeneous genetic effect sizes, we fixed ${\gamma }^{\left(1\right)}$ and increased ${\gamma }^{\left(2\right)}$ from 0. We simulated 100 datasets of phenotypes and covariates for each scenario, and thus a total of 1000 tests were conducted to evaluate powers. We calculated empirical powers at a genome-wide significance level 5 $\times$ 10⁻⁸.

Association analysis of Tractor and SPAmixlocal in simulation studies

Tractor and SPAmix_local calculated ancestry-specific p values based on likelihood ratio test and score test, respectively. In simulation studies, Tractor fitted a full model with covariates of ${X}_{i2}$, ${X}_{i3}$, top 4 genetically derived PCs, local ancestry count of ${h}_{i}^{\left(1\right)}$, and ancestry-specific genotypes ${G}_{i}^{\left(1\right)},{G}_{i}^{\left(2\right)}$. Meanwhile, SPAmix_local fitted a null model with covariates of ${X}_{i2}$, ${X}_{i3}$, and top 4 genetically derived PCs. Regular logistic model and linear model were to fit binary and quantitative traits, respectively. Tractor and SPAmix_local returned two p values corresponding to ancestry-specific effect ${\gamma }^{\left(1\right)}$ and ${\gamma }^{\left(2\right)}$. SPAmix_CCT combined the two p values outputted by SPAmix_local and one p value outputted by SPAmix to calculate one p value.

Simulation studies of quantitative traits with the utilization of local ancestry information in the case of variance heterogeneity

To evaluate the performance of SPAmix_local and SPAmix_CCT for quantitative traits in the case of variance heterogeneity with the utilization of local ancestry information, we simulated $n$ = 10,000 subjects from an admixed population of EUR and EAS and assume that the global ancestry proportion vectors ${{\varvec{a}}}_{{\varvec{i}}},\boldsymbol{ }i\le n$ and ancestry PCs are the same as in the previous section of time-to-event traits simulation, as shown in Additional file 1: Fig. S50 and S51. For individual$i, i\le n$, we simulated local ancestry count ${h}_{i}^{EUR}$ and ${h}_{i}^{EAS}$, in which ${h}_{i}^{EAS}$ follows a binomial distribution $\text{Binom}\left({a}_{i}^{EAS},2\right)$ and ${h}_{i}^{EUR}=2-{h}_{i}^{EAS}$. Then, we simulated ancestry-specific genotype ${G}_{i}^{EUR}$ and ${G}_{i}^{EAS}$ following binomial distribution $\text{Binom}\left({h}_{i}^{EUR},{q}^{EUR}\right)$ and $\text{Binom}\left({h}_{i}^{EAS},{q}^{EAS}\right)$, respectively, where ${q}^{EUR}$ and ${q}^{EAS}$ are the allele frequency corresponding to the ancestry EUR and EAS, respectively $.$ Thus, we obtained genotype ${G}_{i}={G}_{i}^{EUR}+{G}_{i}^{EAS}$. We considered three fixed MAFs of 0.01, 0.1, and 0.3 in EUR and EAS, and thus, 9 scenarios of MAFs were considered in total.

To assess type I error rates, we simulated quantitative phenotypes following the same procedure as in the previous section of type I error simulation studies of quantitative traits in the case of variance heterogeneity. We consider four pairs of standard deviations (sd) (sd_EUR, sd_EAS) = (1, 0.1), (sd_EUR, sd_EAS) = (1, 0.5), (sd_EUR, sd_EAS) = (1, 0.8), (sd_EUR, sd_EAS) = (1, 1). In each scenario, 1000 datasets of phenotypes and covariates were simulated, and thus a total of 10⁶ association tests were conducted for each pair of MAF and standard deviation. We evaluated type I error rates of SPAmix_local, SPAmix_CCT, and Tractor at a significance level of $\alpha =5\times {10}^{-5}$.

Application to the UK Biobank Data

We used SPAmix to conduct genome-wide analyses of 10 time-to-event traits in UK Biobank dataset, including Alzheimer’s disease, type 2 diabetes, essential hypertension, Parkinson’s disease, schizophrenia, multiple sclerosis, sickle cell anemia, osteoporosis, glaucoma, and asthma [5, 68]. Additional file 1: Table S1 provides the detailed summary information of the 10 time-to-event traits. UK Biobank contains 338,044 unrelated subjects with in-patient diagnosis data, of which 282,590 (83.6%) are white British participants and the remaining participants (16.4%) are from the other ancestries including African, Asian, admixed populations, and other ancestry groups. The ancestry information is from Field ID of 22,006, which is highly consistent to the genetic ancestry based on the genetically derived principal components. In order to identify affected subjects and create time-to-event phenotypes, we utilized the PheWAS code which is based on the International Statistical Classification of Diseases (ICD) codes version 9 and 10 [85, 86]. To analyze each disease, we assigned an event indicator $\delta_i = 1$ and time-to-event $T_i$ as the age at the first in-patient diagnosis date if at least one in-patient diagnosis was observed. For patients without an in-patient diagnosis, we set $\delta_i = 0$ and time-to-event $T_i$ as the age at the right-censoring date or lost to follow-up date. Additionally, the observed survival time was left truncated at the in-patient data collection date. For each phenotype, the top ten ancestry principal components (PCs), sex, and birthyear were included as covariates, and markers imputed by the Haplotype Reference Consortium (HRC) [87] panel with a MAF over 0.0001 and imputation INFO score > 0.6 were used in our analysis. In addition, we applied SPAmix to conduct genome-wide analyses of 11 longitudinal traits in UK Biobank dataset.

The majority of UK Biobank’s participants were of white British ancestry, and we selected 10,000 subjects based on genetic principle components (PCs) to mimic a study cohort with greater heterogeneity. Since the first two PCs (Field ID: 22,009, Additional file 1: Fig. S55a) of white British participants are close to zero, we define $\text{d}=\sqrt{{\text{PC}}_{1}^{2}+{\text{PC}}_{2}^{2}}$ as an ancestry distance for each subject. A total of 5000 White British subjects with the smallest ancestry distance and 5000 non-white British subjects with the largest ancestry distance were selected (Additional file 1: Fig. S55b). Using the study cohort of the 10,000 subjects, we conducted a genome-wide analysis of the same 10 time-to-event traits. We increased the MAF cutoff to 0.001 as the sample size is smaller. Consistent to simulation studies, genetic variants were categorized into 15 groups, and SPAmix and SPACox were evaluated for each group.

Application to the ALL of US Data

To evaluate the performance of SPAmix in more diverse genomic contexts, we applied SPAmix to conduct genome-wide association analysis of essential hypertension (modeled as a time-to-event phenotype) using the All of Us dataset, a cohort with substantially greater genetic diversity. This analysis leveraged the cohort’s unique population diversity by conducting parallel GWAS in two subsets: (1) a multi-ancestry sample (sample size = 188,941), and (2) a European-ancestry subset (sample size = 112,080) for population-specific comparison.

Time-to-event phenotypes for essential hypertension (PheCode: 401.1) were constructed using electronic health records from the All of Us Research Program. Individuals with relevant diagnosis codes (mapped from ICD-9, ICD-10, and SNOMED) were classified as cases, while those without such diagnoses were treated as censored. The event time was defined as the age at first diagnosis of hypertension for cases, and as the age at the last recorded clinical observation for censored individuals. Age was calculated using each participant’s date of birth and either the diagnosis date or censoring date. To avoid confounding due to relatedness, we excluded all individuals identified as genetically related (kinship score > 0.1) based on a precomputed kinship file from All of Us dataset, ensuring a fully unrelated cohort for analysis.

Data availability

All data generated or analyzed during this study are included in this published article. The packages and codes used in the analysis are publicly available (see URLs section). SPAmix is implemented in R and is accessible at [88]. The scripts for generating simulation results and real data analyses are available at [89]. All aforementioned codes are freely accessible under a GNU General Public License v3.0 license. The source code used in the manuscript is available on Zenodo [90]. UK Biobank analysis summary statistics from the genome-wide association study analyses presented in this paper are available on Zenodo [91].

References

Abdellaoui A, Yengo L, Verweij KJH, Visscher PM. 15 years of GWAS discovery: realizing the promise. Am J Hum Genet. 2023. https://doi.org/10.1016/j.ajhg.2022.12.011.
Article PubMed PubMed Central Google Scholar
Backman JD, Li AH, Marcketta A, Sun D, Mbatchou J, Kessler MD, et al. Exome sequencing and analysis of 454,787 UK biobank participants. Nature. 2021;599:628–34.
Article CAS PubMed PubMed Central Google Scholar
Beesley LJ, Salvatore M, Fritsche LG, Pandit A, Rao A, Brummett C, et al. The emerging landscape of health research based on biobanks linked to electronic health records: existing resources, statistical challenges, and potential opportunities. Stat Med. 2020;39:773–800.
Article PubMed Google Scholar
Bi W, Lee S. Scalable and robust regression methods for phenome-wide association analysis on large-scale biobank data. Front Genet. 2021;12:682638.
Article PubMed PubMed Central Google Scholar
Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–9.
Article CAS PubMed PubMed Central Google Scholar
Bi W, Fritsche LG, Mukherjee B, Kim S, Lee S. A fast and accurate method for genome-wide time-to-event data analysis and its application to UK biobank. Am J Hum Genet. 2020;107:222–33.
Article CAS PubMed PubMed Central Google Scholar
Bi W, Zhou W, Dey R, Mukherjee B, Sampson JN, Lee S. Efficient mixed model approach for large-scale genome-wide association studies of ordinal categorical phenotypes. Am J Hum Genet. 2021;108:825–39.
Article CAS PubMed PubMed Central Google Scholar
Lu Q, Du Y, Zhang Y, Chen Y, Li H, He W, et al. A genome-wide association study for susceptibility to axial length in highly myopic eyes. Phenomics. 2023;3:255–67.
Article CAS PubMed Google Scholar
Lapointe-Shaw L, Bouck Z, Howell NA, Lange T, Orchanian-Cheff A, Austin PC, et al. Mediation analysis with a time-to-event outcome: a review of use and reporting in healthcare research. BMC Med Res Methodol. 2018;18:1–12.
Article Google Scholar
Thompson S, Kaptoge S, White I, Wood A, Perry P, Danesh J. Statistical methods for the time-to-event analysis of individual participant data from multiple epidemiological studies. Int J Epidemiol. 2010;39:1345–59.
Article PubMed PubMed Central Google Scholar
Zhang M, Davidian M. “Smooth” semiparametric regression analysis for arbitrarily censored time-to-event data. Biometrics. 2008;64:567–76.
Article PubMed Google Scholar
Pedersen EM, Agerbo E, Plana-Ripoll O, Steinbach J, Krebs MD, Hougaard DM, et al. ADuLT: an efficient and robust time-to-event GWAS. Nat Commun. 2023;14:5553. https://doi.org/10.1038/s41467-023-41210-z.
Article CAS PubMed PubMed Central Google Scholar
Daniels HE. Saddlepoint approximations in statistics. Ann Math Stat. 1954. https://doi.org/10.1214/aoms/1177728652.
Article Google Scholar
Jensen JL. Saddlepoint approximations. Oxford Statistical Science Series. Vol. 16. The Clarendon Press Oxford University Press, New York: Oxford Science Publications; 1995.
Dey R, Schmidt EM, Abecasis GR, Lee S. A fast and accurate algorithm to test for binary phenotypes and its application to PheWAS. Am J Hum Genet. 2017;101:37–49.
Article CAS PubMed PubMed Central Google Scholar
Jiang L, Zheng Z, Fang H, Yang J. A generalized linear mixed model association tool for biobank-scale data. Nat Genet. 2021;53:1616–21.
Article CAS PubMed Google Scholar
Mbatchou J, Barnard L, Backman J, Marcketta A, Kosmicki JA, Ziyatdinov A, et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat Genet. 2021;53:1097–103.
Article CAS PubMed Google Scholar
Dey R, Zhou W, Kiiskinen T, Havulinna A, Elliott A, Karjalainen J, et al. FinnGen, Lee S: efficient and accurate frailty model approach for genome-wide survival association analysis in large-scale biobanks. Nat Commun. 2022;13:5437.
Article CAS PubMed PubMed Central Google Scholar
Loh P-R, Kichaev G, Gazal S, Schoech AP, Price AL. Mixed-model association for biobank-scale datasets. Nat Genet. 2018;50:906–8.
Article CAS PubMed PubMed Central Google Scholar
Ko S, German CA, Jensen A, Shen J, Wang A, Mehrotra DV, et al. Gwas of longitudinal trajectories at biobank scale. Am J Hum Genet. 2022;109:433–45.
Article CAS PubMed PubMed Central Google Scholar
Peterson RE, Kuchenbaecker K, Walters RK, Chen C-Y, Popejoy AB, Periyasamy S, et al. Genome-wide association studies in ancestrally diverse populations: opportunities, methods, pitfalls, and recommendations. Cell. 2019;179:589–603.
Article CAS PubMed PubMed Central Google Scholar
Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–9.
Article CAS PubMed PubMed Central Google Scholar
Martin AR, Gignoux CR, Walters RK, Wojcik GL, Neale BM, Gravel S, et al. Human demographic history impacts genetic risk prediction across diverse populations. Am J Hum Genet. 2017;100:635–49.
Article CAS PubMed PubMed Central Google Scholar
Cavalli-Sforza LL. Genes, peoples, and languages. Proc Natl Acad Sci U S A. 1997;94:7719–24.
Article CAS PubMed PubMed Central Google Scholar
Conomos MP, Laurie CA, Stilp AM, Gogarten SM, McHugh CP, Nelson SC, et al. Genetic diversity and association studies in US Hispanic/Latino populations: applications in the Hispanic community health study/study of latinos. Am J Hum Genet. 2016;98:165–84.
Article CAS PubMed PubMed Central Google Scholar
Atkinson EG, Maihofer AX, Kanai M, Martin AR, Karczewski KJ, Santoro ML, et al. Tractor uses local ancestry to enable the inclusion of admixed individuals in GWAS and to boost power. Nat Genet. 2021;53:195–204.
Article CAS PubMed PubMed Central Google Scholar
Mester R, Hou K, Ding Y, Meeks G, Burch KS, Bhattacharya A, et al. Impact of cross-ancestry genetic architecture on GWASs in admixed populations. Am J Hum Genet. 2023;110:927–39.
Article CAS PubMed PubMed Central Google Scholar
Jiang D, Mbatchou J, McPeek MS. Retrospective association analysis of binary traits: overcoming some limitations of the additive polygenic model. Hum Hered. 2015;80:187–95.
Article CAS PubMed Google Scholar
Hayeck TJ, Loh PR, Pollack S, Gusev A, Patterson N, Zaitlen NA, et al. Mixed model association with family-biased case-control ascertainment. Am J Hum Genet. 2017;100:31–9.
Article CAS PubMed Google Scholar
Liu Y, Xie J. Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures. J Am Stat Assoc. 2020;115:393–402.
Article CAS PubMed Google Scholar
Ko S, German CA, Jensen A, Shen J, Wang A, Mehrotra DV, et al. Gwas of longitudinal trajectories at biobank scale. Am J Hum Genet. 2022. https://doi.org/10.1016/j.ajhg.2022.01.018.
Article PubMed PubMed Central Google Scholar
Zhou W, Nielsen JB, Fritsche LG, Dey R, Gabrielsen ME, Wolford BN, et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet. 2018;50:1335–41.
Article CAS PubMed PubMed Central Google Scholar
Yamauchi T, Hara K, Maeda S, Yasuda K, Takahashi A, Horikoshi M, et al. A genome-wide association study in the Japanese population identifies susceptibility loci for type 2 diabetes at UBE2E2 and C2CD4A-C2CD4B. Nat Genet. 2010;42:864–8.
Article CAS PubMed Google Scholar
Zheng Z, Huang D, Wang J, Zhao K, Zhou Y, Guo Z, et al. QTLbase: an integrative resource for quantitative trait loci across multiple human molecular phenotypes. Nucleic Acids Res. 2020;48(D1):D983-d991.
Article CAS PubMed Google Scholar
Wang J, Huang D, Zhou Y, Yao H, Liu H, Zhai S, et al. CAUSALdb: a database for disease/trait causal variants identified using summary statistics of genome-wide association studies. Nucleic Acids Res. 2020;48(D1):D807-d816.
CAS PubMed Google Scholar
Dong S, Zhao N, Spragins E, Kagda MS, Li M, Assis P, et al. Annotating and prioritizing human non-coding variants with RegulomeDB vol 2. Nat Genet. 2023;55:724–6.
Article CAS PubMed PubMed Central Google Scholar
Ferreira MA, Matheson MC, Tang CS, Granell R, Ang W, Hui J, et al. Genome-wide association analysis identifies 11 risk variants associated with the asthma with hay fever phenotype. J Allergy Clin Immunol. 2014;133:1564–71.
Article CAS PubMed Google Scholar
Chang D, Hunkapiller J, Bhangale T, Reeder J, Mukhyala K, Tom J, et al. A whole genome sequencing study of moderate to severe asthma identifies a lung function locus associated with asthma risk. Sci Rep. 2022;12:5574.
Article CAS PubMed PubMed Central Google Scholar
Gharahkhani P, Jorgenson E, Hysi P, Khawaja AP, Pendergrass S, Han X, et al. Genome-wide meta-analysis identifies 127 open-angle glaucoma loci with consistent effect across ancestries. Nat Commun. 2021;12(1):1258.
Article CAS PubMed PubMed Central Google Scholar
Lu SY, Rong SS, Wu Z, Huang C, Matsushita K, Ng TK, et al. Association of the CAV1-CAV2 locus with normal-tension glaucoma in Chinese and Japanese. Clin Experiment Ophthalmol. 2020;48:658–65.
Article PubMed Google Scholar
Thorleifsson G, Walters GB, Hewitt AW, Masson G, Helgason A, DeWan A, et al. Common variants near CAV1 and CAV2 are associated with primary open-angle glaucoma. Nat Genet. 2010;42:906–9.
Article CAS PubMed Google Scholar
Wiggs JL, Kang JH, Yaspan BL, Mirel DB, Laurie C, Crenshaw A, et al. Common variants near CAV1 and CAV2 are associated with primary open-angle glaucoma in Caucasians from the USA. Hum Mol Genet. 2011;20:4707–13.
Article CAS PubMed PubMed Central Google Scholar
Chen F, Klein AP, Klein BE, Lee KE, Truitt B, Klein R, et al. Exome array analysis identifies CAV1/CAV2 as a susceptibility locus for intraocular pressure. Invest Ophthalmol Vis Sci. 2014;56:544–51.
Article PubMed Google Scholar
Loomis SJ, Kang JH, Weinreb RN, Yaspan BL, Cooke Bailey JN, Gaasterland D, et al. Association of CAV1/CAV2 genomic variants with primary open-angle glaucoma overall and by gender and pattern of visual field loss. Ophthalmology. 2014;121:508–16.
Article PubMed Google Scholar
Kim S, Kim K, Heo DW, Kim JS, Park CK, Kim CS, et al. Expression-associated polymorphisms of CAV1-CAV2 affect intraocular pressure and high-tension glaucoma risk. Mol Vis. 2015;21:548–54.
CAS PubMed PubMed Central Google Scholar
Verma A, Huffman JE, Rodriguez A, Conery M, Liu M, Ho Y-L, et al. Diversity and scale: genetic architecture of 2068 traits in the VA million veteran program. Science. 2024;385:eadj1182.
Article CAS PubMed Google Scholar
Ivanova T, Churnosova M, Abramova M, Ponomarenko I, Reshetnikov E, Aristova I, et al. Risk effects of rs1799945 polymorphism of the HFE gene and intergenic interactions of GWAS-significant loci for arterial hypertension in the Caucasian population of Central Russia. Int J Mol Sci. 2023;24:8309.
Article CAS PubMed PubMed Central Google Scholar
Selvaraj S, Seidelmann S, Silvestre OM, Claggett B, Ndumele CE, Cheng S, et al. HFE H63D polymorphism and the risk for systemic hypertension, myocardial remodeling, and adverse cardiovascular events in the ARIC study. Hypertension. 2019;73:68–74.
Article CAS PubMed Google Scholar
Tyrmi JS, Kaartokallio T, Lokki AI, Jääskeläinen T, Kortelainen E, Ruotsalainen S, et al. Genetic risk factors associated with preeclampsia and hypertensive disorders of pregnancy. JAMA Cardiol. 2023;8:674–83.
Article PubMed PubMed Central Google Scholar
Changalidis AI, Maksiutenko EM, Barbitoff YA, Tkachenko AA, Vashukova ES, Pachuliia OV, et al. Aggregation of genome-wide association data from FinnGen and UK biobank replicates multiple risk loci for pregnancy complications. Genes. 2022;13:2255.
Article CAS PubMed PubMed Central Google Scholar
Takeuchi F, Akiyama M, Matoba N, Katsuya T, Nakatochi M, Tabara Y, et al. Interethnic analyses of blood pressure loci in populations of East Asian and European descent. Nat Commun. 2018;9:5052.
Article PubMed PubMed Central Google Scholar
Siva N. 1000 genomes project. Nat Biotechnol. 2008;26:256–7.
Article PubMed Google Scholar
Meirmans PG, Hedrick PW. Assessing population structure: FST and related measures. Mol Ecol Resour. 2011;11:5–18.
Article PubMed Google Scholar
Bi W, Zhao Z, Dey R, Fritsche LG, Mukherjee B, Lee S. A fast and accurate method for genome-wide scale phenome-wide G× E analysis and its application to UK Biobank. Am J Hum Genet. 2019;105:1182–92.
Article CAS PubMed PubMed Central Google Scholar
Jiang L, Zheng Z, Qi T, Kemper KE, Wray NR, Visscher PM, et al. A resource-efficient tool for mixed model association analysis of large-scale data. Nat Genet. 2019;51:1749–55.
Article CAS PubMed Google Scholar
Bryc K, Durand EY, Macpherson JM, Reich D, Mountain JL. The genetic ancestry of African Americans, Latinos, and European Americans across the United States. Am J Hum Genet. 2015;96:37–53.
Article CAS PubMed PubMed Central Google Scholar
Moreno-Estrada A, Gravel S, Zakharia F, McCauley JL, Byrnes JK, Gignoux CR, et al. Reconstructing the population genetic history of the Caribbean. PLoS Genet. 2013;9:e1003925.
Article PubMed PubMed Central Google Scholar
Schlebusch CM, Skoglund P, Sjödin P, Gattepaille LM, Hernandez D, Jay F, et al. Genomic variation in seven Khoe-San groups reveals adaptation and complex African history. Science. 2012;338:374–9.
Article CAS PubMed PubMed Central Google Scholar
Hou K, Bhattacharya A, Mester R, Burch KS, Pasaniuc B. On powerful GWAS in admixed populations. Nat Genet. 2021;53:1631–3.
Article CAS PubMed PubMed Central Google Scholar
Caliebe A, Tekola-Ayele F, Darst BF, Wang X, Song YE, Gui J, et al. Including diverse and admixed populations in genetic epidemiology research. Genetic epidemiology. 2022;46:347–1.
Conomos MP, Reiner AP, Weir BS, Thornton TA. Model-free estimation of recent genetic relatedness. Am J Hum Genet. 2016;98:127–48.
Article CAS PubMed PubMed Central Google Scholar
Deng H-W, Chen W-M, Recker RR. Qtl fine mapping by measuring and testing for Hardy-Weinberg and linkage disequilibrium at a series of linked marker loci in extreme samples of populations. Am J Hum Genet. 2000;66:1027–45.
Article CAS PubMed PubMed Central Google Scholar
Loh P-R, Tucker G, Bulik-Sullivan BK, Vilhjálmsson BJ, Finucane HK, Salem RM, et al. Efficient bayesian mixed-model analysis increases association power in large cohorts. Nat Genet. 2015;47:284–90.
Article CAS PubMed PubMed Central Google Scholar
All of Us Research Program Investigators. The “All of Us” research program. N Engl J Med. 2019;381:668–76.
Article Google Scholar
Taliun D, Harris DN, Kessler MD, Carlson J, Szpiech ZA, Torres R, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed program. Nature. 2021;590:290–9.
Article CAS PubMed PubMed Central Google Scholar
Sohail M, Palma-Martínez MJ, Chong AY, Quinto-Cortés CD, Barberena-Jonas C, Medina-Muñoz SG, et al. Mexican biobank advances population and medical genomics of diverse ancestries. Nature. 2023;622:775–83.
Article CAS PubMed PubMed Central Google Scholar
Schneiderman N, Llabre M, Cowie CC, Barnhart J, Carnethon M, Gallo LC, Giachello AL, Heiss G, Kaplan RC. LaVange LMJDc: Prevalence of diabetes among Hispanics/Latinos from diverse backgrounds: the Hispanic community health study/study of Latinos (HCHS/SOL). Diabetes care. 2014;37:2233–9.
Hayeck TJ, Zaitlen NA, Loh P-R, Vilhjalmsson B, Pollack S, Gusev A, et al. Mixed model with correction for case-control ascertainment increases association power. Am J Hum Genet. 2015;96:720–30.
Article CAS PubMed PubMed Central Google Scholar
Coop G. Genetic similarity versus genetic ancestry groups as sample descriptors in human genetics. arXiv preprint arXiv:2207. 11595. 2022.
Lewis AC, Molina SJ, Appelbaum PS, Dauda B, Di Rienzo A, Fuentes A, et al. Getting genetic ancestry right for science and society. Science. 2022;376:250–2.
Article CAS PubMed PubMed Central Google Scholar
Mathieson I, Scally A. What is ancestry? PLoS Genet. 2020;16:e1008624.
Article CAS PubMed PubMed Central Google Scholar
Krainc T, Fuentes A. Genetic ancestry in precision medicine is reshaping the race debate. Proc Natl Acad Sci U S A. 2022;119:e2203033119.
Article CAS PubMed PubMed Central Google Scholar
Belbin GM, Cullina S, Wenric S, Soper ER, Glicksberg BS, Torre D, et al. Toward a fine-scale population health monitoring system. Cell. 2021;184(2068–2083):e2011.
Google Scholar
Wojcik GL, Graff M, Nishimura KK, Tao R, Haessler J, Gignoux CR, et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature. 2019;570:514–8.
Article CAS PubMed PubMed Central Google Scholar
Ding Y, Hou K, Xu Z, Pimplaskar A, Petter E, Boulier K, et al. Polygenic scoring accuracy varies across the genetic ancestry continuum. Nature. 2023;618:774–81.
Article CAS PubMed PubMed Central Google Scholar
Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2:e190.
Article PubMed PubMed Central Google Scholar
Butler RW: Saddlepoint approximations with applications. Vol.22. Cambridge University Press; 2007.
Goutis C, Casella G. Explaining the saddlepoint approximation. Am Stat. 1999;53:216–24.
Article Google Scholar
Barndorff-Nielsen OE. Approximate interval probabilities. J R Stat Soc Series B Stat Methodol. 1990;52:485–96.
Article Google Scholar
Balding DJ, Nichols RA. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica. 1995;96:3–12.
Article CAS PubMed Google Scholar
Foreman LA, Smith AF, Evett IW. Bayesian analysis of DNA profiling data in forensic identification applications. J R Stat Soc Ser A Stat Soc. 1997;160:429–59.
Article Google Scholar
Rannala B, Mountain JL. Detecting immigration by using multilocus genotypes. Proc Natl Acad Sci U S A. 1997;94:9197–201.
Article CAS PubMed PubMed Central Google Scholar
Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945–59.
Article CAS PubMed PubMed Central Google Scholar
Hou K, Ding Y, Xu Z, Wu Y, Bhattacharya A, Mester R, et al. Causal effects on complex traits are similar for common variants across segments of different continental ancestries within admixed individuals. Nat Genet. 2023;55:549–58.
Article CAS PubMed PubMed Central Google Scholar
Wu P, Gifford A, Meng X, Li X, Campbell H, Varley T, et al. Mapping ICD-10 and ICD-10-CM codes to phecodes: workflow development and initial evaluation. JMIR Med Inform. 2019;7:e14325.
Article PubMed PubMed Central Google Scholar
Denny JC, Ritchie MD, Basford MA, Pulley JM, Bastarache L, Brown-Gentry K, et al. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations. Bioinformatics. 2010;26:1205–10.
Article CAS PubMed PubMed Central Google Scholar
Loh P-R, Danecek P, Palamara PF, Fuchsberger C, A Reshef Y, K Finucane H, et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat Genet. 2016;48:1443–8.
Article CAS PubMed PubMed Central Google Scholar
Bi W, GRAB. CRAN.R-project. https://CRAN.R-project.org/package=GRAB. 2025.
Ma Y, Bi W: SPAmix: A scalable, accurate, and universal analysis framework for large-scale genetic association studies in admixed populations. Github. https://github.com/YuzhuoMa97/SPAmix. 2025.
Ma Y, Bi W. YuzhuoMa97/SPAmix: Code used in SPAmix project. 2025. Zenodo. https://doi.org/10.5281/zenodo.17139129.
Ma Y, Bi W: A scalable, accurate, and universal analysis framework using individual-level allele frequency for large-scale genetic association studies in an admixed population. Zenodo. https://zenodo.org/record/8343684. 2025.

Download references

Acknowledgements

UK Biobank data was accessed under the accession number 78795. We thank Dr. Seokho Jeong for providing the maximal independent set algorithm used in AllofUs data. This research is supported by high-performance computing platform of Peking University. This work was supported under the framework of international cooperation program managed by National Research Foundation of Korea (2024K2A9A2A0601368311).

Peer review information

Tim Sands was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Review history

The review history is available as Additional file 2.

Funding

This research was supported by National Natural Science Foundation of China (82441004, 62273010, 62411540239, W.B.; 32471202, P.Z.), Clinical Medicine Plus X—Young Scholars Project, Peking University (PKU2023LCXQ043, W.B.), National Research Foundation of Korea (2024K2A9A2A0601368311, S.L.), and Innovation Fund for Outstanding Doctoral Candidates of Peking University Health Science Center (BMU2025BSS001, Y.M.).

Author information

Authors and Affiliations

Department of Medical Genetics, School of Basic Medical Sciences, Peking University, Beijing, China
Yuzhuo Ma, He Xu, Ying Li, Lin Miao & Wenjian Bi
Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
Yuzhuo Ma
Graduate School of Data Science, Seoul National University, Seoul, Korea
Hyesung Kim & Seunggeun Lee
Renal Division, Peking University First Hospital, Peking University Institute of Nephrology, Beijing, China
Lin-lin Xu & Xu-jie Zhou
State Key Laboratory of Radiation Medicine and Protection, Cyrus Tang Medical Institute, National Clinical Research Center for Hematologic Diseases, Collaborative Innovation Center of Hematology, Soochow University, Suzhou, China
Peng Xu
Institute of Medical Innovation and Research, Peking University Third Hospital, Beijing, China
Fengbiao Mao
Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
Wei Zhou
Stanley Center for Psychiatric Research, The Broad Institute of MIT and Harvard, Cambridge, MA, USA
Wei Zhou
School of Automation and Electrical Engineering, Zhongyuan University of Technology, Zhengzhou, China
Ji-Feng Zhang
State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
Ji-Feng Zhang
Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Peking University Health Science Center, Beijing, China
Peipei Zhang
Key Laboratory for Neuroscience, Ministry of Education/National Health and Family Planning Commission, Peking University, Beijing, China
Peipei Zhang
Center for Medical Genetics, School of Basic Medical Sciences, Peking University, Beijing, China
Wenjian Bi
Department of Biomedical Informatics, School of Basic Medical Sciences, Peking University, Beijing, China
Wenjian Bi

Authors

Yuzhuo Ma
View author publications
Search author on:PubMed Google Scholar
He Xu
View author publications
Search author on:PubMed Google Scholar
Ying Li
View author publications
Search author on:PubMed Google Scholar
Hyesung Kim
View author publications
Search author on:PubMed Google Scholar
Lin-lin Xu
View author publications
Search author on:PubMed Google Scholar
Lin Miao
View author publications
Search author on:PubMed Google Scholar
Peng Xu
View author publications
Search author on:PubMed Google Scholar
Fengbiao Mao
View author publications
Search author on:PubMed Google Scholar
Xu-jie Zhou
View author publications
Search author on:PubMed Google Scholar
Wei Zhou
View author publications
Search author on:PubMed Google Scholar
Seunggeun Lee
View author publications
Search author on:PubMed Google Scholar
Ji-Feng Zhang
View author publications
Search author on:PubMed Google Scholar
Peipei Zhang
View author publications
Search author on:PubMed Google Scholar
Wenjian Bi
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.M., S.L., J.Z., P.Z., and W.B. designed the experiments. Y.M. and W.B. performed the experiments. H.X. constructed longitudinal traits from UK Biobank and conducted analysis using trajGWAS. Y.M., Y.L., and W.B. constructed the time-to-event traits from UK Biobank. H.K. constructed the time-to-event traits and performed GWAS in AllofUs. L.M. prepared a Docker image for SPAmix, ensuring that the software can be deployed reproducibly. Y.M., L.X., and W.B. wrote the manuscript with the assistance of P.X., F.M., X.Z, S.L., J.Z, and P.Z. All authors have read and approved the final manuscript.

Corresponding authors

Correspondence to Ji-Feng Zhang, Peipei Zhang or Wenjian Bi.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

13059_2025_3827_MOESM1_ESM.pdf

Additional file 1. Supplementary information includes supplementary notes, supplementary figures S1-S72, and supplementary tables S1-S7.

Additional file 2. Peer review history.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Ma, Y., Xu, H., Li, Y. et al. SPAmix: a scalable, accurate, and universal analysis framework for large-scale genetic association studies in admixed populations. Genome Biol 26, 356 (2025). https://doi.org/10.1186/s13059-025-03827-9

Download citation

Received: 17 December 2023
Accepted: 03 October 2025
Published: 16 October 2025
Version of record: 16 October 2025
DOI: https://doi.org/10.1186/s13059-025-03827-9

SPAmix: a scalable, accurate, and universal analysis framework for large-scale genetic association studies in admixed populations

Abstract

Background

Results

Conclusions

Background

Results

An overview of SPAmix

Association analysis in the UK Biobank data

SPAmix is scalable to analyze large-scale biobank data

SPAmix is accurate to analyze multi-way admixed populations

SPAmix identified 377 additional significant loci by including non-WB individuals

Association analysis in the ALL of Us Data

Simulation studies

Type I error rates

Empirical power

Simulation of a three-way admixed population

SPAmix can control type I error rates in the case of variance heterogeneity and baseline hazard function heterogeneity

SPAmix framework can account for ancestry-specific environmental factors

SPAmix and SAIGE show comparable type I error rates and powers

SPAmix framework can utilize local ancestry and Cauchy combination to maximize power for various cross-ancestry genetic architectures

SPAmixlocal and SPAmixCCT can control type I errors in the case of variance heterogeneity

UK Biobank data analysis validated that SPAmix overperformed SPACox

Discussion

Conclusions

URLs

Methods

Model residuals of linear regression and logistic regression

Cox proportional hazard model and SPACox

Longitudinal trajectories and trajGWAS

Ordinal traits and proportional odds logistic regression model

SPAmix uses individual-specific allele frequency to adjust for admixed population

Individual-specific allele frequency estimation

Normal distribution approximation and saddlepoint approximation

Partial normal distribution approximation in SPA

SPAmixlocal tests for ancestry-specific genetic effect using ancestry-specific score statistics

Optimization guidelines for SPAmixlocal

SPAmixCCT combines p values of SPAmix and SPAmixlocal to maximize powers

Data simulation

Simulation studies including three genetic ancestries

Simulation studies of quantitative traits in the case of variance heterogeneity

Simulation studies of time-to-event trait in the case of baseline hazard function heterogeneity

Simulation studies of ordinal traits

Simulation studies of binary and quantitative traits given ancestry-specific genetic effect sizes

Type I error simulations

Power simulations

Association analysis of Tractor and SPAmixlocal in simulation studies

Simulation studies of quantitative traits with the utilization of local ancestry information in the case of variance heterogeneity

Application to the UK Biobank Data

Application to the ALL of US Data

Data availability

References

Acknowledgements

Peer review information

Review history

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Supplementary Information

13059_2025_3827_MOESM1_ESM.pdf

Additional file 2. Peer review history.

Rights and permissions

About this article

Cite this article

Share this article

Genome Biology

Contact us

SPAmix_local and SPAmix_CCT can control type I errors in the case of variance heterogeneity

SPAmix_local tests for ancestry-specific genetic effect using ancestry-specific score statistics

Optimization guidelines for SPAmix_local

SPAmix_CCT combines p values of SPAmix and SPAmix_local to maximize powers