Main

Genome-wide association studies (GWAS) have identified thousands of loci associated with diseases and traits and have contributed to our molecular understanding of human phenotypes1,2,3,4,5,6,7,8,9. However, for most of these loci, we still do not fully understand the causal mechanisms of the associations. This is partly because of insufficient resolution of associations and limited population sources of genetic associations. Non-European large-scale association studies with sufficient resolution of variants would expand the causal mechanisms implicated by population-specific associations and variants. Additionally, the limited availability of sensitive fine-mapping strategies has hindered our understanding of causal variants10,11. Furthermore, a substantial fraction of the lead variants and their linked variants exist in noncoding regions, where functional interpretation is still challenging. Enrichment of causal variants in functional annotations would provide clues about the underlying mechanisms12.

To overcome these challenges and improve our understanding of causal genetic relationships, we adopted the following strategies. First, using 3,256 high-depth whole-genome sequencing (WGS) data from individuals of Japanese ancestry combined with the 1000 Genomes Project13, we developed a new genotype imputation reference panel. This high-quality reference panel enabled us to impute population-specific rare coding and noncoding variants with high accuracy at the population scale. Second, we performed GWAS analyses in up to 260,000 Japanese individuals. Third, we applied statistical fine-mapping to decompose the observed associations into independent causal signals, leveraging the precise linkage disequilibrium (LD) determined by our dense WGS reference panel. Lastly, we conducted comprehensive in silico analyses and follow-up biological experiments for functional interpretation of noncoding variants.

Results

GWAS for 63 quantitative traits

We compiled a new genotype imputation reference panel and used the WGS data to impute the genotypes of 203,216 Japanese individuals; we then performed GWAS analyses for 63 quantitative traits and up to 15,907,072 variants. To replicate the results and maximize statistical power to find new associations, we additionally analyzed 53,083 individuals for 26 traits in another Japanese dataset (Methods, Extended Data Fig. 1, Supplementary Tables 1 and 2, and Supplementary Note 1). We observed a calibrated distribution of test statistics according to the polygenicity of these traits (median LD Score intercept = 1.06; Supplementary Table 3) and high replication rates (Supplementary Note 2). We identified 4,423 genome-wide significant associated loci, including 601 previously unreported loci (Supplementary Tables 4 and 5 and Supplementary Note 2). Statistical fine-mapping revealed 826 phenotype–variant pairs (associations) with a marginal posterior probability of inclusion (PPI) greater than 0.9 and 9,406 with a PPI greater than 0.1 (Supplementary Note 3 and Supplementary Tables 612), which, as shown in previous studies14,15,16, included loci with multiple signals.

New associations with rare functional coding variants

We found rare Japanese-specific coding variants driving new associations and directly implicating probable causal genes. One such example is a very rare missense variant rs730881101 in TNNT2 (ENST00000509001:c.422G>A, p.R141Q), minor allele frequency (MAF) = 0.1%) associated with decreased systolic heart function (reduced ejection fraction (EF) and increased left ventricular end-systolic diameter (LVDS)) (MAF = 0.1%, βEF = − 0.925, PEF = 5.9 × 10−11, PPIEF = 0.50, βLVDS = 0.830, PLVDS = 2.7 × 10−9, PPILVDS = 0.50; Fig. 1a–c). Notably, the effect size of this variant was more than 80% of the s.d. TNNT2 is a causal gene for dilated cardiomyopathy and has not been reported for its association with cardiac function in a population-scale GWAS. We also found that this variant was strongly associated with the prevalence of heart failure with a large effect size (odds ratio (OR) = 4.5 (3.1–6.5), P = 1.4 × 10−15).

Fig. 1: New rare putative causal coding variants associated with human quantitative traits implicate candidate causal genes.
figure 1

a, Deleterious coding variant in TNNT2 (rs730881101) showing strong associations with cardiac functions. The horizontal axis indicates the genomic coordinates; the vertical axis indicates the negative log10(P). Statistical significance was tested using a linear mixed model. The displayed P values are two-sided and not adjusted for multiple testing. b, Three-dimensional structure of Troponin-T and putative effect of the coding variant. c, β estimates, PPI and alternative allele frequency (AAF) of rs730881101. The error bar for the β estimates indicates the 95% CI. The number of individuals included in the analysis is shown after the trait names. d, A deleterious coding variant in TNFRSF17 (rs150352299) showing strong associations with AG ratio and non-ALB protein levels. The horizontal axis indicates the genomic coordinates; the vertical axis indicates the negative log10(P). e, β estimates, PPI and AAF of rs150352299. f, Bulk tissue expression of TNFRSF17 in the GTEx. The number of samples is shown after the organ name. The violin plots show the distribution of gene expression in transcripts per million (TPM). The box plot shows the median value as the centerline; the box boundaries show the first and third quartiles and the whiskers extend 1.5 times the interquartile range. g, OR for 29 diseases of rs150352299 in unrelated Biobank Japan (BBJ) participants. Case counts are shown after the outcomes (nTotal = 169,020). The squares indicate the OR; the error bars indicate the 95% CI. Statistical significance was tested using a logistic regression with two-sided test at P < 0.05/29. The displayed P values were not adjusted for multiple testing. h, Deleterious coding variant in RYR1 (rs192863857) associated with CK levels. i, β estimate, PPI and AAF of rs192863857. j, Bulk tissue expression of RYR1 in the GTEx. AFR, African; AMR, Admixed American; ASJ, Ashkenazi Jewish; Ca, cancer; FIN, Finnish; NFE, non-Finnish European; OTH, others. The AAF was obtained from the Genome Aggregation Database (gnomAD) dataset. The number of individuals included in the association analysis is found in Supplementary Table 1; the abbreviations for the phenotypes are found in Supplementary Table 2.

Another example is rs150352299 in TNFRSF17 (ENST00000053243:457G>A, p.A153T; Fig. 1d–g and Supplementary Table 5). This rare (MAF = 0.38%) Japanese-specific missense variant was significantly associated with a higher albumin :globulin ratio (AG) (βAG = 0.306, PAG = 3.9 × 10−22), lower non-albumin protein (NAP) (βNAP = − 0.327, PNAP = 3.3 × 10−25) and lower total protein (TP) (βTP = − 0.183, PTP = 6.5 × 10−10). These associations suggested decreased globulin concentration in the blood. TNFRSF17 encodes B cell maturation antigen (BMA), which is specifically expressed in mature B cells and is responsible for antibody production (Extended Data Fig. 2a). Furthermore, we identified an increased risk of chronic obstructive pulmonary disease with rs150352299, which is consistent with several reports of primary immunodeficiency as an underlying cause of chronic obstructive pulmonary disease17. BMA is known to interact with B cell activating factor encoded by TNFRSF13B, in which we also identified a Japanese-specific rare loss-of-function variant, rs769165409, associated with the AG with high PPI (MAF = 0.1%, βAG = 0.353, PAG = 3.9 × 10−7; Supplementary Note 4.1). These results provide genetic evidence for critical roles of BMA–B cell activating factor interaction in the immunoglobulin production of B cells.

Other examples include associations with creatine kinase (CK) levels. RYR1 encodes the ryanodine receptor, a crucial calcium channel in muscle. rs192863857, a rare missense substitution in RYR1 (ENST00000359596:c.5317C>T, p.P1773S), was associated with CK (MAF = 1.48%, βCK = 0.134, PCK= 2.5 × 10−15, PPICK = 0.67; Fig. 1h–j and Supplementary Table 5). We also identified a new missense variant associated CK levels in CACNA1S, which encodes the main subunit of the calcium channel (MAF = 3.5%, βCK = − 0.064, PCK= 1.5 × 10−9, PPICK = 1.00; Extended Data Fig. 2). These genes are specifically expressed in skeletal muscle (Extended Data Fig. 2b,c) and are involved in malignant hyperthermia (MH), a disease characterized by massive CK elevations precipitated by exposure to certain anesthetics. The results suggest that high serum CK levels in the absence of the causative stressors of MH may reflect the effects of variants in these causal genes.

Another plausible example is a Japanese-specific rare deleterious missense variant of USP47 associated with glucose levels (Extended Data Fig. 3 and Supplementary Table 4). USP47 was reported to be associated with several cancers in humans, but was not reported in the context of glucose levels. To support this finding, knockout of Usp47 in mice resulted in increased glucose levels (Supplementary Note 4.1).

We also found other new associations between quantitative traits and missense variants that are rare and specific to, or more prevalent in, East Asians (EAS). These include associations of ARHGAP36 with sodium levels, RFWD2 with basophil counts, S1PR4 with segmented neutrophil counts, EVC with estimated glomerular filtration rate (eGFR), MYCT1 with red blood cell count, EGLN1 with eGFR, STAB2 with activated partial thromboplastin time and SLC12A3 with chloride levels (Supplementary Note 4.1 and Extended Data Fig. 3). These new associations deepen our understanding of mechanisms underlying complex traits by providing highly likely causal genes and variants. In line with these findings, putative causal variants (PPI > 0.9) were significantly enriched in protein-altering or protein-truncating variants (PTVs) (ORprotein-altering = 45 (95% CI = 37–55), P = 5.4 × 10−147; ORprotein-truncating = 106 (95% CI = 56–184), P = 2.7 × 10−22; Extended Data Fig. 3 and Supplementary Tables 7 and 8).

New population-specific noncoding associations

We also found other new associations with noncoding variants, including Japanese-specific rare variants. New associations of noncoding variants included genes whose functions are largely unknown or known in limited contexts that are not associated with quantitative traits. In particular, we could connect long noncoding RNAs with quantitative traits. rs78568419 in LINC00670, an EAS-specific variant (also present at very low frequency in admixed American populations), was associated with platelet (PLT) count (Supplementary Table 4 and Extended Data Fig. 4). This long noncoding RNA is called cardinal and is expressed mainly in arterial tissues (coronary artery and aorta in the Genotype-Tissue Expression (GTEx) project12). In line with these findings, this variant showed a pleiotropic association with coronary artery disease (P = 0.001)7. Another example is the association of an EAS-specific variant in LINC01094 (long intergenic non-protein-coding RNA 1094) with total cholesterol (TC) and high-density lipoprotein cholesterol (HDLC) (Supplementary Note 4.1). LINC01094 has been associated with gastric cancer and renal cell carcinoma18. Other examples include associations of zinc-finger protein genes, such as ZNF365 with HDLC, ZNF787 with basophil count, ZNF423 with hematocrit and hemoglobin, ZNF468 with eGFR and blood urea nitrogen and ZNF444 with white blood cell (WBC) count (Supplementary Table 4 and Supplementary Note 4.1).

Previously unreported associations of noncoding variants in genes of known function include the association of an upstream variant of CD118 with low-density lipoprotein cholesterol (LDL) levels (Extended Data Fig. 4); CD118 encodes a leukemia inhibitory factor receptor; this variant was not present in Europeans. Other examples include the association of hematocrit and hemoglobin with an upstream variant in HEY1 that is highly specific to Asians, and the association of eosinophils with an ETV6 variant (Extended Data Fig. 4 and Supplementary Note 4.1). HEY1 encodes a crucial transcription factor involved in the NOTCH pathway, which was suggested to have critical roles in erythropoiesis19. ETV6 is implicated in myeloid lymphoma; several ETV6 fusion protein-positive acute myeloid lymphomas have been associated with clonal eosinophilia20.

We also found new associations of noncoding variants in genes relevant to complex traits. These include an association between glucose levels and an EAS-specific rare variant upstream of PAX4 (Extended Data Fig. 4 and Supplementary Note 4.1). As PAX4 is a master regulator of β-cells in the pancreas, this association suggests that this variant affects the development or function of β-cells via altered PAX4 expression or activity, resulting in increased glucose levels even in individuals without diabetes. Other examples include an association of an intronic variant in AQP1, more frequent in EAS than other populations, with eGFR and serum creatinine levels. AQP1 is a widely expressed water channel, especially in the kidney. Other associations include ATM with hemoglobin and hematocrit, SIRT1 with hemoglobin, RRAS2 with PLTs and CD163 with aspartate aminotransferase (AST) levels (Supplementary Table 4 and Supplementary Note 4.1).

Characterization of putatively causal noncoding variants

We found reasonable enrichment of noncoding causal variants for DNase I hypersensitivity sites (DHS) and consensus footprints (CFPs) (Fig. 2a). To further assess the functionality of these noncoding regulatory elements quantitatively, we applied a deep-learning-based method to predict the pathogenicity (disease impact score) of noncoding variants. The disease impact score showed a strong positive association with PPI (P < 2.9 × 10−30; Fig. 2b, Extended Data Fig. 5 and Methods). A typical example was rs146018792 (MAF = 0.48%), a rare Japanese-specific noncoding variant in an intron of CCND3 significantly associated with red blood cell-related traits (mean corpuscular volume (MCV) and mean corpuscular hemoglobin (MCH); PMCV = 7.8 × 10−14, PPIMCV = 1.00; PMCH = 8.5 × 10−11, PPIMCH = 0.87). rs146018792 is in a CFP within a myeloid-specific DHS (Fig. 2c,d). This variant had one of the highest disease impact scores (99.98 percentile; Extended Data Fig. 5). Specifically, this variant strongly decreased the affinity of the cFos/JUN transcription factor in the myeloid cell line K562 (Extended Data Fig. 5).

Fig. 2: Noncoding rare variants associated with human quantitative traits represent a substantial fraction of putative causal variants.
figure 2

a, Enrichment of variants within the regulatory region in variants with high PPI. The vertical axis indicates the OR of variants in each PPI bin within the DHS/CFP or not in comparison with the variants with the lowest PPI bin (0–0.1). The error bars indicate the 95% CIs. The circles and stars indicate noncoding and coding variants, respectively. b, Higher predicted pathogenicity of noncoding putative causal variants. The vertical axis indicates the disease impact score predicted from its sequence changes (Methods). The box plot shows the median value as the centerline; the box boundaries show the first and third quartiles and the whiskers extend 1.5 times the interquartile range. c, A rare Japanese-specific noncoding variant rs146018792 in CCND3 strongly associated with MCV and MCH is in the CFP of the myeloid cell line K562. d, β estimates, PPI and AAF of rs146018792. The error bar for the β estimates indicates the 95% CI. The number of individuals included in the analysis is shown after the trait names. e, Distribution of the absolute β estimates of associations with a PPI > 0.9. The dashed line shows the median absolute β estimate of protein-truncating associations (median |βPTV | = 0.261). The colored dots indicate large effect associations with |β| > 0.261. f, Distribution of the MAFs of associations with a PPI > 0.9. The colored dots indicate the large effect associations defined in e. g, Proportion of population-specific variants within each PPI bin. The y axis indicates the fraction of variants found in only one population in each indicated PPI bin. The color indicates the population in which the variants were found. The AAF was obtained from the gnomAD dataset. The number of individuals included in the association analysis is found in Supplementary Table 1; the abbreviations for the phenotypes are found in Supplementary Table 2.

High-impact variants were strongly constrained across protein-truncating, protein-altering and noncoding variants (Fig. 2e,f). Importantly, we found that these putative causal variants (especially those with high PPI) were highly specific to EAS (Fig. 2g). We observed an array of rare, noncoding variants with comparable effect sizes to coding variants. As an example, rs542962114, a Japanese-specific rare noncoding variant located upstream of LDHB, was significantly associated with lactate dehydrogenase (LDH) levels with a large effect size (Extended Data Fig. 6; MAF = 1.13%, βLDH = − 0.317, PLDH = 1.6 × 10−70). As most causal variants are noncoding, if we compare the number of variants, more than twice as many noncoding high-impact associations as coding variants were observed (Supplementary Note 5). In line with this finding, among all the putative causal associations (PPI > 0.1), a noncoding variant showed the largest effect size (rs33981098 and MCV; βMCV = − 1.67, PMCV = 1.2 × 10−89, PPIMCV = 1.00). rs33981098 is a known noncoding pathogenic variant for β-thalassemia located upstream of HBB (hemoglobin B)21. These findings underscore the importance of clarifying the mechanisms underlying causal noncoding variants.

Thus, high-quality whole-genome imputation enabled us to assess the impact of these rare noncoding variants on human phenotypes and rare coding variants at the population scale.

New population-specific causal variants in known loci

We also found new EAS or Japanese-specific causal variants (coding and noncoding) in genes previously known for their associations with quantitative traits (variant-level new associations). We found seven signals in the PCSK9 locus associated with LDL (PPI > 0.1), including four population-specific rare coding variants. A very rare new noncoding variant showed the strongest effect size among the seven variants (Extended Data Fig. 7 and Supplementary Note 4.2).

Among these associations, we found rare Asian-specific variants as probably causal via altered splicing. We observed strong enrichment of predicted cryptic splicing variants by Splice-AI (predicted cryptic splice scores > 0.2) in variants with high PPI (PPI > 0.9, OR = 8.4 (3.8–16.1), P = 2.1 × 10−6; Fig. 3a). rs76080105 is an intronic variant in FLT3 (ENST00000241453:c.2208-14A>G, MAF = 0.76%), which encodes a tyrosine kinase and whose distinct cryptic splicing variant was recently reported to cause autoimmune thyroid disease in Europeans22. rs76080105 was predicted to cause a splice acceptor loss; we found significant associations with several immunological traits supported by high PPI (Supplementary Note 4.2). We also found that rs76080105 was associated with rheumatoid arthritis and systemic lupus erythematosus in the Japanese population23 (Supplementary Note 4.2). We experimentally validated this cryptic splice alteration (Fig. 3b–d; P = 1.26 × 10−6, Fisher’s exact test). Another example is rs141440582, a rare missense variant in MMP2 (ENST00000219070:c.1453A>T, p.I485F), MAF = 0.63%), which is associated with height (P = 6.9 × 10−9, PPI = 0.15). This missense variant is predicted to introduce a splice donor gain, resulting in a 25-bp frameshift deletion, which we validated experimentally (P = 2.9 × 10−11, Fisher’s exact test; Fig. 3e,f and Supplementary Note 4.2). In agreement with our observation, Mmp2 knockout in mice resulted in short stature and abnormal bone formation24.

Fig. 3: Rare population-specific putative causal splice variants and pathogenic variants associated with human quantitative traits.
figure 3

a, Enrichment of putative cryptic splice variants among variants with high PPI. The vertical axis indicates the OR and 95% CI of the cryptic splice variants (Splice-AI delta score > 0.2) for each PPI bin (the horizontal axis) to the lowest PPI bin. The OR and 95% CI were estimated using a Fisher’s exact test. The number of variants included in the analysis is shown after the PPI bins. b, Schematic representation of the in vitro splicing assay. cf, Schematic representation of alternative splicing, effect size, PPI and population frequency of the cryptic splice variant rs76080105 (FLT3, c,d) and rs141440582 (MMP2, e,f). The error bar for the β estimates indicates the 95% CI. The number of individuals included in the analysis is shown after the trait names. The horizontal axes indicate the genomic coordinate. The vertical axes indicate the exon coverage of the RNA sequence from the reference construct (top) and the alternate construct (bottom). Variant sites are indicated in red. g, Enrichment of ClinVar variants among variants with a high PPI. The vertical axis indicates the categories in ClinVar. The horizontal axis indicates the OR of a high PPI using benign variants as reference and the 95% CI estimated using a Fisher’s exact test. The number of variants included in the analysis is shown after the variant annotations. h, Fraction of deleterious to tolerated variants evaluated using PolyPhen or sorting intolerant from tolerant (SIFT) in each PPI bin (horizontal axis). i, Schematic representation of the CD36 locus where rs75326924 is located. j, β estimates, PPI and AAF of rs75326924. k, Schematic representation of the ABCG5 locus where rs119480069 is located. l, β estimates, PPI and AAF of rs119480069. The AAF was obtained from the gnomAD dataset. The number of individuals included in the association analysis is found in Supplementary Table 1; the abbreviations for the phenotypes are found in Supplementary Table 2.

High PPI variants enriched in pathogenic variants

We observed a 15-fold enrichment of pathogenic variants in ClinVar among putative causal variants (PPI > 0.1, P = 2.2 × 10−10; Fig. 3g,h and Supplementary Table 9). All these clinically determined pathogenic variants were connected with quantitative trait-relevant diseases (Supplementary Note 5.2). For example, rs75326924, a missense variant in CD36 and known to be causal for CD36 deficiency25, showed putative causal associations with multiple quantitative traits, including PLT count, fatty acids, ejection fraction and heart failure (Fig. 3i,j and Supplementary Note 5.2), in line with the biological functions of CD36. rs119480069, a known pathogenic missense variant in ABCG5, showed causal associations with TC and LDL (Fig. 3k,l).

Functional annotations for noncoding causal variants

To further elucidate the consequences of putative causal noncoding variants on functional annotations that are in line with previous studies26, we assessed the overlap of putative causal noncoding variants in DHS and CFP regions and found significant enrichment in a phenotype-relevant tissue-specific manner (Supplementary Table 11 and Extended Data Fig. 8). One of the most significant enrichments was observed in height-associated noncoding variants in musculoskeletal-specific DHS (ORHeight = 3.2 (2.4–4.1), PHeight = 7.2 × 10−15). Noncoding variants associated with hematological traits and antibody production were significantly enriched in myeloid-specific and lymphoid-specific DHS, respectively. We also found significant enrichment of causal variants for causal expression quantitative trait locus (eQTL) variants in the GTEx (Extended Data Fig. 9 and Supplementary Note 6.1).

We performed an enrichment analysis of functional annotations of noncoding variants neither in DHS nor CFPs to explore potential mechanisms. We found that such variants with high PPI are strongly enriched in the 3′ UTR or 5′ UTR of transcripts, suggesting crucial roles and distinct mechanisms of these regions on quantitative traits and underscoring diverse mechanisms of causal variants (Fig. 4a). These enrichments were also observed in the UK Biobank (UKB) (Extended Data Fig. 9 and Supplementary Note 6.2).

Fig. 4: Enrichment of putative causal noncoding variants for functional annotations and a new mechanism of causal variants in 3′ UTR.
figure 4

a, Enrichment of causal noncoding variants for functional annotations. Each point and error bar indicates the OR of variants with a high PPI ((0.1, 0.9] or (0.9, 1]) and the 95% CI, respectively. The 95% CIs were estimated using a Fisher’s exact test. b, Regional association plot and strong associations of the IL6 locus. The horizontal axis indicates the genomic coordinates and the vertical axis indicates a negative log10(P). Statistical significance was tested using a linear mixed model. The displayed P values are two-sided and were not adjusted for multiple testing. The β estimate, PPI and AAF in the global population of rs13306436 are shown. The error bar for the β estimates indicates the 95% CI. The number of individuals included in the analysis is shown after the trait name. c, rs13306436 showed resistance to regnase-1-mediated inhibition of a IL6 3′ UTR reporter. Overexpression of regnase-1 (10 ng per well) decreased expression of the reporter harboring the IL6 3′ UTR of both the wild-type (WT) (G) and variant (A) alleles of rs13306436, but the variant (A) allele of rs13306436 exhibited less of a decrease. The results are representative of experiments carried out in triplicate. Statistical significance was assessed using two-sided t-test. d, Working hypothesis of rs13306436 altering the posttranscriptional regulation of IL6 expression. Regnase-1 recognizes the stem-loop structure within the 3′ UTR of IL6 and leads to mRNA degradation. rs13306436 is located close to the stem-loop sequence; the variant (A) allele is more structured, which might suppress regnase-1-mediated degradation, thereby making the mRNA more stable (Supplementary Note 6.3). Short transcripts indicate degraded ones. e, OR for 29 diseases among carriers of rs13306436 in unrelated BBJ participants. Case counts are shown after the outcomes (nTotal = 169,020). The squares indicate the OR; the error bars indicate the 95% CI. Statistical significance was tested using a logistic regression with a two-sided test at P < 0.05/29. The displayed P values were not adjusted for multiple testing. The AAF was obtained from the gnomAD dataset. The number of individuals included in the association analysis is found in Supplementary Table 1; the abbreviations for the phenotypes are found in Supplementary Table 2.

One such example was rs13306436, a rare EAS-specific variant in the 3′ UTR of IL6, associated with ten traits with high PPI (PPI > 0.9: fibrinogen (FBG), NAP, C-reactive protein (CRP), PLT, MCH and AG; PPI > 0.1: alkaline phosphatase (ALP), WBC, LDH and mean corpuscular hemoglobin concentration; Fig. 4b), concordant with the multipotency of IL6. The direction of the effects of this variant suggested an increase in immunogenicity (increased FBG, CRP, NAP and WBC). This variant is located near the binding site of regnase-1, an RNA-binding protein targeting the 3′ UTR of transcripts to degrade mRNA and control protein levels27,28. We experimentally showed that a reporter carrying the IL6 mRNA 3′ UTR with the rare minor allele rs13306436 was resistant to degradation by regnase-1 (Fig. 4c), suggesting that the mRNA structure is altered by this variant, resulting in stable mRNA, increased interleukin-6 levels and consequently increased immunogenicity (Fig. 4d and Supplementary Note 6.3). We found that this variant decreased the risk of tuberculosis (Fig. 4e), in agreement with a previous report that identified an increased risk of tuberculosis infection in Il6 knockout mice29. As regnase-1 targets several immune-related genes, our results may suggest more potential 3′ UTR variants as targets of regnase-1. In line with these findings, genes in which we found probable causal variants at the 3′ UTR showed enrichment for the target genes of regnase-1 (hypergeometric test, P = 5.2 × 10−5; Supplementary Note 6.3).

Possibility of drug repurposing

Genes with putatively causal coding variants were enriched in known genes causing monogenic disorders (genes with ‘pathogenic’ variants in the ClinVar database) or drug targets (Extended Data Fig. 10). These genes were more frequently included in protein–protein networks, regardless of genes with coding and noncoding variants (Extended Data Fig. 10). Genes encoding currently available drug targets that showed new associations in the current study included CACNA1S, RYR1, PDE10A, SIRT1 and CYP19A1 (Supplementary Tables 13 and 14). These raise the possibility of repurposing drugs currently available to other phenotypes or diseases (or potential side effects of the currently available drugs), taking advantage of direct or indirect connections to drug targets via the molecular network.

Discussion

In the current study, we combined finely imputed genotypes with a high-density Japanese-orientated imputation reference panel, well-powered multi-trait GWAS in a homogeneous population, a sensitive algorithm to determine the likelihood of causality of variants in the associated loci and in silico and experimental functional analyses. Together, these enabled us to detect many new associations especially specific to EAS, characterize putative causal genes and variants, and find new mechanisms for how causal noncoding variants affect complex traits, despite the bias of existing resources for these purposes toward European populations.

The new associations and putative causal associations in this study will be valuable for further functional follow-up studies for several traits with high sensitivity (Supplementary Note 4 and Data availability). Our study also provides several insights into the genetic architecture of causal associations, especially for coding variants. Discovery of many new associations in rare causal variants indicates the presence of many population-specific rare variants (both for coding and noncoding variants, in line with previous studies30,31) and the importance of deeply analyzing large-scale single populations. We also found that many new associations were driven by common variants in Japanese or EAS populations that were much more frequent in these populations, not limiting to rare variants (Supplementary Note 4).

We found a possible new mechanism underlying causal variants at 3′ UTRs. Further analyses would expand the yet-to-be-identified mechanisms of noncoding variants. The noncoding putative causal variants included rare variants having strong effect sizes on the phenotypes comparable with damaging-coding variants. As most associations are driven by noncoding causal variants, our observations suggest that we could drastically extend potential intervention targets as therapeutic or preemptive options targeting noncoding causal variants.

Our results coincide with the current effort to transition from whole-exome to whole-genome space32. Continuous efforts to expand WGS in single populations by leveraging large sample sizes and extending to the global population would uncover further population-specific associations and variants, from which we can identify causal variants and mechanisms and advance efforts toward personalized medicine.

Methods

Ethics oversights

All participants provided written informed consent according to the protocols approved by following institutional ethical committees: the RIKEN Center for Integrative Medical Sciences; the Institute of Medical Sciences; the University of Tokyo; the National Center for Geriatrics and Gerontology (NCGG); the Tohoku University Graduate School of Medicine; and Iwate Medical University.

Study cohorts

First, we included three different datasets constructed from the contemporary Japanese population (BBJ first cohort, BBJ second cohort, NCGG cohort). We subjected these datasets to imputation using our reference panel (described below) to obtain the results for harmonized variants and then meta-analyzed the results. Additionally, to maximize statistical power to identify new signals especially specific to the EAS population, we also analyzed the quantitative trait data of 53,083 individuals from the Tohoku Medical Megabank Organization (ToMMo) community-based cohort study (67K; Extended Data Fig. 1 and Supplementary Note 1).

The BBJ33,34 is a nationwide hospital-based biobank with 12 collaborating medical institutions. The first cohort targeted 47 diseases and recruited 200,000 people between 2003 and 2013; the second cohort targeted 38 diseases and recruited 67,000 people between 2013 and 2018 (https://biobankjp.org/en/index.html). In this study, 12,098 people with available genotypes were included from the BBJ second cohort. The NCGG Biobank is a hospital-based biobank maintained by the NCGG since 2012. Participants were recruited from the NCGG hospital and nearby medical institutes (https://www.ncgg.go.jp/english/index.html). ToMMo is a population-based cohort in which study participants were recruited from the health checkups conducted in two prefectures of Northeastern Japan: Miyagi (n = 32,459) and Iwate (n = 20,906).

WGS and creation of the imputation reference panel

The procedures for WGS and reference panel construction are described elsewhere35. The 3,256 individuals sequenced are from the BBJ cohort. Briefly, 1,502 individuals were sequenced aiming at a 30× coverage (high coverage) and 1,786 at a 15× coverage (medium coverage) with a HiSeq 2500 (Rapid mode or V4, Illumina) or HiSeq X Five platform. Samples with low sequence quality or from closely related individuals were removed. Sequenced reads were aligned to a human reference genome (hg19) using the Burrows–Wheeler Aligner36; duplicated reads were removed. Then, we conducted joint genotype calling using HaplotypeCaller and GenotypeGVCFs implemented by the Genome Analysis Toolkit37 (v.3.5-0, v.3.8-0 for high coverage, v.3.6-0 for medium coverage, v.3.8-0 for joint calling) according to germline short variant discovery best practice workflows. We removed variants with: (1) read depth (DP) < 5 from high coverage samples; DP < 2 from medium coverage samples; (2) genotype quality (GQ) < 20; (3) DP > 60 and GQ < 95 from high coverage; (4) failed in variant quality score recalibration. The procedures for reference panel construction were as follows. From WGS VCF files generated as above, we removed multiallelic or monomorphic sites, singleton variants and variants deviating from Hardy–Weinberg equilibrium (P < 1 × 10−6). The genotypes from the 1000 Genomes Project13 (phase 3, v.5) were similarly processed. Then, these datasets were merged using IMPUTE2 (ref. 38) v.2.3.2. For the X chromosome, we used BEAGLE39 v.4.1 to merge the male WGS genotypes, and then combined them with the female genotypes. For the ToMMo dataset, the 3,552 Japanese genomes in ToMMo were sequenced using the HiSeq platform and the sequenced reads were aligned to the human reference genome (GRCh37). Genotypes were called using the Genome Analysis Toolkit best practice pipeline and used as a reference panel (ToMMo 3.5KJPNv2)40.

Haplotype phasing and imputation

Genotypes were determined using either (1) the Illumina HumanOmniExpressExome BeadChip or (2) a combination of Illumina HumanOmniExpress and HumanExome arrays for the BBJ first cohort. For the BBJ second cohort, genotypes were determined using the HumanOmniExpressExome BeadChip; for the NCGG cohort, genotypes were determined using the Illumina AsianScreeningArray (the NCGG data were obtained from the NCGG Biobank database). Quality control (QC) was performed by removing individuals who withdrew consent, had call rates lower than 98%, gender mismatch or non-East Asian ancestry. Any samples overlapping with those in the reference panels were also removed. QC on variants excluded those with a call rate lower than 99%, fewer than five heterozygotes, extreme deviation from the Hardy–Weinberg equilibrium (P < 1 × 10−6) and palindromic variants. We also compared the array genotype and WGS to exclude variants with a concordance rate lower than 99.5%. After QC, the BBJ first cohort, BBJ second cohort and NCGG cohort were separately phased using SHAPEIT2 (ref. 41) (BBJ first cohort, v.2.837) or EAGLE2 (ref. 42) (BBJ second cohort and NCGG cohort, v.2.39), followed by whole-genome imputation using Minimac4 (ref. 43) (v.1.0.0).

For the ToMMo dataset, the array dataset in PLINK binary format (659,326 SNPs) and the imputed genotype dataset in the Oxford BGEN format (54,041,917 variants) for 53,365 study participants were obtained. The genotyping and imputation procedures have been described elsewhere40. All samples were genotyped using the Affymetrix Axiom Japonica array. After QC, autosomal variants were phased using SHAPEIT2 (v2.R837) and subsequently imputed using IMPUTE2 (v.2.3.2). We conducted further QC and excluded samples with (1) an array call rate lower than 97% or (2) non-Japanese ancestry identified using principal component analysis with all samples from the 1000 Genomes Phase III dataset. For variants, we excluded variants with an imputation INFO score lower than 0.3 from the downstream analysis. The final dataset consisted of 37,167,587 variants for 53,083 individuals.

Variant annotation

We used VEP44 v.87 to annotate the tested variants. To obtain a single annotation for a variant, we used the --pick option to prioritize annotation on the canonical transcript. rsIDs were assigned using VEP; if an rsID was not assigned, we annotated the variant as chromosome:position:reference allele:alternate allele. The summary and definition of variant annotation are summarized in Supplementary Table 15.

Quantitative phenotype curation, QC and normalization

Quantitative phenotypes were extracted from the BBJ participant health records. The NCGG phenotype data were obtained from the NCGG Biobank database. The ToMMo data were obtained from the ToMMo database. Raw phenotype data were filtered using the mean ± four s.d. Then, phenotype-specific corrections were applied as follows. For individuals taking a lipid-lowering agent, TC and LDL were divided by 0.8 and 0.7, respectively. For individuals taking antihypertensive agents, systolic and diastolic blood pressure were added (15 and 10 mmHg, respectively). Phenotype-specific exclusion criteria were also applied as follows. Individuals taking an antiuremic agent were excluded from the uric acid analysis; individuals taking warfarin were excluded from the prothrombin time analysis; and individuals with diabetes were excluded from the HbA1c and blood sugar analyses. The raw phenotypes were regressed and residualized according to age, sex and principal components (PCs) 1–10 used as covariates. Additionally, we introduced 47 target disease statuses for the BBJ first cohort, 38 target disease statuses for BBJ second cohort and prefecture of enrollment for the ToMMo cohort into the model6. Then, residuals were inverse-rank-normalized and used as quantitative phenotypes. After normalization, we conducted association analysis using the BOLT-LMM algorithm without covariates. The distributions of phenotypes are summarized in Supplementary Tables 1 and 2.

Quantitative and case-control association analysis and meta-analysis

For the quantitative phenotypes, we applied BOLT-LMM45 (v.2.3.4) for the single-variant association test in each cohort separately. When the model was not converged, we applied a linear regression model implemented in PLINK2 (ref. 46) software excluding related individuals (defined as PI-HAT > 0.25). For the X chromosome, males and females were tested separately and meta-analyzed using inverse-variance-weighted fixed-effect meta-analysis implemented in the METAL47 software. The applied model is summarized in Supplementary Table 1. Then, the results were meta-analyzed with the METAL software using an inverse-variance-weighted fixed-effect meta-analysis. After the meta-analysis, the variant with an overall MAF < 0.1% or P value for heterogeneity < 1 × 10−6 were excluded from the results. For the case-control analysis, we conducted the logistic regression analysis implemented in PLINK2 to associate the genetic dosage and case-control status registered in the BBJ first cohort, introducing age, sex and PCs 1–10 as covariates excluding related individuals (PI-HAT > 0.25).

LD Score regression

Lambda GC, LD Score regression intercept and its ratio were determined using the LDSC software48 (v.1.0.1). We used the LD Score calculated from the 1000 Genomes Project EAS individuals using the LDSC software.

Locus definition

Genome-wide significant loci were determined as follows: (1) extracting variants with P < 5 × 10−8; (2) adding a 5 × 105 base length to each position of these variants bilaterally; (3) merging any overlapping regions. For the variants located in the major histocompatibility complex region (defined as chromosome 6 coordinates from 25000000 to 35000000), 1× 106 base length was added to the position of variants with genome-wide significance. If the locus did not contain coordinates with previously reported genome-wide significant variants, the locus was annotated as a new.

Statistical fine-mapping

We applied FINEMAP49 (v.1.4) for each genome-wide significant locus. We used the meta-analysis results of the primary datasets (BBJ first, BBJ second and NCGG; Supplementary Note 1). We uniformly used the genotype dosage of the first cohort of the BBJ to calculate LD matrices using the Ldstore software (v.2.0) as it was the largest cohort in this study. The maximum number of causal variants in the locus was used as ten in the first round. If the number of causal variants was estimated at ten, we reran FINEMAP using 20 as the maximum number of causal variants (Supplementary Note 3). To control fine-mapping quality, we first excluded 48 loci overlapping the major histocompatibility complex region (chromosome 6 25000000–35000000) because of its extensive LD structure50. In addition, we removed 16 loci where the causalities of the variants were not supported by the conditional analysis. In total, we completed statistical fine-mapping for 3,309 of the 3,390 genome-wide significant loci (97.6%). The marginal PPI was used for each variant throughout the study. Detailed processes are described in Supplementary Note 3. For the UKB, we downloaded the summary statistics generated previously (http://www.nealelab.is/uk-biobank) for 37 corresponding phenotypes. We used the LD matrix calculated using the dosage data for White British individuals in the UKB using LDstore. Otherwise, we defined the loci, ran FINEMAP and processed the output data as described for the BBJ.

Estimation of enrichment and PPI

For each PPI bin, the ORs of the variants annotated as ‘high’ or ‘moderate’ (Supplementary Table 15) by the VEP software to the variants annotated as ‘modifier’ were calculated in comparison with the lowest PPI bin (0–0.1) and tested using a Fisher’s exact test.

ClinVar annotation

We downloaded the VCF file from the ClinVar51 website (https://www.ncbi.nlm.nih.gov/clinvar, 27 January 2020). For each PPI bin, the ORs of variants with each level of clinical significance to variants with ‘benign’ annotation were calculated in comparison with the lowest PPI bin (0–0.1) and tested using a Fisher’s exact test.

Protein visualization

We used the PyMOL software (https://pymol.org/2) to visualize the three-dimensional (3D) structure of proteins. We obtained the 3D protein structures from the Protein Data Bank website (https://www.rcsb.org). The following accession codes were used for the visualization: 6KN8, 5LGD and 5DO7 for Troponin-T, CD36 and ABCG5/ABCG8, respectively.

Drug target

The list of genes encoding drug targets was defined using a previous report52. We counted the number of genes with high-PPI coding and noncoding variants overlapping such drug target genes. A Fisher’s exact test was used to estimate the OR and 95% CI. The P value was calculated by comparing genes with the highest PPI > 0.1 and highest PPI ≤ 0.1.

Protein–protein interactions

Protein–protein interaction data were downloaded from the STRING53 website (https://string-db.org/). High-confidence protein–protein interactions were determined using a combined score greater than 0.9. We counted the number of edges from each gene within this set of interactions. Then, we computed the mean number of interactions for genes in each PPI bin. A Wilcoxon rank-sum test was used to test the difference in the number of protein–protein interactions between genes with a gene PPI > 0.1 and genes with a gene PPI ≤ 0.1.

DeepSEA and disease model

We applied the DeepSEA-based disease impact score predicting model54 to all noncoding variants in genome-wide significant loci (n = 7,289,211, https://hb.flatironinstitute.org/asdbrowser/about). The baseline DeepSEA55 model returned the probability differences for 2,002 epigenetic features. Then, the disease impact score was estimated from these predicted probability differences as a single scalar value for each variant. We estimated the effect sizes of PPI on the disease impact score; we conducted linear regression modeling in the variants with the highest PPI in the loci as follows:

$${\mathrm{Disease}}\; {\mathrm{impact}}\; {\mathrm{score}} \sim {\beta }_{1}{\mathrm{minor}}\; {\mathrm{allele}}\; {\mathrm{frequency}}+{\beta }_{2}{\mathrm{PPI}}$$

Splice-AI

We downloaded the precomputed Splice-AI score (https://basespace.illumina.com). The precomputed score file contained all the substitutions around the exon–intron boundary, provided the delta score and predicted the position for the alternative splicing for these substitutions. We annotated all the variants in the genome-wide-significant loci using the score. As the cutoff, we applied a delta score greater than 0.2 (high sensitivity cutoff56).

Splicing assay

The precise method for the in vitro alternative splicing assay was described elsewhere57. Briefly, we cloned exon–intron–exon structures harboring reference and alternate alleles for the predicted cryptic splice variant on the minigene construct. Each construct was transfected into HEK 293T cells. After 24 h of incubation, RNA was extracted and sequenced using the Illumina MiSeq platform. Sequenced reads were processed using our open-source software (https://github.com/SplicingVariant/SplicingVariants_Beta) to quantify the number of non-splicing, normal splicing and aberrant splicing. We calculated the P value using a Fisher’s exact test by normalizing the analyzed reads to 100 for each allele. The oligonucleotide sequences used in this study are provided in Supplementary Table 24.

Regulatory element and tissue enrichment analysis

We obtained the definitions of DHS and CFP from the ENCODE3 projects58; then, we mapped the positions of these elements to the hg19 coordinates using the liftOver software. Next, we counted the overlap of variants with each regulatory element and calculated the OR of variants in each PPI bin to the lowest PPI bin. For each DHS–phenotype pair, we created a contingency table including: (1) variants with a high PPI (0.1–1.0] and located in the DHS of interest; (2) variants with a high PPI and not located in any DHS; (3) variants with a low PPI [0–0.1] and located in the DHS of interest; (4) variants with a low PPI and not located in any of the DHS. We tested the OR using a Fisher’s exact test. For the tissue enrichment analysis, P values were Bonferroni-adjusted; only the associations with an adjusted P < 0.05 were considered significant and are displayed in Extended Data Fig. 8.

For noncoding variants not in CFP or DHS, we tested the enrichment of functional annotations. The ORs of variants with a high PPI ((0.1–0.9] and (0.9–1.0]) in reference to the intergenic variants with a low PPI [0–0.1] were tested using a Fisher’s exact test.

Population-specific alleles defined using gnomAD

We download the site VCF files, including the allele frequency information, from the gnomAD59 website (v.2.1.1, https://gnomad.broadinstitute.org/downloads). We extracted the variants of interest; if a variant was found only in a single population, we defined it as a population-specific variant. We excluded variants found only in the Japanese WGS (current dataset) from the analysis.

Plasmids

To construct the luciferase reporter vector, the human IL6 3′ UTR sequence (1–428) was amplified using genomic DNA derived from HeLa cells as a template and was inserted into the pGL3-Promoter vector (Promega Corporation) using the In-Fusion HD Cloning Kit (Takara Bio). The rs13306436 point mutation was introduced using the QuikChange Lightning Site-Directed Mutagenesis Kit (Agilent Technologies). The regnase-1 expression vector was constructed by inserting the coding sequence of regnase-1 into the pcDNA3.1(+) vector (Invitrogen).

Luciferase assay

Both IL6 WT and IL6 rs13306436 mutant reporter plasmids were cotransfected with Renilla luciferase plasmid into HeLa cells using Lipofectamine 2000 (Invitrogen) according to the manufacturer’s instructions. A pGL3-Promoter vector without IL6 3′ UTR (empty) was used as the control. After 24-h incubation, cells were lysed and the luciferase activity was determined using the Dual-Luciferase Reporter Assay system (Promega Corporation). We further examined the luciferase activity under regnase-1 overexpression. The fold of repression due to regnase-1 was calculated by normalizing the luciferase level of regnase-1-overexpressing cells with that of empty vector transfected cells.

Statistical analysis of luciferase assay and enrichment for target genes of regnase-1

Data are presented as the mean ± s.d. Statistical significance was calculated with a Student’s t-test. The significance level at P < 0.05 (*) is shown. We analyzed the enrichment of genes where causal variants were in 3′ UTR for the target genes of regnase-1, which were experimentally validated28. We used a hypergeometric test for this enrichment (Supplementary Note 6.4).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.