Main

Studies of genetic regulation of mRNA expression (expression quantitative trait locus (eQTL) studies) is highly informative in interpreting associations between genetic variation and human diseases1,2. However, mRNA expression is a limited proxy of protein expression, which affects human phenotypes in a more direct manner3,4,5.

Analysis of genetic regulation of protein expressions (protein QTL (pQTL) studies) is gaining popularity6,7,8,9,10,11, owing to the development of high-throughput affinity-based assays that enable one-shot measurements of thousands of proteins in large-scale biobank cohorts. Sun et al.7 performed a pQTL study in 3,301 samples on 3,622 plasma proteins measured using SOMAscan and identified nearly 2,000 pQTLs. Zhang et al.8 performed large-scale pQTL fine-mapping in European and African populations and discussed druggability. The UK Biobank (UKB) Pharma Proteomics Project (PPP) has performed pQTL mapping in more than 50,000 samples12,13,14, uncovering common and rare variant contributions to variation in protein expression that do not necessarily involve the effect of eQTLs in major tissues.

Complementing such studies, having another layer of diversity by studying East Asian (EAS) populations would be promising because multi-cohort genetic and proteomic studies have been effective in identifying drug targets15,16,17,18. In addition, instead of using an external eQTL dataset to examine colocalization between eQTLs and pQTLs, building on earlier studies with multimodal measurements of mRNA and protein expression in the same sample set19 in this era of large-scale proteomics would be effective in identifying disease-causing variants in multiple layers as it minimizes the potential loss of discovery power due to differences in technical and clinical backgrounds.

In this study, we systematically characterized shared versus mRNA expression or protein-specific QTLs using a collection of 1,405 genotyped samples, whole-blood mRNA and 2,932 plasma protein expression data (measured using Olink Explore 3072) in the Japan COVID-19 Task Force (JCTF20,21), including 998 samples and 2,211 genes with both mRNA and protein measurements (Extended Data Fig. 1). Using statistical fine-mapping, we describe the distinctive features for each class of QTLs, show the relevance of protein-specific QTLs to complex traits and provide examples where protein-specific QTLs explain variant–disease associations. We then discuss the dynamics of mRNA and protein expression as a function of disease (COVID-19) severity. Finally, we characterize that eQTLs and pQTLs in complex linkage disequilibrium (LD) in the ABO locus might possibly create a negative correlation between mRNA and protein expression.

Results

An expanded catalog of fine-mapped whole-blood eQTLs

We identified 3,464 putative causal (posterior inclusion probability (PIP) > 0.9) cis-eQTLs from statistical fine-mapping of 165,410,953 variant–gene pairs in 1,019 samples. The power gain compared to the previous version of our fine-mapping call from 465 samples20 is represented by a 2.96-fold increase in the number of putative causal eQTLs (from 1,169 to 3,464; Extended Data Fig. 2a); the largely maintained quality of fine-mapping was validated by functional enrichment and consistency with external datasets (Extended Data Fig. 2b, and Supplementary Figs. 1 and 2). The list of splice QTLs (sQTLs) was also updated (Supplementary Fig. 3).

For further validation of statistically fine-mapped eQTLs, we performed a massively parallel reporter assay (MPRA22,23) targeting over 10,000 variants with nontrivial causal evidence in the previous call (PIP ≥ 0.1). The fraction of MPRA hits (Supplementary Data 1 and Methods) increased along with the PIPs. For example, 13.9% of variants with a PIP of 1 presented expression modifier effects in either or both K562 and HepG2 at a false discovery rate (FDR) of less than 0.1 (2.2-fold enrichment compared to 6.3% at a PIP < 0.1; Fisher’s exact test P = 2.5 × 10−6; Extended Data Fig. 2c). In addition, we observed increased concordance between the direction of the eQTL effects and the effects in the MPRA, along with the confidence in each metric. For example, when focusing on 60 variants with a PIP greater than 0.99 and an MPRA hit FDR lower than 0.01 (that is, tier 1) in either tissue, an effect direction concordance of around 80% was observed in both cell types (82.5% in K562 and 79.4% in HepG2 cells; Extended Data Fig. 2d). Taken together, these results provide orthogonal support for our fine-mapping results.

A fine-mapped pQTL resource from 1,384 EAS samples

Using the expression of 2,932 plasma proteins measured with the Olink Explore 3072 assay for 1,384 samples, we performed genome-wide cis-pQTL calling and fine-mapping (Supplementary Fig. 4 and Methods). More than 40% of the measured proteins (n = 1,191 proteins, 40.6%) had at least one variant with P < 5 × 10−8 (Fig. 1a). Statistical fine-mapping identified n = 582 putative causal (PIP ≥ 0.9) pQTLs.

Fig. 1: A fine-mapped pQTL resource from 1,384 EAS samples.
figure 1

a, Fraction of genes passing the minimum P threshold. b, Concordance of variant effect size in an external large pQTL dataset (the UKB PPP), stratified according to the population in the UKB PPP (row) and the PIP of our pQTL fine-mapping (column). For simplicity, only variants with P < 0.05 in the UKB PPP are included. c, Concordance of PIP with an external pQTL fine-mapping dataset that used the SOMAscan platform (ARIC)8.

To validate our pQTL calling, we compared our result with multiple external studies. First, we compared our results with the recent large-scale pQTL mapping results from the UKB PPP study12 and observed a high correlation in effect size estimates (Fig. 1b). The correlation increased along with the causal evidence (PIP) in our dataset; it was the highest when the population background matched (that is, when compared to n = 262 EAS samples in the UKB; Pearson r = 0.97 when PIP = 1; first row in Fig. 1b). The second highest correlation was observed in European (EUR) samples in the UKB (n > 40,000 samples; Pearson r = 0.93 when PIP = 1; second row in Fig. 1b), followed by African (AFR) samples in the UKB (n = 931; Pearson r = 0.85 when PIP = 1). The fraction of the lead variants in the UKB PPP dataset also increased along with the PIP (as an approximation of causality in the absence of fine-mapping data in the UKB PPP; Supplementary Fig. 5). Such concordance between two datasets, both measured using the Olink Explore 3072 assay, validate our pQTL fine-mapping. Lower consistency was observed when comparing with other datasets based on different technological platforms8,24,25,26 (Fig. 1c), as discussed in detail in Supplementary Figs. 6 and 7 and in Supplementary Note.

Functional characterization of fine-mapped pQTLs

To characterize fine-mapped pQTLs, we tested the enrichment of major functional annotations. For noncoding annotations, the 5′ and 3′ untranslated regions (UTRs) were significantly enriched for putative causal pQTLs (521.5-fold and 167.6-fold; Fisher’s exact test P = 8.8 × 10−77 and 5.3 × 10−86, respectively); for coding annotations, missense and predicted loss of function (pLoF) showed enrichments an order of magnitude higher (2109.2-fold and 8046.9-fold; Fisher’s exact test P < 10−100 for both), which is consistent with our understanding that pQTL effects could be driven by both transcriptional and posttranscriptional regulations (Fig. 2a).

Fig. 2: Functional characterization of fine-mapped pQTLs.
figure 2

a, Enrichment of major functional annotations along with pQTL PIPs. The missing dot in the category corresponds to n = 0. b, Enrichment of protein-level structure and localization annotations along with pQTL PIPs for missense variants. c, Concordance between the fine-mapped pQTL PIPs in our dataset and the ARIC dataset, stratified according to major functional annotations. d, Enrichment of missense variation in possibly causal (PIP > 0.1) pQTLs, with inconsistent effect direction between datasets. e, Enrichment of colocalization score between pQTLs and 49 fine-mapped eQTLs in the GTEx. *P < 0.05, **P < 0.05/49. a,b, n = 27,354,135, 164,350, 32,868, 4,257, 562, 508 and 81 variant–gene pairs are included in each category. c, n = 1,008, 1,304, 1,759 and 4,705 variant–gene pairs are included in each panel. e, All 27,556,761 variant–gene pairs are included. Those missing in the GTEx were omitted from the downsampling process (Methods).

Focusing on missense pQTLs, we tested the enrichment of protein structure and localization-related annotations (Fig. 2b). Our analysis highlighted two factors: (1) putative causal pQTLs were enriched for protein domains that form an alpha-helix or beta-sheet structure (2.48-fold and 1.96-fold; Fisher’s exact test P = 7.6 × 10−6 and 1.3 × 10−3, respectively) and (2) they were enriched for extracellular domains (or depleted for cytoplasmic domains; 1.43-fold and 0.23-fold; Fisher’s exact test P = 4.5 × 10−3 and 1.8 × 10−3, respectively). These observations suggest that missense mutations could show pQTL effects by affecting protein structure and stability, especially in an extracellular environment, as discussed in ref. 9, who reported an enrichment of secreted protein-related features in genes harboring pQTLs (Supplementary Fig. 7).

On the other hand, as protein expression in our study was measured with an affinity-based assay using antibodies specific to each protein, the observed pQTL effects might reflect differences in binding affinity introduced by genomic variations, rather than differences in physiological protein abundance. To evaluate such potential ‘epitope effects’, we investigated the replication rate of fine-mapped pQTLs stratified according to missense and other functional annotations (ARIC8 instead of the UKB PPP was used because fine-mapping data were only available for ARIC). Missense variations were the most enriched functional annotation, even when focusing on possibly causal pQTLs replicated in the ARIC study (PIP > 0.1 in both studies; Fig. 2c and Supplementary Fig. 8). On the other hand, replicated possibly causal missense pQTLs often presented opposite effect directions (Fig. 2d; 13 of 57 variants). A possible model derived from these observations is that a missense variant increases affinity to an antibody in one platform while decreasing affinity to the probe in another platform; thus, the ‘true’ effect size and direction in the physiological context is uncertain, warranting future large-scale experimental validation of pQTLs as done in the MPRA for eQTLs.

A subset of causal pQTLs in noncoding regions possibly presents pQTL effects caused by transcriptional regulation in related tissues. To characterize such pQTL variants as colocalizing27 with eQTL variants, we compared our fine-mapping results with the eQTL fine-mapping results from the Genotype-Tissue Expression (GTEx) project v.8 (ref. 2) in 49 major tissues (Supplementary Fig. 9a,b and Methods). The liver presented the highest enrichment, followed by whole blood and spleen, which is consistent with our understanding that a large proportion of plasma proteins are secreted from the liver28, as well as hematopoietic organs (2.44-fold, 2.15-fold and 1.92-fold, respectively; P = 2 × 10−3 for the spleen; Fig. 2e). The testis and most brain-related tissues were significantly depleted for colocalization (for example, 0.33-fold, P = 1 × 10−3 for the testis), which is consistent with the previous literature29.

Contrasting eQTLs and pQTLs highlights distinct functional features

Although colocalization between eQTLs and pQTLs has been investigated previously6,7,9,12,19,30,31, when eQTL and pQTL data are from different cohorts, the correlation between two association statistics can be low because of LD structure differences even when the underlying causal variants are the same. Calling eQTLs and pQTLs from the same samples is better suited to colocalization analysis as LD contributes equally to both QTL calls. Thus, we used n = 998 samples and 2,211 genes where both RNA and protein expression profiles existed, fine-mapped the eQTLs and pQTLs within the 998 samples once more and compared their effect sizes (Fig. 3a). While the correlation was relatively high for putative causal eQTLs and pQTLs (Fig. 3a, bottom right; r = 0.80 when both eQTL and pQTL PIPs are greater than 0.9), there were many fine-mapped eQTLs and pQTLs that are probably causal in either mRNA or protein expression in the blood but not both (for example, top-right and bottom-left in Fig. 3a).

Fig. 3: Characterization of mRNA-specific and protein-specific or colocalizing QTLs and genes.
figure 3

a, Correlation between eQTL and pQTL effect sizes in our dataset, stratified according to the PIPs from the fine-mapping of each QTL. b, Fraction of fine-mapped sQTLs for different QTL categories. c, Enrichment of major functional annotations for mRNA-specific or protein-specific QTL PIP bins. d, Number of genes harboring mRNA-specific or protein-specific or colocalizing QTLs. e, Distribution of constraint score (LOEUF) and number of mRNA-expressing (TPM > 1) tissues in the GTEx for different QTL categories. f, Ridge plot showing the distribution of correlation between mRNA and protein expression in 998 samples, stratified according to the percentage of whole-blood expression in the GTEx. g, Fraction with colocalization evidence according to varying thresholds, stratified according to the percentage of whole-blood expression in the GTEx. Each stratum contains 1,429, 510 and 272 (174 + 55 + 43) genes, as visible in f. h, Locus zoom around the NDST1 gene, as an example of blood mRNA-specific regulation on a constrained gene.

Further, we created mRNA-adjusted protein expression and protein-adjusted mRNA expression profiles, and performed adjusted mRNA or protein expression QTL19 calling followed by statistical fine-mapping (Supplementary Fig. 10). Using a catalog of fine-mapped eQTLs, pQTLs, protein-adjusted eQTLs and mRNA-adjusted pQTLs, we classified putative causal variant–gene pairs into three classes: (1) ‘colocalizing QTLs’, confidently acting on both mRNA and protein expression; (2) ‘mRNA-specific QTLs’; and (3) ‘protein-specific QTLs’ (Methods). Nearly half of colocalizing QTLs are probably explained by splice variation (17 of 42 (40.5%) had an sQTL PIP > 0.9, in contrast to 20.2% for mRNA-specific QTLs and 6.4% for protein-specific QTLs; Fisher’s exact test P = 5.3 × 10−3 and 5.3 × 10−8; Fig. 3b). When focusing on noncoding annotations, mRNA-specific QTLs were enriched for transcription factor binding sites and marginally for 5′ UTR variants (1.96-fold and 1.91-fold compared to protein-specific QTLs; P = 1.2 × 10−2 and 5.3 × 10−2 testing the difference of the odds ratio). The enrichment level of 3′ UTR variants for mRNA and protein-specific QTLs was comparable (67.8-fold and 86.0-fold, P > 0.05), supporting the notion that RNA sequencing (RNA-seq)-based methods for testing the effects of 3′ UTR variants32,33, which often involve posttranscriptional modifications, are harder than those of 5′ UTR variants. For coding annotations, missense variants were strongly enriched for protein-specific QTLs as expected (8.95-fold compared to mRNA-specific QTLs, P = 1.1 × 10−15; Fig. 3c and Supplementary Fig. 11a).

Buffering of blood mRNA expression variation in constrained genes

Turning to a gene-centric view, we classified 557 genes carrying variants belonging to one (or, rarely, more than one) of the three QTL classes defined above (Supplementary Data 2 and Fig. 3d). We then examined their mRNA expression level across tissues in the GTEx, as well as their constraint from the Genome Aggregation Database (gnomAD) (LoF observed/expected upper bound fraction (LOEUF)34; Fig. 3e and Supplementary Fig. 11b). Genes carrying mRNA-specific QTLs had a higher number of expressed tissues in the GTEx (expressed in 46 versus 41 or 34 tissues on average, t-test P = 6.3 × 10−4 and 1.4 × 10−3), and were more highly constrained compared to those carrying colocalization QTLs or protein-specific QTLs (mean LOEUF = 0.90 versus 1.29 or 1.02, t-test P = 1.7 × 10−5 and 4.0 × 10−3). Our observation suggests that, in highly constrained and ubiquitously expressed genes, transcriptional regulation in the blood is more likely to be buffered in plasma, presumably resulting in maintained functionality.

Limited colocalization between eQTLs and pQTLs

We next shifted our focus to the intersection of eQTL and pQTL effects (that is, colocalization). In our QTL classification, even with a colocalization posterior probability (CLPP) threshold of 0.1 (Methods), the number of genes carrying possibly colocalizing QTLs were far fewer than those carrying mRNA or protein-specific QTLs (68 versus 260 or 158). Using an alternative method (Mendelian randomization; Supplementary Fig. 12 and Methods) also suggested limited overlap; we explored possible reasons for such a low colocalization level.

First, we tested colocalization with a larger blood eQTL dataset (eQTLgen) to evaluate the potential power loss attributed to a low sample size. Even with a lenient assumption that a significant P value in eQTLgen is a sign of colocalization (in reality, many are noncausal, tagged variants), the colocalization level remained low (only less than 25% were resolved; Supplementary Fig. 13a–d). Second, we compared our colocalization detection method with another state-of-the-art method (SuSiE-coloc). While more colocalization could be declared depending on the threshold setting, the overall fraction of genes with colocalizing evidence remained low (<15% at a PP.H4 = 0.1 threshold; Supplementary Fig. 13e,f).

Last is the contribution of tissues other than blood. We stratified the gene sets into distinct classes based on the mRNA expression level in whole blood relative to other tissues in the GTEx and observed that genes with a higher percentage of mRNA expression had a higher correlation of blood mRNA and plasma protein expression (average Pearson r = 0.26 for genes with more than 50% expression in the blood versus 0.021 for genes with less than 1% expression, t-test P = 1.2 × 10−8, where percentage expression was defined as the transcripts per million (TPM) in whole blood divided by the sum of TPM across 54 tissues in the GTEx; Fig. 3f). Such differences in the mRNA expression level in whole blood presented major differences in colocalization level (Fig. 3g). Only 5% of lowly (<1%) mRNA-expressed genes had colocalization evidence at a 0.1 CLPP threshold, while over 20% did for highly (>5%) mRNA-expressed genes. Considering additional biological properties, such as active secretion of proteins into the bloodstream (Supplementary Fig. 13g,h) or testing colocalization with fine-mapped variants in the GTEx across 49 tissues also increased the colocalization signal (from approximately 5% to over 10%; Supplementary Figs. 9c and 13i). Nevertheless, the fraction of genes with colocalization evidence remained low, even with the most lenient threshold (<33% at an ultra-lenient 0.001 CLPP threshold across GTEx tissues).

Thus, our observations are in line with previous reports19 that only a sizable fraction of plasma pQTLs are mediated through eQTL effects in major tissues.

Examples of limited colocalization between eQTLs and pQTLs

As an example of blood mRNA-specific regulation on a constrained gene, we highlight rs12519827 (chr5:150488763:G:A), an intronic variant on the NDST1 gene (Fig. 3h). The variant has a strong negative effect on blood mRNA expression (P < 1 × 10−200, PIP = 1, β = −0.68) but little or no effect on the protein expression level in plasma (P = 0.41, PIP = 0). NDST1 is a nonsecreted protein, highly constrained (LoF observed/expected = 3/41.1 = 0.07 in gnomAD) and mouse homozygous knockout lethal35 with known association with severe intellectual developmental disorder36. Associations with complex traits did not reach genome-wide significance in the Biobank Japan (BBJ)37,38 (P = 2.6 × 10−7 for basophil count and P > 1 × 10−4 for others). Notably, the mRNA expression of NDST1 in whole blood was the lowest among all tissues in the GTEx (median TPM = 8.9); the eQTL effect of rs12519827 is highly specific to whole blood (P = 1.3 × 10−97 in whole blood but more than 5 × 10−8 in all other tissues), suggesting that the whole-blood-specific eQTL effect has only a negligible effect and is buffered by relatively high and stable mRNA expression in other tissues relevant to plasma protein expression. As another example of a gene with limited colocalization evidence, we highlight the unresolved ALDH2 locus39,40,41,42 in Supplementary Note, Supplementary Fig. 14 and Supplementary Table 1.

pQTLs may be more relevant to complex traits than eQTLs

The observation that variation in mRNA expression in constrained genes is often buffered suggests possible differences in contribution to complex traits. We used large-scale fine-mapping results for 79 traits from the BBJ to quantify the level of trait-causal variant colocalization (Fig. 4a, Supplementary Fig. 15 and Supplementary Data 3). Protein-specific QTLs had significantly higher enrichment compared to mRNA-specific QTLs (7.6-fold versus 5.6-fold, P = 2.7 × 10−3, ratio of means test for PIP > 0.01 combined), suggesting that alteration in protein expression is more relevant to complex traits. We replicated these observations using the fine-mapping results for 96 traits in the UKB38,43, stratified according to trait categories, validating that protein-specific QTL enrichment is not simply driven by a specific biological trait (protein-specific QTL enrichment was greater; P < 0.05 except for psychological traits; Fig. 4b). Enrichment is not likely to be driven by simple measurement traits such as blood cell count or lipid levels as enrichment persisted in several major disease and complex traits, although not universal to all traits as the opposite enrichment pattern was observed for body mass index (Supplementary Fig. 15a). Stratification according to the percentage of mRNA expression of the genes suggested that the power gain of protein-specific QTLs is mainly from genes that are relatively lowly mRNA-expressed in the blood (Fig. 4c and Supplementary Fig. 15d), highlighting the value of the multi-tissue origin of plasma proteins. Comparison with nonspecific QTLs is more nuanced, as described in detail in the Supplementary Note.

Fig. 4: Complex trait colocalization.
figure 4

a, Number of colocalizing genes for each of the major BBJ trait–QTL pairs, along with gene examples. The color coding of the genes corresponds to the QTL classifications. b, Enrichment of complex trait PIPs from the UKB according to trait category, for variants with putative mRNA-specific or protein-specific causal QTL effects. c, Enrichment of complex trait PIPs from the UKB, stratified according to the percentage of whole-blood expression in the GTEx. d,e, Specific examples where pQTLs colocalize with complex trait-causal variants (TNFRSF11A (d) and APOE (e)). b, n = 3,206 protein-specific and n = 5,308 mRNA expression-specific QTLs passing the PIP > 0.1 threshold were included. c, Those QTLs were further divided into n = 3,181, 1,344 and 799 mRNA-specific QTLs and n = 1,908, 723 and 585 protein-specific QTLs in each whole-blood mRNA expression level stratum (from low to high), after removing the ones where GTEx data were missing. AID, autoimmune disease; AG, albumin to globulin ratio; ALP, alkaline phosphatase; ALT, alanine aminotransferase; AST, aspartate aminotransferase; CAD, coronary artery disease; CRP, C-reactive protein; eGFR, estimated glomerular filtration rate; GD, Graves’ disease; GGT, γ-glutamyltransferase; HDLC, high-density lipoprotein cholesterol; LDLC, low-density lipoprotein cholesterol; MAP, mean arterial pressure; MCH, mean corpuscular hemoglobin; MCHC, mean corpuscular hemoglobin concentration; MCV, mean corpuscular volume; Meno, age at menopause; MI, myocardial infarction; NAP, nucleosome assembly protein; RA, rheumatoid arthritis; RBC, red blood cell; SBP, systolic blood pressure; sCr, serum creatinine response; T2D, type 2 diabetes; WBC, white blood cell.

As specific examples, we highlighted three scenarios where plasma pQTL fine-mapping is beneficial in addition to eQTL fine-mapping in whole blood for complex trait-causal variant interpretation: (1) a 3′ UTR variant possibly involved in posttranscriptional regulation (rs884205 on the TNFRSF11A gene; Fig. 4d); (2) a missense variant probably affecting protein stability (rs429358, the well-known constituent of the APOE-ε4 allele44; Fig. 4e and Supplementary Fig. 16); and (3) mRNA expression regulation taking place in non-blood tissue (rs13395911 on EFHD1 and rs28372783 on CDH1; Supplementary Fig. 15h,i), with detailed interpretation in Supplementary Note.

Characterizing the trans-pQTL effects

The regulatory effects of genetic variants on distal genes (that is, trans-regulation) are much lower than those on a nearby gene (that is, cis-regulation). Thus, identifying genome-wide trans-regulatory variants requires a sample size much larger than that of a canonical cis-QTL analysis45,46,47,48. In this study, we used two different approaches to maximize the prior and to control for the multiple test burden (Supplementary Note).

First, we focused on the lead cis-pQTL (that is, the variant with the lowest cis-pQTL P value) for each gene and tested their genome-wide trans-effects. We observed a distribution shift in the test statistics for the lead cis-variant in trans-pQTL effects compared to random variants (Fig. 5a), validating the strategy. Notably, we replicated the observation of a trans-pQTL ‘hotspot’ in the ABO gene, regulating the expressions of 26 genes genome-wide (at a Bonferroni significance; P < 1.1 × 10−8; Fig. 5b).

Fig. 5: The trans-QTL analysis.
figure 5

a, Q–Q plot comparing the P distribution when testing trans-pQTL effects for random variants (blue) or lead cis-pQTL variants (orange). b, Overview of the trans-pQTL effects. c, Miami plot for genome-wide trans-eQTL and trans-pQTL effects of variation in HLA genes. Genes passing the suggestive P < 1 × 10−5 threshold have been enlarged and annotated (in bold if the gene had both mRNA and protein measurements). The largest association in eQTL effect for RNF5P1 was probably driven by a sequencing error61. In the plot, we omitted chromosome X, which contained only one significant pQTL association (CFP; P = 9 × 10−5), for visual simplicity.

Second, we focused on variation in the classical alleles of the HLA49 genes (and MICA, an HLA-like gene located within the major histocompatibility complex (MHC); Fig. 5c) and examined the genome-wide trans-eQTL and trans-pQTL effect. This identified significant associations (P < 1 × 10−5) for 42 genes (Supplementary Data 4), including those between class 1 HLA alleles (especially, HLA-C) and killer immunoglobulin-like receptor (KIR) genes (KIRDL2 at the mRNA level and KIRDL3 at the protein level; P = 7.2 × 10−9 and P = 1.2 × 10−25), providing a genetic basis for the well-known molecular interaction between KIR and class 1 HLA at the protein–protein level50,51,52.

For HLA, comparing trans-mRNA-regulated and protein-regulated genes, there was zero overlap at the threshold of P < 1 × 10−5, potentially suggesting that plasma pQTL analysis better captures the dynamics of immunological responses across cells and organs led by the variation in HLA alleles, whereas the blood transcriptome is more narrowly focused on peripheral immune cell responses.

Buffering of context-specific (severe COVID-19) eQTL effects

Our study cohort was ascertained for COVID-19-positive cases, classified into four levels of severity. We used this phenotype to understand the dynamics of mRNA and protein expression in response to infectious diseases.

First, we observed an increase in the correlation between mRNA and protein expression in patients with severe COVID-19 (Fig. 6a and Supplementary Fig. 17), presumably because of the natural immune response triggering an increase in relevant genes at both the blood mRNA and plasma protein level. As an example, interleukin-1 receptor-like 1 (IL1RL1), a well-known chemokine, had high correlation specifically in the severe and most severe states (Fig. 6b; r = 0.50 in the most severe disease and 0.08 in the asymptomatic state). In addition, the shared biological entities of blood mRNA and plasma protein dynamics were visible at the module level (Supplementary Fig. 18).

Fig. 6: COVID-19 severity interaction eQTLs (ieQTLs) and interaction pQTLs (ipQTLs).
figure 6

a, Increase of correlation between mRNA and protein expression in severe COVID-19 across a ranging fraction of whole-blood expression in the GTEx. The boxes inside the violin plots show the 25%, 50% and 75% quantiles. b, An example of increased correlation between mRNA and protein expression along with COVID-19 severity (IL1RL1 gene). c, Scatter plot comparing the significance (−log10(P)) of ieQTLs (x axis) and ipQTLs (y axis), colored according to the rank of expression in lymphocytes in the GTEx, when significant. d, Scatter plot showing the effect of rs77767746 on IL1RL1 mRNA and protein expression. e, Scatter plot showing the effect of rs11202918 on FAS mRNA and protein expression. f, Scatter plot showing the effect of rs11055602 on CLEC4C mRNA and protein expression.

Then, we investigated the genetic regulation landscape in several different infectious disease statuses53. Specifically, we performed COVID-19 severity interaction QTL (iQTL) analysis as described previously in20 and compared the interaction landscape at the mRNA and protein levels (Supplementary Figs. 19 and 20 and Supplementary Data 5). The interaction between disease status and pQTL effects was less significant compared to eQTL effects (Fig. 6c–f). We attribute this to buffering from high and stable expression of mRNAs in other tissues, such as lymphocytes (Fig. 6c), the dynamics of solid components missing in plasma and the limitation of detection (LOD) specific to the quantification of affinity-based protein expression (Supplementary Fig. 21), with further details described in Supplementary Note.

Interpreting a negative correlation between mRNA and protein expression

Consistent with previous studies54,55, the correlation between mRNA and protein expression was limited in our dataset (Fig. 3f and Supplementary Fig. 17a,b; mean Pearson r = 0.047); a large fraction of genes had a negative correlation between mRNA and protein expression (866 of 2,211 (39.1%)). While multiple mechanisms could contribute to such a negative correlation56, we highlight ABO (r = −0.26; Fig. 7a), where multiple distinct causal variants on the same gene acting at different stages of the central dogma in the opposite direction (eQTLs and pQTLs; Fig. 7b) might be a potential cause of the negative correlation. The ABO gene has partially overlapping eQTL and pQTL peaks with a different effect direction; statistical fine-mapping identified five distinct putative cis-causal variants (three separate eQTLs plus two pQTLs, hereafter called e1, e2, e3, p1 and p2, with further details shown in Supplementary Table 2). Although p1 had a significant positive effect on protein expression and a negative effect on mRNA expression (Fig. 7c), controlling the LD with e1 and e2 (and e3, which is in near-perfect LD with e2) resolved the negative effect on mRNA expression. Similarly, although e2 had a significant positive effect on protein expression and a negative effect on mRNA expression (Fig. 7d), controlling the LD with p1 nearly resolved the negative effect on protein expression. Functional annotations of these variants further supported the distinct causal mechanisms. p1 (chr9:133257521:T:TC, rs8176719) is a frameshift variant on exon 6, thereby acting during translation, whereas e1 (chr9:133252214:G:A, rs9411476) and e2 (chr9:133301911:T:C, rs656105) had medium-to-high sQTL evidence (sQTL PIP = 0.97 and 0.11, respectively, although not canonical splice sites), that is, in the earlier stage of the central dogma and more subject to buffering at the multi-tissue level. For example, a stable supply of the specific mRNA isoform in tissues other than blood may dominate the observed eQTL and sQTL effect on blood mRNA and possibly mask the effect of the variant at the plasma protein level. These observations highlight a complex regulation landscape of ABO mRNA and protein expression by multiple entangled causal variants that form a negative correlation between mRNA and protein expression in the blood.

Fig. 7: Negative correlation between whole-blood mRNA and plasma protein expression in ABO.
figure 7

a, Density scatter plot showing negatively correlated mRNA and protein expression for the ABO gene. b, Locus zoom around the ABO locus containing two putative causal eQTLs and three putative causal pQTLs, and visualization of LD between those five causal variants. c, Normalized protein (x axis) and mRNA (y axis) expression for the rs8176719 genotype, with and without controlling for LD with two nearby putative causal eQTLs. d, Normalized protein (x axis) and mRNA (y axis) expression for the rs656105 genotype, with and without controlling for LD with another nearby putative causal pQTL.

Discussion

In this study, we presented a collection of whole-blood mRNA and plasma protein expression data from 1,405 genotyped Japanese samples. Our eQTL analysis presented an expanded catalog of 3,464 fine-mapped putative causal eQTLs at single-variant resolution, including 932 validated by an MPRA, thereby allowing functional prioritization of putative causal mRNA regulatory variants even in the case of tight LD. Our pQTL analysis of 2,932 proteins presented a catalog of 582 fine-mapped putative causal pQTLs, allowing us to carry out detailed functional characterization, ranging from mediation by eQTL effects in several tissues (for example, liver or spleen) to protein structure disruption.

Combining mRNA and protein expression for 998 samples, we compared mRNA-specific or protein-specific versus shared regulatory effects to highlight distinct characteristics, such as enrichment of sQTLs for shared regulatory variants and higher constraint for genes specifically regulated by mRNA. We reported a limited level of colocalization between causal eQTLs and pQTLs, attributed to fundamental differences in mRNA versus plasma protein in their origin and biological property. Furthermore, we reported a higher proportion of trait-causal and disease-causal variant colocalization for protein-specific QTLs compared to mRNA-specific QTLs, especially in genes with a low mRNA expression fraction in the blood, as well as clear differences in the trans-regulatory landscape of class I HLA variations and their connection to KIR families, all highlighting the value of plasma protein expression studies on top of blood mRNA expression studies.

We also showed that the interaction of QTL effects with COVID-19 severity was milder for protein expression compared to mRNA expression, probably because of active expression of mRNA in lymphocytes regardless of mRNA regulation in the blood, as well as the limited dynamic range of affinity-based protein measurement (LOD). Finally, taking the ABO locus as an example, we showed that negative correlation could arise due to the LD of distinct nearby causal eQTLs and pQTLs, warranting the need to consider the combination of variants57 acting on different layers of regulatory mechanisms.

The limitations of our study include: (1) the existence of measurement-specific effects by altering epitope-binding sites (epitope effects). Although our analysis using multiple cohorts spanning multiple technological platforms suggested a certain level of consistency of pQTL architecture, it also suggested that epitope effects have a role in inconsistent effect size estimations, especially in missense variants; (2) the existence of measurements below the LOD, which probably decreased the power of pQTL calling, especially when the LOD fraction correlated with the biological properties of the sample (for example, COVID-19 severity); and (3) the fact that our pQTL analysis was restricted to approximately 3,000 genes where measurements were available.

With refined understanding of epitope effects, identification of larger number of putative causal pQTLs by increasing the study size and the integration of more detailed machine learning-derived variant features, such as effects on folding58,59, we envision a future where functional priors specific for pQTLs similar to our previous work on eQTLs60 could be leveraged to better understand protein regulation. Finally, although we presented an example where nearby eQTLs and pQTLs with an opposite effect direction exist in LD, a systematic, genome-wide evaluation of such QTL ‘entanglement’ is yet to be performed. Digesting regulatory mechanisms in light of multiple regulatory layers, including splicing, isoform level expression and the dynamics of mRNA and protein turnover, would be important for several aspects, such as estimation of disease heritability mediated at different layers of the central dogma or developments of drugs targeting mRNA or protein expression specifically.

Our study, leveraging a rich dataset of 1,405 EAS individuals with genomics, transcriptomics and proteomics measurements, serves as an important step forward for deciphering the complex multiomics landscape of human genetic variation in health and disease. All the eQTL and pQTL summary statistics, and the MPRA results, are publicly available via the Japan Omics Browser v.0.1 (https://japan-omics.jp/).

Methods

Ethics

We complied with all relevant ethical regulations. This study was approved by the ethics committees of the Keio University School of Medicine, the Osaka University Graduate School of Medicine and affiliated institutes. Written informed consent was obtained from all participants.

The JCTF

The study participants were recruited through the Japan COVID-19 Task Force (JCTF), which is described in detail in 21. In this study, 1,405 genotyped samples from the JCTF cohort presenting varying levels of the COVID-19 phenotype at the time of recruitment and passing stringent quality control (QC) steps (as described in detail in the next sections) were analyzed. COVID-19 severity was categorized according to four levels: most severe for patients in the intensive care unit or requiring intubation and ventilation (n = 501); severe for others requiring oxygen support (n = 494); mild for other symptomatic patients (for example, those with shortness of breath; n = 332); and asymptomatic for those without COVID-19-related symptoms (n = 78). In 1,405 genotyped samples, mRNA expression in the whole blood was measured with RNA-seq in 1,019 samples, protein expression in the plasma was measured in 1,384 samples and 998 samples were at the intersection.

Genotyping, RNA-seq and protein expression measurement

Genotyping was performed using an Infinium Asian Screening Array (Illumina). Stringent sample and variant-level QC filters were applied (for example, sample call rate greater than 0.98, variant call rate greater than 0.99), resulting in n = 1,405 samples and a total of n = 13,355,923 variants (n = 502,364 genotyped and n = 12,853,559 imputed). For imputation, we extended our in-house and population-specific imputation reference panel; the extended, Japanese-specific reference panel included n = 4,561 whole-genome sequenced (WGS) data from multiple studies (for example, n = 1,939 from the BBJ study39 and n = 141 WGS data from 62), and was higher in number and population specificity compared to the imputation panel used in our previous transcriptomics study of the JCTF20, which consisted of WGS data from 1,037 Japanese samples63 plus those from the 1000 Genomes Project. We expected potential bias because the disease-ascertained nature of the WGS data in the reference panel was relatively low, as we observed an improvement in the eQTL fine-mapping results when compared to using the previous reference panel (Supplementary Fig. 1), suggesting a major benefit of a larger sample size and population specificity (Supplementary Note).

We lifted over the hg19 genotypes to hg38 using the Genome Analysis Toolkit (GATK) LiftoverVcf tool, and filtered out those without unique mapping. For the downstream QTL calling and fine-mapping analyses, we applied additional filtering based on the minor allele count (MAC) and imputation quality, while using a relatively lenient threshold (MAC > 2 and imputation R2 > 0.6). The loose threshold setting was based on recent simulation-based observations38 that fine-mapping benefits more from including low-frequency variants even with a relatively limited quality, and was validated with functional enrichment analyses (although warning us that low-frequency variants could be slightly enriched for false positives; Supplementary Fig. 2i–l and Supplementary Note).

RNA-seq was performed using the NovaSeq 6000 platform (Illumina) with paired end reads (read length of 100 bp), using the S4 Reagent Kit (200 cycles). Plasma protein expression was measured using the Olink Explore 3072 platform. The QC steps are described in the next sections.

eQTL and sQTL fine-mapping

We followed our previous pipeline for eQTL calling in principle, which is based on the GTEx pipeline and described in detail in 20. To quantify mRNA expression, RNA-seq data were first aligned to the hg38 human reference genome using STAR v.2.5.3a. Transcripts were quantified using RSEM v.1.3.0. The following criteria were applied for sample QC: 0.5 × 108 < number of mapped reads < 3 × 108, mapping rate > 0.97, intergenic rate <0.05, rRNA rate <0.05, base mismatch rate <0.005 and intersample correlation deviation measure > −15 (the threshold was slightly different from the GTEx pipeline or our previous pipeline to fit the observed data distribution; Supplementary Fig. 22). mRNA expression was then trimmed mean of M component (TMM)-normalized while low-expression data were filtered out as in the GTEx pipeline. An eQTL call was performed using fastQTL, including 60 PEER factors, sex and five genotype principal components as covariates. Potential variations originating from technical factors, such as RNA integrity number, were thought to be captured by the PEER factors (Supplementary Fig. 23). Performing eQTL mapping for each disease severity stratum and combining the test statistics using a fixed effects model yielded generally consistent results (Supplementary Fig. 24).

Fine-mapping of eQTLs and other QTLs was performed based on our previous pipeline (that is, using FINEMAP v.1.3.1 and susieR v.0.11.43 with default parameters, where the inputs were the summary statistics and in-sample covariate-adjusted LD matrix), with two minor changes. First, we changed the number of single effects in the SuSiE model from ten to five, to be consistent with the default parameter in FINEMAP. This change in number led to slightly conservative but largely consistent results, as previously observed in ref. 20. Second, we performed fine-mapping of all genes regardless of their minimum P value (whereas in a previous version20, we restricted this to genes with a minimum P < 5 × 10−8), as we had a larger sample size and higher power. This could result in a small but limited number of false positives (described in detail in ref. 64). Throughout the article, we report PIP as the minimum of the output from FINEMAP and SuSiE, and define putative causal QTLs as PIP greater than 0.9.

sQTL calling and fine-mapping were also performed mainly based on the previous pipeline (that is, the splicing level was quantified using LeafCutter v.0.2.7 and 15 PEER factors were included) with the changes described above and the following additional modifications to reduce possible false positive calls due to noisy annotation-free splice variant quantification: (1) we defined the cis-window to be ±0.1 Mb from the center of the intron cluster; and (2) we omitted several additional filters from the GTEx pipeline (WASP and z-score-based filtering). Instead, we applied an additional stringent filtering step after nominal sQTL calling, which filtered out intron clusters with a minimum P > 5 × 10−8, analogous to our eQTL fine-mapping in a smaller sample20.

pQTL fine-mapping

The Olink Explore 3072 platform quantifies the expression of each protein in a normalized scale (normalized protein expression (NPX)). As measuring proteins was separated into three batches for logistical reasons, we bridge-normalized the NPX values using the OlinkAnalyze R package65, using 16 intersecting samples as bridging samples. The distribution of COVID-19 severity was similar between batches (Supplementary Fig. 4). To be consistent with the eQTL fine-mapping pipeline and other major pQTL studies using Olink data, we further inverse-normal transformed the bridge-normalized NPX matrix. Samples with QC warning flags were removed. We did not apply additional sample QC after confirming that there were no outlier and major batch effects in the first two principal component spaces (Supplementary Fig. 4) after bridge normalization. Although we observed a high percentage of measurements below the LOD for a subset of genes, we followed the guideline on the Olink website (see ‘How is LOD estimated for Olink Explore 3072/384’ and ‘Explore HT and what is recommended downstream usage?’ at https://olink.com/faq) and did not explicitly remove or adjust those entries. We instead evaluated the potential effect of samples below the LOD post hoc and found no major bias introduced by the inclusion of samples below the LOD (Supplementary Fig. 21 and Supplementary Note).

The gene names from the Olink platform were converted into canonical gene names and Ensembl gene IDs based on gencode v.30. For the assays corresponding to multiple gene names separated by an underbar ‘_’ (for example, CGB3_CGB5_CGB8), we empirically split the entry into multiple entries with an identical value (in the case above, we interpreted that three proteins, CGB3, CGB5 and CGB8, had the same measurement value). Small numbers of genes where any gene name alias in GeneCards (https://www.genecards.org/) did not match the gene name in gencode v.30 were excluded from the analysis. When a gene name was mapped to multiple Ensembl IDs, we exploded the matrix and included all the Ensembl IDs separately. When multiple measurements (either multiple gene names or assays) were mapped to an Ensembl ID, we collapsed the measurements by taking the mean. In all such cases, the differences between measurements were minimal, limiting the bias introduced by this step (Supplementary Fig. 4).

After creating the protein expression matrix as described above, QTL calling and fine-mapping were performed according to the same steps as in eQTL fine-mapping (that is, sex, five genotype principal components and 60 PEER factors aimed at capturing sample-to-sample technical variation, each recalculated within the genotype and expression matrix), except that the step corresponding to filtering the genes based on TPM and the number of samples with nonzero TPM did not exist. Thus, 2,932 proteins were included in QTL calling. We acknowledge the slight difference in the number of unique proteins measured compared to a recent large-scale analysis using the same Olink Explore 3072 platform, such as the UKB PPP, because of minor differences in the data processing step, as described above. The biological properties of these 2,932 genes compared to all coding genes; for example, enrichment in inflammatory functions, as noted on their website (https://olink.com/products/olink-explore-3072-384) are summarized in Supplementary Fig. 25.

pQTL replication

The UKB PPP data were downloaded from the Synapse portal (https://www.synapse.org/#!Synapse:syn51365301). We matched each canonical gene name as in gencode v.30 and the variant ID in hg19 to perform the comparison. For the effect size concordance analysis, pQTLs with P > 0.05 in the UKB PPP dataset were not included for simplicity. The ARIC8 data (the full summary statistics and PIPs) were kindly shared by the authors (http://nilanjanchatterjeelab.org/pwas/). We matched the Ensembl gene IDs based on gencode v.30 and the variant IDs in hg38, removed any unmatched entries and performed the comparison. For the EPIC-Norfolk study, we downloaded the supplementary data from ref. 9 and matched the Ensembl gene IDs and the variant IDs (lifted to hg38). For the mass spectrometry-derived pQTL data, we downloaded the supplementary data from ref. 15 (Supplementary Data 3, summary of identified pQTLs) and matched them on the variant ID (lifted to hg38). As the mass spectrometry data did not directly nominate the affected genes, we matched the variant ID alone and assumed that the gene with the most significant cis-pQTL effect in our dataset was the affected gene to make the comparison. Different prefiltering strategies could contribute to the differences in replication rate, deeming our comparison of the replication rate between cohorts as semiquantitative.

mRNA-specific and protein-specific eQTL fine-mapping

To investigate the mRNA-specific or protein-specific expression of QTLs, we focused on 998 samples and 2,211 genes with both QC-passed mRNA and protein expression measurements. Normalization was reperformed within this reduced sample-by-gene matrix, for mRNA and protein expression separately. For each normalized mRNA (or protein) expression datum, linear regression using protein (or mRNA) expression of the same gene as the only variable was performed to calculate the regression coefficient. The mRNA (or protein) expression data where the linear effect of protein (or mRNA) were regressed out using the model described above were used as the mRNA-specific or protein-specific expression matrix. QTL calling and fine-mapping were performed in the same manner as described by others (that is, preparing the same set of covariates in the QTL calling and with the same parameter setting in the fine-mapping algorithm).

Targeted trans-eQTL and trans-pQTL calling

We first focused on the common lead pQTLs (variants with the lowest pQTL P value for each gene, minor allele frequency > 0.01) and tested their genome-wide pQTL effects (Bonferroni-corrected P = 0.05/4,569,247, where 4,569,247 is 1,569 common lead variants times 2,932 genes and minus 31,061 cis-variant–gene pairs within a 5-Mb distance to the TSS) using tensorQTL, including the same set of covariates as in cis-pQTL mapping. Both the lead pQTLs and corresponding random control variants were filtered to minor allele frequency >0.01 for consistency.

When focusing on the genetic variation in classical HLA genes to test genome-wide trans-eQTL and trans-pQTL effects, DEEP*HLA66 was used to impute HLA alleles for the 998 samples with both mRNA and protein measurements. We focused on 144 four-digit alleles passing QC (imputation R2 > 0.7 and minor allele frequency > 0.01) mainly on classical HLA genes (HLA-A, HLA-B, HLA-C, DRB1, DQA1, DQB1, DPA1, DPB1 and MICA). tensorQTL was used to test the trans-effect (defined by >5 Mb distance between the center of the MHC region and the gene TSS) of each of the four-digit alleles on genome-wide mRNA and protein expression. For each of the HLA genes, the minimum P value over all the four-digit alleles were displayed in a Miami plot; those passing a suggestive P < 1 × 10−5 threshold were annotated. The MHC region (defined as chr6:25726063–33400644) was not included in the cis-QTL fine-mapping analyses because of their high complexity leading to lower mapping quality (that is, while we controlled any biases due to a difference in LD structure by focusing on the same set of 998 samples, we left the fine-mapping of causal variants within HLA, whether in cis or trans, on mRNA or protein expression, as future work).

ieQTL and ipQTL calling

COVID-19 severity ieQTL calling (ieQTL calling) was performed according to the pipeline described previously in ref. 20. Briefly, we used COVID-19 severity classified into four levels as the interaction term, kept all the other terms such as principal components and PEER factors, and used tensorQTL to obtain the P value from the likelihood ratio test. ipQTL calling was performed in a similar fashion. Results were largely consistent when excluding age from the covariates (Supplementary Fig. 20); they were far from identical when testing for the interaction with age instead of COVID-19 phenotype, suggesting that the inclusion of age was not introducing bias and that the COVID-19 interaction effects were not simply driven by shifts in age distribution (Supplementary Note). Samples with both RNA-seq and protein measurements (n = 998) were used for ieQTL and ipQTL calling.

MPRA

Massively parallel reporter assay (MPRA) is a high-throughput method that allows quantification of variant effects by measuring the transcriptional activity of many reporters inserted with different sequence elements. Our MPRA library contained 24,000 oligonucleotides, allowing us to test nearly 12,000 variants across the genome (that is, (24,000 − no. of controls)/2). While the library design is described in a separate manuscript in preparation, using K562 and HepG2 cells, we systematically tested variants with PIP > 0.1 in the previous eQTL fine-mapping from the JCTF20, PIP > 0.1 in the previous functionally informed fine-mapping of the GTEx whole-blood eQTLs60, and a small number of additional variants with possible phenotypic effects, while removing variants with an insertion and deletion length greater than 70.

The MPRA experiments were performed according to the steps in ref. 23. Briefly, the MPRA library was synthesized by Agilent, amplified using PCR, adding random barcodes and cloned into the pLS-SceI vector (no. 137725, Addgene). Sequence–barcode associations in the plasmid library were determined by sequencing using the NextSeq Mid Output 300 cycle kit. The plasmid library was packaged with lentivirus and infected into 2.8 million HepG2 or 10 million K562 cells at a multiplicity of infection of 50 and 10, respectively. For each cell line, three independent infections were performed to obtain three biological replicates. After 3 days, genomic DNA and total RNA were extracted using an AllPrep DNA/RNA Mini Kit (no. 80204, QIAGEN). Total RNA was reverse-transcribed to generate complementary DNA, using Superscript IV Reverse Transcriptase (Thermo Fisher Scientific). Integrated and transcribed barcodes were amplified with specific primers while incorporating the 16-bp unique molecular identifier and Illumina sequencing adapters. Barcodes were then sequenced using the NextSeq High Output 75 cycle kit.

MPRAflow23 was used to associate the barcodes to individual 200-bp sequences and to count the number of DNA and RNA reads (Supplementary Fig. 26). After filtering noisy barcodes with fewer than five DNA counts in a replicate, as well as noisy variants with fewer than five total barcode–replicate pairs in reference or alternative alleles, we aggregated all barcode counts for each sequence across replicates and defined the allelic effect of each variant as log2((total RNA count/total DNA count in alt)/(total RNA count/total DNA count in ref)). We then performed permutation-based significance tests to obtain the P value for each variant and used the Benjamini–Hochberg method to estimate the FDR. Variants passing an FDR threshold of 0.01 were defined as tier 1 EMVars; other variants passing an FDR threshold of 0.1 were defined as tier 2 EMVars. As described in detail in the Supplementary Note, our analysis was based on a simplified model where RNA count follows a Poisson distribution with \(\lambda \,=\,\alpha \times ({\mathrm{DNA}}\; {\mathrm{count}})\), where the translation rate \(\alpha\) is a function of the 200-bp sequence alone.

As shown in Extended Data Fig. 2c,d, we did not observe a major difference in enrichment level between K562 and HepG2 cells, even though K562 cells are physiologically more relevant to blood, suggesting that the effect of culturability, transduction efficiency and other methodological details could differ depending on cell type.

Defining shared and specific QTLs

To investigate shared and distinct QTL effects on mRNA and protein expression, we calculated the product of the eQTL and pQTL PIP (recalculated using the identical set of n = 998 samples for each of the intersecting n = 2,211 genes) to define the CLPP. As eQTL and pQTL calling from identical samples were nonindependent, we deemed this CLPP assignment as conservative, and thus took different thresholds to declare (single-variant-level) colocalization: CLPP ≥ 0.9 for ‘colocalizing QTLs’ and CLPP ≥ 0.1 for ‘possibly colocalizing QTLs’. We deem how to set the threshold to declare colocalization as an open problem, although as analyzed in the Results section, using different thresholds did not qualitatively change the results.

When performing coloc, we turned the P values into Bayes factors assuming a single causal variant per gene and calculated the PP.H1 to PP.H4 with the default prior probability setting described in their manuscript67. When performing SuSiE-coloc, we calculated the PP.H1 to PP.H4 for each combination of pure mRNA and protein QTL credible sets in a gene and let the maximum PP.H4 be the colocalization probability for a gene. (We did not relax the purity filter for credible sets, which could result in slightly conservative results compared to their original implementation.) For the main results, we did not use canonical colocalization67 because we observed more than one causal eQTL or pQTL effect in each. SuSiE-coloc68 was also not chosen as our main method because we aimed to also use the results from FINEMAP. We took the minimum PIPs from SuSiE and FINEMAP, resulting in a slightly more conservative CLPP estimation than using one of the two (Supplementary Fig. 13i).

mRNA-specific QTLs were defined as those with a protein-adjusted eQTL PIP ≥ 0.9 and CLPP < 0.01 (CLPP as described above); protein-specific QTLs were defined as mRNA-adjusted pQTL PIP ≥ 0.9 and CLPP < 0.01.

Functional enrichment

Variant Effect Predictor (VEP) v.108 was used to annotate the variant–gene pairs with several functional annotations. Instead of using the canonical ‘most severe consequence’, which could be tagged by the effect on other genes, we parsed the annotation at the unit of the variant–gene pairs to be more specific. When multiple annotations were included, we did not filter to the most severe ones canonically; thus, each variant–gene pair could be positive for more than one annotation. Protein-specific annotations were obtained as BED files from UniProt through the UCSC Genome Browser. We used pybedtools to check whether a variant intersected with each specific protein annotation.

Enrichment of a category C specific functional annotation given a bin B PIP was defined as the likelihood ratio compared to a random draw (that is, \(p(v\in {C\; |\; v}\in B)\,/{p}(v\in C)\)). The error bar denotes the s.e.m. of the numerator (that is, we assumed that the denominator contained a large number of variant–gene pairs and that the error could be trivial). Plots are often displayed in log-scale for visualization purposes. The colocalization enrichment score between our pQTL calling and eQTL calling in GTEx v.8 across 49 tissues (Fig. 2e) was calculated as follows: (1) We obtained variant–gene–tissue triads with PIP > 0.001 in GTEx v.8 (where the PIP was calculated with a uniform prior and the minimum of SuSiE and FINEMAP was taken, as described in ref. 60) and filtered out missense variants. We annotated the pQTL PIP in JCTF as a function of variant–gene while filling with zero when either the PIP was lower than 0.001 or missing in the JCTF. (2) Then, for each tissue T, we calculated the average of the CLPP, defined as the product of the eQTL and pQTL PIPs. Let this be UT. (3) We randomly selected the same number of variant–gene–tissue triads from tissues other than T and calculated the average CLPP. We repeated the process (that is, sampling the same number of variant–gene pairs excluding the tissue of interest and calculating the average CLPP) for 1,000 times (let these be VT,1, VT,2,.‥,VT,1000); (4) our point estimate of the enrichment score is the mean of UT divided by VT and the error bar is the 2.5–97.5% quantile.Although the estimation could be deflated as it does not distinguish between variants missing versus having a PIP below the 0.001 threshold, and the power was dependent on the sample size for each GTEx tissue, we expected to qualitatively characterize the different level of colocalization between plasma proteome and each tissue.

Complex trait and disease analysis

BBJ and UKB fine-mapping data, described in detail in ref. 38, were downloaded from the National Bioscience Database Center (NBDC) Human Database (accession no. hum0197) and the Finucane lab (https://www.finucanelab.org/data). PIP enrichment was defined for each variant–gene–trait triad as the \({\mathrm{sum}}({\mathrm{PI}}{\mathrm{P}}_{{\mathrm{molQTL}}}\times {\mathrm{PI}}{\mathrm{P}}_{\mathrm{{trait}}})\) (those with \({\mathrm{PI}}{\mathrm{P}}_{{\mathrm{trait}}}\) <0.001 were not included in this calculation; this had a minimal effect on the quantification); then, the maximum over genes was taken to obtain the per-variant scores. When the analysis was not trait-specific, or according to the trait category, the maximum over the traits in a category was taken. When visualizing the number of colocalizing genes per trait (Fig. 4a), we used a custom threshold of \(0.1\le \min ({\mathrm{PI}{\mathrm{P}}}_{{\mathrm{molQTL}}},{\mathrm{PI}}{\mathrm{P}}_{{\mathrm{trait}}})\), instead of using the CLPP, to allow comparison between shared versus specific QTLs (that is, for shared QTLs, the threshold was \(0.1\le \min ({\mathrm{PI}}{\mathrm{P}}_{{\mathrm{eQTL}}},{\mathrm{PI}}{\mathrm{P}}_{{\mathrm{pQTL}}},{\mathrm{PI}}{\mathrm{P}}_{{\mathrm{trait}}})\)).

Negative expression correlation analysis

For the analysis of mRNA and protein QTL entanglement at the ABO locus, we selected five variants corresponding to the top PIP variants from two eQTL credible sets and two pQTL credible sets from SuSiE that were in LD, removing the ones with limited Bayes factors or far away from the variant of highest interest (=p1). When comparing the QTL effect with and without controlling for the others, we restricted the analysis to n = 998 samples with both mRNA and protein measurements.

Statistical analysis

All the statistical tests were two-sided. No adjustment was made for the reported P values unless it was clearly stated as ‘adjusted P value’. The error bars denote the s.e.m. unless noted otherwise. When showing enrichment, the enrichment error bar denotes the standard error of the numerator divided by the denominator.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.