+
Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

A statistical framework for multi-trait rare variant analysis in large-scale whole-genome sequencing studies

This article has been updated

A preprint version of the article is available at bioRxiv.

Abstract

Large-scale whole-genome sequencing (WGS) studies have improved our understanding of the contributions of coding and noncoding rare variants to complex human traits. Leveraging association effect sizes across multiple traits in WGS rare variant association analysis can improve statistical power over single-trait analysis, and also detect pleiotropic genes and regions. Existing multi-trait methods have limited ability to perform rare variant analysis of large-scale WGS data. We propose MultiSTAAR, a statistical framework and computationally scalable analytical pipeline for functionally informed multi-trait rare variant analysis in large-scale WGS studies. MultiSTAAR accounts for relatedness, population structure and correlation among phenotypes by jointly analyzing multiple traits, and further empowers rare variant association analysis by incorporating multiple functional annotations. We applied MultiSTAAR to jointly analyze three lipid traits in 61,838 multi-ethnic samples from the Trans-Omics for Precision Medicine (TOPMed) Program. We discovered and replicated new associations with lipid traits missed by single-trait analysis.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: MultiSTAAR framework and pipeline.
Fig. 2: Manhattan plots and Q–Q plots for unconditional gene-centric coding, noncoding and ncRNA multi-trait analysis of LDL-C, HDL-C and TG using TOPMed data (n = 61,838).
Fig. 3: TOPMed genetic-region (2-kb sliding window) unconditional multi-trait analysis results for LDL-C, HDL-C and TG using TOPMed data (n = 61,838).

Similar content being viewed by others

Data availability

This Article used TOPMed Freeze 8 WGS data and lipids phenotype data. Genotype and phenotype data are both available in the database of Genotypes and Phenotypes. The TOPMed WGS data were from the following 20 study cohorts (accession numbers provided in parentheses): Old Order Amish (phs000956.v1.p1), Atherosclerosis Risk in Communities Study (phs001211), Mt Sinai BioMe Biobank (phs001644), Coronary Artery Risk Development in Young Adults (phs001612), Cleveland Family Study (phs000954), Cardiovascular Health Study (phs001368), Diabetes Heart Study (phs001412), Framingham Heart Study (phs000974), Genetic Study of Atherosclerosis Risk (phs001218), Genetic Epidemiology Network of Arteriopathy (phs001345), Genetic Epidemiology Network of Salt Sensitivity (phs001217), Genetics of Lipid Lowering Drugs and Diet Network (phs001359), Hispanic Community Health Study—Study of Latinos (phs001395), Hypertension Genetic Epidemiology Network and Genetic Epidemiology Network of Arteriopathy (phs001293), Jackson Heart Study (phs000964), Multi-Ethnic Study of Atherosclerosis (phs001416), San Antonio Family Heart Study (phs001215), Genome-Wide Association Study of Adiposity in Samoans (phs000972), Taiwan Study of Hypertension using Rare Variants (phs001387) and Women’s Health Initiative (phs001237). The sample sizes, ancestry and phenotype summary statistics of these cohorts are provided in Supplementary Table 2. Source data for Figs. 2 and 3 and Extended Data Figs. 1 and 2 are available via Zenodo (https://doi.org/10.5281/zenodo.14213842)58. The UK Biobank analyses were conducted using the UK Biobank resource under application 52008. The functional annotation data are publicly available and can be downloaded from the following links: GRCh38 CADD v1.4 (https://cadd.gs.washington.edu/download); ANNOVAR dbNSFP v3.3a (https://annovar.openbioinformatics.org/en/latest/user-guide/download); LINSIGHT (https://github.com/CshlSiepelLab/LINSIGHT); FATHMM-XF (http://fathmm.biocompute.org.uk/fathmm-xf); FANTOM5 CAGE (https://fantom.gsc.riken.jp/5/data); GeneCards (https://www.genecards.org; v4.7 for hg38); and Umap/Bismap (https://bismap.hoffmanlab.org; ‘before March 2020’ version). In addition, recombination rate and nucleotide diversity were obtained from ref. 59. The whole-genome individual functional annotation data were assembled from a variety of sources, and the computed annotation PCs are available at the Functional Annotation of Variant-Online Resource (FAVOR) site (https://favor.genohub.org)60 and the FAVOR database (https://doi.org/10.7910/DVN/1VGTJI)61.

Code availability

MultiSTAAR is implemented as an open-source R package available at https://github.com/xihaoli/MultiSTAAR and https://hsph.harvard.edu/research/lin-lab/software. Data analysis was performed in R (4.1.0). STAAR v0.9.7 and MultiSTAAR v0.9.7 were used in simulation and real data analysis and implemented as open-source R packages available at https://github.com/xihaoli/STAAR (ref. 56) and https://github.com/xihaoli/MultiSTAAR (ref. 54). The assembled functional annotation data were downloaded from FAVOR using Wget (https://www.gnu.org/software/wget/wget.html).

Change history

  • 05 March 2025

    In the version of the article initially published, DOIs were missing from refs. 54–58 and have now been added to the HTML and PDF versions of the article.

References

  1. Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).

    Google Scholar 

  2. The All of Us Research Program Investigators. The ‘All of Us’ research program. New Engl. J. Med. 381, 668–676 (2019).

    Google Scholar 

  3. Halldorsson, B. V. et al. The sequences of 150,119 genomes in the UK Biobank. Nature 607, 732–740 (2022).

    MATH  Google Scholar 

  4. Lee, S., Abecasis, G. R., Boehnke, M. & Lin, X. Rare-variant association analysis: study designs and statistical tests. Am. J. Human Genet. 95, 5–23 (2014).

    Google Scholar 

  5. Li, B. & Leal, S. M. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am. J. Human Genet. 83, 311–321 (2008).

    MATH  Google Scholar 

  6. Madsen, B. E. & Browning, S. R. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 5, e1000384 (2009).

    Google Scholar 

  7. Morris, A. P. & Zeggini, E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet. Epidemiol. 34, 188–193 (2010).

    MATH  Google Scholar 

  8. Wu, M. C. et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Human Genet. 89, 82–93 (2011).

    MATH  Google Scholar 

  9. Liu, Y. et al. ACAT: a fast and powerful P value combination method for rare-variant analysis in sequencing studies. Am. J. Human Genet. 104, 410–421 (2019).

    MATH  Google Scholar 

  10. Solovieff, N., Cotsapas, C., Lee, P. H., Purcell, S. M. & Smoller, J. W. Pleiotropy in complex traits: challenges and strategies. Nat. Rev. Genet. 14, 483–495 (2013).

    Google Scholar 

  11. Sivakumaran, S. et al. Abundant pleiotropy in human complex diseases and traits. Am. J. Human Genet. 89, 607–618 (2011).

    MATH  Google Scholar 

  12. Abdellaoui, A., Yengo, L., Verweij, K. J. H. & Visscher, P. M. 15 years of GWAS discovery: realizing the promise. Am. J. Human Genet 110, 179–194 (2023).

    MATH  Google Scholar 

  13. Watanabe, K. et al. A global overview of pleiotropy and genetic architecture in complex traits. Nat. Genet. 51, 1339–1348 (2019).

    MATH  Google Scholar 

  14. Wu, B. & Pankow, J. S. Sequence kernel association test of multiple continuous phenotypes. Genet. Epidemiol. 40, 91–100 (2016).

    MATH  Google Scholar 

  15. Dutta, D., Scott, L., Boehnke, M. & Lee, S. Multi-SKAT: general framework to test for rare-variant association with multiple phenotypes. Genet. Epidemiol. 43, 4–23 (2019).

    Google Scholar 

  16. Luo, L. et al. Multi-trait analysis of rare-variant association summary statistics using MTAR. Nat. Commun. 11, 2850 (2020).

    MATH  Google Scholar 

  17. Broadaway, K. A. et al. A statistical approach for testing cross-phenotype effects of rare variants. Am. J. Human Genet. 98, 525–540 (2016).

    MATH  Google Scholar 

  18. Li, X. et al. Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nat. Genet. 52, 969–983 (2020).

    MATH  Google Scholar 

  19. Sammel, M., Lin, X. & Ryan, L. Multivariate linear mixed models for multiple outcomes. Stat. Med. 18, 2479–2492 (1999).

    MATH  Google Scholar 

  20. Conomos, M. P., Miller, M. B. & Thornton, T. A. Robust inference of population structure for ancestry prediction and correction of stratification in the presence of relatedness. Genet. Epidemiol. 39, 276–293 (2015).

    MATH  Google Scholar 

  21. Conomos, M. P., Reiner, A. P., Weir, B. S. & Thornton, T. A. Model-free estimation of recent genetic relatedness. Am. J. Human Genet. 98, 127–148 (2016).

    MATH  Google Scholar 

  22. Gogarten, S. M. et al. Genetic association testing using the GENESIS R/Bioconductor package. Bioinformatics 35, 5346–5348 (2019).

    MATH  Google Scholar 

  23. Lee, P. H. et al. Principles and methods of in silico prioritization of noncoding regulatory variants. Human Genet. 137, 15–30 (2018).

  24. Li, Z. et al. A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies. Nat. Methods 19, 1599–1611 (2022).

  25. Morrison, A. C. et al. Practical approaches for whole-genome sequence analysis of heart-and blood-related traits. Am. J. Human Genet. 100, 205–215 (2017).

    MATH  Google Scholar 

  26. Selvaraj, M. S. et al. Whole genome sequence analysis of blood lipid levels in >66,000 individuals. Nat. Commun. 13, 5995 (2022).

    MATH  Google Scholar 

  27. Liu, Z. & Lin, X. Multiple phenotype association tests using summary statistics in genome-wide association studies. Biometrics 74, 165–175 (2018).

    MathSciNet  MATH  Google Scholar 

  28. Teslovich, T. M. et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466, 707–713 (2010).

    MATH  Google Scholar 

  29. Schaffner, S. F. et al. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 15, 1576–1583 (2005).

    MATH  Google Scholar 

  30. Natarajan, P. et al. Deep-coverage whole genome sequences and blood lipids among 16,324 individuals. Nat. Commun. 9, 3391 (2018).

    MATH  Google Scholar 

  31. Stilp, A. M. et al. A system for phenotype harmonization in the National Heart, Lung and Blood Institute Trans-Omics for Precision Medicine (TOPMed) program. Am. J. Epidemiol. 190, 1977–1992 (2021).

    MATH  Google Scholar 

  32. Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).

    MATH  Google Scholar 

  33. Dong, C. et al. Comparison and integration of deleteriousness prediction methods for non-synonymous SNVs in whole exome sequencing studies. Human Mol. Genet. 24, 2125–2137 (2014).

  34. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

    MATH  Google Scholar 

  35. Huang, Y.-F., Gulko, B. & Siepel, A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat. Genet. 49, 618–624 (2017).

  36. Rogers, M. F. et al. FATHMM-XF: accurate prediction of pathogenic point mutations via extended features. Bioinformatics 34, 511–513 (2017).

    MATH  Google Scholar 

  37. Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).

    Google Scholar 

  38. Klarin, D. et al. Genetics of blood lipids among ~300,000 multi-ethnic participants of the Million Veteran Program. Nat. Genet. 50, 1514–1523 (2018).

    MATH  Google Scholar 

  39. Forrest, A. R. et al. A promoter-level mammalian expression atlas. Nature 507, 462–470 (2014).

    MATH  Google Scholar 

  40. Abascal, F. et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699–710 (2020).

    MATH  Google Scholar 

  41. Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).

    MATH  Google Scholar 

  42. Fishilevich, S. et al. GeneHancer: genome-wide integration of enhancers and target genes in GeneCards. Database 2017, bax028 (2017).

    Google Scholar 

  43. DiCorpo, D. et al. Whole genome sequence association analysis of fasting glucose and fasting insulin levels in diverse cohorts from the NHLBI TOPMed program. Commun. Biol. 5, 756 (2022).

    Google Scholar 

  44. Jiang, M.-Z. et al. Whole genome sequencing based analysis of inflammation biomarkers in the Trans-Omics for Precision Medicine (TOPMed) consortium. Human Mol. Genet 33, 1429–1441 (2024).

    MATH  Google Scholar 

  45. Dijk, W. et al. Identification of a gain-of-function LIPC variant as a novel cause of familial combined hypocholesterolemia. Circulation 146, 724–739 (2022).

    MATH  Google Scholar 

  46. Ottensmann, L. et al. Genome-wide association analysis of plasma lipidome identifies 495 genetic associations. Nat. Commun. 14, 6934 (2023).

    MATH  Google Scholar 

  47. Guo, T. et al. Association between the DOCK7, PCSK9 and GALNT2 gene polymorphisms and serum lipid levels. Sci. Rep. 6, 19079 (2016).

    Google Scholar 

  48. Li, Z. et al. Dynamic scan procedure for detecting rare-variant association regions in whole-genome sequencing studies. Am. J. Human Genet. 104, 802–814 (2019).

    MATH  Google Scholar 

  49. McCaw, Z. R., Gao, J., Lin, X. & Gronsbell, J. Synthetic surrogates improve power for genome-wide association studies of partially missing phenotypes in population biobanks. Nat. Genet. 56, 1527–1536 (2024).

    Google Scholar 

  50. Li, X. et al. Powerful, scalable and resource-efficient meta-analysis of rare variant associations in large whole genome sequencing studies. Nat. Genet. 55, 154–164 (2023).

    MATH  Google Scholar 

  51. Chen, H. et al. Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. Am. J. Human Genet. 98, 653–666 (2016).

    MATH  Google Scholar 

  52. Chen, H. et al. Efficient variant set mixed model association tests for continuous and binary traits in large-scale whole-genome sequencing studies. Am. J. Human Genet. 104, 260–274 (2019).

    MATH  Google Scholar 

  53. Liu, Y. & Xie, J. Cauchy combination test: a powerful test with analytic P-value calculation under arbitrary dependency structures. J. Am. Stat. Assoc. 115, 393–402 (2020).

    MathSciNet  MATH  Google Scholar 

  54. Li, X. xihaoli/MultiSTAAR: MultiSTAAR_v0.9.7 (v0.9.7). Zenodo. https://doi.org/10.5281/zenodo.13955413 (2024).

  55. Li, X. & Li, Z. xihaoli/STAARpipeline: STAARpipeline_v0.9.7 (v0.9.7). Zenodo. https://doi.org/10.5281/zenodo.10098313 (2023).

  56. Li, X. xihaoli/STAAR: STAAR_v0.9.7 (v0.9.7). Zenodo. https://doi.org/10.5281/zenodo.10060210 (2023).

  57. Li, X. & Li, Z. xihaoli/STAARpipelineSummary: STAARpipelineSummary_v0.9.7 (v0.9.7). Zenodo. https://doi.org/10.5281/zenodo.10113310 (2023).

  58. Li, X. et al. Source data of the MultiSTAAR manuscript “A statistical framework for multi-trait rare variant analysis in large-scale whole-genome sequencing studies”. [Data set]. Zenodo. https://doi.org/10.5281/zenodo.14213842 (2024).

  59. Gazal, S. et al. Linkage disequilibrium–dependent architecture of human complex traits shows action of negative selection. Nat. Genet. 49, 1421–1427 (2017).

    MATH  Google Scholar 

  60. Zhou, H. et al. FAVOR: functional annotation of variants online resource and annotator for variation across the human genome. Nucleic Acids Res. 51, D1300–D1311 (2023).

    Google Scholar 

  61. Zhou, H., Arapoglou, T., Li, X., Li, Z. & Lin, X. FAVOR Essential Database. V1 edn (Harvard Dataverse, 2022).

  62. Moors, J. et al. A Polynesian-specific missense CETP variant alters the lipid profile. Human Genet. Genomics Adv. 4, 100204 (2023).

    MATH  Google Scholar 

Download references

Acknowledgements

This work was supported by grants R35-CA197449, U19-CA203654, U01-HG012064 and U01-HG009088 (X. Lin), NHLBI TOPMed Fellowship 75N92021F00229 (X. Li and M.S.S.), 1R01AG086379-01 (Z. Liu), R01-HL142711 and R01-HL127564 (P.N. and G.M.P.), R00HG012956-02 (Z.Y.), 75N92020D00001, HHSN268201500003I, N01-HC-95159, 75N92020D00005, N01-HC-95160, 75N92020D00002, N01-HC-95161, 75N92020D00003, N01-HC-95162, 75N92020D00006, N01-HC-95163, 75N92020D00004, N01-HC-95164, 75N92020D00007, N01-HC-95165, N01-HC-95166, N01-HC-95167, N01-HC-95168, N01-HC-95169, UL1-TR-000040, UL1-TR-001079, UL1-TR-001420, UL1-TR001881, DK063491, R01-HL071051, R01-HL071205, R01-HL071250, R01-HL071251, R01-HL071258, R01-HL071259 and UL1-RR033176 (J.I.R.), HHSN268201800001I and U01-HL137162 (K.M.R.), DK078616 and HL151855 (J.B.M.), 1R35-HL135818, R01-HL113338 and HL046389 (S.R.), HL105756 (B.M.P.), HHSN268201600018C, HHSN268201600001C, HHSN268201600002C, HHSN268201600003C and HHSN268201600004C (C.K.), R01-MD012765 and R01-DK117445 (N.F.), R01-HL153805 and R03-HL154284 (B.E.C.), HHSN268201700001I, HHSN268201700002I, HHSN268201700003I, HHSN268201700005I and HHSN268201700004I (E.B.), U01-HL072524, R01-HL104135-04S1, U01-HL054472, U01-HL054473, U01-HL054495, U01-HL054509 and R01-HL055673-18S1 (D.K.A.) and U01-HL72518, HL087698, HL49762, HL59684, HL58625, HL071025, HL112064, NR0224103 and M01-RR000052 (to the Johns Hopkins General Clinical Research Center). The Diabetes Heart Study (DHS) was supported by R01 HL92301, R01 HL67348, R01 NS058700, R01 AR48797, R01 DK071891 and R01 AG058921, the General Clinical Research Center of the Wake Forest University School of Medicine (M01 RR07122, F32 HL085989), the American Diabetes Association and a pilot grant from the Claude Pepper Older Americans Independence Center of Wake Forest University Health Sciences (P60 AG10484). The Framingham Heart Study (FHS) acknowledges the support of contracts NO1-HC-25195, HHSN268201500001I, 75N92019D00031, 1R01HL064753, R01HL076784 and 1R01AG028321 from the National Heart, Lung and Blood Institute and grant supplement R01 HL092577-06S1 for this research. We also acknowledge the dedication of the FHS study participants, without whom this research would not be possible. R.S.V. is supported in part by the Evans Medical Foundation and the Jay and Louis Coffman Endowment from the Department of Medicine, Boston University Chobanian & Avedisian School of Medicine. The Jackson Heart Study (JHS) is supported and conducted in collaboration with Jackson State University (HHSN268201800013I), Tougaloo College (HHSN268201800014I), the Mississippi State Department of Health (HHSN268201800015I) and the University of Mississippi Medical Center (HHSN268201800010I, HHSN268201800011I and HHSN268201800012I) contracts from the National Heart, Lung and Blood Institute (NHLBI) and the National Institute on Minority Health and Health Disparities (NIMHD). We also thank the staff and participants of the JHS. Support for GENOA was provided by the NHLBI (U01HL054457, U01HL054464, U01HL054481, R01HL119443 and R01HL087660) of the National Institutes of Health (NIH). Collection of the San Antonio Family Study data was supported in part by NIH grants P01 HL045522, MH078143, MH078111 and MH083824, and whole-genome sequencing of SAFS subjects was supported by U01 DK085524 and R01 HL113323. Molecular data for the Trans-Omics in Precision Medicine (TOPMed) program was supported by the NHLBI. Core support, including centralized genomic read mapping and genotype calling, along with variant quality metrics and filtering were provided by the TOPMed Informatics Research Center (3R01HL-117626-02S1; contract no. HHSN268201800002I). Core support, including phenotype harmonization, data management, sample-identity quality control and general program coordination were provided by the TOPMed Data Coordinating Center (R01HL-120393; U01HL-120393; contract no. HHSN268201800001I). We gratefully acknowledge the studies and participants who provided biological samples and data for TOPMed. The full study-specific acknowledgements are detailed in the Supplementary Information.

Author information

Authors and Affiliations

Authors

Consortia

Contributions

X. Li, H.C., Z. Li, Z. Liu and X. Lin designed the experiments. X. Li, H.C., Z. Li and X. Lin performed the experiments. X. Li, H.C., M.S.S., E.V.B., Y.W., R.S., Z.R.M., Z.Y., M.-Z.J., D.D., S.M.G., R.D., D.K.A., E.J.B., J.C.B., J.B., E.B., D.W.B., J.A.B., B.E.C., A.P.C., J.C.C., N.C., Y.-D.I.C., J.E.C., P.S.d.V., M.F., N.F., B.I.F., C.G., N.L.H.C., J.H., L.H., Y.-J.H., M.R.I., R.C.K., S.L.R.K., T.N.K., I.K., C.K., B.G.K., C.L., Y.L., H.L., C.-T.L. R.J.F.L., M.C.M., L.W.M., R.A.M., B.D.M., M.E.M., A.C.M., T.N., K.E.N., N.D.P., P.A.P., B.M.P., S.R., A.P.R., S.S.R., C.M.S., J.A.S., K.D.T., H.K.T., R.S.V., S.V., Z.W., J.W., L.R.Y., B.Y., J.D., J.B.M., P.L.A., L.M.R., A.K.M., K.M.R., J.I.R., G.M.P., P.N., Z. Li, H.Z., Z. Liu and X. Lin acquired, analyzed or interpreted data. G.M.P., P.N. and the NHLBI TOPMed Lipids Working Group provided administrative, technical or material support. X. Li, Z. Li, Z. Liu and X. Lin drafted the paper and revised it according to suggestions by the coauthors. All authors critically reviewed the paper, suggested revisions as needed, and approved the final version.

Corresponding authors

Correspondence to Zilin Li, Zhonghua Liu or Xihong Lin.

Ethics declarations

Competing interests

Z.R.M. and R.D. are employees of Insitro. S.M.G. is an employee of Regeneron Genetics Center. M.E.M. receives research funding from Regeneron Pharmaceutical Inc., unrelated to this project. B.M.P. serves on the Steering Committee of the Yale Open Data Access Project funded by Johnson & Johnson. L.M.R. and S.S.R. are consultants for the TOPMed Administrative Coordinating Center (via Westat). P.N. reports research grants from Allelica, Amgen, Apple, Boston Scientific, Genentech/Roche and Novartis, personal fees from Allelica, Apple, AstraZeneca, Blackstone Life Sciences, Creative Education Concepts, CRISPR Therapeutics, Eli Lilly & Co, Esperion Therapeutics, Foresite Capital, Foresite Labs, Genentech/Roche, GV, HeartFlow, Magnet Biomedicine, Merck, Novartis, TenSixteen Bio and Tourmaline Bio, equity in Bolt, Candela, Mercury, MyOme, Parameter Health, Preciseli and TenSixteen Bio, and spousal employment at Vertex Pharmaceuticals, all unrelated to the present work. X. Lin is a consultant of AbbVie Pharmaceuticals and Verily Life Sciences. The other authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Yuehua Cui, Yukinori Okada and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Ananya Rastogi, in collaboration with the Nature Computational Science team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Manhattan plots and Q-Q plots for unconditional gene-centric coding, noncoding and genetic region (2-kb sliding window) multi-trait analysis of fasting glucose (FG) and fasting insulin (FI) using TOPMed data (n = 21,731).

a, Manhattan plots for unconditional gene-centric coding analysis of protein-coding genes. The horizontal line indicates a genome-wide MultiSTAAR-O P value threshold of \(5.00\times {10}^{-7}\). The significant threshold is defined by multiple comparisons using the Bonferroni correction (\(0.05/\left(\mathrm{20,000}\times 5\right)=5.00\times {10}^{-7}\)). Different symbols represent the MultiSTAAR-O P value of the protein-coding gene using different functional categories (putative loss-of-function, putative loss-of-function and disruptive missense, missense, disruptive missense, synonymous). b, Quantile-quantile plots for unconditional gene-centric coding analysis of protein-coding genes. Different symbols represent the MultiSTAAR-O P-value of the gene using different functional categories. c, Manhattan plots for unconditional gene-centric noncoding analysis of protein-coding genes. The horizontal line indicates a genome-wide MultiSTAAR-O P value threshold of \(3.57\times {10}^{-7}\). The significant threshold is defined by multiple comparisons using the Bonferroni correction (\(0.05/\left(\mathrm{20,000}\times 7\right)=3.57\times {10}^{-7}\)). Different symbols represent the MultiSTAAR-O P value of the protein-coding gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). Promoter_CAGE and promoter_DHS are the promoters with overlap of Cap Analysis of Gene Expression (CAGE) sites and DNase hypersensitivity (DHS) sites for a given gene, respectively. Enhancer_CAGE and enhancer_DHS are the enhancers in GeneHancer-predicted regions with the overlap of CAGE sites and DHS sites for a given gene, respectively. d, Quantile-quantile plots for unconditional gene-centric noncoding analysis of protein-coding genes. Different symbols represent the MultiSTAAR-O P-value of the gene using different functional categories. e, Manhattan plot showing the associations of 2.68 million 2-kb sliding windows versus \(-{\log }_{10}(P)\) of MultiSTAAR-O. The horizontal line indicates a genome-wide P value threshold of \(1.86\times {10}^{-8}\). f, Quantile-quantile plot of 2-kb sliding window MultiSTAAR-O P values. In panels, a, c and e, the chromosome number are indicated by the colors of dots. In all panels, MultiSTAAR-O is a two-sided test.

Extended Data Fig. 2 Manhattan plots and Q-Q plots for unconditional gene-centric coding, noncoding and genetic region (2-kb sliding window) multi-trait analysis of C-reactive protein (CRP), interleukin-6 (IL-6), lipoprotein-associated phospholipase A2 (Lp-PLA2) activity, and lipoprotein-associated phospholipase A2 (Lp-PLA2) mass using TOPMed data (n = 9,380).

a, Manhattan plots for unconditional gene-centric coding analysis of protein-coding genes. The horizontal line indicates a genome-wide MultiSTAAR-O P value threshold of \(5.00\times {10}^{-7}\). The significant threshold is defined by multiple comparisons using the Bonferroni correction (\(0.05/\left(\mathrm{20,000}\times 5\right)=5.00\times {10}^{-7}\)). Different symbols represent the MultiSTAAR-O P value of the protein-coding gene using different functional categories (putative loss-of-function, putative loss-of-function and disruptive missense, missense, disruptive missense, synonymous). b, Quantile-quantile plots for unconditional gene-centric coding analysis of protein-coding genes. Different symbols represent the MultiSTAAR-O P-value of the gene using different functional categories. c, Manhattan plots for unconditional gene-centric noncoding analysis of protein-coding genes. The horizontal line indicates a genome-wide MultiSTAAR-O P value threshold of \(3.57\times {10}^{-7}\). The significant threshold is defined by multiple comparisons using the Bonferroni correction (\(0.05/\left(\mathrm{20,000}\times 7\right)=3.57\times {10}^{-7}\)). Different symbols represent the MultiSTAAR-O P value of the protein-coding gene using different functional categories (upstream, downstream, UTR, promoter_CAGE, promoter_DHS, enhancer_CAGE, enhancer_DHS). Promoter_CAGE and promoter_DHS are the promoters with overlap of Cap Analysis of Gene Expression (CAGE) sites and DNase hypersensitivity (DHS) sites for a given gene, respectively. Enhancer_CAGE and enhancer_DHS are the enhancers in GeneHancer predicted regions with the overlap of CAGE sites and DHS sites for a given gene, respectively. d, Quantile-quantile plots for unconditional gene-centric noncoding analysis of protein-coding genes. Different symbols represent the MultiSTAAR-O P-value of the gene using different functional categories. e, Manhattan plot showing the associations of 2.67 million 2-kb sliding windows versus \(-{\log }_{10}(P)\) of MultiSTAAR-O. The horizontal line indicates a genome-wide P value threshold of \(1.87\times {10}^{-8}\). f, Quantile-quantile plot of 2-kb sliding window MultiSTAAR-O P values. In panels, a, c and e, the chromosome number are indicated by the colors of dots. In all panels, MultiSTAAR-O is a two-sided test.

Supplementary information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, X., Chen, H., Selvaraj, M.S. et al. A statistical framework for multi-trait rare variant analysis in large-scale whole-genome sequencing studies. Nat Comput Sci 5, 125–143 (2025). https://doi.org/10.1038/s43588-024-00764-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s43588-024-00764-8

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载