+
Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Tutorial: guidelines for quality filtering of whole-exome and whole-genome sequencing data for population-scale association analyses

Abstract

Genetic sequencing technologies are powerful tools for identifying rare variants and genes associated with Mendelian and complex traits; indeed, whole-exome and whole-genome sequencing are increasingly popular methods for population-scale genetic studies. However, careful quality control steps should be taken to ensure study accuracy and reproducibility, and sequencing data require extensive quality filtering to delineate true variants from technical artifacts. Although processing standards are harmonized across pipelines to call variants from sequencing reads, there currently exists no standardized pipeline for conducting quality filtering on variant-level datasets for the purpose of population-scale association analysis. In this Tutorial, we discuss key quality control parameters, provide guidelines for conducting quality filtering of samples and variants, and compare commonly used software programs for quality control of samples, variants and genotypes from sequencing data. As sequencing data continue to gain popularity in genetic research, establishing standardized quality control practices is crucial to ensure consistent, reliable and reproducible results across studies.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of data processing steps and quality filtering for samples, genotypes and variants for sequencing data.
Fig. 2: The effects of filtering heterozygosity ratio with criteria from different samples stratified by ancestry.
Fig. 3: Distributions of sample QC metrics stratified by ancestry for WES (left) and WGS (right) from the 1KGP+HGDP dataset.

Similar content being viewed by others

Data availability

Figures 2 and 3 and Table 3 were created using the publicly available 1000 Genomes Project phase 3 and Human Genome Diversity Project data. These datasets can be directly loaded into Hail as a matrix table using the dataset repository (https://hail.is/docs/0.2/datasets.html).

Code availability

Python code for conducting sample and variant filtering using Hail can be found here at https://github.com/jsealock1/sequencing_qc.

References

  1. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. Goldfeder, R. L. et al. Medical implications of technical accuracy in genome sequencing. Genome Med. 8, 24 (2016).

    PubMed  PubMed Central  Google Scholar 

  4. Nielsen, R., Paul, J. S., Albrechtsen, A. & Song, Y. S. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12, 443–451 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. Olson, N. D. et al. Variant calling and benchmarking in an era of complete human genome sequences. Nat. Rev. Genet. https://doi.org/10.1038/s41576-023-00590-0 (2023).

  6. Carson, A. R. et al. Effective filtering strategies to improve data quality from population-based whole exome sequencing studies. BMC Bioinforma. 15, 125 (2014).

    Google Scholar 

  7. Regier, A. A. et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nat. Commun. 9, 4038 (2018).

    PubMed  PubMed Central  Google Scholar 

  8. Auwera, G. A. & O’Connor, B. D. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (O’Reilly Media, 2020).

  9. Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv https://doi.org/10.1101/201178 (2018).

  10. Behera, S. et al. Comprehensive genome analysis and variant detection at scale using DRAGEN. Nat. Biotechnol. https://doi.org/10.1038/s41587-024-02382-1 (2024).

  11. Lam, M. et al. RICOPILI: Rapid Imputation for COnsortias PIpeLIne. Bioinformatics 36, 930–933 (2020).

    CAS  PubMed  Google Scholar 

  12. Guo, Y. et al. Illumina human exome genotyping array clustering and quality control. Nat. Protoc. 9, 2643–2662 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  13. Rehm, H. L. et al. ACMG clinical laboratory standards for next-generation sequencing. Genet. Med. 15, 733–747 (2013).

    PubMed  PubMed Central  Google Scholar 

  14. Marshall, C. R. et al. Best practices for the analytical validation of clinical whole-genome sequencing intended for the diagnosis of germline disease. npj Genom. Med. 5, 1–12 (2020).

    Google Scholar 

  15. Bergström, A. et al. Insights into human genetic variation and population history from 929 diverse genomes. Science 367, eaay5012 (2020).

    PubMed  PubMed Central  Google Scholar 

  16. Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    PubMed  Google Scholar 

  17. Hu, T., Chitnis, N., Monos, D. & Dinh, A. Next-generation sequencing technologies: an overview. Hum. Immunol. 82, 801–811 (2021).

    CAS  PubMed  Google Scholar 

  18. De Coster, W., Weissensteiner, M. H. & Sedlazeck, F. J. Towards population-scale long-read sequencing. Nat. Rev. Genet. 22, 572–587 (2021).

    PubMed  PubMed Central  Google Scholar 

  19. Chowdhury, B. & Garai, G. A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109, 419–431 (2017).

    CAS  PubMed  Google Scholar 

  20. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  21. Langmead, B. & Nellore, A. Cloud computing for genomic data analysis and collaboration. Nat. Rev. Genet. 19, 208–219 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. Pedersen, B. S. & Quinlan, A. R. Vcfexpress: flexible, rapid user-expressions to filter and format VCFs. Preprint at bioRxiv https://doi.org/10.1101/2024.11.05.622129 (2024).

  24. Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).

    PubMed  PubMed Central  Google Scholar 

  25. Timothy, P. et al. The scalable variant call representation: enabling genetic analysis beyond one million genomes. Bioinformatics 41, btae746 (2025).

    Google Scholar 

  26. Orlov, Y. L. & Potapov, V. N. Complexity: an internet resource for analysis of DNA sequence complexity. Nucleic Acids Res. 32, W628–W633 (2004).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  28. Singh, T. et al. Rare coding variants in ten genes confer substantial risk for schizophrenia. Nature 604, 509–516 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  29. Lander, E. S. & Waterman, M. S. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2, 231–239 (1988).

    CAS  PubMed  Google Scholar 

  30. Sims, D., Sudbery, I., Ilott, N. E., Heger, A. & Ponting, C. P. Sequencing depth and coverage: key considerations in genomic analyses. Nat. Rev. Genet. 15, 121–132 (2014).

    CAS  PubMed  Google Scholar 

  31. Muyas, F. et al. Allele balance bias identifies systematic genotyping errors and false disease associations. Hum. Mutat. 40, 115–126 (2019).

    CAS  PubMed  Google Scholar 

  32. Zhang, F. et al. Ancestry-agnostic estimation of DNA sample contamination from sequence reads. Genome Res. 30, 185–194 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  33. Lu, W. et al. CHARR efficiently estimates contamination from DNA sequencing data. Am. J. Hum. Genet. 110, 2068–2076 (2023).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. Gaspar, H. A. & Breen, G. Probabilistic ancestry maps: a method to assess and visualize population substructures in genetics. BMC Bioinforma. 20, 116 (2019).

    Google Scholar 

  35. Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).

    CAS  PubMed  Google Scholar 

  36. Price, A. L., Zaitlen, N. A., Reich, D. & Patterson, N. New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet. 11, 459–463 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  37. Chen, S. et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature 625, 92–100 (2024).

    CAS  PubMed  Google Scholar 

  38. Anderson, C. A. et al. Data quality control in genetic case-control association studies. Nat. Protoc. 5, 1564–1573 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. Guo, Y. et al. Multi-perspective quality control of Illumina exome sequencing data using QC3. Genomics 103, 323–328 (2014).

    CAS  PubMed  Google Scholar 

  40. Guo, Y., Ye, F., Sheng, Q., Clark, T. & Samuels, D. C. Three-stage quality control strategies for DNA re-sequencing data. Brief. Bioinform. 15, 879–889 (2014).

    CAS  PubMed  Google Scholar 

  41. Ng, S. B. et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461, 272–276 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. Neuman, J. A., Isakov, O. & Shomron, N. Analysis of insertion–deletion from deep-sequencing data: software evaluation for optimal detection. Brief. Bioinform. 14, 46–55 (2013).

    PubMed  Google Scholar 

  44. Boltz, T. A. et al. A blended genome and exome sequencing method captures genetic variation in an unbiased, high-quality, and cost-effective manner. Preprint at bioRxiv https://doi.org/10.1101/2024.09.06.611689 (2024).

Download references

Acknowledgements

This work is supported by the Novo Nordisk Foundation (NNF21SA0072102) with the following funding sources: R37MH107649, U01MH125047, R01MH101244.

Author information

Authors and Affiliations

Authors

Contributions

This tutorial was designed, developed, and written by J.M.S.; F.I., C.L., S.C., C.C., K.J.K., D.P.H. and B.M.N. provided critical feedback and manuscript edits; and K.J.K., D.P.H. and and B.M.N. supervised the work. All authors approved the final manuscript.

Corresponding author

Correspondence to Julia M. Sealock.

Ethics declarations

Competing interests

B.M.N. is a member of the scientific advisory board at Deep Genomics and Neumora. K.J.K. is a consultant for Tome Biosciences, AlloDx and Vor Biosciences, and a member of the scientific advisory board of Nurture Genomics.

Peer review

Peer review information

Nature Protocols thanks Valerio Napolioni, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Note on Sequencing Generation, Supplementary Fig. 1 describing the structure of a Hail matrix table and references for the Supplementary Note.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sealock, J.M., Ivankovic, F., Liao, C. et al. Tutorial: guidelines for quality filtering of whole-exome and whole-genome sequencing data for population-scale association analyses. Nat Protoc (2025). https://doi.org/10.1038/s41596-025-01169-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41596-025-01169-1

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载