Abstract
Genetic sequencing technologies are powerful tools for identifying rare variants and genes associated with Mendelian and complex traits; indeed, whole-exome and whole-genome sequencing are increasingly popular methods for population-scale genetic studies. However, careful quality control steps should be taken to ensure study accuracy and reproducibility, and sequencing data require extensive quality filtering to delineate true variants from technical artifacts. Although processing standards are harmonized across pipelines to call variants from sequencing reads, there currently exists no standardized pipeline for conducting quality filtering on variant-level datasets for the purpose of population-scale association analysis. In this Tutorial, we discuss key quality control parameters, provide guidelines for conducting quality filtering of samples and variants, and compare commonly used software programs for quality control of samples, variants and genotypes from sequencing data. As sequencing data continue to gain popularity in genetic research, establishing standardized quality control practices is crucial to ensure consistent, reliable and reproducible results across studies.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Figures 2 and 3 and Table 3 were created using the publicly available 1000 Genomes Project phase 3 and Human Genome Diversity Project data. These datasets can be directly loaded into Hail as a matrix table using the dataset repository (https://hail.is/docs/0.2/datasets.html).
Code availability
Python code for conducting sample and variant filtering using Hail can be found here at https://github.com/jsealock1/sequencing_qc.
References
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).
Goldfeder, R. L. et al. Medical implications of technical accuracy in genome sequencing. Genome Med. 8, 24 (2016).
Nielsen, R., Paul, J. S., Albrechtsen, A. & Song, Y. S. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12, 443–451 (2011).
Olson, N. D. et al. Variant calling and benchmarking in an era of complete human genome sequences. Nat. Rev. Genet. https://doi.org/10.1038/s41576-023-00590-0 (2023).
Carson, A. R. et al. Effective filtering strategies to improve data quality from population-based whole exome sequencing studies. BMC Bioinforma. 15, 125 (2014).
Regier, A. A. et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nat. Commun. 9, 4038 (2018).
Auwera, G. A. & O’Connor, B. D. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (O’Reilly Media, 2020).
Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv https://doi.org/10.1101/201178 (2018).
Behera, S. et al. Comprehensive genome analysis and variant detection at scale using DRAGEN. Nat. Biotechnol. https://doi.org/10.1038/s41587-024-02382-1 (2024).
Lam, M. et al. RICOPILI: Rapid Imputation for COnsortias PIpeLIne. Bioinformatics 36, 930–933 (2020).
Guo, Y. et al. Illumina human exome genotyping array clustering and quality control. Nat. Protoc. 9, 2643–2662 (2014).
Rehm, H. L. et al. ACMG clinical laboratory standards for next-generation sequencing. Genet. Med. 15, 733–747 (2013).
Marshall, C. R. et al. Best practices for the analytical validation of clinical whole-genome sequencing intended for the diagnosis of germline disease. npj Genom. Med. 5, 1–12 (2020).
Bergström, A. et al. Insights into human genetic variation and population history from 929 diverse genomes. Science 367, eaay5012 (2020).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Hu, T., Chitnis, N., Monos, D. & Dinh, A. Next-generation sequencing technologies: an overview. Hum. Immunol. 82, 801–811 (2021).
De Coster, W., Weissensteiner, M. H. & Sedlazeck, F. J. Towards population-scale long-read sequencing. Nat. Rev. Genet. 22, 572–587 (2021).
Chowdhury, B. & Garai, G. A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109, 419–431 (2017).
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Langmead, B. & Nellore, A. Cloud computing for genomic data analysis and collaboration. Nat. Rev. Genet. 19, 208–219 (2018).
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Pedersen, B. S. & Quinlan, A. R. Vcfexpress: flexible, rapid user-expressions to filter and format VCFs. Preprint at bioRxiv https://doi.org/10.1101/2024.11.05.622129 (2024).
Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).
Timothy, P. et al. The scalable variant call representation: enabling genetic analysis beyond one million genomes. Bioinformatics 41, btae746 (2025).
Orlov, Y. L. & Potapov, V. N. Complexity: an internet resource for analysis of DNA sequence complexity. Nucleic Acids Res. 32, W628–W633 (2004).
Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).
Singh, T. et al. Rare coding variants in ten genes confer substantial risk for schizophrenia. Nature 604, 509–516 (2022).
Lander, E. S. & Waterman, M. S. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2, 231–239 (1988).
Sims, D., Sudbery, I., Ilott, N. E., Heger, A. & Ponting, C. P. Sequencing depth and coverage: key considerations in genomic analyses. Nat. Rev. Genet. 15, 121–132 (2014).
Muyas, F. et al. Allele balance bias identifies systematic genotyping errors and false disease associations. Hum. Mutat. 40, 115–126 (2019).
Zhang, F. et al. Ancestry-agnostic estimation of DNA sample contamination from sequence reads. Genome Res. 30, 185–194 (2020).
Lu, W. et al. CHARR efficiently estimates contamination from DNA sequencing data. Am. J. Hum. Genet. 110, 2068–2076 (2023).
Gaspar, H. A. & Breen, G. Probabilistic ancestry maps: a method to assess and visualize population substructures in genetics. BMC Bioinforma. 20, 116 (2019).
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Price, A. L., Zaitlen, N. A., Reich, D. & Patterson, N. New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet. 11, 459–463 (2010).
Chen, S. et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature 625, 92–100 (2024).
Anderson, C. A. et al. Data quality control in genetic case-control association studies. Nat. Protoc. 5, 1564–1573 (2010).
Guo, Y. et al. Multi-perspective quality control of Illumina exome sequencing data using QC3. Genomics 103, 323–328 (2014).
Guo, Y., Ye, F., Sheng, Q., Clark, T. & Samuels, D. C. Three-stage quality control strategies for DNA re-sequencing data. Brief. Bioinform. 15, 879–889 (2014).
Ng, S. B. et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461, 272–276 (2009).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Neuman, J. A., Isakov, O. & Shomron, N. Analysis of insertion–deletion from deep-sequencing data: software evaluation for optimal detection. Brief. Bioinform. 14, 46–55 (2013).
Boltz, T. A. et al. A blended genome and exome sequencing method captures genetic variation in an unbiased, high-quality, and cost-effective manner. Preprint at bioRxiv https://doi.org/10.1101/2024.09.06.611689 (2024).
Acknowledgements
This work is supported by the Novo Nordisk Foundation (NNF21SA0072102) with the following funding sources: R37MH107649, U01MH125047, R01MH101244.
Author information
Authors and Affiliations
Contributions
This tutorial was designed, developed, and written by J.M.S.; F.I., C.L., S.C., C.C., K.J.K., D.P.H. and B.M.N. provided critical feedback and manuscript edits; and K.J.K., D.P.H. and and B.M.N. supervised the work. All authors approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
B.M.N. is a member of the scientific advisory board at Deep Genomics and Neumora. K.J.K. is a consultant for Tome Biosciences, AlloDx and Vor Biosciences, and a member of the scientific advisory board of Nurture Genomics.
Peer review
Peer review information
Nature Protocols thanks Valerio Napolioni, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Note on Sequencing Generation, Supplementary Fig. 1 describing the structure of a Hail matrix table and references for the Supplementary Note.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sealock, J.M., Ivankovic, F., Liao, C. et al. Tutorial: guidelines for quality filtering of whole-exome and whole-genome sequencing data for population-scale association analyses. Nat Protoc (2025). https://doi.org/10.1038/s41596-025-01169-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41596-025-01169-1