Tutorial: guidelines for quality filtering of whole-exome and whole-genome sequencing data for population-scale association analyses

Sealock, Julia M.; Ivankovic, Franjo; Liao, Calwing; Chen, Siwei; Churchhouse, Claire; Karczewski, Konrad J.; Howrigan, Daniel P.; Neale, Benjamin M.

doi:10.1038/s41596-025-01169-1

Review Article
Published: 28 March 2025

Tutorial: guidelines for quality filtering of whole-exome and whole-genome sequencing data for population-scale association analyses

Nature Protocols volume 20, pages 2372–2382 (2025)Cite this article

3597 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Genetic sequencing technologies are powerful tools for identifying rare variants and genes associated with Mendelian and complex traits; indeed, whole-exome and whole-genome sequencing are increasingly popular methods for population-scale genetic studies. However, careful quality control steps should be taken to ensure study accuracy and reproducibility, and sequencing data require extensive quality filtering to delineate true variants from technical artifacts. Although processing standards are harmonized across pipelines to call variants from sequencing reads, there currently exists no standardized pipeline for conducting quality filtering on variant-level datasets for the purpose of population-scale association analysis. In this Tutorial, we discuss key quality control parameters, provide guidelines for conducting quality filtering of samples and variants, and compare commonly used software programs for quality control of samples, variants and genotypes from sequencing data. As sequencing data continue to gain popularity in genetic research, establishing standardized quality control practices is crucial to ensure consistent, reliable and reproducible results across studies.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Overview of data processing steps and quality filtering for samples, genotypes and variants for sequencing data.**

**Fig. 2: The effects of filtering heterozygosity ratio with criteria from different samples stratified by ancestry.**

**Fig. 3: Distributions of sample QC metrics stratified by ancestry for WES (left) and WGS (right) from the 1KGP+HGDP dataset.**

Opportunities and challenges for the use of common controls in sequencing studies

Article 17 May 2022

Public platform with 39,472 exome control samples enables association studies without genotype sharing

Article Open access 10 January 2024

FVC as an adaptive and accurate method for filtering variants from popular NGS analysis pipelines

Article Open access 16 September 2022

Data availability

Figures 2 and 3 and Table 3 were created using the publicly available 1000 Genomes Project phase 3 and Human Genome Diversity Project data. These datasets can be directly loaded into Hail as a matrix table using the dataset repository (https://hail.is/docs/0.2/datasets.html).

Code availability

Python code for conducting sample and variant filtering using Hail can be found here at https://github.com/jsealock1/sequencing_qc.

References

Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Article PubMed PubMed Central CAS Google Scholar
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).
Article PubMed PubMed Central CAS Google Scholar
Goldfeder, R. L. et al. Medical implications of technical accuracy in genome sequencing. Genome Med. 8, 24 (2016).
Article PubMed PubMed Central Google Scholar
Nielsen, R., Paul, J. S., Albrechtsen, A. & Song, Y. S. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12, 443–451 (2011).
Article PubMed PubMed Central CAS Google Scholar
Olson, N. D. et al. Variant calling and benchmarking in an era of complete human genome sequences. Nat. Rev. Genet. https://doi.org/10.1038/s41576-023-00590-0 (2023).
Carson, A. R. et al. Effective filtering strategies to improve data quality from population-based whole exome sequencing studies. BMC Bioinforma. 15, 125 (2014).
Article Google Scholar
Regier, A. A. et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nat. Commun. 9, 4038 (2018).
Article PubMed PubMed Central Google Scholar
Auwera, G. A. & O’Connor, B. D. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (O’Reilly Media, 2020).
Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv https://doi.org/10.1101/201178 (2018).
Behera, S. et al. Comprehensive genome analysis and variant detection at scale using DRAGEN. Nat. Biotechnol. https://doi.org/10.1038/s41587-024-02382-1 (2024).
Lam, M. et al. RICOPILI: Rapid Imputation for COnsortias PIpeLIne. Bioinformatics 36, 930–933 (2020).
Article PubMed CAS Google Scholar
Guo, Y. et al. Illumina human exome genotyping array clustering and quality control. Nat. Protoc. 9, 2643–2662 (2014).
Article PubMed PubMed Central CAS Google Scholar
Rehm, H. L. et al. ACMG clinical laboratory standards for next-generation sequencing. Genet. Med. 15, 733–747 (2013).
Article PubMed PubMed Central Google Scholar
Marshall, C. R. et al. Best practices for the analytical validation of clinical whole-genome sequencing intended for the diagnosis of germline disease. npj Genom. Med. 5, 1–12 (2020).
Article Google Scholar
Bergström, A. et al. Insights into human genetic variation and population history from 929 diverse genomes. Science 367, eaay5012 (2020).
Article PubMed PubMed Central Google Scholar
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Article PubMed Google Scholar
Hu, T., Chitnis, N., Monos, D. & Dinh, A. Next-generation sequencing technologies: an overview. Hum. Immunol. 82, 801–811 (2021).
Article PubMed CAS Google Scholar
De Coster, W., Weissensteiner, M. H. & Sedlazeck, F. J. Towards population-scale long-read sequencing. Nat. Rev. Genet. 22, 572–587 (2021).
Article PubMed PubMed Central Google Scholar
Chowdhury, B. & Garai, G. A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109, 419–431 (2017).
Article PubMed CAS Google Scholar
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Article PubMed PubMed Central CAS Google Scholar
Langmead, B. & Nellore, A. Cloud computing for genomic data analysis and collaboration. Nat. Rev. Genet. 19, 208–219 (2018).
Article PubMed PubMed Central CAS Google Scholar
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Article PubMed PubMed Central CAS Google Scholar
Pedersen, B. S. & Quinlan, A. R. Vcfexpress: flexible, rapid user-expressions to filter and format VCFs. Preprint at bioRxiv https://doi.org/10.1101/2024.11.05.622129 (2024).
Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).
Article PubMed PubMed Central Google Scholar
Timothy, P. et al. The scalable variant call representation: enabling genetic analysis beyond one million genomes. Bioinformatics 41, btae746 (2025).
Google Scholar
Orlov, Y. L. & Potapov, V. N. Complexity: an internet resource for analysis of DNA sequence complexity. Nucleic Acids Res. 32, W628–W633 (2004).
Article PubMed PubMed Central CAS Google Scholar
Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).
Article PubMed PubMed Central CAS Google Scholar
Singh, T. et al. Rare coding variants in ten genes confer substantial risk for schizophrenia. Nature 604, 509–516 (2022).
Article PubMed PubMed Central CAS Google Scholar
Lander, E. S. & Waterman, M. S. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2, 231–239 (1988).
Article PubMed CAS Google Scholar
Sims, D., Sudbery, I., Ilott, N. E., Heger, A. & Ponting, C. P. Sequencing depth and coverage: key considerations in genomic analyses. Nat. Rev. Genet. 15, 121–132 (2014).
Article PubMed CAS Google Scholar
Muyas, F. et al. Allele balance bias identifies systematic genotyping errors and false disease associations. Hum. Mutat. 40, 115–126 (2019).
Article PubMed CAS Google Scholar
Zhang, F. et al. Ancestry-agnostic estimation of DNA sample contamination from sequence reads. Genome Res. 30, 185–194 (2020).
Article PubMed PubMed Central CAS Google Scholar
Lu, W. et al. CHARR efficiently estimates contamination from DNA sequencing data. Am. J. Hum. Genet. 110, 2068–2076 (2023).
Article PubMed PubMed Central CAS Google Scholar
Gaspar, H. A. & Breen, G. Probabilistic ancestry maps: a method to assess and visualize population substructures in genetics. BMC Bioinforma. 20, 116 (2019).
Article Google Scholar
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Article PubMed CAS Google Scholar
Price, A. L., Zaitlen, N. A., Reich, D. & Patterson, N. New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet. 11, 459–463 (2010).
Article PubMed PubMed Central CAS Google Scholar
Chen, S. et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature 625, 92–100 (2024).
Article PubMed CAS Google Scholar
Anderson, C. A. et al. Data quality control in genetic case-control association studies. Nat. Protoc. 5, 1564–1573 (2010).
Article PubMed PubMed Central CAS Google Scholar
Guo, Y. et al. Multi-perspective quality control of Illumina exome sequencing data using QC3. Genomics 103, 323–328 (2014).
Article PubMed CAS Google Scholar
Guo, Y., Ye, F., Sheng, Q., Clark, T. & Samuels, D. C. Three-stage quality control strategies for DNA re-sequencing data. Brief. Bioinform. 15, 879–889 (2014).
Article PubMed CAS Google Scholar
Ng, S. B. et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461, 272–276 (2009).
Article PubMed PubMed Central CAS Google Scholar
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Article PubMed PubMed Central CAS Google Scholar
Neuman, J. A., Isakov, O. & Shomron, N. Analysis of insertion–deletion from deep-sequencing data: software evaluation for optimal detection. Brief. Bioinform. 14, 46–55 (2013).
Article PubMed Google Scholar
Boltz, T. A. et al. A blended genome and exome sequencing method captures genetic variation in an unbiased, high-quality, and cost-effective manner. Preprint at bioRxiv https://doi.org/10.1101/2024.09.06.611689 (2024).

Download references

Acknowledgements

This work is supported by the Novo Nordisk Foundation (NNF21SA0072102) with the following funding sources: R37MH107649, U01MH125047, R01MH101244.

Author information

Authors and Affiliations

Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
Julia M. Sealock, Franjo Ivankovic, Calwing Liao, Siwei Chen, Claire Churchhouse, Konrad J. Karczewski, Daniel P. Howrigan & Benjamin M. Neale
Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Julia M. Sealock, Franjo Ivankovic, Calwing Liao, Siwei Chen, Claire Churchhouse, Konrad J. Karczewski, Daniel P. Howrigan & Benjamin M. Neale
Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Konrad J. Karczewski & Benjamin M. Neale
Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Konrad J. Karczewski & Benjamin M. Neale

Authors

Julia M. Sealock
View author publications
Search author on:PubMed Google Scholar
Franjo Ivankovic
View author publications
Search author on:PubMed Google Scholar
Calwing Liao
View author publications
Search author on:PubMed Google Scholar
Siwei Chen
View author publications
Search author on:PubMed Google Scholar
Claire Churchhouse
View author publications
Search author on:PubMed Google Scholar
Konrad J. Karczewski
View author publications
Search author on:PubMed Google Scholar
Daniel P. Howrigan
View author publications
Search author on:PubMed Google Scholar
Benjamin M. Neale
View author publications
Search author on:PubMed Google Scholar

Contributions

This tutorial was designed, developed, and written by J.M.S.; F.I., C.L., S.C., C.C., K.J.K., D.P.H. and B.M.N. provided critical feedback and manuscript edits; and K.J.K., D.P.H. and and B.M.N. supervised the work. All authors approved the final manuscript.

Corresponding author

Correspondence to Julia M. Sealock.

Ethics declarations

Competing interests

B.M.N. is a member of the scientific advisory board at Deep Genomics and Neumora. K.J.K. is a consultant for Tome Biosciences, AlloDx and Vor Biosciences, and a member of the scientific advisory board of Nurture Genomics.

Peer review

Peer review information

Nature Protocols thanks Valerio Napolioni, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Note on Sequencing Generation, Supplementary Fig. 1 describing the structure of a Hail matrix table and references for the Supplementary Note.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sealock, J.M., Ivankovic, F., Liao, C. et al. Tutorial: guidelines for quality filtering of whole-exome and whole-genome sequencing data for population-scale association analyses. Nat Protoc 20, 2372–2382 (2025). https://doi.org/10.1038/s41596-025-01169-1

Download citation

Received: 28 June 2024
Accepted: 04 March 2025
Published: 28 March 2025
Version of record: 28 March 2025
Issue date: September 2025
DOI: https://doi.org/10.1038/s41596-025-01169-1