Abstract
Long-read sequencing technologies substantially overcome the limitations of short-reads but have not been considered as a feasible replacement for population-scale projects, being a combination of too expensive, not scalable enough or too error-prone. Here we develop an efficient and scalable wet lab and computational protocol, Napu, for Oxford Nanopore Technologies long-read sequencing that seeks to address those limitations. We applied our protocol to cell lines and brain tissue samples as part of a pilot project for the National Institutes of Health Center for Alzheimer’s and Related Dementias. Using a single PromethION flow cell, we can detect single nucleotide polymorphisms with F1-score comparable to Illumina short-read sequencing. Small indel calling remains difficult within homopolymers and tandem repeats, but achieves good concordance to Illumina indel calls elsewhere. Further, we can discover structural variants with F1-score on par with state-of-the-art de novo assembly methods. Our protocol phases small and structural variants at megabase scales and produces highly accurate, haplotype-specific methylation calls.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The cell line data (HG002, HG0073 and HG02723) are openly available through the AnVIL workspace: https://anvil.terra.bio/#workspaces/anvil-datastorage/ANVIL_NIA_CARD_Coriell_Cell_Lines_Open. Human brain sequencing datasets are under controlled access and require a dbGap application (phs001300.v4). Afterwards, the data will be available through the restricted AnVIL workspace: https://anvil.terra.bio/#workspaces/anvil-datastorage/ANVIL_NIA_CARD_LR_WGS_NABEC_GRU. Matching Illumina data used for cell line evaluations are available at: https://www.internationalgenome.org/data-portal/data-collection/30x-grch38. HPRC assemblies are available at: https://github.com/human-pangenomics/HPP_Year1_Data_Freeze_v1.0. GIAB benchmarks are available at: https://www.nist.gov/programs-projects/genome-bottle.
Code availability
The Napu implementation in WDL is available at: https://github.com/nanoporegenomics/napu_wf. Hapdup is available as a standalone tool at: https://github.com/KolmogorovLab/hapdup. Hapdiff is available at: https://github.com/KolmogorovLab/hapdiff.
References
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
1000 Genomes Project Consortium et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
100,000 Genomes Project Pilot Investigators et al. 100,000 Genomes pilot on rare-disease diagnosis in health care—preliminary report. N. Engl. J. Med. 385, 1868–1880 (2021).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Huang, K.-L. et al. Pathogenic germline variants in 10,389 adult cancers. Cell 173, 355–370.e14 (2018).
ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Pan-cancer analysis of whole genomes. Nature 578, 82–93 (2020).
Sedlazeck, F. J., Lee, H., Darby, C. A. & Schatz, M. C. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat. Rev. Genet. 19, 329–346 (2018).
Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016).
Mahmoud, M. et al. Structural variant calling: the long and the short of it. Genome Biol. 20, 246 (2019).
Zarate, S. et al. Parliament2: accurate structural variant calling at scale. Gigascience 9, giaa145 (2020).
Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).
Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 40, 672–680 (2022).
Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Cell Genom. 2, 100128 (2022).
Lee, H. & Schatz, M. C. Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score. Bioinformatics 28, 2097–2105 (2012).
Martin, M. et al. WhatsHap: fast and accurate read-based phasing. Preprint at bioRxiv https://doi.org/10.1101/085050 (2016).
Loh, P.-R. et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 48, 1443–1448 (2016).
Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).
Jiang, T. et al. Long-read-based human genomic structural variation detection with cuteSV. Genome Biol. 21, 189 (2020).
Shafin, K. et al. Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads. Nat. Methods 18, 1322–1332 (2021).
Lin, J.-H., Chen, L.-C., Yu, S.-C. & Huang, Y.-T. LongPhase: an ultra-fast chromosome-scale phasing algorithm for small and large variants. Bioinformatics 38, 1816–1822 (2022).
Mahmoud, M., Doddapaneni, H., Timp, W. & Sedlazeck, F. J. PRINCESS: comprehensive detection of haplotype resolved SNVs, SVs, and methylation. Genome Biol. 22, 268 (2021).
Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 21, 597–614 (2020).
Rhie, A. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737–746 (2021).
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
Liao, W.-W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).
Jarvis, E. D. et al. Automated assembly of high-quality diploid human reference genomes. Nature 611, 519–531 (2022).
Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive -mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).
Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).
Rautiainen, M. et al. Verkko: telomere-to-telomere assembly of diploid chromosomes. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01662-6 (2023).
Billingsley, K. J. et al. Processing human frontal cortex brain tissue for population-scale Oxford Nanopore long-read DNA sequencing SOP v2. protocols.io https://doi.org/10.17504/protocols.io.kxygxzmmov8j/v2 (2022).
Baker, B. et al. Processing human frontal cortex brain tissue for population-scale SQK-LSK114 Oxford Nanopore long-read DNA sequencing SOP v1. protocols.io https://doi.org/10.17504/protocols.io.kxygx3zzog8j/v1 (2022).
Alvarez Jerez, P. et al. Processing frozen cells for population-scale Oxford Nanopore long-read DNA sequencing SOP v1. protocols.io https://doi.org/10.17504/protocols.io.5jyl8pnk7g2w/v1 (2022).
Gibbs, J. R. et al. Abundant quantitative trait loci exist for DNA methylation and gene expression in human brain. PLoS Genet. 6, e1000952 (2010).
Schatz, M. C. et al. Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space. Cell Genom. 2, 100085 (2022).
Li, H. yak: yet another k-mer analyzer. GitHub https://github.com/lh3/yak (2023).
Smolka, M. et al. Comprehensive structural variant detection: from mosaic to population-level. Preprint at bioRxiv https://doi.org/10.1101/2022.04.04.487055 (2022).
English, A. C., Menon, V. K., Gibbs, R. A., Metcalf, G. A. & Sedlazeck, F. J. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol. 23, 271 (2022).
Yang, J. & Chaisson, M. J. P. TT-Mars: structural variants assessment based on haplotype-resolved assemblies. Genome Biol. 23, 110 (2022).
Vollger, M. R. et al. Long-read sequence and assembly of segmental duplications. Nat. Methods 16, 88–94 (2019).
Kirsche, M. et al. Jasmine: population-scale structural variant comparison and analysis. Nat. Methods 20, 408–417 (2023).
Chowdhury, M., Pedersen, B. S., Sedlazeck, F. J., Quinlan, A. R. & Layer, R. M. Searching thousands of genomes to classify somatic and novel structural variants using STIX. Nat. Methods 19, 445–448 (2022).
Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).
Byrska-Bishop, M. et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell 185, 3426–3440.e19 (2022).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Lin, Y. et al. Assembly of long error-prone reads using de Bruijn graphs. Proc. Natl Acad. Sci. USA 113, E8396–E8405 (2016).
Mikheenko, A., Prjibelski, A., Saveliev, V., Antipov, D. & Gurevich, A. Versatile genome assembly evaluation with QUAST-LG. Bioinformatics 34, i142–i150 (2018).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Cheng, H. et al. Haplotype-resolved assembly of diploid genomes without parental data. Nat. Biotechnol. 40, 1332–1335 (2022).
Li, H., Feng, X. & Chu, C. The design and construction of reference pangenome graphs with minigraph. Genome Biol. 21, 265 (2020).
Wick, R. R., Schultz, M. B., Zobel, J. & Holt, K. E. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics 31, 3350–3352 (2015).
Heller, D. & Vingron, M. SVIM-asm: structural variant detection from haploid and diploid genome assemblies. Bioinformatics 36, 5519–5521 (2020).
Razaghi, R. et al. Modbamtools: analysis of single-molecule epigenetic data for long-range profiling, heterogeneity, and clustering. Preprint at bioRxiv https://doi.org/10.1101/2022.07.07.499188 (2022).
Acknowledgements
This work was supported in part by the Intramural Research Program of the National Cancer Institute (M.K.), the National Human Genome Research Institute (A.M.P.), the National Institute on Aging (B.J.T.) and the Center for Alzheimer’s and Related Dementias (C.B.), within the Intramural Research Program of the NIA and the National Institute of Neurological Disorders and Stroke (grant nos. ZIANS003154, ZIAAG000538), National Institutes of Health (grant no. AG000538). The Brain and Body Donation Program has been supported by the National Institute of Neurological Disorders and Stroke (grant no. U24 NS072026 National Brain and Tissue Resource for Parkinson’s Disease and Related Disorders), the National Institute on Aging (grant nos. P30AG19610 and P30AG072980, Arizona Alzheimer’s Disease Center), the Arizona Department of Health Services (contract 211002, Arizona Alzheimer’s Research Center), the Arizona Biomedical Research Commission (contracts 4001, 0011, 05–901 and 1001 to the Arizona Parkinson’s Disease Consortium) and the Michael J. Fox Foundation for Parkinson’s Research. B.P. was partly supported by NIH grant nos. R01HG010485, U24HG010262, U24HG011853, OT3HL142481, U01HG010961 and OT2OD033761. M. Mastoras was supported by NIH grant no. T32HG012344. K.D. was supported by the JSPS Research Fellowship for Japanese Biomedical and Behavioral Researchers at NIH. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. We acknowledge the support of Oxford Nanopore Technologies staff in generating this dataset, in particular A. Markham. We acknowledge the support of the Circulomics Inc. team in generating this protocol, in particular K. Liu, J. Burke, M. Kim and D. Kilburn. We also acknowledge the Terra support team for their help with the data storage and cloud computing solutions. This work utilized the computational resources of the NIH HPC Biowulf cluster (https://hpc.nih.gov). We thank members of the North American Brain Expression Consortium (NABEC) for providing samples derived from brain tissue. We are grateful to the Banner Sun Health Research Institute Brain and Body Donation Program of Sun City, Arizona for the provision of human biological materials.
Author information
Authors and Affiliations
Contributions
M.K., K.J.B., C.B. and B.P. conceptualized and designed the study. K.J.B., P.A.J., L.M., R.D., X.R., R.M.G., K.D. and M.J. were responsible for protocol optimization and sequencing. M.K., M. Mastoras, M. Meredith, J.M., M.A., K.S., T.P., J.P. and P.C. were responsible for algorithmic development. M.K., K.J.B., M. Mastoras, M. Meredith, J.M., R.L.-R., M.A., P.A.J., R.M.G., K.D., S.B., K.S., T.P., P.C., J.Y., A.R., M.J., W.T., M.C., F.J.S., C.B. and B.P. performed data analysis. M.K., K.J.B., S.W.S., B.J.T., K.H.M., M.J., W.T., A.M.P., M.C., F.J.S., C.B. and B.P. interpreted data and oversaw the study. M.K. and B.P. drafted the manuscript. All authors provided feedback and helped revise the manuscript.
Corresponding authors
Ethics declarations
Competing interests
K.S. is an employee of Google LLC and owns Alphabet stock as part of the standard compensation package; authors from Google LLC did not have access to the cell line and brain tissue sample data. W.T. has two patents (8,748,091 and 8,394,584) licensed to Oxford Nanopore Technologies. F.J.S. received research support from Illumina, Pacific Biosciences and Oxford Nanopore Technologies. S.W.S. serves on the Scientific Advisory Council of the Lewy Body Dementia Association and the Multiple System Atrophy Coalition. S.W.S. and B.J.T. receive research support from Cerevel Therapeutics. B.J.T. holds patents on the clinical testing and therapeutic implications of the C9orf72 repeat expansion. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks Justin Zook and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editors: Hui Hua and Lei Tang, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Variant calling and methylation analysis using Napu.
Raw ONT sequencing reads are basecalled by Guppy 6.1.2, which simultaneously produces methylation tags. A diploid, de-novo phased assembly is produced using a combination of Shasta and Hapdup. These assemblies are used to call SVs with Hapdiff. Small variants are called against a reference genome with Pepper-Margin-DeepVariant. The phased alignment file generated by Margin is used to produce haplotype-resolved methylation calls. Small variants and SVs are jointly phased by Margin, producing a single harmonized vcf.
Extended Data Fig. 2 Assemblies of 14 brain tissues and 3 cell lines generated by Shasta+Hapdup.
(A) NG50 and NGA50 contiguity measured using QUAST. Sample 06_66 had the lowest contiguity due to the decreased sequencing yield. (B) Assembly length. (C) Mean assemblies QV computed using yak. (D) Contiguity of phased blocks, broken at phase switches. An increased value for HG02723 suggests an increased heterozygosity rate. Cell lines marked with asterisks.
Extended Data Fig. 3 Assembly metrics comparison against HG002 assemblies produced in Jarvis et. al (2022).
Our assemblies are highlighted in green. Flye (ONT+trio) were produced using standard ONT reads at 60x coverage and Illumina parental information; Flye (ONT UL + trio) is similar, but using ultra-long ONT extraction. HiCanu and hifiasm used 34x HiFi reads and Illumina parental sequencing. DipAsm used 34x HiFi reads and 60x Hi-C reads. Original evaluations from Jarvis et al. are shown. See Supplementary Table 5 for more detail.
Extended Data Fig. 4 TT-Mars evaluation of Hapdup and Sniffles2 calls.
SV calls from Hapdup and Sniffles2 were compared to the assemblies from the HPRC for HG002 (top), HG00733 (middle), and HG02723 (bottom) with TT-Mars. The calls were either validated by the alignment (green), not validated (orange), or couldn’t be annotated by TT-Mars (blue). We evaluated all SVs across the genome (left), as well as the subset of SVs that don’t overlap centromeres or segmental duplications larger than 10 Kbp (right).
Extended Data Fig. 5 Flagger results based on HiFi alignments to cell line CARD and HPRC-Y1 assemblies.
The y-axis of each panel indicates the unreliability percentages which are the total number of bases flagged as misassembly divided by the total assembly length and multiplied by one hundred.
Extended Data Fig. 6 Flagger results based on ONT alignments to cell line CARD and HPRC-Y1 assemblies.
The y-axis of each panel indicates the unreliability percentages which are the total number of bases flagged as misassembly divided by the total assembly length and multiplied by one hundred.
Extended Data Fig. 7 Lenient SV catalog.
Similar to Fig. 5a but including SVs close to centromeres, telomeres, or within segmental duplications were removed. Number of SVs across samples. In the left panel, SVs were annotated with three SV catalogs (the gnomAD-SV database, a long-read-based SV catalog, and the HPRC v1.0 SV catalog). SVs are matched if they have at least 10% genomic overlap. The colors highlight the maximum frequency across these catalogs, the lighter blue showing ‘rare’ SVs (with an allele frequency below 1%) in the catalogs, or unmatched. SVs may be unmatched, either because they are novel or due to the difficulties in the database comparison. The right panel shows the number of rare SVs in protein-coding genes, grouped by their impact on the gene structure.
Extended Data Fig. 8 IGV view of a 4.2 Kbp heterozygous deletion of a transcription start site and exon of RBFOX1.
The coverage histogram (dark grey) shows the drop in read coverage. The alignment of about half of the reads, labelled by strand (red/blue), support the deletion. The GENCODE track, ENCODE candidate cis-regulatory elements, and conservation tracks are shown at the bottom.
Extended Data Fig. 10 F1-score for SV inside clusters of different sizes.
The HiFi calls for HG002 genome were used as reference, and calls within 2 kbp were clustered using single linkage clustering. The number of true positive calls in each category is shown as text. When VNTR grouping is enabled, all insertions and deletions within the same haplotype in a single VNTR are combined into a single call. A substantial portion of the reduced Sniffles2 concordance is explained by the differences in representation of SV clusters by the assembly-based and mapping-based approaches.
Supplementary information
Supplementary Information
Supplementary methods.
Supplementary Tables
Supplementary Tables 1–22.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kolmogorov, M., Billingsley, K.J., Mastoras, M. et al. Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation. Nat Methods 20, 1483–1492 (2023). https://doi.org/10.1038/s41592-023-01993-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-023-01993-x
This article is cited by
-
Clinical evaluation of long-read sequencing-based episignature detection in developmental disorders
Genome Medicine (2025)
-
K-mer analysis of long-read alignment pileups for structural variant genotyping
Nature Communications (2025)
-
Computational analysis of DNA methylation from long-read sequencing
Nature Reviews Genetics (2025)
-
Genetic regulation of TERT splicing affects cancer risk by altering cellular longevity and replicative potential
Nature Communications (2025)
-
Prospective, multicenter validation of a platform for rapid molecular profiling of central nervous system tumors
Nature Medicine (2025)