Phased diploid genome assembly with single-molecule real-time sequencing

Chin, Chen-Shan; Peluso, Paul; Sedlazeck, Fritz J; Nattestad, Maria; Concepcion, Gregory T; Clum, Alicia; Dunn, Christopher; O'Malley, Ronan; Figueroa-Balderas, Rosa; Morales-Cruz, Abraham; Cramer, Grant R; Delledonne, Massimo; Luo, Chongyuan; Ecker, Joseph R; Cantu, Dario; Rank, David R; Schatz, Michael C

doi:10.1038/nmeth.4035

Article
Published: 17 October 2016

Phased diploid genome assembly with single-molecule real-time sequencing

Nature Methods volume 13, pages 1050–1054 (2016)Cite this article

21k Accesses
169 Altmetric
Metrics details

Subjects

Abstract

While genome assembly projects have been successful in many haploid and inbred species, the assembly of noninbred or rearranged heterozygous genomes remains a major challenge. To address this challenge, we introduce the open-source FALCON and FALCON-Unzip algorithms (https://github.com/PacificBiosciences/FALCON/) to assemble long-read sequencing data into highly accurate, contiguous, and correctly phased diploid genomes. We generate new reference sequences for heterozygous samples including an F1 hybrid of Arabidopsis thaliana, the widely cultivated Vitis vinifera cv. Cabernet Sauvignon, and the coral fungus Clavicorona pyxidata, samples that have challenged short-read assembly approaches. The FALCON-based assemblies are substantially more contiguous and complete than alternate short- or long-read approaches. The phased diploid assembly enabled the study of haplotype structure and heterozygosities between homologous chromosomes, including the identification of widespread heterozygous structural variation within coding sequences.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Overview of FALCON and FALCON-Unzip.**

**Figure 2: SNP density and structural variation in the FALCON-Unzip F1 *Arabidopsis* assembly.**

Efficient hybrid de novo assembly of human genomes with WENGAN

Article Open access 14 December 2020

Semi-automated assembly of high-quality diploid human reference genomes

Article Open access 19 October 2022

Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads

Article Open access 07 December 2020

Accession codes

Accessions

Sequence Read Archive

References

Goffeau, A. et al. Life with 6000 genes. Science 274, 546, 563–567 (1996).
Article CAS PubMed Google Scholar
Myers, E.W. et al. A whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000).
Article CAS PubMed Google Scholar
Bonfield, J.K., Smith, Kf. & Staden, R. A new DNA sequence assembly program. Nucleic Acids Res. 23, 4992–4999 (1995).
Article CAS PubMed PubMed Central Google Scholar
Mouse ENCODE Consortium. et al. An encyclopedia of mouse DNA elements (mouse ENCODE). Genome Biol. 13, 418 (2012).
Celniker, S.E. et al. Unlocking the secrets of the genome. Nature 459, 927–930 (2009).
Article CAS PubMed PubMed Central Google Scholar
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Article PubMed Google Scholar
Earl, D. et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21, 2224–2241 (2011).
Article CAS PubMed PubMed Central Google Scholar
Church, D.M. et al. Extending reference assembly models. Genome Biol. 16, 13 (2015).
Article PubMed PubMed Central Google Scholar
Tewhey, R., Bansal, V., Torkamani, A., Topol, E.J. & Schork, N.J. The importance of phase information for human genomics. Nat. Rev. Genet. 12, 215–223 (2011).
Article CAS PubMed PubMed Central Google Scholar
Henson, J., Tischler, G. & Ning, Z. Next-generation sequencing and large genome assemblies. Pharmacogenomics 13, 901–915 (2012).
Article CAS PubMed Google Scholar
Alkan, C., Sajjadian, S. & Eichler, E.E. Limitations of next-generation genome sequence assembly. Nat. Methods 8, 61–65 (2011).
Article CAS PubMed Google Scholar
Vinson, J.P. et al. Assembly of polymorphic genomes: algorithms and application to Ciona savignyi. Genome Res. 15, 1127–1135 (2005).
Article PubMed PubMed Central Google Scholar
Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).
Article PubMed PubMed Central Google Scholar
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).
Article CAS PubMed PubMed Central Google Scholar
Kajitani, R. et al. Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res. 24, 1384–1395 (2014).
Article CAS PubMed PubMed Central Google Scholar
Roach, J.C. et al. Chromosomal haplotypes by genetic phasing of human families. Am. J. Hum. Genet. 89, 382–397 (2011).
Article CAS PubMed PubMed Central Google Scholar
Kirkness, E.F. et al. Sequencing of isolated sperm cells for direct haplotyping of a human genome. Genome Res. 23, 826–832 (2013).
Article CAS PubMed PubMed Central Google Scholar
Kitzman, J.O. et al. Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat. Biotechnol. 29, 59–63 (2011).
Article CAS PubMed Google Scholar
McCoy, R.C. et al. Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements. PloS One 9, e106689 (2014).
Article PubMed PubMed Central Google Scholar
Mostovoy, Y. et al. A hybrid approach for de novo human genome sequence assembly and phasing. Nat. Methods 13, 587–590 (2016).
Article CAS PubMed PubMed Central Google Scholar
Berlin, K. et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33, 623–630 (2015).
Article CAS PubMed Google Scholar
Gordon, D. et al. Long-read sequence assembly of the gorilla genome. Science 352, aae0344 (2016).
Article PubMed PubMed Central Google Scholar
Chin, C.-S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).
Article CAS PubMed Google Scholar
Fasulo, D., Halpern, A., Dew, I. & Mobarry, C. Efficiently detecting polymorphisms during the fragment assembly process. Bioinformatics 18, S294–S302 (2002).
Article PubMed Google Scholar
The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000).
Gan, X. et al. Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature 477, 419–423 (2011).
Article CAS PubMed PubMed Central Google Scholar
Simão, F.A., Waterhouse, R.M., Ioannidis, P., Kriventseva, E.V. & Zdobnov, E.M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Article PubMed Google Scholar
Koren, S., Walenz, B.P., Berlin, K., Miller, J.R. & Phillippy, A.M. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Preprint at bioRxiv http://dx.doi.org/10.1101/071282 (2016).
Luo, R. et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1, 18 (2012).
Article PubMed PubMed Central Google Scholar
Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19, ii215–ii225 (2003).
Article PubMed Google Scholar
Jaillon, O. et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449, 463–467 (2007).
Article CAS PubMed Google Scholar
Patel, S., Swaminathan, P., Fennell, A. & Zeng, E. in Proceedings of the 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (eds. Huan, J. et al.) 1771–1773 (EEE, 2015).
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at arXiv:1207.3907v2 [q-bio.GN] (2012).
Bansal, V. & Bafna, V. HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics 24, i153–i159 (2008).
Article PubMed Google Scholar
Degner, J.F. et al. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25, 3207–3212 (2009).
Article CAS PubMed PubMed Central Google Scholar
Liu, Y.-G. & Whittier, R.F. Rapid preparation of megabase plant DNA from nuclei in agarose plugs and microbeads. Nucleic Acids Res. 22, 2168–2169 (1994).
Article CAS PubMed PubMed Central Google Scholar
Hayward, G.S. Unique double-stranded fragments of bacteriophage T5 DNA resulting from preferential shear-induced breakage at nicks. Proc. Natl. Acad. Sci. USA 71, 2108–2112 (1974).
Article CAS PubMed PubMed Central Google Scholar
Myers, G. Algorithms in Bioinformatics (eds. Brown, D. & Morgenstern, B.) 52–67 (Springer, 2014).
Myers, E.W. The fragment assembly string graph. Bioinformatics 21, ii79–ii85 (2005).
CAS PubMed Google Scholar
Chaisson, M.J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13, 238 (2012).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The sequencing of the Cabernet Sauvignon genome was supported in part by a gift from the J. Lohr Vineyards and Wines to D.C. We would also like to thank F. Neto for providing an early-release BUSCO plant data set. Clavicorona pyxidata DNA was provided by L. Nagy (Institute of Biochemistry Biological Research Centre of the Hungarian Academy of Sciences). We thank J. Puglisi, F. Jupe, A. Copeland, and A. Wenger for reading and critiquing the manuscript. The project was supported in part by National Institutes of Health award (R01-HG006677 to M.C.S.) and by National Science Foundation awards (DBI-1350041 and IOS-1237880 to M.C.S.; MCB 0929402; and MCB 1122246 to J.R.E.). J.R.E. is an investigator at the Howard Hughes Medical Institute and Gordon and Betty Moore Foundation (GBMF 3034).

Author information

Chen-Shan Chin and Paul Peluso: These authors contributed equally to this work.

Authors and Affiliations

Pacific Biosciences, Menlo Park, California, USA
Chen-Shan Chin, Paul Peluso, Gregory T Concepcion, Christopher Dunn & David R Rank
Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA
Fritz J Sedlazeck & Michael C Schatz
Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA
Maria Nattestad & Michael C Schatz
DOE Joint Genome Institute, Walnut Creek, California, USA
Alicia Clum
Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, California, USA
Ronan O'Malley, Chongyuan Luo & Joseph R Ecker
Department of Viticulture and Enology, University of California Davis, Davis, California, USA
Rosa Figueroa-Balderas, Abraham Morales-Cruz & Dario Cantu
Department of Biochemistry and Molecular Biology, University of Nevada, Reno, Nevada, USA
Grant R Cramer
Dipartimento di Biotecnologie, Universita' degli Studi di Verona, Verona, Italy
Massimo Delledonne
Department of Biology, Johns Hopkins University, Baltimore, Maryland, USA
Michael C Schatz

Authors

Chen-Shan Chin
View author publications
You can also search for this author inPubMed Google Scholar
Paul Peluso
View author publications
You can also search for this author inPubMed Google Scholar
Fritz J Sedlazeck
View author publications
You can also search for this author inPubMed Google Scholar
Maria Nattestad
View author publications
You can also search for this author inPubMed Google Scholar
Gregory T Concepcion
View author publications
You can also search for this author inPubMed Google Scholar
Alicia Clum
View author publications
You can also search for this author inPubMed Google Scholar
Christopher Dunn
View author publications
You can also search for this author inPubMed Google Scholar
Ronan O'Malley
View author publications
You can also search for this author inPubMed Google Scholar
Rosa Figueroa-Balderas
View author publications
You can also search for this author inPubMed Google Scholar
Abraham Morales-Cruz
View author publications
You can also search for this author inPubMed Google Scholar
Grant R Cramer
View author publications
You can also search for this author inPubMed Google Scholar
Massimo Delledonne
View author publications
You can also search for this author inPubMed Google Scholar
Chongyuan Luo
View author publications
You can also search for this author inPubMed Google Scholar
Joseph R Ecker
View author publications
You can also search for this author inPubMed Google Scholar
Dario Cantu
View author publications
You can also search for this author inPubMed Google Scholar
David R Rank
View author publications
You can also search for this author inPubMed Google Scholar
Michael C Schatz
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

C-S.C., P.P., A.C., D.R.R., and M.C.S. conceived the idea of the FALCON–FALCON-Unzip assembler. C.-S.C, P.P., F.J.S., M.N., G.T.C., D.R.R., D.C., and M.C.S. designed the experiments and performed the analysis. P.P., D.C., D.R.R., and M.C.S. collected the sequencing data. R.O'M. C.L., and J.R.E. constructed the Col-0-Cvi-1. A.C., R.O'M. R.F.-B., A.M.-C., G.R.C., M.D., C.L., J.R.E., and D.C. collected the samples and prepared DNA for sequencing. C.-S.C., P.P., F.J.S., M.N., G.T.C., D.C., D.R.R., and M.C.S. wrote the manuscript. C.-S.C. and C.D. implemented the computer code.

Corresponding authors

Correspondence to Chen-Shan Chin or Michael C Schatz.

Ethics declarations

Competing interests

C.-S.C., P.P., G.T.C., C.D., and D.R. are employees and shareholders of Pacific Biosciences, a company commercializing DNA sequencing technology.

Integrated supplementary information

Supplementary Figure 1 Schematics of the software and data process modules and the FACLON-Unzip assembly graph process for resolving haplotypes.

(a) Data dependence flow and software modules inside FALCON and FALCON-Unzip

(b) Left: Initial assembly graph of a contig in the Arabidopsis F1 hybrid assembly. The different colors represent different haplotype blocks and phases. Right: The assembly graph after “unzipping”. Conceptually, the unzipping step identifies the heterozygous SNPs and uses them to remove overlaps between reads from different haplotypes. After removing such overlaps, nodes from the different haplotypes in the assembly graph will no longer have edges between them. This allows FALCON-Unzip to identify long haplotype specific paths and construct haplotigs of them. The dashed circle region indicates haplotype blocks that can be extended through a bubble region.

Supplementary Figure 2 Reverse accumulative read length distribution of the three diploid genome datasets

Supplementary Figure 3 SOAPdenovo assembly sizes and N50 and NG50 sizes of the 3 genomes using different values of k using the raw reads and corrected by Lighter.

Supplementary Figure 4 Assemblytic analysis comparison of the Arabidopsis F1 assemblies from FALCON-Unzip, Platanus, and SOAPdenovo.

(a) Cumulative sequence length of three Arabidopsis F1 assemblies created by FALCON-Unzip, Platanus, and SOAPdenovo compared to the TAIR10 reference. (b) Variants called using Assemblytics from three Arabidopsis F1 assemblies created by FALCON-Unzip,Platanus, and SOAPdenovo.

Supplementary Figure 5 Variation comparison between the inbred line assemblies and the F1-hybrid for all Arabidopsis chromosome along with TAIR10 references.

Supplementary Figure 6 Homopolymer length and frequency in the TAIR10 Assembly.

Supplementary Figure 7 Assembly comparison: FALCON-Unzip V. vinifera cv. Cabernet Sauvignon assembly versus V. vinifera reference genome

(a) MUMmerplot of FALCON-Unzip V. vinifera cv. Cabernet Sauvignon assembly versus V. vinifera reference genome. For clarity only alignments >= 10,000 bp long to the primary chromosomes are displayed. (b) The synteny between PN40024 Chr1 from 5’- telomere to centromere (green line) to the longest contig 000000F (black line) and its associated haplotigs (blue lines). The vertical green and blue lines indicated homologous coding sequences between the sequences. The cyan lines in the bottom indicate the synteny between the primary contig and other primary contigs. (c) Synteny alignment between two primary contigs 000334F vs. 000000F. (d) Synteny alignment between two primary contigs 000057F vs 000075F.

Supplementary Figure 8 Comparison of the distribution the het-SNP site density of the three genomes

(a) The distribution of number of het-SNPs observed of the reads used for phasing of the longest contig of each genome in semi-log plot. (b) Fitting the distributions with a exponential function (density ~ c * exp(-a * het-SNP count)). We pick het-SNP count range of 10 to 200 for Arabidopsis, 50 to 200 for Vitis, and 10 to 100 for Clavicorona to catch the exponential decay part. The fitted parameter a = -0.0222, 0.0216, 0.0412 for Arabidopsis, Vitis and Clavicorona respectively. The fastest decay rate for Clavicorona indicates it has the least variation between the haplotypes among the three genomes. From this fitting, we expect to see about 45 (Arabidopsis), 46 (Vitis), and 24 (Clavicorona) per 10kb in the regions of interests.

Supplementary Figure 9 Example of a low heterozygosity region observed in Clavicorona genome.

The het-SNPs are called with FreeBayes on the alignments of the short read data to only the primary contigs. The contig 00003F has a low heterozygosity region from ~1.2Mb to ~2.7Mb.

Supplementary Figure 10 General schematic about how different levels of heterozygosity can affect the contig layout.

Supplementary Figure 11 Candidates for differentially expressed alleles from RNA-seq data.

(a)(b)We mapped both genomic reads (middle panel) and cDNA reads (lower panel) to the primary contigs from our Clavicorona pyxidata assembly. We also shows curated CDS sequences mapped to the contig (top panel). The genomic reads shows both alleles mapped while we only observe on major allele in the transcript reads.

Supplementary Figure 12 An Example of how the FALCON-sense algorithm generates consensus sequence.

Supplementary Figure 13 (a) Summary of the graph reduction from sequence overlaps to contigs. (b) Example on constructing haplotigs in the Clavicorona pyxidata assembly.

Supplementary Figure 14 Summary of the graph reduction from sequence overlaps to contigs.

Supplementary Figure 15 Summary of the greedy SNP phasing algorithm.

(a) All pairs of het-SNPs that are covered by multiple reads are evaluation. A “coupling score” is calculation from the number reads that support current haplotype assignment of the het-SNPs. (b)(c) We linearly scan through the het-SNP positions. If the total score is improved by flipping the haplotype assigned at one location, then we flip the assignment. (d) An example showing the “coupling score” before the flipping process (un-phased het-SNPs assignment) and afterward (phased het-SNP assignment).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chin, CS., Peluso, P., Sedlazeck, F. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat Methods 13, 1050–1054 (2016). https://doi.org/10.1038/nmeth.4035

Download citation

Received: 06 June 2016
Accepted: 25 August 2016
Published: 17 October 2016
Issue Date: December 2016
DOI: https://doi.org/10.1038/nmeth.4035

This article is cited by

Whole-genome sequencing and analysis of Chryseobacterium arthrosphaerae from Rana nigromaculata
- Lihong Zhu
- Hao Liu
- Xionge Pi
BMC Microbiology (2024)
De novo Phased Genome Assembly, Annotation and Population Genotyping of Alectoris Chukar
- Hao Zhou
- Xunhe Huang
- He Meng
Scientific Data (2024)
Chromosome-level genome assembly of Platycarya strobilacea
- Huijuan Zhou
- Xuedong Zhang
- Peng Zhao
Scientific Data (2024)
Chromosome-scale genome assembly and annotation of Cotoneaster glaucophyllus
- Kaikai Meng
- Wenbo Liao
- Qiang Fan
Scientific Data (2024)
Transcriptional effects of carbon and nitrogen starvation on Ganoderma boninense, an oil palm phytopathogen
- Jayanthi Nagappan
- Siew-Eng Ooi
- Eng Ti Leslie Low
Molecular Biology Reports (2024)