Abstract
Although much is known about human genetic variation, such information is typically ignored in assembling new genomes. Instead, reads are mapped to a single reference, which can lead to poor characterization of regions of high sequence or structural diversity. We introduce a population reference graph, which combines multiple reference sequences and catalogs of variation. The genomes of new samples are reconstructed as paths through the graph using an efficient hidden Markov model, allowing for recombination between different haplotypes and additional variants. By applying the method to the 4.5-Mb extended MHC region on human chromosome 6, combining 8 assembled haplotypes, the sequences of known classical HLA alleles and 87,640 SNP variants from the 1000 Genomes Project, we demonstrate using simulations, SNP genotyping, and short-read and long-read data how the method improves the accuracy of genome inference and identified regions where the current set of reference sequences is substantially incomplete.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Accession codes
References
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).
Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).
Lunter, G. & Goodson, M. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 21, 936–939 (2011).
Zook, J.M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).
Horton, R. et al. Variation analysis and gene annotation of eight MHC haplotypes: the MHC Haplotype Project. Immunogenetics 60, 1–18 (2008).
Jiang, W. et al. Copy number variation leads to considerable diversity for B but not A haplotypes of the human KIR genes encoding NK cell receptors. Genome Res. 22, 1845–1854 (2012).
Trask, B.J. et al. Large multi-chromosomal duplications encompass many members of the olfactory receptor gene family in the human genome. Hum. Mol. Genet. 7, 2007–2020 (1998).
Steinberg, K.M. et al. Structural diversity and African origin of the 17q21.31 inversion polymorphism. Nat. Genet. 44, 872–880 (2012).
Boettger, L.M., Handsaker, R.E., Zody, M.C. & McCarroll, S.A. Structural haplotypes and recent evolution of the human 17q21.31 region. Nat. Genet. 44, 881–885 (2012).
Stefansson, H. et al. A common inversion under selection in Europeans. Nat. Genet. 37, 129–137 (2005).
Lupski, J.R. & Stankiewicz, P. Genomic disorders: molecular mechanisms for rearrangements and conveyed phenotypes. PLoS Genet. 1, e49 (2005).
1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).
1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
Lee, C., Grasso, C. & Sharlow, M.F. Multiple sequence alignment using partial order graphs. Bioinformatics 18, 452–464 (2002).
Raphael, B., Zhi, D., Tang, H. & Pevzner, P. A novel method for multiple alignment of sequences with repeated and shuffled elements. Genome Res. 14, 2336–2346 (2004).
Paten, B. et al. Cactus graphs for genome comparisons. J. Comput. Biol. 18, 469–481 (2011).
Paten, B., Novak, A. & Haussler, D. Mapping to a reference genome structure. ArXiv http://arxiv.org/abs/1404.5010 (2014).
Rimmer, A. et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat. Genet. 46, 912–918 (2014).
Garrison, E.P. & Marth, G. Haplotype-based variant detection from short-read sequencing. ArXiv http://arxiv.org/abs/1207.3907 (2012).
Huang, L., Popic, V. & Batzoglou, S. Short read alignment with populations of genomes. Bioinformatics 29, i361–i370 (2013).
Schneeberger, K. et al. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10, R98 (2009).
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).
Katoh, K. & Frith, M.C. Adding unaligned sequences into an existing alignment using MAFFT and LAST. Bioinformatics 28, 3144–3146 (2012).
Bradley, R.K. et al. Fast statistical alignment. PLoS Comput. Biol. 5, e1000392 (2009).
Lefranc, M.P. et al. IMGT, the international ImMunoGeneTics information system. Nucleic Acids Res. 37, D1006–D1012 (2009).
Li, H. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics 28, 1838–1844 (2012).
Weisenfeld, N.I. et al. Comprehensive variation discovery in single human genomes. Nat. Genet. 46, 1350–1355 (2014).
Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39, 906–913 (2007).
Li, Y., Sidore, C., Kang, H.M., Boehnke, M. & Abecasis, G.R. Low-coverage sequencing: implications for design of complex trait association studies. Genome Res. 21, 940–951 (2011).
Holdsworth, R. et al. The HLA dictionary 2008: a summary of HLA-A, -B, -C, -DRB1/3/4/5, and -DQB1 alleles and their association with serologically defined HLA-A, -B, -C, -DR, and -DQ antigens. Tissue Antigens 73, 95–170 (2009).
Flicek, P. et al. Ensembl 2013. Nucleic Acids Res. 41, D48–D55 (2013).
Spraggs, C.F., Parham, L.R., Hunt, C.M. & Dollery, C.T. Lapatinib-induced liver injury characterized by class II HLA and Gilbert's syndrome genotypes. Clin. Pharmacol. Ther. 91, 647–652 (2012).
Acknowledgements
We thank M. Eberle and colleagues at Illumina for early access to the Moleculo data. The study was funded by grants from GlaxoSmithKline and grant 100956/Z/13/Z from the Wellcome Trust to G.M., a Nuffield Department of Medicine Fellowship to Z.I. and a Sir Henry Dale Fellowship jointly awarded by the Wellcome Trust and the Royal Society to Z.I. (102541/Z/13/Z).
Author information
Authors and Affiliations
Contributions
G.M. designed the experiment. A.D. and C.C. performed analyses. Z.I., M.R.N. and G.M. supervised the research. A.D. and G.M. wrote the manuscript with the assistance of co-authors.
Corresponding authors
Ethics declarations
Competing interests
C.C. and M.R.N. are employed by GlaxoSmithKline (GSK) and may own GSK stock. GSK does not sell or market any software or services related to genetic analysis or the generation of genetic data. G.M. is a founder and shareholder of Genomics, Ltd. G.M. and A.D. are partners in Peptide Groove, LLP.
Integrated supplementary information
Supplementary Figure 1 Relationship between the nucleotide and k-mer PRGs.
The nucleotide PRG is a directed, acyclic graph constructed from a multiple-sequence alignment reflecting variation within the aligned sequences. A k-mer PRG is constructed from the nucleotide PRG by enumerating the possible paths of length k and their relationship. A multi-PRG is generated by combining all non-branching stretches of levels in the k-mer PRG into single levels for the multi-PRG, with edges labeled with multiple k-mers.
Supplementary Figure 2 NA12878 k-mer recovery within classical HLA loci for four approaches.
Each panel shows the fraction of k-mers recovered at single-nucleotide resolution from chromotypes inferred by the four methods using the high-coverage data from NA12878. The average over the locus is also shown.
Supplementary Figure 3 Example of a non-MHC region with low k-mer recovery from mapping-based analysis.
k-mer recovery on chromosome 8 in a region containing multiple members of the ubiquitin-specific peptidase 17–like gene family where there are several 10-kb intervals where <90% of k-mers predicted to exist from the Platypus VCF are recovered from high-coverage sequencing data on NA12878.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–3, Supplementary Tables 1–9 and Supplementary Note (PDF 3181 kb)
Supplementary Data Set: Discrepancies between SNP array and PRG genotypes in NA12878.
Compressed (zip) file with screenshots showing read mapping at the 55 sites where the Viertbi-inferred genotype from the PRG disagrees with the SNP array genotype and where the PRG specifies a gap character. A manual evaluation of these sites is also provided as an Excel file. (ZIP 792 kb)
Rights and permissions
About this article
Cite this article
Dilthey, A., Cox, C., Iqbal, Z. et al. Improved genome inference in the MHC using a population reference graph. Nat Genet 47, 682–688 (2015). https://doi.org/10.1038/ng.3257
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/ng.3257
This article is cited by
-
CRISPR-based targeted haplotype-resolved assembly of a megabase region
Nature Communications (2023)
-
Quantitative proteomics analysis to assess protein expression levels in the ovaries of pubescent goats
BMC Genomics (2022)
-
HLA imputation and its application to genetic and molecular fine-mapping of the MHC region in autoimmune diseases
Seminars in Immunopathology (2022)
-
Computational graph pangenomics: a tutorial on data structures and their applications
Natural Computing (2022)
-
Gramtools enables multiscale variation analysis with genome graphs
Genome Biology (2021)