Abstract
Genomes usually contain some non-repetitive sequences that are missing from the reference genome and occur only in a population subset. Such non-repetitive, non-reference (NRNR) sequences have remained largely unexplored in terms of their characterization and downstream analyses. Here we describe 3,791 breakpoint-resolved NRNR sequence variants called using PopIns from whole-genome sequence data of 15,219 Icelanders. We found that over 95% of the 244 NRNR sequences that are 200 bp or longer are present in chimpanzees, indicating that they are ancestral. Furthermore, 149 variant loci are in linkage disequilibrium (r2 > 0.8) with a genome-wide association study (GWAS) catalog marker, suggesting disease relevance. Additionally, we report an association (P = 3.8 × 10−8, odds ratio (OR) = 0.92) with myocardial infarction (23,360 cases, 300,771 controls) for a 766-bp NRNR sequence variant. Our results underline the importance of including variation of all complexity levels when searching for variants that associate with disease.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Alkan, C., Coe, B.P. & Eichler, E.E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363–376 (2011).
Mills, R.E. et al. Mapping copy number variation by population-scale genome sequencing. Nature 470, 59–65 (2011).
Kloosterman, W.P. et al. Chromothripsis as a mechanism driving complex de novo structural rearrangements in the germline. Hum. Mol. Genet. 20, 1916–1924 (2011).
Chaisson, M.J.P. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2014).
Hehir-Kwa, J.Y. et al. A high-quality human reference panel reveals the complexity and distribution of genomic structural variants. Nat. Commun. 7, 12989 (2016).
Telenti, A. et al. Deep sequencing of 10,000 human genomes. Proc. Natl. Acad. Sci. USA 113, 11901–11906 (2016).
Sudmant, P.H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).
Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).
Kehr, B., Melsted, P. & Halldórsson, B.V. PopIns: population-scale detection of novel sequence insertions. Bioinformatics 32, 961–967 (2016).
Schneider, V.A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. bioRxiv http://dx.doi.org/10.1101/072116 (2016).
Gudbjartsson, D.F. et al. Sequence variants from whole genome sequencing a large group of Icelanders. Sci. Data 2, 150011 (2015).
Genovese, G. et al. Using population admixture to help complete maps of the human genome. Nat. Genet. 45, 406–414 (2013).
Kong, A. et al. A high-resolution map of the human genome. Nat. Genet. 31, 241–247 (2002).
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Venter, C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
Abyzov, A. et al. Analysis of deletion breakpoints from 1,092 humans reveals details of mutation mechanisms. Nat. Commun. 6, 7256 (2015).
Levy, S. et al. The Diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).
Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP–trait associations. Nucleic Acids Res. 42, D1001–D1006 (2014).
Olesen, M.S., Nielsen, M.W., Haunsø, S. & Svendsen, J.H. Atrial fibrillation: the role of common and rare genetic variants. Eur. J. Hum. Genet. 22, 297–306 (2014).
Osborne, T.F. Sterol regulatory element–binding proteins (SREBPs): key regulators of nutritional homeostasis and insulin action. J. Biol. Chem. 275, 32379–32382 (2000).
Schunkert, H. et al. Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Nat. Genet. 43, 333–338 (2011).
Church, D.M. et al. Extending reference assembly models. Genome Biol. 16, 13 (2015).
Paten, B., Novak, A. & Haussler, D. Mapping to a reference genome structure. arXiv https://arxiv.org/abs/1404.5010 (2014).
Eichler, E.E. et al. Missing heritability and strategies for finding the underlying causes of complex disease. Nat. Rev. Genet. 11, 446–450 (2010).
Manolio, T.A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).
Kong, A. et al. Fine-scale recombination rate differences between sexes, populations and individuals. Nature 467, 1099–1103 (2010).
Estrada, K. et al. Genome-wide meta-analysis identifies 56 bone mineral density loci and reveals 14 loci associated with risk of fracture. Nat. Genet. 44, 491–501 (2012).
McMahon, F.J. et al. Meta-analysis of genome-wide association data identifies a risk locus for major mood disorders on 3p21.1. Nat. Genet. 42, 128–131 (2010).
arcOGEN Consortium & arcOGEN Collaborators. Identification of new susceptibility loci for osteoarthritis (arcOGEN): a genome-wide association study. Lancet 380, 815–823 (2012).
Manning, A.K. et al. A genome-wide approach accounting for body mass index identifies genetic variants influencing fasting glycemic traits and insulin resistance. Nat. Genet. 44, 659–669 (2012).
Cai, Q. et al. Genome-wide association analysis in East Asians identifies breast cancer susceptibility loci at 1q32.1, 5q14.3 and 15q26.1. Nat. Genet. 46, 886–890 (2014).
Caporaso, N. et al. Genome-wide and candidate gene association study of cigarette smoking behaviors. PLoS One 4, e4653 (2009).
Wood, A.R. et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat. Genet. 46, 1173–1186 (2014).
Shin, S.Y. et al. An atlas of genetic influences on human blood metabolites. Nat. Genet. 46, 543–550 (2014).
Trégouët, D.A. et al. Genome-wide haplotype association study identifies the SLC22A3–LPAL2–LPA gene cluster as a risk locus for coronary artery disease. Nat. Genet. 41, 283–285 (2009).
Perry, J.R. et al. Parent-of-origin-specific allelic associations among 106 genomic loci for age at menarche. Nature 514, 92–97 (2014).
Elks, C.E. et al. Thirty new loci for age at menarche identified by a meta-analysis of genome-wide association studies. Nat. Genet. 42, 1077–1085 (2010).
Brown, C.C. et al. A genome-wide association analysis of temozolomide response using lymphoblastoid cell lines shows a clinically relevant association with MGMT. Pharmacogenet. Genomics 22, 796–802 (2012).
Lango Allen, H. et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467, 832–838 (2010).
Gudbjartsson, D.F. et al. Large-scale whole-genome sequencing of the Icelandic population. Nat. Genet. 47, 435–444 (2015).
Koressaar, T. & Remm, M. Enhancements and modifications of primer design program Primer3. Bioinformatics 23, 1289–1291 (2007).
Untergasser, A. et al. Primer3—new capabilities and interfaces. Nucleic Acids Res. 40, e115 (2012).
Döring, A., Weese, D., Rausch, T. & Reinert, K. SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinformatics 9, 11 (2008).
Kehr, B. et al. STELLAR: fast and exact local alignments. BMC Bioinformatics 12, S15 (2011).
Gu∂´ bjartsson, H. et al. GORpipe: a query tool for working with sequence data based on a Genomic Ordered Relational (GOR) architecture. Bioinformatics 32, 3081–3088 (2016).
Styrkarsdottir, U. et al. Nonsense mutation in the LGR4 gene is associated with several human diseases and other traits. Nature 497, 517–520 (2013).
Helgadottir, A. et al. Variants with large effects on blood lipids and the role of cholesterol and triglycerides in coronary disease. Nat. Genet. 48, 634–639 (2016).
Gretarsdottir, S. et al. A splice region variant in LDLR lowers non–high density lipoprotein cholesterol and protects against coronary artery disease. PLoS Genet. 11, e1005379 (2015).
Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).
Anders, S., Pyl, P.T. & Huber, W. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).
Robinson, M.D. et al. A scaling normalization method for differential expression analysis of RNA–seq data. Genome Biol. 11, R25 (2010).
Benson, D.A. et al. GenBank. Nucleic Acids Res. 45, D37–D42 (2017).
Author information
Authors and Affiliations
Contributions
B.K., P.M., B.V.H. and K.S. designed the experiments. B.K., P.M., A.G., S.K., D.F.G. and B.V.H. implemented the methodology and analyzed the call set. B.K., A. Helgadottir, H. Holm, P.S., D.F.G. and B.V.H. interpreted the association results. B.K., H. Helgason and G.H.H. analyzed gene expression. Aslaug Jonasdottir, Adalbjorg Jonasdottir, and A.S. performed PCR verification and Sanger sequencing. U.T. oversaw the operations of the genotyping facilities. G.T., I.O., H. Holm and U.T. were responsible for phenotype data acquisition. B.K. prepared tables and figures. B.K., H.J., A. Helgason and B.V.H. wrote the manuscript. All authors reviewed and approved the final manuscript. K.S. supervised the study.
Corresponding authors
Ethics declarations
Competing interests
B.K., A. Helgadottir, P.M., H.J., H. Helgason, Adalbjorg Jonasdottir, Aslaug Jonasdottir, A.S., A.G., G.H.H., S.K., H. Holm, P.S., U.T., A. Helgason, D.F.G., B.V.H. and K.S. are all employees of deCODE Genetics/Amgen, Inc.
Integrated supplementary information
Supplementary Figure 1 Primer pairs designed for the five categories of NRNR sequence variants for validation by Sanger sequencing.
For the categories “INS > 200” and “DEL > INS”, two or three primer pairs were designed including at least one for each allele. For “INS < 200”, only a single primer pair was designed that may amplify in both alleles. For “Different contig” always three and for “Singleton” always two primer pairs were designed.
Supplementary information
Supplementary Text and Figures
Supplementary Figure 1, Supplementary Tables 2, 7 and 8, and Supplementary Note (PDF 1950 kb)
Supplementary Data 1: NRNR sequences anchored by imputed NRNR markers.
Sequences are given in FASTA format. (TXT 2129 kb)
Supplementary Data 2: NRNR sequences anchored by fixed NRNR markers.
Fixed are those markers that were predicted to have 100% frequency in Iceland. Sequences are given in FASTA format. (TXT 288 kb)
Supplementary Table 1
List of imputed NRNR markers. (XLSX 623 kb)
Supplementary Table 3
Details of Sanger sequencing validation experiments. (XLSX 65 kb)
Supplementary Table 4
List of fixed NRNR markers. (XLSX 48 kb)
Supplementary Table 5
Overlap of NRNR markers with known variants and sequences. (XLSX 358 kb)
Supplementary Table 6
Correlation with the GWAS catalog. (XLSX 84 kb)
Supplementary Table 9
Conversion to GenBank sequences. (XLSX 163 kb)
Rights and permissions
About this article
Cite this article
Kehr, B., Helgadottir, A., Melsted, P. et al. Diversity in non-repetitive human sequences not found in the reference genome. Nat Genet 49, 588–593 (2017). https://doi.org/10.1038/ng.3801
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/ng.3801
This article is cited by
-
Diversity and consequences of structural variation in the human genome
Nature Reviews Genetics (2025)
-
Pig pangenome graph reveals functional features of non-reference sequences
Journal of Animal Science and Biotechnology (2024)
-
Sequence variants influencing the regulation of serum IgG subclass levels
Nature Communications (2024)
-
Pan-African genome demonstrates how population-specific genome graphs improve high-throughput sequencing data analysis
Nature Communications (2022)
-
The human “contaminome”: bacterial, viral, and computational contamination in whole genome sequences from 1000 families
Scientific Reports (2022)