这是indexloc提供的服务,不要输入任何密码
Skip to main content

Evolutionary dynamics of predicted G-quadruplexes in human and other great apes

Abstract

Background

G-quadruplexes (G4s) are non-canonical DNA structures that can form at approximately 1% of the human genome. They facilitate genomic instability by increasing point mutations and structural variation. Numerous G4s participate in telomere maintenance and regulating transcription and replication, and evolve under purifying selection. Despite these important functions, G4s have remained under-studied in human and ape genomes due to incomplete assemblies.

Results

Here, we conduct a comprehensive analysis of predicted G4s (pG4s) in the recently released, telomere-to-telomere (T2T) genomes of human, bonobo, chimpanzee, gorilla, Bornean orangutan, and Sumatran orangutan. We annotate 41,232–174,442 new pG4s in these T2T compared to previous ape genome assemblies (5%–21% increase). Analyzing inter-species whole-genome alignments, we identify pG4s shared across apes (approximately one-third of all pG4s) and thousands of species-specific pG4s. pG4s accumulate and diverge at rates consistent with divergence times between species, following molecular clock. pG4s shared across apes are enriched and hypomethylated at regulatory regions—enhancers, promoters, UTRs, and origins of replication—suggesting their conserved formation and functions. Species-specific pG4s (constituting 11–27% of all pG4s) are located in regulatory regions, potentially contributing to adaptations, and in repeats, likely driving genome expansions.

Conclusions

Our findings illuminate the evolutionary dynamics of G4s, conservation of their role in gene regulation, and their contributions to ape genome evolution. Our study highlights the utility of high-resolution T2T genomes in revealing elusive yet likely functionally relevant genomic features previously hidden by incomplete assemblies.

Peer Review reports

Background

Most genomic DNA exists in B form, i.e., a right-handed double helix with 10–10.4 base pairs per turn [1,2,3]. However, certain regions of the genome can form non-B DNA with different structures [4], which are transient and have lower stability than B DNA. These alternative DNA structures—such as Z-DNA [5], cruciforms [6], H-DNA [7], bent DNA [8], and G-quadruplexes (G4s) [9]—are involved in various biological processes (reviewed in [10]). Among non-B DNA, G4s have gained considerable attention due to their critical role in gene regulation [11,12,13]. Moreover, G4s have emerged as promising therapeutic targets for drug development [14, 15]. Both DNA and RNA [16] can form G4 structures, which may be either intra- or intermolecular. These structures arise due to the formation of Hoogsteen hydrogen bonds [17] among nonconsecutive guanine bases, which assemble into planar G-tetrads. Multiple G-tetrads are stabilized by monovalent cations such as potassium. A G4 structure comprises stems and loops (Fig. 1A): stems consist of runs of three (sometimes two) to five consecutive guanines, whereas loops, which can range from 1 to 12 nucleotides, can include other nucleotides [18]. In the cell, G4s with bulges in their stems (i.e. G-runs interrupted by non-G nucleotides) can also form [19, 20].

G4s affect genome stability and regulate gene expression [21, 22]. G4s are abundant at telomeres and participate in telomere maintenance [23,24,25,26]. G4s have been suggested to be important regulatory elements in DNA replication, albeit in two contrasting ways: (a) G4s are important for replication initiation, and their deletion can impede this process [27, 28]; (b) G4s can act as barriers to replication progression, leading to replication fork stalling [29] and contributing to genomic instability [30, 31]. G4s have also been suggested to contribute to genome instability as mutation hotspots during evolution [22, 32] and in cancer [33, 34]. Similarly, G4s likely play a dual role in transcription regulation: (a) G4s can serve as binding hubs for transcription factors [35] and enhance transcription levels [36]; (b) when present on the template strand, they can stall RNA polymerase, thereby blocking transcription [37, 38]. Consistent with their crucial regulatory functions, G4 structures have been shown to evolve under purifying selection at multiple genomic regions [39].

Dysregulation of G4s has been associated with diseases. These structures have been identified in the promoter regions of several genes linked to cancer, including MYC [40], KRAS [41], BCL2 [42], and KIT [43]. Relatedly, G4s have been used as therapeutic targets for melanoma, leukemia, and pancreatic cancer [14]. Moreover, G4s have been associated with multiple neurodegenerative diseases (reviewed in [44]) such as X-linked dystonia-parkinsonism [45], Alzheimer’s disease [46,47,48], Parkinson's disease [49], fragile X syndrome [50, 51], Amyotrophic Lateral Sclerosis [52], and Progressive Myoclonus Epilepsy Type I [53]. Thus, studies of G4s are clinically relevant.

Despite the significance of G4s for mutagenesis and cellular functions, as well as their clinical relevance, their evolution remains understudied. To fill this gap, we conducted a comprehensive analysis of predicted G4s (pG4s) in the recently-released complete, telomere-to-telomere (T2T) genomes of six great apes [54,55,56,57]—human (Homo sapiens), chimpanzee (Pan troglodytes, diverged from the human lineage ~ 7 million years ago, MYA), bonobo (Pan paniscus, diverged from the chimpanzee lineage ~ 2.5 MYA), gorilla (Gorilla gorilla, diverged from the human lineage ~ 9 million years ago), Sumatran orangutan (Pongo abelii, diverged from the human lineage ~ 17 MYA), and Bornean orangutan (Pongo pygmaeus, diverged from the Sumatran orangutan lineage ~ 1 MYA). The advent of high-quality T2T reference genomes has enabled a detailed analysis of complex and previously inaccessible genomic regions, including telomeres and centromeres. These T2T genomes, assembled with long-read sequencing technologies having low error rates at pG4s [58], can provide accurate sequence information at G4 loci [55, 59]. Importantly, the use of T2T genomes in great ape research can facilitate the identification of previously unmapped G4s, providing a more complete understanding of how G4s have evolved across species.

In this study, we established a complete database of pG4s, delving into the newly resolved regions of the T2T assemblies and providing insights into evolutionary patterns of pG4 occurrence, conservation, and divergence among great apes. We identified shared and species-specific pG4s, offering a comparative view of their evolutionary trajectories. Additionally, we examined G4 enrichment across different genomic contexts. Taking advantage of the availability of methylation data, we were able to predict G4 formation in two cell lines. Our findings pave the way for future studies of the evolutionary impact of G4s, their structural dynamics, and their role in ape evolution.

Results

Primate T2T assemblies provide a complete catalog of predicted G4 structures, including tens of thousands of new occurrences

We predicted G4 motifs (pG4s) in the T2T genome assemblies of human, bonobo, chimpanzee, gorilla, Bornean orangutan, and Sumatran orangutan [57]. We initially evaluated several computational methods proposed to predict G4 motifs (reviewed in [60]) and used a combination of methods that captured both canonical and non-canonical G4 motifs while minimizing false positives and false negatives. In particular, our pG4 discovery pipeline [61] (https://github.com/makovalab-psu/g4Discovery) (Fig. 1B) utilized both the pqsfinder [62] and the G4Hunter [63] prediction algorithms along with several filtering steps (see Methods for details).

Fig. 1
figure 1

The structural elements of a G4, and the workflow of the pG4 discovery pipeline. A A schematic of a non-canonical G4 structure highlighting bulges, stems, and loops. The three planar structures constitute the three tetrads of this G4. B G4s were predicted in each great ape T2T genome using pqsfinder (score threshold ≥ 30). The resulting G4 predictions had to (1) satisfy either standard or bulged G4 motif regular expression, (2) have a minimum score of 40 and 1.5 for pqsfinder and G4Hunter, respectively, and (3) be non-overlapping and have the highest pqsfinder score in comparison with the other pG4 s in that region

We selected the pqsfinder and G4Hunter algorithms because they (a) had the highest accuracy among multiple algorithms tested with an in vitro validation dataset [60]; and (b) allowed us to leverage different sequence information. Indeed, to compute a score for a pG4, pqsfinder takes into account the number of tetrads, bulges/mismatches, and loop lengths [62], whereas G4Hunter considers G-richness and G-skewness [63]. To determine the thresholds for each scoring algorithm, we first predicted G4s in the human T2T genome with relaxed parameters, i.e. a pqsfinder score ≥ 30 and an absolute G4Hunter score ≥ 0. The distribution of pqsfinder scores exhibited an inflection point at 40 (Additional File 1: Fig. S1 A), and we considered pG4s with score ≥ 40 for subsequent analyses (a score of 52 was previously recommended as a minimum threshold [62]). The distribution of G4Hunter scores was bell-shaped with no clear inflection point (Additional File 1: Fig. S1B). A previous study [63] reported that G4Hunter precision exceeds 90% above the threshold of 1.5 (for 25-bp G4s—the average length of pG4s predicted by pqsfinder), and thus we selected this threshold for our subsequent analysis.

We included pG4s with standard, i.e. [G3+L1−12]3+G3+ [18], and bulged, i.e. [GN0−1GN0−1GL1−3]3+GN0−1GN0−1G [19], motifs, where L  {A, T, C, G} and N  {A, T, C}. We excluded pG4s with uneven motifs, i.e. [G1−2N1−2]7+G1−2 [64], because such motifs can have two tetrads potentially representing an intermediate G4 form [65]; three tetrads were found to be the shortest guanine runs to result in a stable structure formation, in most cases [66]. Additionally, pqsfinder (with a minimum score threshold of 30) could not predict any G4s with uneven motifs.

Using the discovery pipeline described above (Fig. 1B), we annotated 769,184–844,650 pG4s in great ape T2T genome assemblies (Table 1), with the lowest number found in human and the highest in gorilla. Among human pG4s, ~ 69% were standard and ~ 31% were bulged. These proportions were similar for non-human great apes (Table 1). Our final G4 predictions in the human T2T genome contained sequences known to form G4 structures, particularly in the promoter regions of c-MYC [67], TEAD4 [68], TERT [69], NEIL3 [70], ZFP42 [71], and SLC6 A3 [72] genes, and in the body of the SRC gene [73].

Table 1 The number of pG4s annotated in the T2T great ape genomes

Importantly, we predicted an additional 41,232 G4s in the newly resolved regions of the human T2T genome assembly (T2T-CHM13v2.0) compared to the previous assembly (GRCh38). The human T2T assembly unveiled a large number of new pG4s at acrocentric chromosomes—13, 14, 15, 21, and 22—and at chromosomes 8 and 10 (Fig. 2A). As expected, we found most of these new pG4s to be derived from repetitive DNA across chromosomes, as compared to pG4s located in regions resolved in previous assemblies of the human genome (Fig. 2B). The T2T assemblies for the other non-human great apes, for which previous chromosome-level assemblies were available (all but Bornean orangutan), also unveiled a large number of new pG4s (53,843–174,442; Additional File 1: Fig. S2, Table 1). These newly discovered pG4s constituted 5–21% of all pG4s across great ape genomes (Table 1).

Fig. 2
figure 2

Comparison of predicted G4 density in newly resolved regions of the human T2T genome assembly versus hg38, across all chromosomes. A The bars show the number of G4s predicted in the newly resolved regions (per million bases) of the human T2T genome, as compared to the hg38 version, across all chromosomes. The horizontal red line indicates the number of pG4s per million bases across the whole human T2T genome, i.e. the average genome-wide density. The numbers on the top of each bar show the absolute number of pG4s in the regions newly resolved by T2T assemblies on each chromosome. B The plot compares the proportion of pG4s (in percentages) derived from repetitive DNA in the T2T-resolved regions versus the ones present in previously resolved regions of human genome assemblies

Density of predicted G4s is similar across homologous chromosomes and is positively correlated with gene density

Next, we examined the density of pG4s across homologous chromosomes in great apes. We constructed a chromosome homology map displaying the homology between great ape and human chromosomes (Fig. 3A). The homology between chromosomes was estimated from whole-genome pairwise alignments (see Methods for details). We were able to corroborate several previously known homologous relationships. For instance, human chromosome 9 is homologous to chromosome 11 in Pan species (chimpanzee and gorilla) and to chromosome 13 in Pongo species (Sumatran and Bornean orangutans) and gorilla. The telomeric fusion of two ancestral acrocentric chromosomes led to the formation of human chromosome 2 [74, 75], which is homologous to chromosomes 12 and 13 in Pan, and chromosomes 11 and 12 in Pongo and gorilla. In gorilla, a reciprocal translocation occurred between ancestral chromosomes 4 and 19, which are homologous to human chromosomes 5 and 17, respectively [76,77,78].

Fig. 3
figure 3

Homologous relationships and pG4 density across great ape chromosomes, and their correlation with gene density in the human genome. A The map of homologous relationships among great ape chromosomes. Each column represents a homolog, and each row represents a genus or a species. The legend displays the colors used for clades of great apes. Colors that are common to two or more clades indicate shared homology. The numbers over the boxes indicate the chromosome numbers assigned in the T2T assemblies. B The heatmap of pG4 density with homologous chromosomes (human numbering system is used as a base) as rows and species as columns. Homologs of human chromosomes 5 and 17 for gorilla are excluded from this plot because of the translocation event in gorilla. pG4 density for human chromosome 2 is assigned to two rows—2a and 2b. C The relationship between pG4 density and gene density across human chromosomes (p-value = 3.72\(\times\)10–9). D The same as C, with GC-adjusted residuals for pG4 density and gene density (p-value = 0.159)

We observed similar pG4 densities for chromosomes homologous between great apes (Fig. 3B). Human chromosomes 17, 19, and 22, as well as their non-human great ape homologs, had high pG4 density (>0.4/kb) compared to that of other chromosomes. We found a positive correlation (R2 = 0.80) between pG4 density and protein-coding gene density of human chromosomes (Fig. 3C). Genes are known to be GC-rich [79, 80] potentially driving this relationship. While most of the variation in pG4 density across chromosomes was explained by their GC content (Additional File 1: Fig. S3), gene density still explained a small portion of the variation in pG4 density after correcting for GC content (R2 = 0.088), even though such association was not statistically significant (Fig. 3D). To assess multicollinearity, we performed ridge regression alongside ordinary least squares, with GC content as the dependent variable (Additional File 1: Table S3). GC content emerged as the strongest predictor of G4 density, while gene density had a smaller but positive effect. Ridge regression (slightly) reduced GC content’s influence and redistributed the variance, improving the model against multicollinearity.

Presence/absence of predicted G4s follows a molecular clock, with many shared and species-specific variants

Using pairwise inter-species genome alignments (see Methods for details), we identified species-specific pG4s vs. pG4s shared across different groups of species, forming evolutionary groups in our subsequent analyses (Fig. 4A). We found that 271,122 pG4s—approximately one-third of the total number of pG4s in each species—were shared across all species. Closely related congeneric species shared many pG4s as well—198,311 were shared by Bornean and Sumatran orangutans, and 54,923 were shared by bonobo and chimpanzee. Additionally, we found a large number of pG4s shared by Homininae (bonobo, chimpanzee, human, and gorilla sharing 129,762 pG4s) and Hominini (bonobo, chimpanzee, and human sharing 41,601 pG4s). More generally, we found a negative correlation between the number of pG4s shared between any two great ape species and their divergence times (Fig. 4B).

Fig. 4
figure 4

Inter-specific sharing, divergence, and evolutionary dynamics of pG4s in great ape T2T genomes. A The interspecific sharing of pG4s present at homologous locations of great ape T2T genomes (only the top 17 groups are shown, see Additional File 1: Fig. S5 for all the groups). The vertical orange bars represent the number of pG4s shared between different great ape species as shown by the shaded circles, and the vertical blue bars represent the number of species-specific pG4s, with aligned and unaligned pG4s in dark blue and light blue, respectively. The horizontal bars represent the total number of pG4s in each great ape species throughout its genome. B The correlation between the total number of shared pG4s and divergence time. Black dots represent pairs of species, and the red line represents the best fit. C The number of aligned vs. unaligned species-specific pG4s in each repeat family—DNA repeats, LINEs, LTRs, low-complexity regions, retroposons, SINEs, satellites, simple repeats, and non-repetitive regions in gorilla. The percentages inside each bar represent the proportion of aligned/unaligned pG4s in the respective repeat family. D The correlation between the total number of non-shared pG4s and divergence time. Black dots represent pairs of species, and the red line represents the best fit. The Venn diagram at the bottom right is a schematic explaining pG4s that were considered to calculate the distance between any two species, 1 and 2, based on the number of non-shared pG4s. E The phylogram inferred as the most parsimonious relationship from pG4 presence/absence data. The branches are scaled to their lengths and represent the number of pG4s on that branch, i.e. the total number of births and deaths of G4 inferred from the parsimony-informative pG4s

Gorilla exhibited the highest number of species-specific pG4s (224,922, or 27% of all gorilla pG4s), followed by human (110,567, or 14%). Chimpanzee had the smallest number of species-specific pG4s (82,873, or 11%). We observed some variation in pG4 distribution across homologous chromosome groups (Additional File 1: Fig. S5). The most striking difference was observed on chromosome Y, where pG4 sharing among great apes was minimal, with each species exhibiting a high proportion of species-specific pG4s (39–80%). Additionally, human chromosomes 13 and 21, along with their homologs in non-human great apes, displayed similarly high proportions of species-specific pG4s (21–39% and 22–38%, respectively). Furthermore, in gorilla, chromosome 17 (homologous to human chromosome 18) contained a substantial number of species-specific pG4s, whereas its chromosomal homologs in other species did not (Additional File 1: Fig. S5).

Species-specific pG4s were further divided into unaligned and aligned (Table 2) because they might arise via different mechanisms. The sequences for unaligned species-specific pG4s were present in one species only and were absent from any pairwise alignments; we hypothesized that such pG4s might have arisen due to lineage-specific repeat and satellite expansions. The sequences for aligned species-specific pG4s were present in pairwise alignments but were annotated as pG4s in one species only; we hypothesized that such pG4s might have arisen by nucleotide substitutions and/or small insertions and deletions. Many such pG4s were located in regulatory regions (see next section).

Table 2 The total number of species-specific pG4s in each great ape species

Across the great apes, 26,998 to 129,852 unaligned species-specific pG4s were identified, constituting 24–58% of their species-specific pG4s. Gorilla had the highest number of unaligned species-specific pG4s (129,852), while the lowest number was observed for human (26,998). Consistent with our hypothesis, unaligned pG4s were abundant within the repetitive regions of the great ape genomes (Additional File 1: Fig. S6). Among different repeat classes, simple repeats—frequently present at telomeric and subtelomeric regions [81]—contained the highest number of unaligned species-specific pG4s, in all species but human (Fig. 4C, Additional File 1: Fig. S7). Retroposons, particularly SVA elements, represented the next most frequent class across all species studied, except for orangutans (Additional File 1: Fig. S7).

We next investigated whether pG4 presence/absence follows the molecular clock. The molecular clock hypothesis postulates that the genetic distance between any two species is proportional to their divergence time (reviewed in [82]). Here we defined the genetic distance as the number of non-shared pG4s between species, called the pairwise G4 distance henceforth. The pairwise G4 distance was calculated by subtracting two times the number of shared pG4 s from the total number of pG4s present in the genomes of two species. We found a strong correlation (R2 = 0.956) between such distance and divergence time as provided in [56], indicating that pG4 presence/absence does indeed follow the molecular clock (Fig. 4D). When we removed the unaligned species-specific pG4s from this analysis, we observed an even stronger correlation (R2 = 0.997; Additional File 1: Fig. S8 A). Additionally, we constructed the most parsimonious tree using the presence/absence of pG4s at homologous positions across great ape T2T assemblies (see Methods for details). The topology of this tree (Fig. 4E) was consistent with the known species phylogeny [56].

Predicted G4s are enriched and hypomethylated at regulatory regions

We further investigated the enrichment of pG4s for different categories of functional regions in the great ape genomes by computing pG4 densities for each category and evolutionary group and comparing them with the respective genome-wide pG4 densities (see Methods). The functional region categories (later called functional categories) considered were: promoters, 5’UTRs (5’ untranslated regions), protein-coding sequences (CDS), introns, 3’UTRs, enhancers, non-protein coding genes, origins of replication, CpG islands, and repeats (as annotated by RepeatMasker [83]). We added a category comprising non-functional non-repetitive, presumably neutrally evolving, regions (NFNR; see Methods). The genic functional categories—UTRs, protein-coding sequences, introns, and non-protein coding genes–were divided into transcribed and non-transcribed strands. Furthermore, pG4s in each category were divided into evolutionary groups based on their sharing at homologous locations across ape genome assemblies: pG4s shared by great apes, by Homininae, by Hominini, Pan or Pongo genus-specific pG4s, and species-specific pG4s (Fig. 5, Additional File 1: Fig. S9).

Fig. 5
figure 5

pG4 enrichment and methylation across functional categories and evolutionary groups in the human genome. A Enrichment levels of pG4s across different categories of functional regions in the human genome. The y-axis displays fold enrichments for each functional category. The bars are color-coded by evolutionary group (great ape, Homininae, Hominini, and human-specific pG4s). The bar height reflects enrichment values, with numbers atop indicating the total pG4s in each evolutionary group and functional category. B GC-corrected enrichment (positive residuals indicate enrichment and negative residuals indicate depletion) of pG4s across the same functional categories as in (A). The y-axis displays residuals from the regression of pG4 enrichment on GC content, fitted using 5-Mb non-overlapping genome windows. Stars above/below the bars denote the statistical significance of the GC-corrected enrichment, with significance computed using the percentile rank of residuals within the genome-wide residual distribution of GC content in 5-Mb windows in a two-tailed test (Additional File 1: Fig. S10D). C Methylation distributions (kernel density plots) of pG4s with CpG sites across pG4 evolutionary groups and functional categories for the HG002 cell line. Each pie chart shows the fraction of pG4s with CpG sites (red portion). The rows correspond to pG4 evolutionary groups. The columns match the functional categories in (A) and (B). Significance of methylation differences, calculated using a two-tailed test of proportions, is shown above the columns, highlighting hypomethylation differences between pG4s shared by great apes and human-specific pG4s. Red stars indicate stronger hypomethylation in human-specific pG4s, while green stars indicate stronger hypomethylation in pG4s shared by great apes (ns: not significant, *: p-value < 0.05, **: p-value < 0.01, ***: p-value < 0.001)

Notably, we found many human-specific pG4s in regulatory regions—particularly in promoters (2,753) and enhancers (4,772), but also in 5’UTRs (546) and 3’UTRs (1,039). In non-human great apes, we detected a high number of species-specific pG4s in gorillas, as well as both species- and genus-specific pG4s in other groups—Pan-specific pG4s in bonobos and chimpanzees, and Pongo-specific pG4s in orangutans—across promoters, enhancers, and 5'UTRs (Additional File 1: Fig. S9 A, D, G, J, M).

Considering all evolutionary groups, in human, pG4s were highly enriched at promoters (4.11 to 6.71 fold), 5’UTRs (1.99 to 7.34 fold), enhancers (2.51 to 4.08 fold), and origins of replication (3.09 to 4.13 fold), with the lowest and highest values listed for the human-specific pG4s and for the pG4s shared across great apes, respectively (Fig. 5A). We also observed pG4 enrichment at CpG islands (7.11 to 11.93 fold). However, here the highest enrichment was observed for the human-specific pG4s and the lowest for Homininae pG4s (Fig. 5A). Protein-coding sequences displayed both strand- and group-specific enrichment patterns. Specifically, pG4s were sizeably enriched on the transcribed strand for both great ape and Homininae shared pG4s (1.49 and 3.18-fold, respectively) and only moderately enriched on the non-transcribed strand for great ape shared pG4s (1.32-fold; Fig. 5A). In contrast, 3'UTRs exhibited enrichment on both strands, with the greatest enrichment observed in great ape shared and Homininae pG4 evolutionary groups (1.47 to 2.59-fold; Fig. 5A). pG4s at non-protein coding genes, introns, repeats and NFNR were either at the genome-level enrichment or depleted (Fig. 5A). Non-human great apes followed similar trends of pG4 enrichment across functional categories (Additional File 1: Fig. S9 A,D,G,J,M).

Given the correlation between G4 density and GC content (Additional File 1: Fig. S10 A), we corrected for the latter regressing pG4 fold enrichment over GC content genome-wide (Additional File 1: Fig. S10B) and considering residuals, which capture enrichment not explained by GC content (Fig. 5B, Additional File 1: Fig. S9B,E,H,K,N; see Methods for details). After such correction, pG4s shared across great apes were still significantly enriched at promoters, 5’ and 3’ UTRs (both strands), and origins of replication (annotated only for human). Other evolutionary groups of pG4s in some instances switched from enrichment to depletion after the GC correction. All pG4s evolutionary groups were significantly depleted at protein-coding sequences, with particularly strong depletion for the non-transcribed strand. Additionally, after the GC correction, CpG islands were significantly depleted in pG4s, except for species-specific pG4s, which still showed enrichment.

Using the genome-wide methylation data obtained from long-read sequencing of the human genome [84], we evaluated the methylation status of pG4s in a lymphoblastoid cell line (HG002 cell line) and a hydatidiform mole cell line (CHM13). We performed a similar analysis with the cell lines from which the non-human great ape T2T assemblies originated—fibroblast cell lines for bonobo, gorilla, Sumatran orangutan, and Bornean orangutan, and a lymphoblastoid cell line for chimpanzee [57]. This allowed us to predict G4 formation in vivo, as methylated sequences were shown to be less likely to form G4 structures [85, 86]. We examined the 5-methylcytosine methylation profiles of pG4s that contain CpG sites within the functional categories in the cell lines analyzed (Fig. 5C, Additional File 1: Fig. S12 A). Following a previous study [84], we defined hypomethylated and hypermethylated pG4s as having methylation fractions lower than 0.2 and higher than 0.8, respectively. In some functional categories, including introns, 3’UTRs, non-protein coding genes, repeats, and NFNRs, less than half of pG4s had CpG sites (Fig. 5C, Additional File 1: Fig. S9 C,G,I,L,O). This suggests that G4s in these regions are regulated in ways other than 5-methylcytosine methylation.

We made three important observations about hypomethylated pG4s, which are likely to fold in vivo. First, in HG002, G4 methylation levels were low across all pG4 evolutionary groups at 5’UTRs (both transcribed and non-transcribed strands), promoters, enhancers, and CpG islands—suggesting G4 formation in these functional categories. This was also true for other great ape cell lines (Additional File 1: Fig. S9 C,F,I,L,O). Second, across all great apes, pG4s shared by great apes consistently exhibited a significantly higher proportion of hypomethylation (p-value < 0.001, two-tailed test of proportions) compared to species-specific pG4s in most functional categories—except for promoters, UTRs, and protein-coding sequences, which did not always show significant differences (significance stars in Fig. 5C and Additional File 1: Fig. S11). Third, the hypomethylated fraction of pG4s shared among great apes was significantly higher in CHM13 than that in HG002 in all functional categories except for promoters, CpG islands, protein-coding sequences, and 5'UTRs (Additional File 1: Fig. S12 C). This observation is consistent with lower levels of methylation in tumor compared to normal cells [87, 88] and suggests a stronger activation of G4s in the CHM13 cell line than in the HG002 cell line.

We performed KEGG pathway enrichment analysis using gget enrichr [89, 90] on pG4s located in the promoters of human genes, classified by their evolutionary group (i.e., shared by great apes vs. human-specific). This analysis identified 95 significant pathways (Benjamini-Hochberg [91] adjusted p-value < 0.05) in the great ape group and 26 pathways in the human-specific group. We highlight the top 20 most significant KEGG pathways for both the groups in Additional File 1: Fig. S14. pG4s shared by great apes and located in promoters were predominantly enriched in pathways essential for cellular maintenance and regulation, consistent with previous studies demonstrating the regulatory role of promoter G4s in fundamental cellular processes [92, 93]. Notably, one of the top hits, signaling pathways regulating pluripotency of stem cells, corroborates findings from prior research [94], further supporting the functional relevance of these structures in promoters. While many of the pathways significantly enriched in the human-specific group were also present in the great ape group, they did not rank among the most significant hits in the latter. In contrast, the pathways enriched for the human-specific pG4s located in promoters were predominantly associated with neuronal development and regulation, including the neurotrophin signaling pathway, glutamatergic and cholinergic synapse, and circadian entrainment. Moreover, pathways associated with neurodegenerative diseases, such as Alzheimer’s disease and spinocerebellar ataxia, were exclusively significant in the human-specific group, further reinforcing the proposed link between G4s and neurodegenerative diseases (reviewed in [44]).

Discussion

Using recently released great apes T2T genomes, we conducted a detailed analysis of the occurrence of predicted G4s—focusing on their evolution and enrichment in different functional regions. To predict G4s based on genome sequence information, we combined two commonly used G4 prediction algorithms: pqsfinder [62] and G4Hunter [63]. Our robust computational pipeline identified many new pG4s, which were previously unknown likely due to incomplete sequence information—especially in acrocentric chromosomes—and created a comprehensive pG4 catalog for human and non-human great apes. With its focus on great apes, our study complements previous studies of G4 evolution, which considered larger evolutionary distances, e.g., spanning different kingdoms of life [95] or early diverging eukaryotic clades [96]. Previously, pG4 occurrence was studied only for three human T2T chromosomes [97,98,99] and was not analyzed for great ape chromosomes. Our chromosomal-level analysis showed that the density of G4s is linearly correlated with the gene density, and the high GC content of the gene-enriched chromosomes explains most of this pattern. Using pairwise alignments, G4 predictions, and homology information of great ape chromosomes, we identified species-specific pG4s and pG4s shared across different species clades using a graph-based approach. Our analysis provided the first detailed evolutionary framework to study their conservation and divergence and revealed that G4s evolve in accordance with the molecular clock. We also identified differences in pG4 enrichment based on evolutionary sharing, as discussed further below, and adjusted the enrichment using a non-linear model of G4 enrichment and GC content relationship. Moreover, using methylation data as a proxy for G4 formation in the cell, we predicted G4 formation in different types of functional regions, such as promoters and origins of replication. Given that our study relies on genome-wide methylation data obtained from ultralong nanopore reads, our rationale—that hypomethylated sequences tend to form G4s [85, 86, 100]—was strengthened. Additionally, we found that pG4 methylation levels are inversely correlated with evolutionary conservation in most of the genomic regions enriched with pG4s.

Newly discovered pG4s in great apes

The number of pG4s generated by our pipeline for the human genome (769,184) was higher than previous estimates—using first- [101] and second-generation G4-seq [102], and sequencing data-based algorithms, e.g., G4-miner [103]—which reported a total number of G4s between 716,310 and 736,689. This increase is explained by our use of the T2T genome assembly and by our prediction of standard and bulged G4s. We found that the newly resolved regions in the T2T genomes of all great apes were enriched for pG4s, consistent with overall enrichment for non-B DNA in such regions [59]. In human, we detected a particular enrichment of pG4s at the acrocentric chromosomes, emphasizing the importance of complete genome assemblies for uncovering pG4s in structurally complex regions. In non-human great apes, newly resolved G4-dense regions were abundant at the metacentric chromosomes (Additional File 1: Fig. S2).

Evolution of pG4 occurrence in great apes

At the genome level, we found that pG4s evolve following the molecular clock hypothesis (Fig. 4D), indicating a consistent rate of G4 divergence over time. This suggests that, globally, pG4s evolve with similar rates and under similar selective constraints across the great ape lineages. Previous studies suggested that G4s are subject to purifying selection in several functional categories of the human genome, with different levels of constraint depending on the regions considered [39]. Our study suggests that such constraints might be similar across the studied ape species, although a more detailed investigation is warranted. The strong negative correlation we found between the number of pG4s shared between species and the species divergence time (Fig. 4B) also supports their evolution following the molecular clock hypothesis.

Our study also highlighted species-specific pG4s. In particular, we discovered different evolutionary and genomic distribution patterns between aligned and unaligned species-specific pG4s. The aligned species-specific pG4s followed a rather strict molecular clock (Additional File 1: Fig. S8B), consistent with their origination via nucleotide substitutions and/or small insertions/deletions, which are known to follow the molecular clock. In contrast, the unaligned species-specific pG4s followed a more relaxed molecular clock (Additional File 1: Fig. S8 C) and were located predominantly in the repetitive regions of the great ape genomes (which were deciphered in the T2T assemblies), in agreement with a recent study [59]. We found that pG4s are common at simple repeats (including telomeres), satellites, and retroposons (Fig. 4C, Additional File 1: Fig. S7), which were previously suggested to expand rapidly and frequently in a genus- and/or species-specific manner [56, 57, 104, 105].

For the first time, we also showed that for pG4s in different functional categories, there is a significant difference between the proportion of hypomethylated G4s, depending on their level of sharing across the great apes. In particular, in most functional region categories, shared pG4s had higher hypomethylation levels and thus are more likely to form than species-specific pG4s. The opposite was true for human protein-coding genes, where shared pG4s were more methylated than species-specific ones. Additional studies are needed to investigate the potential causes of these relationships.

Functional regions display a bias towards G4s

Our study corroborates a correlation between G4 density and GC-content that was observed previously [96]. Moreover, our functional enrichment study provides evidence that the prevalence of G4s cannot be solely attributed to the GC content of any functional category, in line with previous studies [106]. Some functional categories, such as promoters and origins of replication, show a significant G4 enrichment even after a GC correction. However, some other functional categories are no longer significantly enriched for G4s after such a correction.

Promoters

The enrichment of G4s we found at promoters corroborates previous studies [93, 96, 107]. We also found that G4s located at promoters were hypomethylated, indicating significant G4 activation. This is consistent with the results published by Wu and colleagues (2021), who observed significantly lower methylation levels at G4s located within 2 kb upstream of genes vs. the rest of the genic regions in mammals and insects. These observations agree with the critical role G4s play in the promoter regions during gene regulation (e.g., [107, 108]).

CpG islands

GC-rich CpG islands were highly enriched in G4s—particularly before GC correction and at human-specific CpG islands even after GC correction. CpG islands are often found at promoters, i.e., almost one-fourth of all promoter regions in our study overlapped with CpG islands. Studies have shown regulatory roles of G4s in CpG islands [109]. Since CpG islands are typically hypomethylated, they provide a favorable environment for activated G4s, particularly at promoter regions [85].

In gorilla, we found a notable enrichment of species-specific pG4s at CpG islands after correcting for GC content (Additional File 1: Fig. S9H). Interestingly, unlike human-specific pG4s at CpG islands, these pG4s were hypermethylated (Additional File 1: Fig. S9I). A similarly strong hypermethylation pattern was evident in gorilla repeat regions, suggesting that gorilla may have experienced repeat expansions [56, 57] leading to the formation of new G4 structures (Additional File 1: Fig. S9I). The difference might also be due to different cell lines analyzed for human (lymphoblastoid) vs. gorilla (fibroblast). The increased methylation at these sites could represent an evolutionary response to suppress potential G4 activation, reflecting a regulatory adaptation in gorilla.

Enhancers

Enhancers also showed enrichment and hypomethylation of pG4s, though generally to a lesser degree than observed in promoters, across all great apes. This aligns with previous studies [110], which have proposed that G4s play a role in promoter-enhancer interactions, with G4s forming partly in the promoter and partly in the enhancer. Another study [111] further supported the possibility of these G4-mediated interactions, suggesting that enhancers with long G4-forming sequences could be instrumental in such regulatory contacts. However, after correcting for GC content, the observed pG4 enrichment in enhancers was often reduced, suggesting that GC content largely accounts for the presence of pG4s in these regions. Therefore, while our observations do not directly confirm these models, they suggest that G4s may act as secondary players in enhancer regulation, contingent upon GC-rich contexts.

UTRs

Prior to GC correction, pG4 enrichment was observed in both 5’UTRs and 3’UTRs across all great apes, although the enrichment was less pronounced in 3’UTRs. This observation supports previous studies [112]. However, similar to other functional groups, the enrichment decreased with the level of sharing—going from the most shared evolutionary group of great apes to the human-specific ones. After GC correction, however, enrichment patterns varied by evolutionary group. For both 5’UTRs and 3’UTRs, pG4s shared across all great apes showed significant enrichment, while other evolutionary groups in 5’UTRs showed a depletion of pG4s. This consistent pattern across great apes suggests that while pG4 structures may be functionally relevant within UTRs, pG4s that are highly conserved across species are favored. Further investigation is needed to understand the selective mechanisms underlying this conservation at pG4s shared by great apes.

Whereas G4s at 5’UTRs were mostly hypomethylated, G4s at 3’UTRs showed a dominant hypermethylated fraction. Additionally, nearly three-quarters of G4s in 5’UTRs contained CpG sites, whereas fewer than half of G4s in 3’UTRs did. This disparity, evident across all great apes, suggests that G4s play different roles in 5’ vs. 3’UTR biology. Whereas 5’UTR G4s might influence transcriptional regulation (e.g., they may act as transcriptional barriers, potentially stalling transcription and representing considerable roadblocks for transcriptional machinery), 3’UTR G4s could contribute to post-transcriptional processes. Thus, our study suggests a more nuanced functional landscape for G4s at UTRs, significantly augmenting previous studies [113, 114].

Protein-coding sequences

In line with a previous study [39], we found that pG4 enrichment, albeit modest, had a strand-specific pattern in the protein-coding sequences—with the transcribed strand having a higher enrichment than the non-transcribed strand, for shared pG4s. This aligns with the hypothesis that G4s are not favored at the mRNA level [115]. Although known to be GC-rich, protein-coding sequences displayed a significant G4 depletion after correcting for GC content [115, 116] across all evolutionary groups, and at both transcribed and non-transcribed strands. Thus, G4 formation is disfavored in these regions.

Interestingly, across all great apes, species-specific G4s in the non-transcribed strand of the protein-coding sequences had significantly higher hypomethylation than G4s shared by great apes. Therefore, it can be hypothesized that evolutionarily new G4s have a tendency to be active in the non-transcribed strand, which might result in G4 formation in mRNA. Alternatively, such G4s might not have had enough time to become silenced by methylation. This interesting observation should be investigated further.

Replication origins

Our analysis revealed an enrichment of G4s at human replication origins. While this enrichment decreased after correcting for GC content, a positive trend suggests a connection between G4s and replication origins, particularly in more highly conserved G4s as opposed to human-specific ones. This trend hints at a functional relationship between G4 structures and replication initiation suggested in previous studies [27, 117]. Due to the rapid evolution of origins of replication [118], we were unable to study their enrichment and methylation status in other great apes.

Study limitations and future directions

This study, based exclusively on sequence data, underscores the necessity for further experimental validation to confirm G4 formation in vivo in great apes and to enhance our understanding of their evolutionary dynamics. Our analysis did not consider several factors, which may influence G4 prediction accuracy. These include the loop-base composition [119], the presence of uneven G4 motifs [64], and the impact of adenine repeats on G4 stability [120]. Incorporating these factors into future studies could refine our G4 predictions.

Future research should also revisit some limitations of our study. First, our G4 repertoire was defined as a non-overlapping set, selecting the most stable G4s (based on the highest pqsfinder scores) in each genomic interval. However, multiple G4 conformations can form under physiological conditions, and sequence data alone cannot fully capture this complexity. Experimental approaches (e.g., permanganate/S1 footprinting with direct adapter ligation and sequencing [121]) will be essential to identify these dynamic G4 structures in vivo. Second, for the sharing profile of G4s, we only considered those that fully overlapped with alignment blocks, which may have led to an underestimation or overestimation of shared or species-specific G4s, respectively, among great apes. Third, we used a quadratic regression model to correct for GC content when assessing G4 enrichment. While this worked satisfactorily for our data, a more sophisticated model (e.g., logistic growth model) may better reflect the biological relationship between G4 formation and GC content and provide a more accurate understanding of how G4 enrichment varies with GC content. Fourth, we assumed that unaligned species-specific pG4 s arose independently in each great ape species. However, this assumption implicitly suggests that these regions lack homology with other apes, even though it is well known that alignments often perform poorly in highly repetitive regions. Therefore, additional analysis of G4s in these regions is warranted. Fifth, our results from the KEGG pathway analysis highlight functional differences between shared by great apes and human-specific pG4s located in promoters. However, we acknowledge that statistical significance might be influenced by the large size differences between the two groups (i.e., 7,541 and 2,181 genes in great apes and human-specific groups, respectively). Moreover, using a background set consisting of all genes with a pG4 in their promoter in our enrichr analysis led to the loss of significance in both groups, which needs deeper exploration (Additional File 1: Fig. S14). Finally, our analysis did not consider the predicted stability of G4s. Incorporating such information in future analyses could be informative, as, for instance, a previous study provided evidence of depletion of thermodynamically stable G4s in the genomes of multiple species [119].

Importantly, the findings from the KEGG pathway analysis suggest promising avenues for future research into the regulatory roles of pG4s in neuronal pathways, particularly in the context of human evolution. This underscores the necessity of refining our understanding of how promoter-localized G4s (along with G4s in other regulatory regions) contribute to gene regulation and evolutionary divergence. Investigating the mechanistic roles of these structures in neuronal pathways could provide valuable insights into human-specific traits and disease susceptibilities, particularly in a clinical context. Moreover, with the future availability of epigenetic datasets—such as chromatin accessibility, histone methylation, and acetylation—models such as epiG4 NN [122] can incorporate this information to refine G4 predictions further.

Conclusions

We have built a comprehensive catalog of predicted G4s in completely sequenced genomes of great apes and found that the newly resolved regions of T2T genomes are enriched in predicted G4s. We found strong associations between evolutionary groups of pG4s (shared, clade-, and species-specific), their enrichment in regulatory regions, and their predicted formation based on methylation profiles. Taken together, our whole-genome catalog and analyses of G4s in humans and other great apes open new avenues for studies of G4 evolution and function, including their roles in genome expansions (e.g., likely particularly important in gorilla) and in species-specific adaptations (relevant to all species). Indeed, our analysis revealed that species-specific pG4s are abundant in regions of functional significance, such as promoters and enhancers, and might, for instance, drive novel promoter-enhancer interactions. Exploring these mechanisms further should illuminate how pG4s contribute to regulatory evolution and species-specific traits.

Methods

Predicting G4s using pqsfinder and G4Hunter

Genome FASTA files for six great ape species (version 2.0)—Homo sapiens (GenBank: GCA_009914755.4), Pan troglodytes (GenBank: GCA_028858775.2), Pan paniscus (GenBank: GCA_029289425.2), Gorilla gorilla (GenBank: GCA_029281585.2), Pongo abelii (GenBank: GCA_028885655.2), and Pongo pygmaeus (GenBank: GCA_028885625.2)—were downloaded from NCBI [123]. Each of these FASTA files was divided by chromosome, and only the primary haplotype was used. pqsfinder (dockerized at https://github.com/kxk302/PqsFinder_Docker/tree/main) was employed on all files with the following settings: overlapping=True, maxLength=50 and minScore=30. The pG4 s that satisfied the regular expression motifs of standard and bulged G4 s, contained more than two tetrads, had a pqsfinder score of 40 or higher, and had an absolute G4Hunter score (adapted from https://github.com/AnimaTardeb/G4Hunter/blob/master/G4Hunter.py) greater than 1.5 were retained. Following this, the pG4s in the same region (i.e. between a certain start and end position) were grouped together and only the pG4s with the highest pqsfinder score were selected for further analyses. The final dataset consisted of a non-overlapping set of pG4s on both strands for all the chromosomes of the great apes.

The newly identified regions in the T2T genomes for human were downloaded from https://github.com/marbl/CHM13 [54, 55] and for the non-human great apes were adapted from [57].

pG4 density and distribution across great ape genomes

The pG4 density for the great ape chromosomes was calculated by dividing the number of pG4s by the length of the respective chromosome in each species. Inter-species chromosomal homolog mapping was inferred from the Comparative Genome Viewer from NCBI, which utilizes pairwise alignments between the genomes of the great apes. To study the distribution of pG4s, chromosomes were divided into 100-kb windows, and the number of pG4s in each window was tabulated. The density for each window was then calculated by normalizing the number of pG4s in that window against the window with the highest number of pG4s on the same chromosome.

Pairwise genome alignments using LASTZ between great ape homologous chromosomes

The FASTA files used for the prediction of G4s were also utilized for pairwise alignments of homologous chromosomes among the great apes (see previous section), as illustrated in Fig. 3A. To account for the fusion event in human, all the great ape chromosomes homologous to human chromosome 2a and 2b, were aligned to the whole human chromosome. In addition, pairwise alignments were performed among the non-human great ape chromosomes corresponding to 2a and 2b. Similarly, to account for the translocation event in the gorilla involving the human homologs of chromosomes 5 and 17, the pairwise alignments for these homologs included both gorilla chromosomes 4 and 19. For all human homologs other than 2, 5, and 17, a total of 15 pairwise alignments were performed for each homolog across the six great ape species. Alignments were performed using LASTZ [124] with the settings –notransition and –allocate:traceback=1.5G. The human-chimp.v2 scoring matrix (https://genomewiki.ucsc.edu/index.php/Hg19_conservation_lastz_parameters) was used for these pairwise alignments.

Connected graphs for shared and species-specific G4 s

Predicted G4s were mapped onto the pairwise alignments and classified as shared or species-specific using our custom-designed pipeline—MAP-SEA (Mapping And Prediction of Shared Elements in Alignments–https://github.com/makovalab-psu/mapsea) [125]. The mapping was performed using the BEDTools suite [126, 127]. A pG4 was classified as "shared" if its start and end positions were within 3 base pairs (bp) of each other in the pairwise alignments, allowing for up to 3 bp extensions or truncations at either or both ends. Shared G4s from all pairwise alignments across all chromosomes of the great apes were identified as connecting edges, which were subsequently grouped to form connected graphs of shared pG4s across all species. Constructing connected graphs eliminated the issue of double-counting specific pG4s, a problem that can arise due to repetitive sequences in alignments. pG4s in alignments that showed no sharing and those not included in the alignments were categorized as species-specific pG4s. Each shared group of pG4s, along with species-specific pG4s, was assigned a unique identifier. GNU Parallel [128] was used to parallelize the workflow across multiple chromosomes. The connected graphs were collectively stored as a single dataframe encompassing all pG4s. A diagram of this approach is shown in Additional File 1: Fig. S13.

A presence/absence matrix was calculated from the dataframe above for downstream analyses. To identify whether a pG4 is present or absent in a species, duplicated pG4s in any single species were counted only once. This caused a reduction in total pG4 counts, as shown by the horizontal bars in Fig. 4A. To visualize the sharing profile, the dataframe was represented as an upset plot [129] using the Python package UpSetPlot (https://upsetplot.readthedocs.io/en/stable/index.html).

Defining shared pG4s using a strict start and end within 3 bp extension or truncation criterion has limitations, but alternative approaches might introduce misclassification. The start or end criterion increases ambiguity, while the percentage overlap approach may bias results depending on sequence length. A 3-bp extension or truncation provides a balanced trade-off. We also reran the MAP-SEA [125] pipeline with a 4-bp extension or truncation to assess the robustness of our results, as shown in Additional File 1: Fig. S15. While the quantitative results differed—showing more G4s annotated as shared—the qualitative differences were insufficient to justify changing the initial 3-bp truncation or extension.

Reconstructing phylogeny with maximum parsimony

The pG4s presence/absence data at orthologous locations was used to infer the most parsimonious phylogram for the great apes. The data was converted into a NEXUS file format following the guidelines provided in the PAUP [130] (https://paup.phylosolutions.com/) quick start tutorial. A maximum parsimony heuristic search was conducted designating the orangutans as the monophyletic outgroup. The resulting best-rooted tree was saved in Newick format and subsequently visualized using MEGA [131].

Functional annotations for humans and other great apes

Genome annotation files—in GFF (General Feature Format) format—for six great apes were downloaded from NCBI [123]. Protein-coding genes with the longest corresponding mRNA, CDS, and exon entries were extracted (using the snippet at https://gist.github.com/karolpal-jr/48213a0a65475e44f708d5d815127bc3). Introns were inferred from the mRNA and exon data, representing the non-exonic regions of mRNA. The 5’ and 3’ UTRs were determined using CDS and exon entries, representing the non-CDS regions of exons at the 5’ and 3’ ends, respectively. Promoters were defined as 1 kb upstream of the gene start sites. All the genes, excluding pseudogenes, not encoding an mRNA, were identified as non-protein coding genes. We used repeat annotations generated with RepeatMasker [83].

For the human genome, core and stochastic origins of replication annotations in the hg38 genome were downloaded from [132], and the UCSC Genome Browser's liftOver tool [133] was used to convert the coordinates to the CHM13 assembly. Unmasked CpG island annotations were downloaded from the UCSC table browser [134] for the CHM13v2.0 genome. Human 5-methylcytosine methylation data for CHM13, generated using ONT technology, were downloaded in BED format [55, 84]. All annotations were converted to BED format for further analysis.

CpG annotations for the non-human great apes were obtained with the same algorithm used to calculate CpG island annotations for humans [135, 136]. Enhancer annotations for the non-human great apes were obtained with the UCSC liftOver tool, using chain files derived from wfmash [137] all-to-all alignments of primates and human haplotypes (CHM13, HG002 mat/pat, GRCh38) [57]. Methylation data for human were generated by aligning ultralong nanopore reads from HG002 directly to the CHM13 reference genome using Winnowmap [84]. The data for non-human great apes were taken from [57]. NFNR regions in the human genome were defined as those that do not overlap with genes, enhancers, promoters, origins of replication, and repeats. A similar definition was applied to the non-human great apes; however, due to the lack of annotations for origins of replication in their genomes, and because of the rapid evolution of origins of replication among species [118], their NFNR regions were calculated without excluding overlaps with origins of replication.

G4 enrichment and methylation profiles for functional regions

Using the dataframe generated from the connected graphs, pG4s were grouped based on their sharing across great apes. In each evolutionary group, pG4s entirely located within (f = 1.0) each functional category were identified using bedtools intersect. In addition, to account for the strand-specificity of pG4s in genic elements—introns, protein-coding sequences, and UTRs—pG4s in these functional categories were divided into transcribed and non-transcribed, using the -S and -s flags in bedtools intersect, respectively. Subsequently, the fold enrichment of pG4s for each functional category within an evolutionary group was calculated by dividing the fraction of pG4s in that category for the group by the proportion of the category’s total length relative to the entire genome length. In symbols, for a given functional category \(X\) and evolutionary group \(Y\), the fold enrichment (\(FE\)) was calculated as:

$$FE(X,Y)=\frac{\#pG4s(X,Y)}{\#pG4s(Y)}\times \frac{len(genome)}{len(X)}$$

To calculate GC-corrected G4 enrichment, the human genome was divided into 5-Mb windows using bedtools makewindows. GC content for each window was determined with bedtools nuc. A second-order polynomial regression was then used to model the relationship between G4 fold enrichment and GC content. We tested a range of window sizes—10 kb, 50 kb, 100 kb, 500 kb, 1 Mb, 5 Mb, 10 Mb, 50 Mb, and 100 Mb—and found that the resulting regression coefficients’ estimates were consistent across window sizes (Additional File 1: Fig. S10 C). Using the regression model of G4-enrichment against GC content fitted using 5-Mb windows, we calculated the residuals for each functional category within an evolutionary group based on its GC content and observed G4 fold enrichment. The same procedure was applied to calculate the GC-corrected G4 enrichment for non-human great apes. The significance of enrichment, post-GC correction, was calculated using the percentile rank of residuals within the genome-wide residual distribution in a two-tailed test.

The average methylation fraction for each CpG-containing pG4 was calculated by taking the mean of the methylation fraction—the fraction of samples methylated at a given CpG site—across all CpG sites present in the pG4. A kernel density plot was used to visualize the distribution of the average methylation fraction for CpG-containing pG4s in different functional categories across the genomes of great apes. For inter-group differences across all functional categories, the significance of the difference in the proportions of hypomethylation (methylation fraction < 0.2) between pG4 evolutionary groups was computed using a two-tailed test of proportions. To account for multiple testing relative to each functional category, p-values were Bonferroni-corrected.

Data availability

All source code is available on GitHub at https://github.com/makovalab-psu/GreatApeT2T-G4s [138] under the MIT License, and a snapshot of the code along with all associated data has been archived on Zenodo at https://doi.org/10.5281/zenodo.15007392 [139] under the Creative Commons Attribution 4.0 International License. The pipelines, G4 discovery [61] and MAP-SEA [125], can be found as separate GitHub repositories at https://github.com/makovalab-psu/g4Discovery and https://github.com/makovalab-psu/mapsea, respectively. The genome sequences and annotations were downloaded from NCBI [123].

References

  1. Levitt M. How many base-pairs per turn does DNA have in solution and in chromatin? Some theoretical calculations. Proc Natl Acad Sci U S A. 1978;75(2):640–4.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Watson JD, Crick FHC. Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature. 1953;171(4356):737–8.

    Article  CAS  PubMed  Google Scholar 

  3. Wang JC. Helical repeat of DNA in solution. Proc Natl Acad Sci U S A. 1979;76(1):200–3.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Kouzine F, Wojtowicz D, Baranello L, Yamane A, Nelson S, Resch W, et al. Permanganate/S1 nuclease footprinting reveals non-B DNA structures with regulatory potential across a mammalian genome. Cell Syst. 2017;4(3):344-56.e7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Mitsui Y, Langridge R, Shortle BE, Cantor CR, Grant RC, Kodama M, et al. Physical and enzymatic studies on poly d(I–C).Poly d(I–C), an unusual double-helical DNA. Nature. 1970;228(5277):1166–9.

    Article  CAS  PubMed  Google Scholar 

  6. Panayotatos N, Fontaine A. A native cruciform DNA structure probed in bacteria by recombinant T7 endonuclease. J Biol Chem. 1987;262(23):11364–8.

    Article  CAS  PubMed  Google Scholar 

  7. Felsenfeld G, Rich A. Studies on the formation of two- and three-stranded polyribonucleotides. Biochim Biophys Acta. 1957;26(3):457–68.

    Article  CAS  PubMed  Google Scholar 

  8. Prosseda G, Falconi M, Giangrossi M, Gualerzi CO, Micheli G, Colonna B. The virF promoter in Shigella: more than just a curved DNA stretch. Mol Microbiol. 2004;51(2):523–37.

    Article  CAS  PubMed  Google Scholar 

  9. Sen D, Gilbert W. Formation of parallel four-stranded complexes by guanine-rich motifs in DNA and its implications for meiosis. Nature. 1988;334(6180):364–6.

    Article  CAS  PubMed  Google Scholar 

  10. Wang G, Vasquez KM. Dynamic alternative DNA structures in biology and disease. Nat Rev Genet. 2022;24(4):211–34.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Pavlova AV, Kubareva EA, Monakhova MV, Zvereva MI, Dolinnaya NG. Impact of G-Quadruplexes on the Regulation of Genome Integrity, DNA Damage and Repair. Biomolecules. 2021 Aug 27;11(9). https://doi.org/10.3390/biom11091284

  12. Ravichandran S, Ahn JH, Kim KK. Unraveling the regulatory G-quadruplex puzzle: lessons from genome and transcriptome-wide studies. Front Genet. 2019;18(10):469392.

    Google Scholar 

  13. Zhang R, Wang Y, Wang C, Sun X, Mergny JL. G-quadruplexes as pivotal components of cis-regulatory elements in the human genome. BMC Biol. 2024;22(1):1–23.

    Article  Google Scholar 

  14. Kosiol N, Juranek S, Brossart P, Heine A, Paeschke K. G-quadruplexes: a promising target for cancer therapy. Mol Cancer. 2021;20(1):1–18.

    Article  Google Scholar 

  15. Monsen RC. Higher-order G-quadruplexes in promoters are untapped drug targets. Front Chem. 2023;7(11):1211512.

    Article  Google Scholar 

  16. Fay MM, Lyons SM, Ivanov P. RNA G-quadruplexes in biology: principles and molecular mechanisms. J Mol Biol. 2017;429(14):2127.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Hoogsteen K. The crystal and molecular structure of a hydrogen-bonded complex between 1-methylthymine and 9-methyladenine. Acta Crystallogr. 1963;16(9):907–16.

    Article  CAS  Google Scholar 

  18. Guédin A, Gros J, Alberti P, Mergny JL. How long is too long? Effects of loop size on G-quadruplex stability. Nucleic Acids Res. 2010;38(21):7858–68.

    Article  PubMed  PubMed Central  Google Scholar 

  19. Mukundan VT, Phan AT. Bulges in G-quadruplexes: broadening the definition of G-quadruplex-forming sequences. J Am Chem Soc. 2013;135(13):5017–28.

    Article  CAS  PubMed  Google Scholar 

  20. Papp C, Mukundan VT, Jenjaroenpun P, Winnerdy FR, Ow GS, Phan AT, et al. Stable bulged G-quadruplexes in the human genome: identification, experimental validation and functionalization. Nucleic Acids Res. 2023;51(9):4148–77.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Teng FY, Jiang ZZ, Guo M, Tan XZ, Chen F, Xi XG, et al. G-quadruplex DNA: a novel target for drug design. Cell Mol Life Sci. 2021;78(19):6557–83.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Guiblet WM, Cremona MA, Harris RS, Chen D, Eckert KA, Chiaromonte F, et al. Non-B DNA: a major contributor to small- and large-scale variation in nucleotide substitution frequencies across the genome. Nucleic Acids Res. 2021;49(3):1497–516.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Bryan TM. G-Quadruplexes at Telomeres: Friend or Foe? Molecules. 2020;25(16). Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7464828/. Cited 2024 Oct 2.

  24. Lin C, Yang D. Human telomeric G-Quadruplex structures and G-Quadruplex-interactive compounds. Methods Mol Biol. 2017;1587:171.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Moye AL, Porter KC, Cohen SB, Phan T, Zyner KG, Sasaki N, et al. Telomeric G-quadruplexes are a substrate and site of localization for human telomerase. Nat Commun. 2015;6(1):1–12.

    Article  Google Scholar 

  26. Xu Y, Komiyama M. G-Quadruplexes in human telomere: structures, properties, and applications. Molecules. 2023;29(1):174.

    Article  PubMed  PubMed Central  Google Scholar 

  27. Prioleau MN. G-Quadruplexes and DNA replication origin. DNA Replication. 2017;32:273–86.

    Article  Google Scholar 

  28. Valton A, Hassan‐Zadeh V, Lema I, Boggetto N, Alberti P, Saintomé C, et al. G4 motifs affect origin positioning and efficiency in two vertebrate replicators. EMBO J. 2014 Feb 12; Available from: https://www.embopress.org/doi/10.1002/embj.201387506. Cited 2024 Oct 2.

  29. Stein M, Hile SE, Weissensteiner MH, Lee M, Zhang S, Kejnovský E, et al. Variation in G-quadruplex sequence and topology differentially impacts human DNA polymerase fidelity. DNA Repair. 2022;1(119):103402.

    Article  Google Scholar 

  30. Sun D, Hurley LH. Biochemical techniques for the characterization of G-quadruplex structures: EMSA, DMS footprinting, and DNA polymerase stop assay. G-Quadruplex DNA. 2010;608:65–79.

    Article  CAS  Google Scholar 

  31. Lormand JD, Buncher N, Murphy CT, Kaur P, Lee MY, Burgers P, et al. DNA polymerase δ stalls on telomeric lagging strand templates independently from G-quadruplex formation. Nucleic Acids Res. 2013;41(22):10323–33.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Guiblet WM, Cremona MA, Cechova M, Harris RS, Kejnovská I, Kejnovsky E, et al. Long-read sequencing technology indicates genome-wide effects of non-B DNA on polymerization speed and error rate. Genome research. 2018;28(12). Available from: https://pubmed.ncbi.nlm.nih.gov/30401733/. cited 2024 Oct 29.

  33. Georgakopoulos-Soares I, Morganella S, Jain N, Hemberg M, Nik-Zainal S. Noncanonical secondary structures arising from non-B DNA motifs are determinants of mutagenesis. Genome research. 2018;28(9). Available from: https://pubmed.ncbi.nlm.nih.gov/30104284/. Cited 2024 Oct 29.

  34. Stein M, Eckert KA. Impact of G-Quadruplexes and chronic inflammation on genome instability: additive effects during carcinogenesis. Genes. 2021;12(11):1779.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Spiegel J, Cuesta SM, Adhikari S, Hänsel-Hertsch R, Tannahill D, Balasubramanian S. G-quadruplexes are transcription factor binding hubs in human chromatin. Genome Biol. 2021;22(1):1–15.

    Article  Google Scholar 

  36. Lee CY, McNerney C, Ma K, Zhao W, Wang A, Myong S. R-loop induced G-quadruplex in non-template promotes transcription by successive R-loop formation. Nat Commun. 2020;11(1):1–15.

    Google Scholar 

  37. Broxson C, Beckett J, Tornaletti S. Transcription Arrest by a G Quadruplex Forming-Trinucleotide Repeat Sequence from the Human c-myb Gene. 2011. Available from: https://doi.org/10.1021/bi2002136. Cited 2024 Oct 2.

  38. Smestad JA, Maher LJ. Relationships between putative G-quadruplex-forming sequences, RecQ helicases, and transcription. BMC Med Genet. 2015;16(1):1–14.

    Article  Google Scholar 

  39. Guiblet WM, DeGiorgio M, Cheng X, Chiaromonte F, Eckert KA, Huang YF, et al. Selection and thermostability suggest G-quadruplexes are novel functional elements of the human genome. Genome Res. 2021;31(7):1136–49.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Madden SK, de Araujo AD, Gerhardt M, Fairlie DP, Mason JM. Taking the Myc out of cancer: toward therapeutic strategies to directly inhibit c-Myc. Mol Cancer. 2021;20(1):1–18.

    Article  Google Scholar 

  41. D’Aria F, Pagano B, Petraccone L, Giancola C. KRAS promoter G-quadruplexes from sequences of different length: a physicochemical study. Int J Mol Sci. 2021;22(1):448.

    Article  PubMed  PubMed Central  Google Scholar 

  42. Del Toro M, Bucek P, Aviñó A, Jaumot J, González C, Eritja R, et al. Targeting the G-quadruplex-forming region near the P1 promoter in the human BCL-2 gene with the cationic porphyrin TMPyP4 and with the complementary C-rich strand. Biochimie. 2009;91(7):894–902.

    Article  PubMed  Google Scholar 

  43. Phan AT, Kuryavyi V, Burge S, Neidle S, Patel DJ. Structure of an unprecedented G-quadruplex scaffold in the human c-kit promoter. J Am Chem Soc. 2007;129(14):4386–92.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Wang E, Thombre R, Shah Y, Latanich R, Wang J. G-Quadruplexes as pathogenic drivers in neurodegenerative disorders. Nucleic Acids Res. 2021;49(9):4816.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Nicoletto G, Terreri M, Maurizio I, Ruggiero E, Cernilogar FM, Vaine CA, et al. G-quadruplexes in an SVA retrotransposon cause aberrant TAF1 gene expression in X-linked dystonia parkinsonis. Nucleic Acids Res. 2024;17:gkae797.

    Google Scholar 

  46. Crenshaw E, Leung BP, Kwok CK, Sharoni M, Olson K, Sebastian NP, et al. Amyloid precursor protein translation is regulated by a 3’UTR Guanine Quadruplex. PLoS ONE. 2015;10(11):e0143160.

    Article  PubMed  PubMed Central  Google Scholar 

  47. Lammich S, Kamp F, Wagner J, Nuscher B, Zilow S, Ludwig AK, et al. Translational repression of the disintegrin and metalloprotease ADAM10 by a stable G-quadruplex secondary structure in its 5’-untranslated region. J Biol Chem. 2011;286(52):45063–72.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Fisette JF, Montagna DR, Mihailescu MR, Wolfe MS. A G-rich element forms a G-quadruplex and regulates BACE1 mRNA alternative splicing. J Neurochem. 2012;121(5):763–73.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Koukouraki P, Doxakis E. Constitutive translation of human α-synuclein is mediated by the 5’-untranslated region. Open Biol. 2016;6(4):160022.

    Article  PubMed  PubMed Central  Google Scholar 

  50. Zhang Y, Gaetano CM, Williams KR, Bassell GJ, Mihailescu MR. FMRP interacts with G-quadruplex structures in the 3’-UTR of its dendritic target Shank1 mRNA. RNA Biol. 2014;11(11):1364–74.

    Article  PubMed  Google Scholar 

  51. McAninch DS, Heinaman AM, Lang CN, Moss KR, Bassell GJ, Mihailescu MR, et al. Fragile X mental retardation protein recognizes a G quadruplex structure within the survival motor neuron domain containing 1 mRNA 5′-UTR. Mol Biosyst. 2017;13(8):1448–57.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Haeusler AR, Donnelly CJ, Periz G, Simko EAJ, Shaw PG, Kim MS, et al. C9orf72 nucleotide repeat structures initiate molecular cascades of disease. Nature. 2014;507(7491):195–200.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Saha T, Usdin K. Tetraplex formation by the progressive myoclonus epilepsy type-1 repeat: implications for instability in the repeat expansion diseases. FEBS Lett. 2001;491(3):184–7.

    Article  CAS  PubMed  Google Scholar 

  54. Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, et al. The complete sequence of a human genome. Science. 2022 Apr 1; Available from: https://www.science.org/doi/10.1126/science.abj6987. Cited 2024 Oct 1.

  55. Rhie A, Nurk S, Cechova M, Hoyt SJ, Taylor DJ, Altemose N, et al. The complete sequence of a human Y chromosome. Nature. 2023;621:1–11.

  56. Makova KD, Pickett BD, Harris RS, Hartley GA, Cechova M, Pal K, et al. The complete sequence and comparative analysis of ape sex chromosomes. Nature. 2024;630(8016):401–11.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. 1.Yoo D, Rhie A, Hebbar P, Antonacci F, Logsdon GA, Solar SJ, et al. Complete sequencing of ape genomes. Nature. 2025;641:401–18.

  58. Makova KD, Weissensteiner MH. Noncanonical DNA structures are drivers of genome evolution. Trends Genet. 2023;39(2):109–24.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Smeds L, Kamali K, Kejnovská I, Eduard Kejnovský, Chiaromonte F, Makova KD. Non-canonical DNA in human and other ape telomere-to-telomere genomes. Nucleic Acids Res. 2025;53(7).

  60. Lombardi EP, Londoño-Vallejo A. A guide to computational methods for G-quadruplex prediction. Nucleic Acids Res. 2020;48(3):1603.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Mohanty SK, Chiaromonte F, Makova K. g4Discovery: Scripts for annotating/predicting G-Quadruplexes (G4s) in a genome sequence, combining pqsfinder and G4Hunter. Github; 2024. Available from: https://github.com/makovalab-psu/g4Discovery. Cited 2025 May 14.

  62. Hon J, Martínek T, Zendulka J, Lexa M. pqsfinder: an exhaustive and imperfection-tolerant search tool for potential quadruplex-forming sequences in R. Bioinformatics. 2017;33(21):3373–9.

    Article  CAS  PubMed  Google Scholar 

  63. Bedrat A, Lacroix L, Mergny JL. Re-evaluation of G-quadruplex propensity with G4Hunter. Nucleic Acids Res. 2016;44(4):1746.

    Article  PubMed  PubMed Central  Google Scholar 

  64. Maity A, Winnerdy FR, Chang WD, Chen G, Phan AT. Intra-locked G-quadruplex structures formed by irregular DNA G-rich motifs. Nucleic Acids Res. 2020;48(6):3315–27.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Zhang Z, Dai J, Veliath E, Jones RA, Yang D. Structure of a two-G-tetrad intramolecular G-quadruplex formed by a variant human telomeric sequence in K+ solution: insights into the interconversion of human telomeric G-quadruplex structures. Nucleic Acids Res. 2009;38(3):1009–21.

    Article  PubMed  PubMed Central  Google Scholar 

  66. Bugaut A, Balasubramanian S. A sequence-independent study of the influence of short loop lengths on the stability and topology of intramolecular DNA G-quadruplexes. Biochemistry. 2008;47(2):689–97.

    Article  CAS  PubMed  Google Scholar 

  67. Siddiqui-Jain A, Grand CL, Bearss DJ, Hurley LH. Direct evidence for a G-quadruplex in a promoter region and its targeting with a small molecule to repress c-MYC transcription. https://doi.org/10.1073/pnas.182256799. Cited 2024 Aug 20.

  68. Cozzaglio M, Ceschi S, Groaz E, Sturlese M, Sissi C. G-quadruplexes formation within the promoter of TEAD4 oncogene and their interaction with Vimentin. Front Chem. 2022;15(10):1008075.

    Article  Google Scholar 

  69. Pavlova AV, Savitskaya VY, Dolinnaya NG, Monakhova MV, Litvinova AV, Kubareva EA, et al. G-Quadruplex formed by the promoter region of the hTERT Gene: structure-driven effects on DNA mismatch repair functions. Biomedicines. 2022;10(8):1871.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  70. Fleming AM, Omaga A, Burrows CJ. NEIL3 promoter G-quadruplex with oxidatively modified bases shows magnesium-dependent folding that stalls polymerase bypass. Biochimie. 2023;1(214):156–66.

    Article  Google Scholar 

  71. Roy A, Basu D, Bose D, Dutta A, Dastidar SG, Chatterjee S. Identification and characterization of a flexile G-quadruplex in the distal promoter region of stemness gene REX1. Int J Biol Macromol. 2023;15(231):123263.

    Article  Google Scholar 

  72. Nain N, Singh A, Khan S, Kukreti S. G-quadruplex formation at human DAT1 gene promoter: effect of cytosine methylation. Biochemist Biophys Rep. 2023;1(34):101464.

    Google Scholar 

  73. Rodriguez R, Miller KM, Forment JV, Bradshaw CR, Nikan M, Britton S, et al. Small-molecule–induced DNA damage identifies alternative DNA structures in human genes. Nat Chem Biol. 2012;8(3):301–10.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. Turleau C, De Grouchy J, Klein M. Chromosomal phylogeny of man and the anthropomorphic primates. (Pan troglodytes, Gorilla gorilla, Pongo pygmaeus). Attempt at reconstitution of the karyotype of the common ancestor. Ann Genet. 1972;15(4):225–40.

    CAS  PubMed  Google Scholar 

  75. Wienberg J, Jauch A, Lüdecke HJ, Senger G, Horsthemke B, Claussen U, et al. The origin of human chromosome 2 analyzed by comparative chromosome mapping with a DNA microlibrary. Chromosome Res. 1994;2(5):405–10.

    Article  CAS  PubMed  Google Scholar 

  76. Dutrillaux B, Rethoré MO, Prieur M, Lejeune J. Analysis of the structure of chromatids of Gorilla gorilla. Comparison with Homo sapiens and Pan troglodytes (author’s transl). Humangenetik. 1973;20(4):343–54.

    Article  CAS  PubMed  Google Scholar 

  77. Stanyon R, Wienberg J, Romagno D, Bigoni F, Jauch A, Cremer T. Molecular and classical cytogenetic analyses demonstrate an apomorphic reciprocal chromosomal translocation in Gorilla gorilla. Am J Phys Anthropol. 1992;88(2):245–50.

    Article  CAS  PubMed  Google Scholar 

  78. Stankiewicz P, Park SS, Inoue K, Lupski JR. The Evolutionary Chromosome Translocation 4;19 in Gorilla gorilla is Associated with Microduplication of the Chromosome Fragment Syntenic to Sequences Surrounding the Human Proximal CMT1A-REP. Genome Res. 2001;11(7):1205–10.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  79. Vinogradov AE. DNA helix: the importance of being GC-rich. Nucleic Acids Res. 2003;31(7):1838–44.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  80. Jaksik R, Rzeszowska-Wolny J. The distribution of GC nucleotides and regulatory sequence motifs in genes and their adjacent sequences. Gene. 2012;492(2):375–81.

    Article  CAS  PubMed  Google Scholar 

  81. Aksenova AY, Mirkin SM. At the beginning of the end and in the middle of the beginning: structure and maintenance of telomeric DNA repeats and interstitial telomeric sequences. Genes. 2019;10(2):118.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  82. Bromham L, Penny D. The modern molecular clock. Nat Rev Genet. 2003;4(3):216–24.

    Article  CAS  PubMed  Google Scholar 

  83. Smit AFA, Hubley R, Green P. RepeatMasker Open-4.0. 2013–2015. RepeatMasker Open-4.0. Available from: http://www.repeatmasker.org. Cited 2024.

  84. Gershman A, Sauria MEG, Guitart X, Vollger MR, Hook PW, Hoyt SJ, et al. Epigenetic patterns in a complete human genome. Science. 2022 Apr 1; Available from: https://www.science.org/doi/10.1126/science.abj5089. Cited 2024 Aug 27.

  85. Mao SQ, Ghanbarian AT, Spiegel J, Martínez Cuesta S, Beraldi D, Di Antonio M, et al. DNA G-quadruplex structures mold the DNA methylome. Nat Struct Mol Biol. 2018;25(10):951–7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  86. Halder R, Halder K, Sharma P, Garg G, Sengupta S, Chowdhury S. Guanine quadruplex DNA structure restricts methylation of CpG dinucleotides genome-wide. Mol Biosyst. 2010;6(12):2439–47.

    Article  CAS  PubMed  Google Scholar 

  87. Hoffmann MJ, Schulz WA. Causes and consequences of DNA hypomethylation in human cancer. Biochemistry and Cell Biology. 2011 Jan 24; Available from: https://cdnsciencepub.com/doi/10.1139/o05-036. Cited 2024 Oct 23.

  88. Besselink N, Keijer J, Vermeulen C, Boymans S, de Ridder J, van Hoeck A, et al. The genome-wide mutational consequences of DNA hypomethylation. Sci Rep. 2023;13(1):1–12.

    Article  Google Scholar 

  89. Xie Z, Bailey A, Kuleshov MV, Clarke DJB, Evangelista JE, Jenkins SL, et al. Gene set knowledge discovery with Enrichr. Curr Protoc. 2021;1(3):e90.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  90. Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 2016;44(W1):W90–7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  91. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol. 1995;57(1):289–300.

    Article  Google Scholar 

  92. Huppert JL, Balasubramanian S. G-quadruplexes in promoters throughout the human genome. Nucleic Acids Res. 2007;35(6):2105–2105.

    Article  CAS  PubMed Central  Google Scholar 

  93. Lago S, Nadai M, Cernilogar FM, Kazerani M, Domíniguez Moreno H, Schotta G, et al. Promoter G-quadruplexes and transcription factors cooperate to shape the cell type-specific transcriptome. Nat Commun. 2021;12(1):1–13.

    Article  Google Scholar 

  94. Zyner KG, Simeone A, Flynn SM, Doyle C, Marsico G, Adhikari S, et al. G-quadruplex DNA structures in human stem cells and differentiation. Nat Commun. 2022;13(1):142.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  95. Vannutelli A, Ouangraoua A, Perreault JP. Toward a Better Understanding of G4 Evolution in the 3 Living Kingdoms. Evolutionary Bioinformatics. 2023; Available from: https://journals.sagepub.com/doi/10.1177/11769343231212075. Cited 2024 Oct 23.

  96. Wu F, Niu K, Cui Y, Li C, Lyu M, Ren Y, et al. Genome-wide analysis of DNA G-quadruplex motifs across 37 species provides insights into G4 evolution. Communications Biology. 2021;4(1):1–11.

    Article  Google Scholar 

  97. Brázda V, Bohálová N, Bowater RP. New telomere to telomere assembly of human chromosome 8 reveals a previous underestimation of G-quadruplex forming sequences and inverted repeats. Gene. 2022;810(146058):146058.

    Article  PubMed  Google Scholar 

  98. Bohálová N, Mergny JL, Brázda V. Novel G-quadruplex prone sequences emerge in the complete assembly of the human X chromosome. Biochimie. 2021;191:87–90.

    Article  PubMed  Google Scholar 

  99. Dobrovolná M, Mergny JL, Brázda V. Complete analysis of G-quadruplex forming sequences in the gapless assembly of human chromosome Y. Biochimie. 2024 Oct 9; Available from: https://pubmed.ncbi.nlm.nih.gov/39389449/. Cited 2024 Oct 29.

  100. Niu K, Xiang L, Li X, Li J, Li Y, Zhang C, et al. DNA 5-methylcytosine regulates genome-wide formation of G-quadruplex structures. bioRxiv. 2023. https://doi.org/10.1101/2023.02.16.528796.

  101. Chambers VS, Marsico G, Boutell JM, Di Antonio M, Smith GP, Balasubramanian S. High-throughput sequencing of DNA G-quadruplex structures in the human genome. Nat Biotechnol. 2015;33(8):877–81.

    Article  CAS  PubMed  Google Scholar 

  102. Marsico G, Chambers VS, Sahakyan AB, McCauley P, Boutell JM, Antonio MD, et al. Whole genome experimental maps of DNA G-quadruplexes in multiple species. Nucleic Acids Res. 2019;47(8):3862–74.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  103. Tu J, Duan M, Liu W, Lu N, Zhou Y, Sun X, et al. Direct genome-wide identification of G-quadruplex structures by whole-genome resequencing. Nat Commun. 2021;12(1):1–9.

    Article  Google Scholar 

  104. Hoyt SJ, Storer JM, Hartley GA, Grady PGS, Gershman A, de Lima LG, et al. From telomere to telomere: The transcriptional and epigenetic state of human repeat elements. Science. 2022; Available from: https://www.science.org/doi/10.1126/science.abk3112. Cited 2024 Oct 29.

  105. Cechova M, Harris RS, Tomaszkiewicz M, Arbeithuber B, Chiaromonte F, Makova KD. High satellite repeat turnover in great apes studied with short- and long-read technologies. Mol Biol Evol. 2019;36(11):2415–31.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  106. Vannutelli A, Perreault JP, Ouangraoua A. G-quadruplex occurrence and conservation: more than just a question of guanine-cytosine conten. NAR Genom Bioinform. 2022;4(1):lqac010.

    Article  PubMed  PubMed Central  Google Scholar 

  107. Huppert JL, Balasubramanian S. G-quadruplexes in promoters throughout the human genome. Nucleic acids research. 2007;35(2). Available from: https://pubmed.ncbi.nlm.nih.gov/17169996/ Cited 2024 Oct 24.

  108. Tian T, Chen YQ, Wang SR, Zhou X. G-Quadruplex: a regulator of gene expression and its chemical targeting. Chem. 2018;4(6):1314–44.

    Article  CAS  Google Scholar 

  109. Bay DH, Busch A, Lisdat F, Iida K, Ikebukuro K, Nagasawa K, et al. Identification of G-quadruplex structures that possess transcriptional regulating functions in the Dele and Cdc6 CpG islands. BMC Mol Biol. 2017;18(1):1–8.

    Article  Google Scholar 

  110. Hegyi H. Enhancer-promoter interaction facilitated by transiently forming G-quadruplexes. Sci Rep. 2015;5(1):1–6.

    Article  Google Scholar 

  111. Williams JD, Houserova D, Johnson BR, Dyniewski B, Berroyer A, French H, et al. Characterization of long G4-rich enhancer-associated genomic regions engaging in a novel loop:loop “G4 Kissing” interaction. Nucleic Acids Res. 2020;48(11):5907–25.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  112. Huppert JL, Bugaut A, Kumari S, Balasubramanian S. G-quadruplexes: the beginning and end of UTRs. Nucleic Acids Res. 2008;36(19):6260–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  113. Qi T, Xu Y, Zhou T, Gu W. The Evolution of G-quadruplex Structure in mRNA Untranslated Region. Evolutionary Bioinformatics. 2021 Jul 21; Available from: https://journals.sagepub.com/doi/10.1177/11769343211035140. Cited 2024 Oct 23.

  114. Bugaut A, Balasubramanian S. 5′-UTR RNA G-quadruplexes: translation regulation and targeting. Nucleic Acids Res. 2012;40(11):4727.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  115. Huppert JL, Balasubramanian S. Prevalence of quadruplexes in the human genome. Nucleic Acids Res. 2005;33(9):2908–16.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  116. Varizhuk A, Ischenko D, Tsvetkov V, Novikov R, Kulemin N, Kaluzhny D, et al. The expanding repertoire of G4 DNA structures. Biochimie. 2017;1(135):54–62.

    Article  Google Scholar 

  117. Prorok P, Artufel M, Aze A, Coulombe P, Peiffer I, Lacroix L, et al. Involvement of G-quadruplex regions in mammalian replication origin activity. Nat Commun. 2019;10(1):1–16.

    Article  Google Scholar 

  118. Massip F, Laurent M, Brossas C, Fernández-Justel JM, Gómez M, Prioleau MN, et al. Evolution of replication origins in vertebrate genomes: rapid turnover despite selective constraints. Nucleic Acids Res. 2019;47(10):5114.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  119. Puig Lombardi E, Holmes A, Verga D, Teulade-Fichou MP, Nicolas A, Londoño-Vallejo A. Thermodynamically stable and genetically unstable G-quadruplexes are depleted in genomes across species. Nucleic Acids Res. 2019;47(12):6098–113.

    Article  PubMed  PubMed Central  Google Scholar 

  120. Chen J, Guo Y, Zhou J, Ju H. The effect of adenine repeats on G-quadruplex/hemin Peroxidase Mimicking DNAzyme Activity. Chemist A Eur J. 2017;23(17):4210–5.

    Article  CAS  Google Scholar 

  121. Lahnsteiner A, Craig SJC, Kamali K, Weissensteiner B, McGrath B, Risch A, et al. In vivo detection of DNA secondary structures using permanganate/S1 footprinting with direct adapter ligation and sequencing (PDAL-Seq). Methods Enzymol. 2024;23(695):159–91.

    Article  Google Scholar 

  122. Korsakova A, Phan AT. Prediction of G4 formation in live cells with epigenetic data: a deep learning approac. NAR Genom Bioinform. 2023;5(3):lqad071.

    Article  PubMed  PubMed Central  Google Scholar 

  123. O’Leary NA, Cox E, Holmes JB, Anderson WR, Falk R, Hem V, et al. Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets. Sci Data. 2024;11(1):732.

    Article  PubMed  PubMed Central  Google Scholar 

  124. Harris RS. Improved pairwise alignment of genomic DNA. The Pennsylvania State University; 2007.

  125. Mohanty SK, Chiaromonte F, Makova K. mapsea: Mapping And Prediction of Shared Elements in Alignments. Github; 2024. Available from: https://github.com/makovalab-psu/mapsea/. Cited 2025 May 14.

  126. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841–2.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  127. Dale RK, Pedersen BS, Quinlan AR. Pybedtools: a flexible Python library for manipulating genomic datasets and annotations. Bioinformatics. 2011;27(24):3423–4.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  128. Tange O. GNU Parallel 20250222 (’Grete Tange'). Zenodo; 2025. https://doi.org/10.5281/zenodo.14911163.

  129. Lex A, Gehlenborg N, Strobelt H, Vuillemot R, Pfister H. UpSet: Visualization of Intersecting Sets. Available from: https://ieeexplore.ieee.org/document/6876017. Cited 2024 Aug 14.

  130. Swofford DL. PAUP. Phylogenetic analysis using parsimony ( and other methods). Version 4. Sinauer Associates, Sunderland. 2003. Available from: https://www.scirp.org/reference/referencespapers?referenceid=1085917. Cited 2024 Aug 20.

  131. Stecher G, Tamura K, Kumar S. Molecular Evolutionary Genetics Analysis (MEGA) for macOS. Mol Biol Evol. 2020;37(4):1237–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  132. Akerman I, Kasaai B, Bazarova A, Sang PB, Peiffer I, Artufel M, et al. A predictable conserved DNA base composition signature defines human core DNA replication origins. Nat Commun. 2020;11(1):4826.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  133. Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, et al. The UCSC genome browser database: update 200. Nucleic Acids Res. 2006;34(Database issue):D590-8.

    Article  CAS  PubMed  Google Scholar 

  134. Karolchi D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, et al. The UCSC table browser data retrieval too. Nucleic Acids Res. 2004;32(Database issue):D493-6.

    Article  Google Scholar 

  135. Miklem G, Hillier L. CpG Islands CpG Islands Track Settings. 2022. CpG Islands. Available from: https://genome.ucsc.edu/cgi-bin/hgTrackUi?g=cpgIslandExt. Cited 2024 Oct 2.

  136. Gardiner-Garden M, Frommer M. CpG islands in vertebrate genomes. J Mol Biol. 1987;196(2):261–82.

    Article  CAS  PubMed  Google Scholar 

  137. Guarracino A, Mwaniki N, Marco-Sola S, Garrison E. wfmash: a pangenome-scale aligner. 2021 Sep 9; Available from: https://zenodo.org/records/6949373. Cited 2024 Sep 24.

  138. Mohanty SK, Chiaromonte F, Makova K. Evolutionary Dynamics of G-Quadruplexes across Great Ape Telomere-to-Telomere Genomes. Github; 2024. Available from: https://github.com/makovalab-psu/GreatApeT2T-G4s. Cited 2025 May 14.

  139. Mohanty SK, Chiaromonte F, Makova K. Evolutionary dynamics of G-quadruplexes in human and other great ape telomere-to-telomere genomes. Zenodo; 2024. https://doi.org/10.5281/zenodo.15007392.

Download references

Acknowledgements

We are grateful to Xinru Zhang, Linnéa Smeds, Edmundo Torres-González, Jacob Sieg, and Christian Huber for discussing the results and providing useful suggestions. Kaivan Kamali dockerized the pqsfinder package for G4 prediction using fasta files. Karol Pál provided code for extracting protein-coding genes from the great ape gene annotation files. Yong Hwee Eddie Loh, Soojin Yi, and Ryan Son provided useful insights for methylation data analysis.

Peer review information

Tim Sands was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. The peer-review history is available in the online version of this article.

Funding

This research was supported by grants R01GM136684 and R35GM151945 and by the Willaman Chair Endowment Fund from the Eberly College of Science to KDM. Computations were performed at the Penn State Institute of Computational Data Sciences.

Author information

Authors and Affiliations

Authors

Contributions

SKM: conceptualization, data curation, formal analysis, investigation, methodology, software, visualization, writing–original draft. FC: statistical analysis, writing–review & editing. KDM: conceptualization, funding acquisition, methodology, project administration, resources, supervision, validation, writing–review & editing.

Corresponding author

Correspondence to Kateryna D. Makova.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mohanty, S.K., Chiaromonte, F. & Makova, K.D. Evolutionary dynamics of predicted G-quadruplexes in human and other great apes. Genome Biol 26, 161 (2025). https://doi.org/10.1186/s13059-025-03635-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13059-025-03635-1

Keywords