这是indexloc提供的服务,不要输入任何密码
Skip to main content
BMC is moving to Springer Nature Link. Visit this journal in its new home.

Diverse short tandem repeat sequences influence gene regulation in human populations

Abstract

Background

Short tandem repeat (STR) length is a known determinant of pathogenicity in a variety of human disorders. The repeat sequence itself can modulate disease severity and penetrance; however, the broader impact of STR sequence variation on gene expression in the general population remains poorly understood.

Results

Here, we analyze the sequence composition of STRs across two general population cohorts of unrelated individuals (n = 3,150) and report that ~ 7% of STRs exhibit sequence variability, with distinct patterns observed among different ethnic groups. These variable repeats are more prone to expansion and are frequently found in proximity to Alu elements. Notably, STRs with variable motifs are often found near splice junctions of genes involved in brain and neuronal functions. This is supported by the differential expression of genes associated with neuron and cellular projection functions, driven by the presence of distinct STR sequences.

Conclusions

Our findings underscore the previously unrecognized role of STR sequence variability in modulating gene expression and contributing to human phenotypic diversity.

Peer Review reports

Background

Tandem DNA repeats consist of sequence units, or motifs, located adjacent to each other, which account for ~ 7% of the human genome [1]. STRs, defined as tandem repeats with motif size of 1–6 bp, are responsible for over 60 monogenic diseases, including fragile X syndrome and Huntington disease. They are also associated with multiple complex disorders, such as autism spectrum disorder, schizophrenia, cancer, and cardiomyopathies [2,3,4,5,6,7]. Studies of the biological functions of STRs primarily focus on their variability in length, given that most of the known repeat-associated diseases are caused by expansion of repeat size beyond particular thresholds [8, 9]. While variations in STR length play a role in regulating gene expression [10, 11] and splicing [12], sequence variation in STRs can modulate gene expression by influencing the affinity of DNA binding factors at their corresponding domains [13, 14].

While sequence variability is well documented in variable number tandem repeats (VNTRs; tandem repeats with motif size > 6) [15, 16], it remains largely uncharacterized in STRs. This is mainly due to the limited technologies to sequence through expanded STRs, particularly for those with high GC content [17]. The most recognized sequence variation in STRs is the small interrupting sequences in some regions associated with monogenic repeat disorders, which reduce repeat instability [18]. With the advancement of technologies and methods for STR detection, more diseases have been found to be caused by expansion of repeat sequences not present in the reference genome, or “inserted” repeats. This is exemplified by (TTTCA)n repeat insertion in familial adult myoclonic epilepsies (FAMEs) (e.g., type 1 2, 3, 4, 6 and 7), (TGGAA)n repeat insertion in spinocerebellar ataxia (SCA) type 31, and (AAGGG)n repeat insertion in cerebellar ataxia, neuropathy, and vestibular areflexia syndrome (CANVAS) [19,20,21,22,23,24,25]. Intriguingly, all these diseases involve the expansions of alternative motif sequences from AT-rich repeats, of which most are located proximal to Alu elements [26].

Alu elements are transposable elements that comprise ~ 11% of the human genome. They are ubiquitously inserted in the human genome by retrotransposition, which may lead to DNA insertion, deletion, and recombination [27]. Some events can be advantageous by altering gene expression or splicing, and the associated Alu elements can be fixed through positive selection [28,29,30,31]. The accompanying poly(A) tails will subsequently accumulate mutations and often eventually form STRs [32]. It has thus been suggested that disease-causing STRs with variable sequence composition (variable STRs) may have emerged from the poly(A) tails of Alu elements [33].

To elucidate the characteristics and functions of variable STRs, we leverage genome sequence data of 3,150 individuals from the 1000 Genomes Project (1000G) and the Genotype-Tissue Expression Project (GTEx). Since 1000G samples are derived from lymphoblastoid cell lines, which are known to have elevated rates of somatic mutations [34, 35], we include GTEx samples—derived from human tissues—to mitigate this potential bias and assess the robustness of our findings. We compare the genomic characteristics of variable STRs with non-variable STRs (those with no sequence variation) in the two cohorts and further study the effect of sequence composition changes on expression of nearby genes in 49 human tissues. ExpansionHunter denovo (EHdn) is a catalogue-free method capable of identifying novel repeat expansions in the human genome [36]. Its application on diverse cohorts has revealed extensive polymorphism in repeat size and motif both in patient samples and unaffected controls [5, 7, 37, 38].

Results

STRs with variable sequence composition are detected in the general population

We identified 13,977 STRs on autosomes in GTEx project samples and 7,774 STRs on autosomes in 1000G samples. Among STRs detected in GTEx, 1,023 (7.3%) had more than one motif (i.e., variable STRs), whereas 12,954 (92.7%) had a single motif (i.e., non-variable STRs). Similarly, among STRs detected in 1000G, there were 556 (7.2%) variable STRs and 7,218 (92.8%) non-variable STRs.

For the majority of non-variable STRs, their motifs had a GC-composition around 50%, while in variable STRs, the GC-composition of the dominant motif—defined as the most prevalent repeat unit across all samples—was more broadly distributed (Fig. 1a and b). Overall, the GC-composition of the dominant motifs of variable STRs was significantly lower than of the motifs of non-variable STRs (Fig. 1c and Additional file 1: Fig. S1).

Fig. 1
figure 1

Characteristics of STRs in the general population. a, b Distribution of the GC-composition of the motif in non-variable (a) and variable (b) STRs. The x-axis represents GC-composition, and the y-axis represents frequency. c Boxplot showing the GC-composition in non-variable (green) and variable (red) STRs. P-value reported is of the two-sided Wilcoxon’s test. d Association analysis between the number of non-variable and variable STRs and different genomic features (x-axis) based on 1 kb segments. Odds ratios (y-axis) were derived from logistic regression coefficients of the genomic features with 95% confidence intervals

Non-variable STRs were more often found near fragile sites, but variable STRs were depleted in known fragile sites. Both variable and non-variable STRs were significantly enriched in genomic regions with high GC-content and underrepresented at sequence-conserved regions (Fig. 1d and Additional file 2: Table S1).

Variable DNA sequences in known STR regions and across populations

While both variable and non-variable STRs were significantly enriched in known STR regions (Additional file 2: Table S1), a higher proportion of variable STRs overlapped with known STRs (95% versus 57% for non-variable STRs in 1000G cohort, and 96% versus 60% for non-variable STRs in GTEx cohort). Variable STRs overlapped known STR regions more often as compared to non-variable STRs (OR1000G = 14.2; two-sided Fisher’s exact test p1000G = 1.7 × 10−87 and ORGTEx = 18.6; two-sided Fisher’s exact test pGTEx = 8.5 × 10−158) (Fig. 2a and Additional file 1: Fig. S2a), suggesting that variable sequences in STRs were underrecognized.

Fig. 2
figure 2

Comparison of non-variable and variable STRs. a, b, c Proportion of known STRs (a), expanded STRs (b), and known disease-causing regions (c) among variable STRs (red bars) and non-variable STRs (green bars). d Frequency of alternative motifs (different from the most abundant motif in the variable STR regions) across different populations in the 1000G cohort. Bars represent frequency across populations. Asterisks indicate significance level of p < 0.001

Using our outlier detection method, we detected 356 and 1,306 expansions (0.55 per sample in GTEx and 0.52 per sample in 1000G) among the 646 GTEx samples and 2,504 1000G samples, respectively. Variable STRs were expanded more often than non-variable STRs (Fig. 2b and Additional file 1: Fig. S2b). Variable STRs were also significantly enriched at regions associated with monogenic repeat disorders, compared to non-variable STRs (Fig. 2c and Additional file 1: Fig. S2c) (Additional file 2: Table S2).

The frequency of alternative motifs (those that are different from dominant motif in variable STR regions) was significantly higher in the African population as compared to other ethnic groups (two-sided Wilcoxon’s test p < 0.001) (Fig. 2d). In particular, the frequencies of alternative motifs in STRs associated with monogenic repeat disorders were population-specific. For the repeats in BEAN1, the highest frequency of alternative motifs was found among individuals of South Asian ancestry (Additional file 1: Fig. S3). For the repeats in RFC1, the highest frequency of alternative motifs was found among individuals of East Asian ancestry (Additional file 1: Fig. S4). For the repeats in STARD7, all individuals of East Asian ancestry carried an alternative motif (AAAAT) different from the dominant (AAACT), and individuals of European ancestry carried the most alternative motifs, including AAAAT, AACAT, AACTG, and AAAATAACATAACAT (Additional file 1: Fig. S5). Other notable examples of regions with alternative motifs include CACNB1 (Additional file 1: Fig. S6) and FGF14 (Additional file 1: Fig. S7).

Variable STRs more often locate proximal to recently emerged Alu elements

We found that variable STRs were enriched near mobile elements belonging to SINE and LINE families. The enrichment of variable over non-variable STRs was highest in direct proximity (distance = 0, ORGTEx = 2.96 and false discovery rate (FDR)-corrected pGTEx = 5.5 × 10−60 and OR1000G = 4.1 and FDR-corrected p1000G = 1.4 × 10−54) and within 1kb (ORGTEx = 1.91 and FDR-corrected pGTEx = 7 × 10−18 and OR1000G = 2.23 and FDR-corrected p1000G = 1.2 × 10−14) to SINE (Fig. 3a and Additional file 2: Table S3). The enrichment of variable over non-variable STRs was significant in direct proximity to LINE elements (Fig. 3b and Additional file 2: Table S3), although the enrichment was less pronounced than that observed for SINE elements.

Fig. 3
figure 3

Characteristics of STRs near mobile elements. a, b Enrichment of variable sequence STRs over non-variable sequence STRs in proximity to SINE (a) and LINE (b) elements. The x-axis represents the distance from a transposable element; the y-axis represents the Fisher’s exact test odds ratio. Diamonds represent GTEx cohort samples; circles represent 1000G cohort samples. The color represents FDR. c, d Enrichment of variable STRs over non-variable STRs in Alu (c) and L1 (d) subfamilies. The x-axis represents subfamilies; the y-axis represents Fisher’s test odds ratio. Diamonds represent GTEx cohort samples; circles represent 1000G cohort samples. The color represents FDR. e Distribution of the motif for variable STRs (upper plot, red) and non-variable STRs (lower plot, green) in STRs near the 3′ end of Alu elements in GTEx cohort samples. The x-axis represents motif; the y-axis represents the fraction of the motif among STRs of the corresponding group. Bright red bars with black border indicate motifs significantly enriched in variable STRs over the non-variable STRs; bright green bars with black border indicate motifs significantly enriched in non-variable STRs over the variable STRs. f, g Boxplot showing GC composition of motifs in non-variable STRs, dominant (most frequent) motifs among variable STRs (VADdom), and alternative (less frequent) motifs among variable STRs (VARalt) in GTEx samples (f) and 1000G samples (g). The x-axis represents motif groups; the y-axis represents GC content. Two-sided Wilcoxon’s test p-values are labelled for groups with significantly different GC content

Variable STRs were significantly enriched in all three major Alu subfamilies (AluJ, AluS and AluY) in both cohorts, with the odds ratio being the highest in AluY, the one most recently emerged, followed by AluS and AluJ, the most ancient one (Fig. 3c and Additional file 2: Table S4). The effect was not observed in VNTRs (Additional file 1: Fig. S8b). Among L1 subfamilies, variable STRs were significantly enriched in L1MB, L1ME, and L1PA in 1000G cohort (Fig. 3d and Additional file 2: Table S4). In VNTRs, the two subfamilies with significant enrichment were L1MB and L1MC (Additional file 1: Fig. S8d).

The most frequent motifs among both non-variable and variable STRs near the 3’ end of Alu in GTEx cohort were AAAG, AAGG, AAG, AG, AC, and AAAAT (ordered by descending frequency). Non-variable STRs were significantly enriched in AAGG and AC. Variable STRs were significantly enriched in AAG, AGGG, AGGGG, AGAGG, and AGG (Fig. 3e and Additional file 2: Table S5). Similar patterns are found in 1000G cohorts (Additional file 1: Fig. S9 and Additional file 2: Table S5).

The GC-content was significantly higher in the dominant motifs of variable STRs as compared to the motifs of non-variable STRs near the 3′ end of Alu (two-sided Wilcoxon’s test p = 2.6 × 10−2 in GTEx) (Fig. 3f). The GC-content was significantly higher in the alternative motifs of variable STRs as compared to the GC-content of the motifs of non-variable STRs (two-sided Wilcoxon’s test p = 1.4 × 10−5 in GTEx) near the 3′ end of Alu (Fig. 3f). There was no significant difference in the GC composition between the variable and non-variable STRs in 1000G cohort (Fig. 3g).

Variable STRs are associated with gene splicing and brain functions

Variable STRs were enriched in splice junctions (exonic;splicing) in both cohorts (ORGTEx = 2.17 and FDR-corrected pGTEx = 1.8 × 10−4 and OR1000G = 2.71 and FDR-corrected p1000G = 1.1 × 10−3), and in ncRNA_exonic; ncRNA_splicing regions in GTEx cohort (ORGTEx = 3.18 and FDR-corrected pGTEx = 1.8 × 10−4) (Fig. 4a and Additional file 2: Table S6).

Fig. 4
figure 4

Functional characteristics of STRs. a Burden of variable STRs over non-variable STRs across different genic regions. The x-axis represents genic annotation from ANNOVAR, the y-axis represents regression-based odds ratio. Diamonds represent GTEx samples, circles represent 1000G samples. The color represents FDR. Horizontal dashed line represents odds ratio = 1. b Enrichment of variable STRs over non-variable STRs in regions with significant TR-splicing associations (spl-TR) across human tissues. The x-axis represents tissue type, the y-axis represents the odds ratio of Fisher’s exact test in variable STRs over non-variable STRs. Color represents tissue type. Bright colored bars with black border and asterisks indicate tissues with significant enrichment of variable STRs among spl-TRs. Horizontal dashed line represents odds ratio = 1. c Gene sets enriched in genes with variable STRs. The x-axis represents -log10(q-value), the y-axis represents gene sets significantly enriched in genes with variable STRs in GTEx. The color of the dots represents q-value. The size of the dot represents the ratio of the genes with variable STRs among the genes in the genome. d Network of genes and gene sets enriched in variable STRs

Tandem repeats are known to be associated with gene splicing in different tissues (spl-TR) (12). We compared the overlap of spl-TR between variable and non-variable STRs across 49 human tissues. We found that variable STRs were significantly enriched (two-sided Fisher’s exact test p < 0.05) for spl-TRs in 4 different brain tissues including Hippocampus, Hypothalamus, Nucleus accumbens (basal ganglia), Putamen (basal ganglia), and also enriched in Spleen and Whole Blood (Fig. 4b and Additional file 2: Table S7).

Out of 10 gene sets that were significantly enriched (FDR-correct p or q-value < 0.2) among the genes with variable STRs in GTEx cohort, six were associated with the terms “neuron” and “axon,” including “Main axon,” “Axon,” “Neuron recognition,” “Neuron projection guidance,” “Axonogenesis,” and “Axon ensheathment” (Fig. 4c and Additional file 2: Table S8).

Out of 10 gene sets that were enriched among the genes with variable STRs in 1000G cohort, six were associated with the terms “growth,” including “Developmental growth,” “Developmental cell growth,” “Regulation of cell growth,” “Positive regulation of growth,” “Regulation of developmental growth,” and “Developmental growth involved in morphogenesis” (Additional file 2: Table S8).

Differential gene expression between variable sequences in STR

Comparing the gene expression between samples with dominant motifs and samples with alternative motifs in the same STR region revealed differential expression of 89 genes in 46 tissues. Among 89 genes that were differentially expressed in samples with variable STRs, 42 had a higher expression in the presence of an alternative motif, while 57 had a lower expression in the presence of an alternative motif (Fig. 5a).

Fig. 5
figure 5

STRs motif in gene expression regulation. a Volcano plot showing genes that are differentially expressed in samples that have dominant (most frequent) motif and samples that have alternative (less frequent) motif in the same variable STR. The x-axis represents log fold change (logFC) for expression level between alternative and dominant motif-containing samples. The y-axis represents -log10(p-value). Each dot represents one STR in one tissue, labelled by the nearby gene name. Color represents tissue. Red horizontal dashed line represents significance level = 0.05. Vertical dashed lines represent logFC = ± 1. b Boxplot for GC-composition in dominant motifs (VARdom) as compared to alternative motifs (VARalt) in the variable STRs that are downregulated in the presence of alternative motif. The x-axis represents motif type, the y-axis represents GC-composition. Two-sided Wilcoxon’s test p-value is labelled on top. c Gene sets significantly enriched in genes that are downregulated in the presence of alternative motif. The x-axis represents -log10 of the adjusted p-value. The y-axis represents gene set. The size of the dot represents number of genes in a corresponding gene set. The color represents the gene set group: cellular component (CC), biological process (BP) or molecular function (MF) from GO, or WikiPathways (WP). d Gene sets significantly enriched in genes that are upregulated in the presence of alternative motif. The x-axis represents -log10 of the adjusted p-value. The y-axis represents gene set. The size of the dot represents number of genes in a corresponding gene set. The color represents the gene set group. e, f Top two genes that are upregulated in the presence of an alternative motif: RPSAP58 (e) and ZP3 (f). The x-axis represents logFC for expression level between alternative and dominant motif-containing samples. The y-axis represents -log10(p-value). Grey dots represent all other genes. Only genes with |logFC|> 1, and p-value < 0.05 are shown. Large colored dots represent corresponding gene (RPSAP58 in e, and ZP3 in f). Color represents tissue. Gene name and the dominant motif are labelled inside the box. Red horizontal dashed line represents significance level = 0.05. Vertical dashed lines represent logFC = ± 1

The GC-composition of alternative motifs in STRs associated with downregulated genes was significantly higher than that of the dominant motifs (two-sided Wilcoxon’s p = 1.9 × 10−3) (Fig. 5b). Downregulated genes in the presence of an alternative motif in STRs were significantly associated with functions such as neuron projection, cell periphery, and axon (Fig. 5c and Additional file 2: Table S9). Upregulated genes in the presence of an alternative motif in STR were significantly associated with functions such as cell junction organization, plasma membrane: plasma membrane region, plasma membrane bounded cell projection (Fig. 5d and Additional file 2: Table S9). There was no significant difference in GC-composition between alternative motifs and dominant motifs in upregulated genes (Additional file 1: Fig. S10).

One of the top genes that was upregulated in the presence of alternative motifs was RPSAP58 (RPSA2, Ribosomal Protein SA 2). RPSAP58 had a higher expression in the presence of an AG (alternative motif) as compared to AAGG (reference motif) in 12 tissues, including Heart – Left Ventricle, Heart – Atrial Appendage, Lung, and other functions (Fig. 5e and Additional file 2: Table S10). Other notable genes that were upregulated in the presence of alternative motifs were ROBO2, BEAN1, and NGB (Additional file 1: Fig. S11). One of the top genes that was downregulated in the presence of alternative motifs was ZP3 (Zona Pellucida Glycoprotein 3). ZP3 had lower expression in the presence of an AAAG (alternative motif) as compared to AG (reference motif) in eight tissues, including Adipose—Subcutaneous, Artery—Aorta, Brain—Frontal Cortex (BA9), and other functions (Fig. 5f and Additional file 2: TableS10). Other genes that were downregulated in the presence of an alternative motif were HOXC8, EPHX2, and IL1R2 (Additional file 1: Fig. S12).

Discussion

Tandem repeats with variable motif length have been associated with gene expression [9] and splicing [10], but little is known about the effect of their sequence composition outside of disease contexts. In this study, we examined sequence composition variation in STRs across two general population cohorts and identified key features that suggest their potential origin and function. Our analysis is constrained by the methodology used to define repeat motifs, which captures only the majority motif present in each individual and does not account for sequence interruptions or complex motifs. Thus, we are unable to assess the internal structure or heterogeneity within STR alleles. Additionally, our approach focuses on sequence-level variation rather than repeat length variation, and the relationship between sequence variability and repeat expansion remains unresolved. While we observe that variable STRs are more frequently associated with expansion events, it is unclear whether sequence alterations drive instability or arise as a consequence of it. Future studies leveraging family-based long-read sequencing data will be useful to establish the causal relationships and characterize the sequence complexity within repeats.

We discovered that variable STRs were enriched near SINEs, particularly Alu elements. Intriguingly, while the number of variable STRs near AluS, the most active element among Alu, was the highest, variable STRs were more prominent than non-variable STRs in AluY, the most recent subfamily. This suggests that variable STRs may have been presented in the human genome more recently, after the divergence of the most recent Alu subfamily. Non-variable STRs near the 3’ end of Alu were enriched for A-containing motifs, such as AAAG and AAAAT, while variable STRs were enriched in motifs that had a higher GC content, such as AGGG, AGGGG, and AGAGG. This suggests that such STRs originated from the A-rich 3’ end of the Alu sequence, with subsequent mutations turning them into the more GC-rich motifs, consistent with the findings in most of the known pathogenic pentanucleotide repeat expansions [26]. While the observed co-localization and sequence similarity support the hypothesis that some STRs originated from poly(A) tails of retrotransposons, alternative hypotheses such as co-localization due to shared genomic features or preferential retention in Alu-rich regions due to potential functional roles in gene regulation and chromatin organization [27, 39, 40] may also account for the observed patterns. Additional genomic analyses will be necessary to clarify the causal relationship between Alu mobility and STR variability and to explore alternative hypotheses regarding the evolutionary origin of these STRs.

We found that variable sequence STRs are frequently located at splice junctions of genes linked to molecular functions related to “neuron” “axon,” and “growth” A comparison with the existing data on splicing-associated TRs (spl-TRs) across 49 human tissues revealed that these variable STRs are enriched in spl-TR within four brain regions: hippocampus, hypothalamus, nucleus accumbens, and putamen. These regions are key to motor control, learning, language, reward, cognitive functioning, and addiction. This suggests that variable STRs may influence human brain development and functioning. Furthermore, our analysis identified differential expression of genes, such as ROBO2 and RPSAP58, with the presence of alternative motifs. Most of these genes are involved in neuron and cellular projection processes. Alternative motifs in these genes may thus provide an additional variation and contribute to shaping the diversity of the human genome.

The frequency of alternative motifs in the variable STRs was significantly higher in individuals of African ancestry as compared to individuals of other ancestries. Additionally, we found many undescribed alternative motifs among the STRs linked to monogenic repeat disorders. For the majority of the regions that are known to harbor alternative disease-causing motifs, we did not observe these pathogenic motifs in the general population, likely because the associated diseases are rare. Instead, for regions with known disease-causing alternative motifs—such as BEAN1, MARCHF6, RFC1, and STARD7—we identified multiple other motifs across both cohorts (Additional file 2: Table S11 and Additional file 1: Fig. S3-7). Notably, we found a known pathogenic motif (AAGGG) in RFC1 in a few samples across both cohorts, which aligns with the available long-read sequencing data [41] (Additional file 2: Table S12). To better understand the disease relevance of the variable sequence STRs, further investigation of the sequence heterogeneity for these regions in a larger general population is needed.

Interestingly, in one of the known disease-associated repeat regions, STARD7, the dominant motif (AAACT) was different from the reference sequence (AAAAT) (Additional file 1: Fig. S5). Inability to detect the reference motif could attribute to the complex structure of the repeat region and limitations of the available methods, i.e., short-read sequencing technology and algorithm for tandem repeat detection, since here we only investigated repeat lengths that were longer than the sequence read length. Long-read sequencing technologies, such as Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) platforms, enable study of repetitive and complex genomic regions, including long tandem repeat sequences. The use of long-read sequencing technologies facilitated the T2T-CHM13 reference assembly, uncovering an additional 8% of the genome, and allowed detailed characterization of human centromeres and telomeres [42,43,44,45]. The increasing availability of datasets has led to the development of tools for analyzing long-read sequencing data, particularly for tandem repeat detection and sizing [46,47,48]. While long-read sequencing still faces challenges related to data availability and analytical tools, it is steadily becoming a new standard in genomic research and clinical diagnostics. Future studies leveraging long-read sequencing to explore tandem repeat sequence variation may further illuminate its contribution to human phenotypic diversity.

Conclusions

In this study we demonstrated sequence composition variation in STRs across two general population cohorts. We identified key features that suggest their potential origin from the A-rich 3’ end of the Alu sequence, with subsequent mutations turning them into the more GC-rich motifs and function. We found many undescribed alternative motifs among the known disease-causing regions; however, further investigation of the sequence heterogeneity for these regions in a larger general population is needed.

Methods

Genome sequencing data and samples

We used WGS data from 2,504 participants from the 1000 Genomes Project (1000G). We downloaded genome sequencing data for the 1000 Genomes Project via Amazon Web Services (s3://1000genomes/1000G_2504_high_coverage/data). DNA from 1000 Genomes Project samples was extracted from lymphoblastoid cell lines, prepared with a PCR-free library kit, and sequenced on the Illumina NovaSeq 6000 platform (2 × 150-bp paired-end reads. DNA sequence data was aligned to the GRCh38/hg38 human reference genome.

For replication we used WGS data from 646 participants from the Genotype Tissue Expression Project (GTEx). We downloaded genome sequencing data for the GTEx Project via AnVIL (Genomic Data Science Analysis, Visualization, and Informatics Lab-space). DNA from GTEx Project samples was extracted from blood, prepared with a PCR-free library kit, and sequenced on the Illumina HiSeqX platform (2 × 150-bp paired-end reads). DNA sequence data was aligned to the GRCh38/hg38 human reference genome.

Transcriptome sequencing data

RNA-seq was performed using the Illumina TruSeq library construction protocol (non-stranded, polyA + selection). Flow cell cluster amplification and sequencing were performed according to the manufacturer’s protocols using either the HiSeq 2000 or HiSeq 2500. Sequencing generated 76 bp paired-end reads and an eight-base index barcode read and was run with a coverage goal of 50 M reads (the median achieved was ~ 82 M total reads). RNA quality was assessed by RIN (RNA Integrity Number, as measured by Agilent Bioanalyzer). All samples with a RIN of 6.0 or higher were considered for RNA sequence analysis.

Detection of tandem repeat expansions

We used ExpansionHunter Denovo (EHdn) to detect tandem repeats that were larger than the sequence read length (150 bp) from short-read WGS data. The definition of the repeat motif is the smallest sequence unit (length 2–20 bases) that, when repeated, best explains the periodic structure of a read. This unit is chosen by detecting significant repeating patterns in the read, correcting for sequencing noise, and standardizing for orientation and offset to ensure consistent labeling [36]. We used EHdn scripts (github/Illumina/ExpansionHunterDenovo) (minCount = 2 parameter in the script compare_anchored_irrs.py) to select regions where at least one sample had C ≥ 2, where C = A * 40/R (A = raw count of anchored in-repeat reads for that region; R = the average read depth of the sample calculated by EHdn).

We applied Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm to identify outliers with repeat size exceeding those in the vast majority of samples (clusters) for that region (defined as a tandem repeat expansion) [49]. We used default DBSCAN parameters: the minimum number of samples in the cluster (minPts) was set to 15 and the maximum allowed distance between the points of the same cluster (epsilon) was set to 2.

Burden analysis

Regions with TREs were assigned to a certain genic feature using ANNOVAR [50]. In the case when one TRE overlapped several genic features, the annotation was prioritized based on the effect in the following order: exonic, splicing (including “splicing” and “exonic;splicing”), UTR5, UTR3, intronic, upstream, downstream, and intergenic. To compare the burden of tandem repeat expansions, we performed a logistic regression analysis by regressing the number of rare tandem repeat expansions in each of the genic features among the STRs with variable sequence composition as compared to STRs with reference sequence composition, with the total number of rare TREs as a covariate. We used a one-sided Wald test assuming higher burden in STRs with variable sequence composition. We used only regions located on autosomes to avoid sex bias.

Gene set enrichment analysis with GeneMANIA

We used GeneMANIA Cytoscape [51,52,53] plugin to analyze genes with variable STRs. The list of genes (n = 450 for GTEx cohort samples and n = 265 genes for 1000G) was loaded to Cytoscape as input, resulting in 10 enriched gene sets in each cohort. To plot the enrichment map, we selected genes that belong to enriched gene sets.

Gene set enrichment analysis with GO terms

We used the terms from Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and Reactome (R-HSA) databases, restricting to the gene sets with the number of genes between 100 and 1000 (n = 2,453 entries). For each gene set, we calculated the Fisher’s exact test as the fraction of genes from the set that have variable STRs over the total number of genes with variable STRs, versus the number of genes from the set with non-variable STRs over the total number of genes with non-variable STRs. We corrected the resulting Fisher’s test p-value within each database (GO, KEGG and R-HSA) using the Benjamini–Hochberg method.

Overlap with the mobile elements

To investigate overlap with the mobile elements, we obtained the coordinates of all repetitive sequences from UCSC (n = 5,655,664 entries). We selected repetitives that belong to “LINE” or “SINE” repeat class (repClass), and for “SINE” class we selected those that belong to “Alu” repeat family (repFamily) (n = 2,870,078 entries). We then overlapped the coordinates of the tandem repeats with the coordinates of mobile elements and calculated the Fisher’s exact test as the fraction of variable STRs that overlap mobile elements over the total number of variable STRs, versus the number of non-variable STRs that overlap the mobile elements over the total number of non-variable STRs. We performed this procedure for each class of mobile elements (LINE and SINE), extending the coordinates of the mobile elements by 0, 1,000, 5,000, 10,000, and 20,000 base pairs.

Motif composition near Alu elements

To study the sequence composition of STRs at the 3′ end of Alu, we used the coordinates and the strand information from the UCSC simple repeat table. We then assigned these repeats either “variable” or “non-variable” type based on the overlap with our tandem repeats. We calculated the frequency of repeat motif for those repeats located near the 3′ of Alu, separately for non-variable repeats, variable repeats (using the dominant, or the most frequent motif), and for alternative motifs (all other motifs in the region that were different form the dominant). To assess the enrichment of a particular motif, we calculated Fisher’s exact test for the fraction of a particular motif among variable motifs versus the fraction of the same motif among non-variable motifs. We defined motifs enriched in variable STRs as the motifs with the odds ratio > 1 and p-value < 0.05, and we defined motifs enriched in non-variable STRs as the motifs with the odds ratio < 1 and p-value < 0.05. To define motifs enriched among alternative STRs, we calculated the Fisher’s exact test for the fraction of a particular motif among the alternative motifs versus the fraction of the same motif among all motifs. We defined motifs enriched in alternative STRs as the motifs with the odds ratio > 1 and p-value < 0.05.

Assessment of gene expression

To study the expression of genes containing variable STRs, we obtained gene expression data from GTEx project for each tissue in normalized counts per gene per sample. We then calculated the log2FC of the median TPM and two-sided Wilcoxon’s test p-value for each gene within a single tissue between samples containing dominant (most frequent) motif and alternative (all other motifs in the region that were different form the dominant) motifs. We defined differentially expressed genes as genes with p-value < = 0.05 and |log2FC|> 1. Genes with log2FC > 1 were defined as upregulated, and genes with log2FC < 1 were defined as downregulated.

Enrichment of genes with differential expression

To assess the enrichment of genes with differential expression, we used gProfiler2 R package [54] and ran it separately for genes that were upregulated and genes that were downregulated. We used the resulting adjusted p-value and the number of genes in each gene set to plot the results.

Data availability

The GitHub library with the code for the downstream analysis of EHdn data is available on Zenodo under ‘Tandem repeat expansions in ASD’ doi.org/https://doi.org/10.5281/zenodo.8083691 . The GitHub library with the code for statistical analysis and figures is available on Zenodo under ‘Alternative motifs in healthy individuals’ doi.org/https://doi.org/10.5281/zenodo.14009777 . The data generated in this study are available in the main text or the additional files.

References

  1. Hannan AJ. Expanding horizons of tandem repeats in biology and medicine: why ‘genomic dark matter’ matters. Emerg Top Life Sci. 2023;7:239.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Rajan-Babu I-S, Dolzhenko E, Eberle MA, Friedman JM. Sequence composition changes in short tandem repeats: heterogeneity, detection, mechanisms and clinical implications. Nat Rev Genet. 2024. https://doi.org/10.1038/s41576-024-00696-z.

    Article  PubMed  Google Scholar 

  3. Hannan AJ. Tandem repeats mediating genetic plasticity in health and disease. Nat Rev Genet. 2018;19:286–98.

    Article  CAS  PubMed  Google Scholar 

  4. Mitra I, et al. Patterns of de novo tandem repeat mutations and their role in autism. Nature. 2021;589:246–50.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Trost B, et al. Genome-wide detection of tandem DNA repeats that are expanded in autism. Nature. 2020;586:80–6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Erwin GS, et al. Recurrent repeat expansions in human cancer genomes. Nature. 2023;613:96–102.

    Article  CAS  PubMed  Google Scholar 

  7. Mitina A, et al. Genome-wide enhancer-associated tandem repeats are expanded in cardiomyopathy. EBioMedicine. 2024. https://doi.org/10.1016/j.ebiom.2024.105027.

    Article  PubMed  PubMed Central  Google Scholar 

  8. Gall-Duncan T, Sato N, Yuen RKC, Pearson CE. Advancing genomic technologies and clinical awareness accelerates discovery of disease-associated tandem repeat sequences. Genome Res. 2022;32:1–27.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Depienne C, Mandel J-L. 30 years of repeat expansion disorders: what have we learned and what are the remaining challenges? Am J Hum Genet. 2021;108:764–85.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Gymrek M, et al. Abundant contribution of short tandem repeats to gene expression variation in humans. Nat Genet. 2016;48:22–9.

    Article  CAS  PubMed  Google Scholar 

  11. Fotsing SF, et al. The impact of short tandem repeat variation on gene expression. Nat Genet. 2019;51:1652–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Hamanaka K, et al. Genome-wide identification of tandem repeats associated with splicing variation across 49 tissues in humans. Genome Res. 2023;33:435–47.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Horton CA, et al. Short tandem repeats bind transcription factors to tune eukaryotic gene expression. Science. 2023;381:eadd1250.

    Article  CAS  PubMed  Google Scholar 

  14. de Boer CG, Taipale J. Hold out the genome: a roadmap to solving the cis-regulatory code. Nature. 2024;625:41–50.

    Article  PubMed  Google Scholar 

  15. Di Maio S, et al. Resolving intra-repeat variation in medically relevant VNTRs from short-read sequencing data using the cardiovascular risk gene LPA as a model. Genome Biol. 2024;25:167.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Lu T-Y, Human Genome Structural Variation Consortium, Chaisson MJP. Profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs. Nat Commun. 2021;12:4250.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Loomis EW, et al. Sequencing the unsequenceable: expanded CGG-repeat alleles of the fragile X gene. Genome Res. 2013;23:121–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. López Castel A, Cleary JD, Pearson CE. Repeat instability as the basis for human diseases and as a potential target for therapy. Nat Rev Mol Cell Biol. 2010;11:165–70.

    Article  PubMed  Google Scholar 

  19. Ishiura H, et al. Expansions of intronic TTTCA and TTTTA repeats in benign adult familial myoclonic epilepsy. Nat Genet. 2018;50:581–90.

    Article  CAS  PubMed  Google Scholar 

  20. Corbett MA, et al. Intronic ATTTC repeat expansions in STARD7 in familial adult myoclonic epilepsy linked to chromosome 2. Nat Commun. 2019;10:4920.

    Article  PubMed  PubMed Central  Google Scholar 

  21. Florian RT, et al. Unstable TTTTA/TTTCA expansions in MARCH6 are associated with familial adult myoclonic epilepsy type 3. Nat Commun. 2019;10:4919.

    Article  PubMed  PubMed Central  Google Scholar 

  22. Yeetong P, et al. Tttca repeat insertions in an intron of YEATS2 in benign adult familial myoclonic epilepsy type 4. Brain. 2019;142:3360–6.

    Article  PubMed  Google Scholar 

  23. Seixas AI, et al. A pentanucleotide ATTTC repeat insertion in the non-coding region of DAB1, mapping to SCA37, causes spinocerebellar ataxia. Am J Hum Genet. 2017;101:87–103.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Rafehi H, et al. Bioinformatics-based identification of expanded repeats: a non-reference intronic pentamer expansion in RFC1 causes CANVAS. Am J Hum Genet. 2019;105:151–65.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Cortese A, et al. Biallelic expansion of an intronic repeat in RFC1 is a common cause of late-onset ataxia. Nat Genet. 2019;51:649–58.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Ishiura H, Tsuji S. Advances in repeat expansion diseases and a new concept of repeat motif-phenotype correlation. Curr Opin Genet Dev. 2020;65:176–85.

    Article  CAS  PubMed  Google Scholar 

  27. Deininger P. Alu elements: know the SINEs. Genome Biol. 2011;12:236.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Batzer MA, Deininger PL. Alu repeats and human genomic diversity. Nat Rev Genet. 2002;3:370–9.

    Article  CAS  PubMed  Google Scholar 

  29. Beck CR, et al. LINE-1 retrotransposition activity in human genomes. Cell. 2010;141:1159–70.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Sen SK, et al. Human genomic deletions mediated by recombination between Alu elements. Am J Hum Genet. 2006;79:41–53.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Sulovari A, et al. Human-specific tandem repeat expansion and differential gene expression during primate evolution. Proc Natl Acad Sci U S A. 2019;116:23243–53.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Comeaux MS, Roy-Engel AM, Hedges DJ, Deininger PL. Diverse cis factors controlling Alu retrotransposition: what causes Alu elements to die? Genome Res. 2009;19:545–55.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Shortt JA, Ruggiero RP, Cox C, Wacholder AC, Pollock DD. Finding and extending ancient simple sequence repeat-derived regions in the human genome. Mob DNA. 2020;11:11.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Mohyuddin A, et al. Genetic instability in EBV-transformed lymphoblastoid cell lines. Biochim Biophys Acta. 2004;1670:81–3.

    Article  CAS  PubMed  Google Scholar 

  35. Caballero M, Koren A. The landscape of somatic mutations in lymphoblastoid cell lines. Cell Genom. 2023;3:100305.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Dolzhenko E, et al. Expansionhunter denovo: a computational method for locating known and novel repeat expansions in short-read sequencing data. Genome Biol. 2020;21:102.

    Article  PubMed  PubMed Central  Google Scholar 

  37. Mojarad BA, et al. Genome-wide tandem repeat expansions contribute to schizophrenia risk. Mol Psychiatry. 2022;27:3692–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Wen J, et al. Rare tandem repeat expansions associate with genes involved in synaptic and neuronal signaling functions in schizophrenia. Mol Psychiatry. 2023;28:475–82.

    Article  CAS  PubMed  Google Scholar 

  39. Steely CJ, Watkins WS, Baird L, Jorde LB. The mutational dynamics of short tandem repeats in large, multigenerational families. Genome Biol. 2022;23:253.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Murat P, Guilbaud G, Sale JE. DNA polymerase stalling at structured DNA constrains the expansion of short tandem repeats. Genome Biol. 2020;21:209.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Gustafson JA, et al. High-coverage nanopore sequencing of samples from the 1000 genomes project to build a comprehensive catalog of human genetic variation. Genome Res. 2024. https://doi.org/10.1101/gr.279273.124.

    Article  PubMed  PubMed Central  Google Scholar 

  42. Nurk S, et al. The complete sequence of a human genome. Science. 2022;376:44–53.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Logsdon GA, et al. The variation and evolution of complete human centromeres. Nature. 2024. https://doi.org/10.1038/s41586-024-07278-3.

    Article  PubMed  PubMed Central  Google Scholar 

  44. Karimian K, et al. Human telomere length is chromosome end-specific and conserved across individuals. Science. 2024;384:533–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Liao W-W, et al. A draft human pangenome reference. Nature. 2023;617:312–24.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Kaplun L, et al. ONT long-read WGS for variant discovery and orthogonal confirmation of short read WGS derived genetic variants in clinical genetic testing. Front Genet. 2023;14:1145285.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Jam HZ, et al. Genome-wide profiling of genetic variation at tandem repeat from long reads. bioRxiv (2024) https://doi.org/10.1101/2024.01.20.576266.

  48. Dolzhenko E, et al. Characterization and visualization of tandem repeats at genome scale. Nat Biotechnol. 2024. https://doi.org/10.1038/s41587-023-02057-3.

    Article  PubMed  PubMed Central  Google Scholar 

  49. Ester M, Kriegel H-P, Sander J., Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. https://cse.hkust.edu.hk/~raywong/comp5331/References/ADensityBasedAlgorithmForDiscoveringClustersInLargeSpatialDatabasesWithNoise.pdf.

  50. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38:e164.

    Article  PubMed  PubMed Central  Google Scholar 

  51. Mostafavi S, Ray D, Warde-Farley D, Grouios C, Morris Q. GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biol. 2008;9(Suppl 1):S4.

    Article  PubMed  PubMed Central  Google Scholar 

  52. Warde-Farley D, et al. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res. 2010;38:W214–20.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Shannon P, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13:2498–504.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Kolberg L, Raudvere U, Kuzmin I, Vilo J, Peterson H. Gprofiler2 – an R package for gene list functional enrichment analysis and namespace conversion toolset g:Profiler. F1000Res. 2020;9:709.

    Article  Google Scholar 

Download references

Acknowledgements

We thank The Centre for Applied Genomics, especially B.Thiruvahindrapuram and T. Nalpathamkalam.

Peer review information

Tim Sands was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. The peer-review history is available in the online version of this article.

Funding

Canadian Institutes of Health Research, PJT-175329.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: RKCY. Methodology: WE, BT, GP, RKCY. Investigation: WE, AM. Visualization: AM. Data curation: WE, BT, AM. Funding acquisition: RKCY. Supervision: RKCY. Writing—original draft: AM, RKCY. Writing—review & editing: WE, BT, GP, SWS. RKCY and AM have accessed and verified the underlying data. All of the authors read and approved the final version of the manuscript.

Corresponding author

Correspondence to Ryan K. C. Yuen.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

13059_2025_3754_MOESM1_ESM.pdf

Additional file 1: Fig S1-12 Fig S1. Characteristics of STRs in healthy individuals in GTEx cohort. Fig S2. Comparison of non-variable and variable STRs in GTEx cohort. Fig S3. Motif frequency distribution in STR region in BEAN1 gene across ethnic groups in 1000G samples. Fig S4. Motif frequency distribution in STR region in RFC1 gene across ethnic groups in 1000G samples. Fig S5. Motif frequency distribution in STR region in STARD7 gene across ethnic groups in 1000G samples. Fig S6. Motif frequency distribution in STR region in CACNB1 gene across ethnic groups in 1000G samples. Fig S7. Motif frequency distribution in STR region in FGF14 gene across ethnic groups in 1000G samples. Fig S8. Characteristics of TRs near mobile elements. Fig S9. Distribution of the motif for variable sequence STRs (upper plot, red) and non-variable sequence STRs (lower plot, green) near the 3’ end of Alu elements in 1000G cohort samples. Fig S10. Boxplot for GC-composition in dominant motifs (VARdom) as compared to alternative motifs (VARalt) in the variable STRs that are up-regulated in the presence of alternative motif. Fig S11. Genes upregulated in the presence of an alternative motif. Fig S12. Genes downregulated in the presence of an alternative motif.

13059_2025_3754_MOESM2_ESM.xlsx

Additional file 2: Table S1-12 Table S1. Correlation in 1-kb bins across the genome for regions with non-variable and variable STRs. Table S2. Enrichment of variable over non-variable STRs in known, expanded, and disease-causing regions. Table S3. Enrichment of variable over non-variable STRs near mobile elements. Table S4. Enrichment of variable over non-variable STRs near Alu and LINE families. Table S5. Motif frequency for STRs located near 3' end of Alu elements. Table S6. Enrichment of variable over non-variable STRs near genic regions. Table S7. Enrichment of variable STRs in regions with significant TR-splicing associations. Table S8. Gene sets enriched in genes with variable STRs. Table S9. Terms enriched in differentially expressed genes with variable STRs in GTEx cohort. Table S10. Differentially expressed STRs across GTEx tissues. Table S11. Known disease-causing repeat regions. Table S12. Samples with AAGGG repeat in RFC1 gene in 1000G cohort.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mitina, A., Engchuan, W., Trost, B. et al. Diverse short tandem repeat sequences influence gene regulation in human populations. Genome Biol 26, 279 (2025). https://doi.org/10.1186/s13059-025-03754-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1186/s13059-025-03754-9

Keywords