Diverse short tandem repeat sequences influence gene regulation in human populations

Mitina, Aleksandra; Engchuan, Worrawat; Trost, Brett; Pellecchia, Giovanna; Scherer, Stephen W.; Yuen, Ryan K. C.

doi:10.1186/s13059-025-03754-9

Research
Open access
Published: 12 September 2025

Diverse short tandem repeat sequences influence gene regulation in human populations

Genome Biology volume 26, Article number: 279 (2025) Cite this article

2394 Accesses
4 Altmetric
Metrics details

Abstract

Background

Short tandem repeat (STR) length is a known determinant of pathogenicity in a variety of human disorders. The repeat sequence itself can modulate disease severity and penetrance; however, the broader impact of STR sequence variation on gene expression in the general population remains poorly understood.

Results

Here, we analyze the sequence composition of STRs across two general population cohorts of unrelated individuals (n = 3,150) and report that ~ 7% of STRs exhibit sequence variability, with distinct patterns observed among different ethnic groups. These variable repeats are more prone to expansion and are frequently found in proximity to Alu elements. Notably, STRs with variable motifs are often found near splice junctions of genes involved in brain and neuronal functions. This is supported by the differential expression of genes associated with neuron and cellular projection functions, driven by the presence of distinct STR sequences.

Conclusions

Our findings underscore the previously unrecognized role of STR sequence variability in modulating gene expression and contributing to human phenotypic diversity.

Peer Review reports

Background

Tandem DNA repeats consist of sequence units, or motifs, located adjacent to each other, which account for ~ 7% of the human genome [1]. STRs, defined as tandem repeats with motif size of 1–6 bp, are responsible for over 60 monogenic diseases, including fragile X syndrome and Huntington disease. They are also associated with multiple complex disorders, such as autism spectrum disorder, schizophrenia, cancer, and cardiomyopathies [2,3,4,5,6,7]. Studies of the biological functions of STRs primarily focus on their variability in length, given that most of the known repeat-associated diseases are caused by expansion of repeat size beyond particular thresholds [8, 9]. While variations in STR length play a role in regulating gene expression [10, 11] and splicing [12], sequence variation in STRs can modulate gene expression by influencing the affinity of DNA binding factors at their corresponding domains [13, 14].

While sequence variability is well documented in variable number tandem repeats (VNTRs; tandem repeats with motif size > 6) [15, 16], it remains largely uncharacterized in STRs. This is mainly due to the limited technologies to sequence through expanded STRs, particularly for those with high GC content [17]. The most recognized sequence variation in STRs is the small interrupting sequences in some regions associated with monogenic repeat disorders, which reduce repeat instability [18]. With the advancement of technologies and methods for STR detection, more diseases have been found to be caused by expansion of repeat sequences not present in the reference genome, or “inserted” repeats. This is exemplified by (TTTCA)_n repeat insertion in familial adult myoclonic epilepsies (FAMEs) (e.g., type 1 2, 3, 4, 6 and 7), (TGGAA)_n repeat insertion in spinocerebellar ataxia (SCA) type 31, and (AAGGG)_n repeat insertion in cerebellar ataxia, neuropathy, and vestibular areflexia syndrome (CANVAS) [19,20,21,22,23,24,25]. Intriguingly, all these diseases involve the expansions of alternative motif sequences from AT-rich repeats, of which most are located proximal to Alu elements [26].

Alu elements are transposable elements that comprise ~ 11% of the human genome. They are ubiquitously inserted in the human genome by retrotransposition, which may lead to DNA insertion, deletion, and recombination [27]. Some events can be advantageous by altering gene expression or splicing, and the associated Alu elements can be fixed through positive selection [28,29,30,31]. The accompanying poly(A) tails will subsequently accumulate mutations and often eventually form STRs [32]. It has thus been suggested that disease-causing STRs with variable sequence composition (variable STRs) may have emerged from the poly(A) tails of Alu elements [33].

To elucidate the characteristics and functions of variable STRs, we leverage genome sequence data of 3,150 individuals from the 1000 Genomes Project (1000G) and the Genotype-Tissue Expression Project (GTEx). Since 1000G samples are derived from lymphoblastoid cell lines, which are known to have elevated rates of somatic mutations [34, 35], we include GTEx samples—derived from human tissues—to mitigate this potential bias and assess the robustness of our findings. We compare the genomic characteristics of variable STRs with non-variable STRs (those with no sequence variation) in the two cohorts and further study the effect of sequence composition changes on expression of nearby genes in 49 human tissues. ExpansionHunter denovo (EHdn) is a catalogue-free method capable of identifying novel repeat expansions in the human genome [36]. Its application on diverse cohorts has revealed extensive polymorphism in repeat size and motif both in patient samples and unaffected controls [5, 7, 37, 38].

Results

STRs with variable sequence composition are detected in the general population

We identified 13,977 STRs on autosomes in GTEx project samples and 7,774 STRs on autosomes in 1000G samples. Among STRs detected in GTEx, 1,023 (7.3%) had more than one motif (i.e., variable STRs), whereas 12,954 (92.7%) had a single motif (i.e., non-variable STRs). Similarly, among STRs detected in 1000G, there were 556 (7.2%) variable STRs and 7,218 (92.8%) non-variable STRs.

For the majority of non-variable STRs, their motifs had a GC-composition around 50%, while in variable STRs, the GC-composition of the dominant motif—defined as the most prevalent repeat unit across all samples—was more broadly distributed (Fig. 1a and b). Overall, the GC-composition of the dominant motifs of variable STRs was significantly lower than of the motifs of non-variable STRs (Fig. 1c and Additional file 1: Fig. S1).

Non-variable STRs were more often found near fragile sites, but variable STRs were depleted in known fragile sites. Both variable and non-variable STRs were significantly enriched in genomic regions with high GC-content and underrepresented at sequence-conserved regions (Fig. 1d and Additional file 2: Table S1).

Variable DNA sequences in known STR regions and across populations

While both variable and non-variable STRs were significantly enriched in known STR regions (Additional file 2: Table S1), a higher proportion of variable STRs overlapped with known STRs (95% versus 57% for non-variable STRs in 1000G cohort, and 96% versus 60% for non-variable STRs in GTEx cohort). Variable STRs overlapped known STR regions more often as compared to non-variable STRs (OR_1000G = 14.2; two-sided Fisher’s exact test p_1000G = 1.7 × 10⁻⁸⁷ and OR_GTEx = 18.6; two-sided Fisher’s exact test p_GTEx = 8.5 × 10⁻¹⁵⁸) (Fig. 2a and Additional file 1: Fig. S2a), suggesting that variable sequences in STRs were underrecognized.

Using our outlier detection method, we detected 356 and 1,306 expansions (0.55 per sample in GTEx and 0.52 per sample in 1000G) among the 646 GTEx samples and 2,504 1000G samples, respectively. Variable STRs were expanded more often than non-variable STRs (Fig. 2b and Additional file 1: Fig. S2b). Variable STRs were also significantly enriched at regions associated with monogenic repeat disorders, compared to non-variable STRs (Fig. 2c and Additional file 1: Fig. S2c) (Additional file 2: Table S2).

The frequency of alternative motifs (those that are different from dominant motif in variable STR regions) was significantly higher in the African population as compared to other ethnic groups (two-sided Wilcoxon’s test p < 0.001) (Fig. 2d). In particular, the frequencies of alternative motifs in STRs associated with monogenic repeat disorders were population-specific. For the repeats in BEAN1, the highest frequency of alternative motifs was found among individuals of South Asian ancestry (Additional file 1: Fig. S3). For the repeats in RFC1, the highest frequency of alternative motifs was found among individuals of East Asian ancestry (Additional file 1: Fig. S4). For the repeats in STARD7, all individuals of East Asian ancestry carried an alternative motif (AAAAT) different from the dominant (AAACT), and individuals of European ancestry carried the most alternative motifs, including AAAAT, AACAT, AACTG, and AAAATAACATAACAT (Additional file 1: Fig. S5). Other notable examples of regions with alternative motifs include CACNB1 (Additional file 1: Fig. S6) and FGF14 (Additional file 1: Fig. S7).

Variable STRs more often locate proximal to recently emerged Alu elements

We found that variable STRs were enriched near mobile elements belonging to SINE and LINE families. The enrichment of variable over non-variable STRs was highest in direct proximity (distance = 0, OR_GTEx = 2.96 and false discovery rate (FDR)-corrected p_GTEx = 5.5 × 10⁻⁶⁰ and OR_1000G = 4.1 and FDR-corrected p_1000G = 1.4 × 10⁻⁵⁴) and within 1kb (OR_GTEx = 1.91 and FDR-corrected p_GTEx = 7 × 10⁻¹⁸ and OR_1000G = 2.23 and FDR-corrected p_1000G = 1.2 × 10⁻¹⁴) to SINE (Fig. 3a and Additional file 2: Table S3). The enrichment of variable over non-variable STRs was significant in direct proximity to LINE elements (Fig. 3b and Additional file 2: Table S3), although the enrichment was less pronounced than that observed for SINE elements.

Variable STRs were significantly enriched in all three major Alu subfamilies (AluJ, AluS and AluY) in both cohorts, with the odds ratio being the highest in AluY, the one most recently emerged, followed by AluS and AluJ, the most ancient one (Fig. 3c and Additional file 2: Table S4). The effect was not observed in VNTRs (Additional file 1: Fig. S8b). Among L1 subfamilies, variable STRs were significantly enriched in L1MB, L1ME, and L1PA in 1000G cohort (Fig. 3d and Additional file 2: Table S4). In VNTRs, the two subfamilies with significant enrichment were L1MB and L1MC (Additional file 1: Fig. S8d).

The most frequent motifs among both non-variable and variable STRs near the 3’ end of Alu in GTEx cohort were AAAG, AAGG, AAG, AG, AC, and AAAAT (ordered by descending frequency). Non-variable STRs were significantly enriched in AAGG and AC. Variable STRs were significantly enriched in AAG, AGGG, AGGGG, AGAGG, and AGG (Fig. 3e and Additional file 2: Table S5). Similar patterns are found in 1000G cohorts (Additional file 1: Fig. S9 and Additional file 2: Table S5).

The GC-content was significantly higher in the dominant motifs of variable STRs as compared to the motifs of non-variable STRs near the 3′ end of Alu (two-sided Wilcoxon’s test p = 2.6 × 10⁻² in GTEx) (Fig. 3f). The GC-content was significantly higher in the alternative motifs of variable STRs as compared to the GC-content of the motifs of non-variable STRs (two-sided Wilcoxon’s test p = 1.4 × 10⁻⁵ in GTEx) near the 3′ end of Alu (Fig. 3f). There was no significant difference in the GC composition between the variable and non-variable STRs in 1000G cohort (Fig. 3g).

Variable STRs are associated with gene splicing and brain functions

Variable STRs were enriched in splice junctions (exonic;splicing) in both cohorts (OR_GTEx = 2.17 and FDR-corrected p_GTEx = 1.8 × 10⁻⁴ and OR_1000G = 2.71 and FDR-corrected p_1000G = 1.1 × 10⁻³), and in ncRNA_exonic; ncRNA_splicing regions in GTEx cohort (OR_GTEx = 3.18 and FDR-corrected p_GTEx = 1.8 × 10⁻⁴) (Fig. 4a and Additional file 2: Table S6).

Tandem repeats are known to be associated with gene splicing in different tissues (spl-TR) (12). We compared the overlap of spl-TR between variable and non-variable STRs across 49 human tissues. We found that variable STRs were significantly enriched (two-sided Fisher’s exact test p < 0.05) for spl-TRs in 4 different brain tissues including Hippocampus, Hypothalamus, Nucleus accumbens (basal ganglia), Putamen (basal ganglia), and also enriched in Spleen and Whole Blood (Fig. 4b and Additional file 2: Table S7).

Out of 10 gene sets that were significantly enriched (FDR-correct p or q-value < 0.2) among the genes with variable STRs in GTEx cohort, six were associated with the terms “neuron” and “axon,” including “Main axon,” “Axon,” “Neuron recognition,” “Neuron projection guidance,” “Axonogenesis,” and “Axon ensheathment” (Fig. 4c and Additional file 2: Table S8).

Out of 10 gene sets that were enriched among the genes with variable STRs in 1000G cohort, six were associated with the terms “growth,” including “Developmental growth,” “Developmental cell growth,” “Regulation of cell growth,” “Positive regulation of growth,” “Regulation of developmental growth,” and “Developmental growth involved in morphogenesis” (Additional file 2: Table S8).

Differential gene expression between variable sequences in STR

Comparing the gene expression between samples with dominant motifs and samples with alternative motifs in the same STR region revealed differential expression of 89 genes in 46 tissues. Among 89 genes that were differentially expressed in samples with variable STRs, 42 had a higher expression in the presence of an alternative motif, while 57 had a lower expression in the presence of an alternative motif (Fig. 5a).

The GC-composition of alternative motifs in STRs associated with downregulated genes was significantly higher than that of the dominant motifs (two-sided Wilcoxon’s p = 1.9 × 10⁻³) (Fig. 5b). Downregulated genes in the presence of an alternative motif in STRs were significantly associated with functions such as neuron projection, cell periphery, and axon (Fig. 5c and Additional file 2: Table S9). Upregulated genes in the presence of an alternative motif in STR were significantly associated with functions such as cell junction organization, plasma membrane: plasma membrane region, plasma membrane bounded cell projection (Fig. 5d and Additional file 2: Table S9). There was no significant difference in GC-composition between alternative motifs and dominant motifs in upregulated genes (Additional file 1: Fig. S10).

One of the top genes that was upregulated in the presence of alternative motifs was RPSAP58 (RPSA2, Ribosomal Protein SA 2). RPSAP58 had a higher expression in the presence of an AG (alternative motif) as compared to AAGG (reference motif) in 12 tissues, including Heart – Left Ventricle, Heart – Atrial Appendage, Lung, and other functions (Fig. 5e and Additional file 2: Table S10). Other notable genes that were upregulated in the presence of alternative motifs were ROBO2, BEAN1, and NGB (Additional file 1: Fig. S11). One of the top genes that was downregulated in the presence of alternative motifs was ZP3 (Zona Pellucida Glycoprotein 3). ZP3 had lower expression in the presence of an AAAG (alternative motif) as compared to AG (reference motif) in eight tissues, including Adipose—Subcutaneous, Artery—Aorta, Brain—Frontal Cortex (BA9), and other functions (Fig. 5f and Additional file 2: TableS10). Other genes that were downregulated in the presence of an alternative motif were HOXC8, EPHX2, and IL1R2 (Additional file 1: Fig. S12).

Discussion

Tandem repeats with variable motif length have been associated with gene expression [9] and splicing [10], but little is known about the effect of their sequence composition outside of disease contexts. In this study, we examined sequence composition variation in STRs across two general population cohorts and identified key features that suggest their potential origin and function. Our analysis is constrained by the methodology used to define repeat motifs, which captures only the majority motif present in each individual and does not account for sequence interruptions or complex motifs. Thus, we are unable to assess the internal structure or heterogeneity within STR alleles. Additionally, our approach focuses on sequence-level variation rather than repeat length variation, and the relationship between sequence variability and repeat expansion remains unresolved. While we observe that variable STRs are more frequently associated with expansion events, it is unclear whether sequence alterations drive instability or arise as a consequence of it. Future studies leveraging family-based long-read sequencing data will be useful to establish the causal relationships and characterize the sequence complexity within repeats.

We discovered that variable STRs were enriched near SINEs, particularly Alu elements. Intriguingly, while the number of variable STRs near AluS, the most active element among Alu, was the highest, variable STRs were more prominent than non-variable STRs in AluY, the most recent subfamily. This suggests that variable STRs may have been presented in the human genome more recently, after the divergence of the most recent Alu subfamily. Non-variable STRs near the 3’ end of Alu were enriched for A-containing motifs, such as AAAG and AAAAT, while variable STRs were enriched in motifs that had a higher GC content, such as AGGG, AGGGG, and AGAGG. This suggests that such STRs originated from the A-rich 3’ end of the Alu sequence, with subsequent mutations turning them into the more GC-rich motifs, consistent with the findings in most of the known pathogenic pentanucleotide repeat expansions [26]. While the observed co-localization and sequence similarity support the hypothesis that some STRs originated from poly(A) tails of retrotransposons, alternative hypotheses such as co-localization due to shared genomic features or preferential retention in Alu-rich regions due to potential functional roles in gene regulation and chromatin organization [27, 39, 40] may also account for the observed patterns. Additional genomic analyses will be necessary to clarify the causal relationship between Alu mobility and STR variability and to explore alternative hypotheses regarding the evolutionary origin of these STRs.

We found that variable sequence STRs are frequently located at splice junctions of genes linked to molecular functions related to “neuron” “axon,” and “growth” A comparison with the existing data on splicing-associated TRs (spl-TRs) across 49 human tissues revealed that these variable STRs are enriched in spl-TR within four brain regions: hippocampus, hypothalamus, nucleus accumbens, and putamen. These regions are key to motor control, learning, language, reward, cognitive functioning, and addiction. This suggests that variable STRs may influence human brain development and functioning. Furthermore, our analysis identified differential expression of genes, such as ROBO2 and RPSAP58, with the presence of alternative motifs. Most of these genes are involved in neuron and cellular projection processes. Alternative motifs in these genes may thus provide an additional variation and contribute to shaping the diversity of the human genome.

The frequency of alternative motifs in the variable STRs was significantly higher in individuals of African ancestry as compared to individuals of other ancestries. Additionally, we found many undescribed alternative motifs among the STRs linked to monogenic repeat disorders. For the majority of the regions that are known to harbor alternative disease-causing motifs, we did not observe these pathogenic motifs in the general population, likely because the associated diseases are rare. Instead, for regions with known disease-causing alternative motifs—such as BEAN1, MARCHF6, RFC1, and STARD7—we identified multiple other motifs across both cohorts (Additional file 2: Table S11 and Additional file 1: Fig. S3-7). Notably, we found a known pathogenic motif (AAGGG) in RFC1 in a few samples across both cohorts, which aligns with the available long-read sequencing data [41] (Additional file 2: Table S12). To better understand the disease relevance of the variable sequence STRs, further investigation of the sequence heterogeneity for these regions in a larger general population is needed.

Interestingly, in one of the known disease-associated repeat regions, STARD7, the dominant motif (AAACT) was different from the reference sequence (AAAAT) (Additional file 1: Fig. S5). Inability to detect the reference motif could attribute to the complex structure of the repeat region and limitations of the available methods, i.e., short-read sequencing technology and algorithm for tandem repeat detection, since here we only investigated repeat lengths that were longer than the sequence read length. Long-read sequencing technologies, such as Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) platforms, enable study of repetitive and complex genomic regions, including long tandem repeat sequences. The use of long-read sequencing technologies facilitated the T2T-CHM13 reference assembly, uncovering an additional 8% of the genome, and allowed detailed characterization of human centromeres and telomeres [42,43,44,45]. The increasing availability of datasets has led to the development of tools for analyzing long-read sequencing data, particularly for tandem repeat detection and sizing [46,47,48]. While long-read sequencing still faces challenges related to data availability and analytical tools, it is steadily becoming a new standard in genomic research and clinical diagnostics. Future studies leveraging long-read sequencing to explore tandem repeat sequence variation may further illuminate its contribution to human phenotypic diversity.

Conclusions

In this study we demonstrated sequence composition variation in STRs across two general population cohorts. We identified key features that suggest their potential origin from the A-rich 3’ end of the Alu sequence, with subsequent mutations turning them into the more GC-rich motifs and function. We found many undescribed alternative motifs among the known disease-causing regions; however, further investigation of the sequence heterogeneity for these regions in a larger general population is needed.

Methods

Genome sequencing data and samples

We used WGS data from 2,504 participants from the 1000 Genomes Project (1000G). We downloaded genome sequencing data for the 1000 Genomes Project via Amazon Web Services (s3://1000genomes/1000G_2504_high_coverage/data). DNA from 1000 Genomes Project samples was extracted from lymphoblastoid cell lines, prepared with a PCR-free library kit, and sequenced on the Illumina NovaSeq 6000 platform (2 × 150-bp paired-end reads. DNA sequence data was aligned to the GRCh38/hg38 human reference genome.

For replication we used WGS data from 646 participants from the Genotype Tissue Expression Project (GTEx). We downloaded genome sequencing data for the GTEx Project via AnVIL (Genomic Data Science Analysis, Visualization, and Informatics Lab-space). DNA from GTEx Project samples was extracted from blood, prepared with a PCR-free library kit, and sequenced on the Illumina HiSeqX platform (2 × 150-bp paired-end reads). DNA sequence data was aligned to the GRCh38/hg38 human reference genome.

Transcriptome sequencing data

RNA-seq was performed using the Illumina TruSeq library construction protocol (non-stranded, polyA + selection). Flow cell cluster amplification and sequencing were performed according to the manufacturer’s protocols using either the HiSeq 2000 or HiSeq 2500. Sequencing generated 76 bp paired-end reads and an eight-base index barcode read and was run with a coverage goal of 50 M reads (the median achieved was ~ 82 M total reads). RNA quality was assessed by RIN (RNA Integrity Number, as measured by Agilent Bioanalyzer). All samples with a RIN of 6.0 or higher were considered for RNA sequence analysis.

Detection of tandem repeat expansions

We used ExpansionHunter Denovo (EHdn) to detect tandem repeats that were larger than the sequence read length (150 bp) from short-read WGS data. The definition of the repeat motif is the smallest sequence unit (length 2–20 bases) that, when repeated, best explains the periodic structure of a read. This unit is chosen by detecting significant repeating patterns in the read, correcting for sequencing noise, and standardizing for orientation and offset to ensure consistent labeling [36]. We used EHdn scripts (github/Illumina/ExpansionHunterDenovo) (minCount = 2 parameter in the script compare_anchored_irrs.py) to select regions where at least one sample had C ≥ 2, where C = A * 40/R (A = raw count of anchored in-repeat reads for that region; R = the average read depth of the sample calculated by EHdn).

We applied Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm to identify outliers with repeat size exceeding those in the vast majority of samples (clusters) for that region (defined as a tandem repeat expansion) [49]. We used default DBSCAN parameters: the minimum number of samples in the cluster (minPts) was set to 15 and the maximum allowed distance between the points of the same cluster (epsilon) was set to 2.

Burden analysis

Regions with TREs were assigned to a certain genic feature using ANNOVAR [50]. In the case when one TRE overlapped several genic features, the annotation was prioritized based on the effect in the following order: exonic, splicing (including “splicing” and “exonic;splicing”), UTR5, UTR3, intronic, upstream, downstream, and intergenic. To compare the burden of tandem repeat expansions, we performed a logistic regression analysis by regressing the number of rare tandem repeat expansions in each of the genic features among the STRs with variable sequence composition as compared to STRs with reference sequence composition, with the total number of rare TREs as a covariate. We used a one-sided Wald test assuming higher burden in STRs with variable sequence composition. We used only regions located on autosomes to avoid sex bias.

Gene set enrichment analysis with GeneMANIA

We used GeneMANIA Cytoscape [51,52,53] plugin to analyze genes with variable STRs. The list of genes (n = 450 for GTEx cohort samples and n = 265 genes for 1000G) was loaded to Cytoscape as input, resulting in 10 enriched gene sets in each cohort. To plot the enrichment map, we selected genes that belong to enriched gene sets.

Gene set enrichment analysis with GO terms

We used the terms from Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and Reactome (R-HSA) databases, restricting to the gene sets with the number of genes between 100 and 1000 (n = 2,453 entries). For each gene set, we calculated the Fisher’s exact test as the fraction of genes from the set that have variable STRs over the total number of genes with variable STRs, versus the number of genes from the set with non-variable STRs over the total number of genes with non-variable STRs. We corrected the resulting Fisher’s test p-value within each database (GO, KEGG and R-HSA) using the Benjamini–Hochberg method.

Overlap with the mobile elements

To investigate overlap with the mobile elements, we obtained the coordinates of all repetitive sequences from UCSC (n = 5,655,664 entries). We selected repetitives that belong to “LINE” or “SINE” repeat class (repClass), and for “SINE” class we selected those that belong to “Alu” repeat family (repFamily) (n = 2,870,078 entries). We then overlapped the coordinates of the tandem repeats with the coordinates of mobile elements and calculated the Fisher’s exact test as the fraction of variable STRs that overlap mobile elements over the total number of variable STRs, versus the number of non-variable STRs that overlap the mobile elements over the total number of non-variable STRs. We performed this procedure for each class of mobile elements (LINE and SINE), extending the coordinates of the mobile elements by 0, 1,000, 5,000, 10,000, and 20,000 base pairs.

Motif composition near Alu elements

To study the sequence composition of STRs at the 3′ end of Alu, we used the coordinates and the strand information from the UCSC simple repeat table. We then assigned these repeats either “variable” or “non-variable” type based on the overlap with our tandem repeats. We calculated the frequency of repeat motif for those repeats located near the 3′ of Alu, separately for non-variable repeats, variable repeats (using the dominant, or the most frequent motif), and for alternative motifs (all other motifs in the region that were different form the dominant). To assess the enrichment of a particular motif, we calculated Fisher’s exact test for the fraction of a particular motif among variable motifs versus the fraction of the same motif among non-variable motifs. We defined motifs enriched in variable STRs as the motifs with the odds ratio > 1 and p-value < 0.05, and we defined motifs enriched in non-variable STRs as the motifs with the odds ratio < 1 and p-value < 0.05. To define motifs enriched among alternative STRs, we calculated the Fisher’s exact test for the fraction of a particular motif among the alternative motifs versus the fraction of the same motif among all motifs. We defined motifs enriched in alternative STRs as the motifs with the odds ratio > 1 and p-value < 0.05.

Assessment of gene expression

To study the expression of genes containing variable STRs, we obtained gene expression data from GTEx project for each tissue in normalized counts per gene per sample. We then calculated the log2FC of the median TPM and two-sided Wilcoxon’s test p-value for each gene within a single tissue between samples containing dominant (most frequent) motif and alternative (all other motifs in the region that were different form the dominant) motifs. We defined differentially expressed genes as genes with p-value < = 0.05 and |log2FC|> 1. Genes with log2FC > 1 were defined as upregulated, and genes with log2FC < 1 were defined as downregulated.

Enrichment of genes with differential expression

To assess the enrichment of genes with differential expression, we used gProfiler2 R package [54] and ran it separately for genes that were upregulated and genes that were downregulated. We used the resulting adjusted p-value and the number of genes in each gene set to plot the results.

Data availability

The GitHub library with the code for the downstream analysis of EHdn data is available on Zenodo under ‘Tandem repeat expansions in ASD’ doi.org/https://doi.org/10.5281/zenodo.8083691 . The GitHub library with the code for statistical analysis and figures is available on Zenodo under ‘Alternative motifs in healthy individuals’ doi.org/https://doi.org/10.5281/zenodo.14009777 . The data generated in this study are available in the main text or the additional files.

References

Hannan AJ. Expanding horizons of tandem repeats in biology and medicine: why ‘genomic dark matter’ matters. Emerg Top Life Sci. 2023;7:239.
Article CAS PubMed PubMed Central Google Scholar
Rajan-Babu I-S, Dolzhenko E, Eberle MA, Friedman JM. Sequence composition changes in short tandem repeats: heterogeneity, detection, mechanisms and clinical implications. Nat Rev Genet. 2024. https://doi.org/10.1038/s41576-024-00696-z.
Article PubMed Google Scholar
Hannan AJ. Tandem repeats mediating genetic plasticity in health and disease. Nat Rev Genet. 2018;19:286–98.
Article CAS PubMed Google Scholar
Mitra I, et al. Patterns of de novo tandem repeat mutations and their role in autism. Nature. 2021;589:246–50.
Article CAS PubMed PubMed Central Google Scholar
Trost B, et al. Genome-wide detection of tandem DNA repeats that are expanded in autism. Nature. 2020;586:80–6.
Article CAS PubMed PubMed Central Google Scholar
Erwin GS, et al. Recurrent repeat expansions in human cancer genomes. Nature. 2023;613:96–102.
Article CAS PubMed Google Scholar
Mitina A, et al. Genome-wide enhancer-associated tandem repeats are expanded in cardiomyopathy. EBioMedicine. 2024. https://doi.org/10.1016/j.ebiom.2024.105027.
Article PubMed PubMed Central Google Scholar
Gall-Duncan T, Sato N, Yuen RKC, Pearson CE. Advancing genomic technologies and clinical awareness accelerates discovery of disease-associated tandem repeat sequences. Genome Res. 2022;32:1–27.
Article CAS PubMed PubMed Central Google Scholar
Depienne C, Mandel J-L. 30 years of repeat expansion disorders: what have we learned and what are the remaining challenges? Am J Hum Genet. 2021;108:764–85.
Article CAS PubMed PubMed Central Google Scholar
Gymrek M, et al. Abundant contribution of short tandem repeats to gene expression variation in humans. Nat Genet. 2016;48:22–9.
Article CAS PubMed Google Scholar
Fotsing SF, et al. The impact of short tandem repeat variation on gene expression. Nat Genet. 2019;51:1652–9.
Article CAS PubMed PubMed Central Google Scholar
Hamanaka K, et al. Genome-wide identification of tandem repeats associated with splicing variation across 49 tissues in humans. Genome Res. 2023;33:435–47.
Article CAS PubMed PubMed Central Google Scholar
Horton CA, et al. Short tandem repeats bind transcription factors to tune eukaryotic gene expression. Science. 2023;381:eadd1250.
Article CAS PubMed Google Scholar
de Boer CG, Taipale J. Hold out the genome: a roadmap to solving the cis-regulatory code. Nature. 2024;625:41–50.
Article PubMed Google Scholar
Di Maio S, et al. Resolving intra-repeat variation in medically relevant VNTRs from short-read sequencing data using the cardiovascular risk gene LPA as a model. Genome Biol. 2024;25:167.
Article PubMed PubMed Central Google Scholar
Lu T-Y, Human Genome Structural Variation Consortium, Chaisson MJP. Profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs. Nat Commun. 2021;12:4250.
Article CAS PubMed PubMed Central Google Scholar
Loomis EW, et al. Sequencing the unsequenceable: expanded CGG-repeat alleles of the fragile X gene. Genome Res. 2013;23:121–8.
Article CAS PubMed PubMed Central Google Scholar
López Castel A, Cleary JD, Pearson CE. Repeat instability as the basis for human diseases and as a potential target for therapy. Nat Rev Mol Cell Biol. 2010;11:165–70.
Article PubMed Google Scholar
Ishiura H, et al. Expansions of intronic TTTCA and TTTTA repeats in benign adult familial myoclonic epilepsy. Nat Genet. 2018;50:581–90.
Article CAS PubMed Google Scholar
Corbett MA, et al. Intronic ATTTC repeat expansions in STARD7 in familial adult myoclonic epilepsy linked to chromosome 2. Nat Commun. 2019;10:4920.
Article PubMed PubMed Central Google Scholar
Florian RT, et al. Unstable TTTTA/TTTCA expansions in MARCH6 are associated with familial adult myoclonic epilepsy type 3. Nat Commun. 2019;10:4919.
Article PubMed PubMed Central Google Scholar
Yeetong P, et al. Tttca repeat insertions in an intron of YEATS2 in benign adult familial myoclonic epilepsy type 4. Brain. 2019;142:3360–6.
Article PubMed Google Scholar
Seixas AI, et al. A pentanucleotide ATTTC repeat insertion in the non-coding region of DAB1, mapping to SCA37, causes spinocerebellar ataxia. Am J Hum Genet. 2017;101:87–103.
Article CAS PubMed PubMed Central Google Scholar
Rafehi H, et al. Bioinformatics-based identification of expanded repeats: a non-reference intronic pentamer expansion in RFC1 causes CANVAS. Am J Hum Genet. 2019;105:151–65.
Article CAS PubMed PubMed Central Google Scholar
Cortese A, et al. Biallelic expansion of an intronic repeat in RFC1 is a common cause of late-onset ataxia. Nat Genet. 2019;51:649–58.
Article CAS PubMed PubMed Central Google Scholar
Ishiura H, Tsuji S. Advances in repeat expansion diseases and a new concept of repeat motif-phenotype correlation. Curr Opin Genet Dev. 2020;65:176–85.
Article CAS PubMed Google Scholar
Deininger P. Alu elements: know the SINEs. Genome Biol. 2011;12:236.
Article CAS PubMed PubMed Central Google Scholar
Batzer MA, Deininger PL. Alu repeats and human genomic diversity. Nat Rev Genet. 2002;3:370–9.
Article CAS PubMed Google Scholar
Beck CR, et al. LINE-1 retrotransposition activity in human genomes. Cell. 2010;141:1159–70.
Article CAS PubMed PubMed Central Google Scholar
Sen SK, et al. Human genomic deletions mediated by recombination between Alu elements. Am J Hum Genet. 2006;79:41–53.
Article CAS PubMed PubMed Central Google Scholar
Sulovari A, et al. Human-specific tandem repeat expansion and differential gene expression during primate evolution. Proc Natl Acad Sci U S A. 2019;116:23243–53.
Article CAS PubMed PubMed Central Google Scholar
Comeaux MS, Roy-Engel AM, Hedges DJ, Deininger PL. Diverse cis factors controlling Alu retrotransposition: what causes Alu elements to die? Genome Res. 2009;19:545–55.
Article CAS PubMed PubMed Central Google Scholar
Shortt JA, Ruggiero RP, Cox C, Wacholder AC, Pollock DD. Finding and extending ancient simple sequence repeat-derived regions in the human genome. Mob DNA. 2020;11:11.
Article CAS PubMed PubMed Central Google Scholar
Mohyuddin A, et al. Genetic instability in EBV-transformed lymphoblastoid cell lines. Biochim Biophys Acta. 2004;1670:81–3.
Article CAS PubMed Google Scholar
Caballero M, Koren A. The landscape of somatic mutations in lymphoblastoid cell lines. Cell Genom. 2023;3:100305.
Article CAS PubMed PubMed Central Google Scholar
Dolzhenko E, et al. Expansionhunter denovo: a computational method for locating known and novel repeat expansions in short-read sequencing data. Genome Biol. 2020;21:102.
Article PubMed PubMed Central Google Scholar
Mojarad BA, et al. Genome-wide tandem repeat expansions contribute to schizophrenia risk. Mol Psychiatry. 2022;27:3692–8.
Article CAS PubMed PubMed Central Google Scholar
Wen J, et al. Rare tandem repeat expansions associate with genes involved in synaptic and neuronal signaling functions in schizophrenia. Mol Psychiatry. 2023;28:475–82.
Article CAS PubMed Google Scholar
Steely CJ, Watkins WS, Baird L, Jorde LB. The mutational dynamics of short tandem repeats in large, multigenerational families. Genome Biol. 2022;23:253.
Article CAS PubMed PubMed Central Google Scholar
Murat P, Guilbaud G, Sale JE. DNA polymerase stalling at structured DNA constrains the expansion of short tandem repeats. Genome Biol. 2020;21:209.
Article CAS PubMed PubMed Central Google Scholar
Gustafson JA, et al. High-coverage nanopore sequencing of samples from the 1000 genomes project to build a comprehensive catalog of human genetic variation. Genome Res. 2024. https://doi.org/10.1101/gr.279273.124.
Article PubMed PubMed Central Google Scholar
Nurk S, et al. The complete sequence of a human genome. Science. 2022;376:44–53.
Article CAS PubMed PubMed Central Google Scholar
Logsdon GA, et al. The variation and evolution of complete human centromeres. Nature. 2024. https://doi.org/10.1038/s41586-024-07278-3.
Article PubMed PubMed Central Google Scholar
Karimian K, et al. Human telomere length is chromosome end-specific and conserved across individuals. Science. 2024;384:533–9.
Article CAS PubMed PubMed Central Google Scholar
Liao W-W, et al. A draft human pangenome reference. Nature. 2023;617:312–24.
Article CAS PubMed PubMed Central Google Scholar
Kaplun L, et al. ONT long-read WGS for variant discovery and orthogonal confirmation of short read WGS derived genetic variants in clinical genetic testing. Front Genet. 2023;14:1145285.
Article CAS PubMed PubMed Central Google Scholar
Jam HZ, et al. Genome-wide profiling of genetic variation at tandem repeat from long reads. bioRxiv (2024) https://doi.org/10.1101/2024.01.20.576266.
Dolzhenko E, et al. Characterization and visualization of tandem repeats at genome scale. Nat Biotechnol. 2024. https://doi.org/10.1038/s41587-023-02057-3.
Article PubMed PubMed Central Google Scholar
Ester M, Kriegel H-P, Sander J., Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. https://cse.hkust.edu.hk/~raywong/comp5331/References/ADensityBasedAlgorithmForDiscoveringClustersInLargeSpatialDatabasesWithNoise.pdf.
Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38:e164.
Article PubMed PubMed Central Google Scholar
Mostafavi S, Ray D, Warde-Farley D, Grouios C, Morris Q. GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biol. 2008;9(Suppl 1):S4.
Article PubMed PubMed Central Google Scholar
Warde-Farley D, et al. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res. 2010;38:W214–20.
Article CAS PubMed PubMed Central Google Scholar
Shannon P, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13:2498–504.
Article CAS PubMed PubMed Central Google Scholar
Kolberg L, Raudvere U, Kuzmin I, Vilo J, Peterson H. Gprofiler2 – an R package for gene list functional enrichment analysis and namespace conversion toolset g:Profiler. F1000Res. 2020;9:709.
Article Google Scholar

Download references

Acknowledgements

We thank The Centre for Applied Genomics, especially B.Thiruvahindrapuram and T. Nalpathamkalam.

Peer review information

Tim Sands was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. The peer-review history is available in the online version of this article.

Funding

Canadian Institutes of Health Research, PJT-175329.

Author information

Authors and Affiliations

Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada
Aleksandra Mitina, Worrawat Engchuan, Brett Trost, Giovanna Pellecchia, Stephen W. Scherer & Ryan K. C. Yuen
The Centre for Applied Genomics, The Hospital for Sick Children, Toronto, ON, Canada
Worrawat Engchuan, Giovanna Pellecchia, Stephen W. Scherer & Ryan K. C. Yuen
Molecular Medicine, The Hospital for Sick Children, Toronto, ON, Canada
Brett Trost
Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
Brett Trost, Stephen W. Scherer & Ryan K. C. Yuen
McLaughlin Centre, University of Toronto, Toronto, ON, Canada
Stephen W. Scherer

Authors

Aleksandra Mitina
View author publications
Search author on:PubMed Google Scholar
Worrawat Engchuan
View author publications
Search author on:PubMed Google Scholar
Brett Trost
View author publications
Search author on:PubMed Google Scholar
Giovanna Pellecchia
View author publications
Search author on:PubMed Google Scholar
Stephen W. Scherer
View author publications
Search author on:PubMed Google Scholar
Ryan K. C. Yuen
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization: RKCY. Methodology: WE, BT, GP, RKCY. Investigation: WE, AM. Visualization: AM. Data curation: WE, BT, AM. Funding acquisition: RKCY. Supervision: RKCY. Writing—original draft: AM, RKCY. Writing—review & editing: WE, BT, GP, SWS. RKCY and AM have accessed and verified the underlying data. All of the authors read and approved the final version of the manuscript.

Corresponding author

Correspondence to Ryan K. C. Yuen.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

13059_2025_3754_MOESM1_ESM.pdf

Additional file 1: Fig S1-12 Fig S1. Characteristics of STRs in healthy individuals in GTEx cohort. Fig S2. Comparison of non-variable and variable STRs in GTEx cohort. Fig S3. Motif frequency distribution in STR region in BEAN1 gene across ethnic groups in 1000G samples. Fig S4. Motif frequency distribution in STR region in RFC1 gene across ethnic groups in 1000G samples. Fig S5. Motif frequency distribution in STR region in STARD7 gene across ethnic groups in 1000G samples. Fig S6. Motif frequency distribution in STR region in CACNB1 gene across ethnic groups in 1000G samples. Fig S7. Motif frequency distribution in STR region in FGF14 gene across ethnic groups in 1000G samples. Fig S8. Characteristics of TRs near mobile elements. Fig S9. Distribution of the motif for variable sequence STRs (upper plot, red) and non-variable sequence STRs (lower plot, green) near the 3’ end of Alu elements in 1000G cohort samples. Fig S10. Boxplot for GC-composition in dominant motifs (VARdom) as compared to alternative motifs (VARalt) in the variable STRs that are up-regulated in the presence of alternative motif. Fig S11. Genes upregulated in the presence of an alternative motif. Fig S12. Genes downregulated in the presence of an alternative motif.

13059_2025_3754_MOESM2_ESM.xlsx

Additional file 2: Table S1-12 Table S1. Correlation in 1-kb bins across the genome for regions with non-variable and variable STRs. Table S2. Enrichment of variable over non-variable STRs in known, expanded, and disease-causing regions. Table S3. Enrichment of variable over non-variable STRs near mobile elements. Table S4. Enrichment of variable over non-variable STRs near Alu and LINE families. Table S5. Motif frequency for STRs located near 3' end of Alu elements. Table S6. Enrichment of variable over non-variable STRs near genic regions. Table S7. Enrichment of variable STRs in regions with significant TR-splicing associations. Table S8. Gene sets enriched in genes with variable STRs. Table S9. Terms enriched in differentially expressed genes with variable STRs in GTEx cohort. Table S10. Differentially expressed STRs across GTEx tissues. Table S11. Known disease-causing repeat regions. Table S12. Samples with AAGGG repeat in RFC1 gene in 1000G cohort.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Mitina, A., Engchuan, W., Trost, B. et al. Diverse short tandem repeat sequences influence gene regulation in human populations. Genome Biol 26, 279 (2025). https://doi.org/10.1186/s13059-025-03754-9

Download citation

Received: 29 October 2024
Accepted: 21 August 2025
Published: 12 September 2025
Version of record: 12 September 2025
DOI: https://doi.org/10.1186/s13059-025-03754-9

Diverse short tandem repeat sequences influence gene regulation in human populations

Abstract

Background

Results

Conclusions

Background

Results

STRs with variable sequence composition are detected in the general population

Variable DNA sequences in known STR regions and across populations

Variable STRs more often locate proximal to recently emerged Alu elements

Variable STRs are associated with gene splicing and brain functions

Differential gene expression between variable sequences in STR

Discussion

Conclusions

Methods

Genome sequencing data and samples

Transcriptome sequencing data

Detection of tandem repeat expansions

Burden analysis

Gene set enrichment analysis with GeneMANIA

Gene set enrichment analysis with GO terms

Overlap with the mobile elements

Motif composition near Alu elements

Assessment of gene expression

Enrichment of genes with differential expression

Data availability

References

Acknowledgements

Peer review information

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

13059_2025_3754_MOESM1_ESM.pdf

13059_2025_3754_MOESM2_ESM.xlsx

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Genome Biology

Contact us