这是indexloc提供的服务,不要输入任何密码
Skip to main content
BMC is moving to Springer Nature Link. Visit this journal in its new home.

Prediction of target genes and functional types of cis-regulatory modules in the human genome reveals their distinct properties

Abstract

Background

Cis-regulatory modules (CRMs) such as enhancers and silencers play critical roles in virtually all biological processes by enhancing and repressing, respectively, the transcription of their target genes in specific cell types. Although numerous CRMs have been predicted in genomes, identifying their target genes remains a challenge due to low quality of the predicted CRMs and the fact that CRMs often do not regulate their closest genes.

Results

We developed a method — correlation and physical proximity (CAPP) by leveraging our recently predicted 1.2 M CRMs in the human genome. CAPP is able to not only predict the CRMs’ target genes but also their functional types using only chromatin accessibility (CA) and RNA-seq data in a panel of cell/tissue types plus Hi-C data in a few cell types. Applying CAPP to a panel of only 107 cell/tissue types with CA and RNA-seq data available, we predict target genes for 14.3% of the 1.2 M CRMs, of which 1.4% are predicted as both enhancers and silencers (dual functional CRMs), 98.2% as exclusive enhancers, and 0.4% as exclusive silencers. Dual functional CRMs tend to regulate more distant genes than exclusive enhancers and silencers. Enhancers tend to cooperate with other enhancers, whereas silencers typically act independently. Silencers preferentially regulate genes expressed across many cell/tissue types, while enhancers are prone to regulate genes expressed in fewer cell/tissue types.

Conclusions

CAPP represents a significant advancement in predicting target genes and functional types of CRMs, especially dual functional CRMs, and different types of CRMs show distinct properties.

Background

Cis-regulatory modules (CRMs), classified as enhancers, silencers, promoters, and insulators play critical roles in virtually all biological processes, from cell differentiation to physiological homeostasis, pathogenesis, and evolution by regulating transcription of genes in various cell/tissue types, thereby rendering their types and functions [1, 2]. Specific binding of transcription factors (TFs) to their cognate binding sites (TFBSs) within enhancers and silencers can facilitate and repress the recruitment of RNA polymerase to the promoters of target genes, resulting in upregulation and downregulation of gene transcription, respectively [3, 4], in a cell/tissue-specific manner. Moreover, binding of CCCTC-binding factor (CTCF) to its cognate sites within insulators establishes boundaries between topologically associating domains (TADs), preventing cross-regulation of CRMs in a TAD [5]. Cellular specificity of CRMs is largely established by unique epigenetic modifications of CRMs in different cell types by altering the accessibility and binding affinity of TFs to their cognate binding sites [6, 7]. Therefore, to fully understand the functions of CRMs, one needs to not only precisely locate their loci in the genome, but also to characterize their functional types (mainly enhancers and silencers), states (active or inactive), and target genes in various cell/tissue types of the organism. However, only a small portion of CRMs (mostly enhancers) in genomes have been fully characterized in all these three aspects due to the difficulty to study them at a large scale using traditional molecular biology methods [8].

The recent availability of enormous functional and epigenomic data in various cell/tissue types in well-studied organisms have provided an unprecedented opportunity to predict loci of enhancers and silencers in the genomes and their functional states in various cell/tissue types of the organisms. Most of these methods attempt to simultaneously predict the loci of CRMs and their functional states in a cell/tissue type using multiple epigenetic marks such as chromatin accessibility (CA) assayed by either DNase I hypersensitive sites sequencing (DNase-seq) [9,10,11] or assay for transposase-accessible chromatin using sequencing (ATAC-seq) [12], and histone marks assayed by chromatin immunoprecipitation sequencing (ChIP-seq) [13]. Although conceptually appealing, these one-step methods result in high false discovery rate (FDR) [14,15,16,17,18,19,20] since CA and histone marks, though informative, are not specific marks for active enhancers or silencers [17, 18, 21].

High-throughput methods like Hi-C [22] and chromatin interaction analysis with paired-end-tag sequencing(ChIA-PET) [23], which examine physical proximity between two linearly distant loci, have been used to map regulatory relationships between enhancers and target genes [24,25,26] in various cell/tissue types. Hi-C technologies have led to the identification of distinct genome compartments at the mega-base scale. Compartment A resides in open, gene-rich euchromatin, while compartment B is composed of closed, gene-poor heterochromatin [27]. Interactions are more frequent within the same compartment than across different compartments [28]. Within the compartments, TADs are formed at the sub-mega-base scale, where interactions are highly enriched within TADs relative to between TADs [29]. At a higher resolution, chromatin loops can establish spatial proximity between specific distant genomic loci through a process called loop extrusion [30,31,32]. However, it is difficult to identify CRM-gene regulations precisely due to the often-low genomic resolution of Hi-C data and that genomic contacts may not guarantee regulation relationships. More recently, clustered regularly interspaced short palindromic repeats (CRISPR)-Cas9 (CRISPR-Cas9) technology has been used to identify target genes of putative CRMs in various ways [33]. For example, CRISPR activation (CRISPRa) was used to probe putative enhancers for their potential to regulate nearby genes [34]. Additionally, CRISPR interference (CRISPRi) was employed to identify regulatory relationships by targeting putative enhancers and assessing the effects on the neighboring genes [35,36,37,38]. Nonetheless, these methods are limited to a few CRMs and genes. Consequently, experimental determination of target genes of CRMs on a genome-wide scale remains an ongoing challenge.

In the past few years, several computational methods have been developed for predicting target genes of putative enhancers and silencers. However, these methods are limited, since in the absence of an accurate and comprehensive CRM map in the genome, they aim to predict target genes for specific genomic regions marked by distinct epigenetic modifications. These methods attempt to predict both CRMs and their target genes, along with their functional types if applicable, all in a single step. The predominant one-step approaches for predicting target genes of enhancers include score-based [39, 40], correlation-based [41,42,43], and machine learning methods [43,44,45,46,47].

As the most intuitive and widely used score-based method, the distance-based method uses the genomic distance or some function of the distance, usually defined as the number of base pairs between the potential regulatory region and the transcription start site (TSS) of the candidate gene [39]. The simplest distance-based method assigns the gene whose TSS is closest to either end of a CRM as the CRM’s target gene. While this closest neighbor assignment (CNA) method suits some predictions well, it can overlook distant regulations. This is problematic because a CRM can target genes from hundreds of thousands to a million bp away [48], and a CRM can regulate multiple target genes [49]. Other score-based methods incorporate additional genomic or functional data, consolidating multiple metrics into a composite score that quantifies a putative regulatory region’s potential in regulating a candidate gene. For example, GeneHancer [40] integrates five features associated with enhancers and/or target genes — RNA and eRNA levels, and eQTL, cHi-C, and distance data — to predict target genes of candidate enhancers compiled from four databases.

The correlation-based methods evaluate correlation between the activities of a potential regulatory region and the activities of possible target genes across a panel of cell/tissue types. DNase I hypersensitive sites (DHSs), indicative of DNA accessibility, have been employed as potential regulatory regions, while correlations between DHS signals in these regions and in promoters across a panel of 79 tissues have been used to predict enhancer-gene links [50]. However, it has been shown that the correction between DHS signals alone is inadequate to indicate enhancer-gene regulations [41]. Thus, a variant method was proposed that quantified correlations between DHS signals at potential regulatory regions with the expression levels of possible target genes [41]. Moreover, correlation between DNA methylation levels at potential regulatory regions and the expression levels of possible target genes has also been utilized to pinpoint enhancer-gene regulations [42, 51].

Supervised learning models trained on “gold standard” enhancer-gene links, along with an equivalent number of negative links, have been used to predict unknown links using features such as gene expression levels [45], DHS signals [45, 46], histone marks [43, 45], correlations of these signals [43], sequence compositions [43, 47], and distances [43, 46] between candidate enhancers and the TSSs of potential target genes. However, a significant drawback of these methods is the lack of sufficient experimentally validated “gold standard” positive and negative sets [8], leading different groups to define training sets differently. This divergence in defining training sets can introduce noise and potentially influence prediction accuracy. Moreover, in many of these methods [24, 25, 38], the candidate genes were typically screened within an arbitrarily selected flanking region around a putative CRM, although the true target genes can be located outside of the region.

To overcome the limitation of these existing one-step methods for predicting the loci of enhancers and silencers, we have recently proposed a two-step approach based on the following two observations: (i) TF binding is more informative for locating enhancers than CA and histone marks and (ii) once the loci of enhancers are accurately located, their epigenetic marks are good predictors of their functional states [17,18,19,20,21]. Specifically, we first predict a highly accurate and more complete map of CRMs in the genome, and then predict the functional states of all the CRMs in various cell/tissue types of the organism. For the first step, we have developed the dePCRM2 algorithm [14] that predicts the locations of all possible CRMs in the genome by integrating all available TF ChIP-seq datasets in various cell/tissue types of the organism [14]. Applying dePCRM2 to more than 11 k TF ChIP-seq datasets in various human cell/tissue types, we predicted 1.2 M CRMs in the human genome. For the second step, we have developed two logistic regression (LR) models that can accurately predict functional states of the many CRMs as enhancers and silencers in any cell/tissue types using CA in combination with two active enhancer or silencer histone marks, thereby simultaneously predicting the functional types and states of the CRMs [52, 53]. This two-step approach substantially outperforms existing one-step methods for predicting the loci of CRMs in the genome as well as their functional types and states in cell/tissue types of the organism [52, 53]. Nonetheless, since CA has overwhelmingly higher weights than do the two enhancer or silencer histone marks in both the LR models, they are unable to unambiguously distinguish enhancers and silencers that have high CA signals but weak enhancer or silencer histone marks signals [52, 53]. Moreover, unlike CA data that are widely available in a broader spectrum of cell/tissue types, data of enhancer or silencer histone marks are only available in a small number of even well-studied human and mouse cell/tissue types, limiting the application and statistical power of the models.

Building upon the success of the two-step strategy and by overcoming the limitation of the LR models, we now introduce a new method, Correlation and Physical Proximity (CAPP) for the third step of our strategy to predict the target genes of the predicted CRMs in a genome-wide fashion. CAPP predicts enhancer-gene and silencer-gene links based on the correlation between functional states of the CRMs and the expression levels of potential target genes within the same TAD across a panel of cell/tissue types as well as physical proximity between the CRMs and potential target genes. In this way, CAPP is able to not only predict the CRMs’ target genes, but also to more accurately predict their functional types.

Results

Most of the genome and our predicted CRMs are covered by consensus TADs in various cell types

Since CRM-gene regulations are believed to occur mainly within a TAD [54,55,56,57], to reduce the search space we only consider CRMs and genes in the same TAD for potential regulation relationships. Utilizing Hi-C interaction data in six cell/tissue types, we identified their TADs at five resolutions (5000 bp ~ 100,000 bp). As expected, TADs identified at different resolutions cover varying proportions of the human genome, genes, and our predicted CRMs in each cell/tissue type (Fig. 1A shows the case in the K562 cells). As TADs predicted at a higher resolution tend to be shorter and are often nested inside of larger ones predicted at a lower resolution, we thus merged overlapping TADs predicted at different resolutions to form a certain number of merged TADs (e.g., there are 944 merged TADs in the K562 cells), which cover higher proportions of the genome, genes, and CRMs than TADs predicted at a single resolution (Fig. 1A for the K562 cells). Particularly, the vast majority (1,178,225, or 96.2%) of our 1225 K predicted CRMs in the genome are located within the merged TADs (Fig. 1A). Similar results were obtained in other five cell/tissue types with Hi-C data available (Additional file 1: Fig. S1). As shown in Fig. 1B, 81.12% of the genome can be covered by the merged TADs in all the six cell/tissue types, and 6.48% of the genome remains uncovered by any of the merged TADs in the six cell/tissue types, while 91.44% of the genome are covered by the merged TADs in at least two cell/tissue types, suggesting that TADs from different cell/tissue types are largely invariable, consistent with previous reports [58,59,60]. Therefore, in cases where Hi-C data is not yet available in a cell/tissue, we use TADs from these six cell/tissue types as a substitute. Particularly, if not otherwise noted, the analysis was conducted using TADs derived from the K562 cells.

Fig. 1
figure 1

Coverage analyses on predicted TADs. A Coverage of the genome, CRMs, and genes by TADs identified at various resolutions and by merged TADs in the K562 cells. B Proportion of the genome covered by the merged TADs from different numbers of cell lines. The legends such as “6 TADs 81.12%” means 81.12% of the genome is covered by the merged TADs from the six cell lines

CA alone can accurately predict the functional states of CRMs

As the first step of the CAPP method, it predicts functional states of CRMs in as many as possible cell/tissue types of the organism. To this end, unlike our previous method that used two LR models with CA and two histone marks as the features [52, 53], CAPP employs a single LR model with CA as the sole feature to predict functional states of CRM without first differentiating their types (enhancers and silencers). When trained and evaluated on the 67 human cell/tissue types (mostly cell lines, see Methods) with ten-fold cross-validation, the LR model achieved a median area under the receiver operator characteristic curve (AUROC) of 0.93 (Additional file 1: Fig. S2), which was only slightly lower than those (AUROC = 0.98) of LR models using two additional histone marks [52, 53]. Hence, the LR model using CA as the sole feature is able to accurately predict the functional states of CRMs in a given cell/tissue type, although it cannot differentiate their functional types (enhancers and silencers). Applying the model to the 107 human cell/tissue types (mostly primary cells/tissues, see Methods), we predicted highly varying numbers of active CRMs ranging from 18,995 (1.6%) in SJSA1 cells to 166,236 (14.1%) in motor-neuron cells, with a median of 64,476 (5.5%) in the cell/tissue types (Additional file 2: Table S1). We predicted a total of 7,363,163 active CRMs in the 107 cell/tissue types. A CRM can be predicted to be active in multiple cell/tissue types; thus, we removed such redundancy and ended up with a total of 547,695 (46.5%) non-redundant active CRMs in the 107 cell/tissue types.

Target genes of 14% of CRMs can be predicted using currently available datasets

After the functional states of the CRMs are predicted, CAPP predicts the target genes of the CRMs in TADs by examining the correlation across a panel of cell/tissue types between their functional states and expression levels of genes located within the same TADs, followed by validation of physical proximity between the CRMs and the genes using Hi-C interaction data available in the six cell/tissue types (Methods). To ensure high statistical power of the predictions, we only considered the CRMs that were predicted to be active and inactive in at least five different cell/tissue types, resulting in 260,220 (47.8%) CRMs out of the 547,695 CRMs that were predicted to be active in at least one of the 107 cell/tissue types. We first examined the impact of Hi-C contact thresholds on the predictions. As shown in Additional file 1: Figs. S3A-S3B, at an FDR of 0.1, with the increase in the threshold of normalized Hi-C contact, the numbers of predicted regulations as well as involved enhancers/silencers and genes drop rapidly, followed by a slower decrease after the threshold reaches 20. Thus, CAPP is rather robust when the Hi-C contact threshold is greater than 20. We therefore used the Hi-C contact threshold of 20 for final predictions unless indicated otherwise. Specifically, at the Hi-C contact threshold of 20, we predicted 168,366 (64.7%) CRMs that enhanced the expression of 43,521 genes via 750,058 enhancer-gene regulations. Similarly, we predicted 3049 (1.2%) CRMs repressed 3213 genes via 5247 silencer-gene regulations. In total, CAPP predicted 43,548 (74.7% out of 58,261) target genes for 169,061 (65.0% out of 260,220) CRMs. Thus, CAPP was able to predict target genes for 14% of the 1178 K CRMs within TADs using data available in only 107 cell/tissue types. Of the 169,061 CRMs with predicted target genes, 2354 (1.4%) functioned as both enhancers and silencers (dual CRMs) for different genes, while 166,012 (98.2%) exclusively acted as enhancers (exclusive enhancers), and the remaining 695 (0.4%) exclusively acted as silencers (exclusive silencers). It is evident that obtaining more data from more and diverse cell/tissue types is crucial to predict target genes for more CRMs of all the three types.

A dual functional CRM (hereafter referred to as a dual CRM) tends to regulate the largest number of genes, followed by an exclusive enhancer and an exclusive silencer

We first analyzed the number of genes regulated by the three different types of CRMs, i.e., dual CRMs, exclusive enhancers and exclusive silencers. As shown in Fig. 2A, a dual CRM tends to regulate more genes than an exclusive CRM. Furthermore, an exclusive enhancer tends to regulate more genes than an exclusive silencer (Fig. 2A). Specifically, only 52.8% of dual CRMs regulate no more than 5 genes, contrasting with 79.0% for exclusive enhancers and 95.5% for exclusive silencers. These results indicate that exclusive silencers tend to have narrower effects by regulating few genes, and exclusive enhancers tend to have broader effects by regulating larger numbers of genes, while dual CRMs regulate the largest number of genes due probably to their dual functionality.

Fig. 2
figure 2

Comparisons of different types of CRMs for their predicted target genes and regulation lengths. A Cumulative probability of CRMs regulating the indicated numbers of genes. The inset is a zooming-in view of the region with the number of genes not larger than 10. B Cumulative probability of genes regulated by the indicated numbers of CRMs. The inset is a zooming-in view of the region with the number of CRMs not larger than 10. C Boxplots of regulation lengths (in bp) of the predicted exclusive enhancer-gene, exclusive silencer-gene, dual enhancer-gene and dual silencer-gene regulations. D Cumulative probability of regulation lengths (in bp) of the predicted exclusive enhancer-gene, exclusive silencer-gene, dual enhancer-gene, and dual silencer-gene regulations. E Boxplots of \(\tau\) values of target genes of exclusive enhancers, exclusive silencers, dual enhancers, and dual silencers. F Boxplots of the number of cell/tissue types where target genes of exclusive enhancers, exclusive silencers, dual enhancers, and dual silencers are expressed

To see if the three types of CRMs might be under different evolutionary constraints, we plotted the phyloP conservation scores [61] of the positions of the three types of CRMs. As shown in Additional file 1: Fig. S4, the three types of CRMs exhibit similar distributions of phyloP scores, indicating that they are under similar evolutionary constraints. However, when we examined overlaps of these three types of CRMs with 855,436 expression quantitative trait loci (eQTLs) and 308,613 genome-wide association studies (GWAS) loci (Methods), we observed that dual CRMs were significantly more enriched for eQTLs and GWAS loci than exclusive elements (Fisher’s exact test, p-value < 1e − 5) (Additional file 2: Tables S2-S3).

Enhancers are more cooperative than silencers to regulate target genes

We next analyzed the numbers of enhancers or silencers that regulate a gene. As shown in Fig. 2B, only 2.7% of genes are regulated by more than 5 silencers. In contrast, 64.4% of genes are regulated by more than 5 enhancers. These results suggest that enhancers are more likely to cooperate with one another to regulate a gene than silencers. In other words, multiple enhancers are required to upregulate a gene, while only few silencers are needed to downregulate a gene. This result is in line with a previous report that an assembly of multiple enhancers, often located tens to hundreds of thousands basepairs away from their target genes, tend to collaboratively regulate the common target gene by looping to the gene’s promoter [62].

Dual CRMs tend to regulate more distant genes

We ask whether different types of CRMs have preference to regulate nearby or distant genes. To this end, we compared the distribution of regulation lengths of our predicted CRM-gene regulations. Here the regulation length is defined as the distance in basepairs between the nearer end of a CRM and the TSS of its target gene. If a target gene’s TSS falls within the body of a CRM, we consider the regulation length to be 0. For each dual CRM, we refer to it as a dual enhancer when it functions as an enhancer, and a dual silencer when it functions as a silencer. As illustrated in Fig. 2C, the regulation lengths of dual enhancers and silencers are significantly longer than those of exclusive CRM-gene links. Remarkably, approximately 16.3, 12.8, 8.1, and 8.4% of the dual enhancer-gene, dual silencer-gene, exclusive enhancer-gene, and exclusive silencer-gene links, respectively, exhibit a regulation length exceeding 5 M bp (Fig. 2D). This exceeds the fixed 1 ~ 5 M bp flanking regions used in conventional approaches, which can potentially miss such regulations. Similarly, the number of intervening genes between dual CRMs and their target genes are greater than those between exclusive CRMs and their target genes (Additional file 1: Figs. S5A-S5B).

Enhancers tend to regulate more narrowly expressed genes while silencers more broadly expressed genes

To explore whether enhancers and silencers exhibit preferences for regulating specific types of genes, we analyzed the expression patterns of their predicted target genes. Specifically, we first calculated the \(\tau\) index value [63] for each gene within TADs that were expressed (transcript per million (TPM) > 0) in at least one of the 107 distinct cell/tissue types. The τ index measures gene expression specificity, with a value of 0 indicating ubiquitous expression and a value of 1 indicating expression in a single cell/tissue type [63]. Genes regulated by enhancers tend to have greater \(\tau\) values than those regulated by silencers (Fig. 2E), implying that enhancers tend to regulate more narrowly expressed genes, while silencers tend to regulate more broadly expressed genes. Consistently, genes regulated by enhancers tend to be expressed in fewer cell/tissue types than those regulated by silencers (Fig. 2F). These observations underscore the distinct strategies that enhancers and silencers take to regulate different gene types based on the prevalence of their usages.

Static and active cis-regulatory networks can be built by the predicted CRM-gene links

Based on our 43,548 predicted target genes of the 169,061 CRMs, we constructed static cis-regulatory networks (sCRNs) in which the nodes are the CRMs and their target genes, and the edges are the CRM-gene links between them, regardless of the functional states of the CRMs in cell/tissue types. The sCRNs are made of 1712 connected components distributed across the 23 pairs of chromosomes (Additional file 2: Table S4). The active cis-regulatory networks (aCRNs) in a cell/tissue type can be induced from the sCRNs by the active enhancer-gene and silencer-gene links in the cell/tissue type. As an example, Figs. 3A and 3B show the sub-sCRNs on chromosome 17, composed of 44 connected components and the sub-aCRNs on chromosome 17 in the K562 cells embedded in the sub-sCRNs with the active exclusive enhancer-gene, exclusive silencer-gene, and dual CRM-gene links highlighted in green, red, and orange, respectively. More network examples can be found in the Additional file 3: Fig. S1.

Fig. 3
figure 3

Examples of sub-static and sub-active cis-regulatory networks on chromosome 17. A The sub-static cis-regulatory networks on chromosome 17 are made up of 44 connected components. The sub-active cis-regulatory networks on the chromosome in the K562 cells are embedded in the static cis-regulatory networks and can be induced by active exclusive enhancers, active exclusive silencers and active dual CRMs in the cells. The circles represent CRMs, including inactive CRMs (gray), active exclusive enhancers (green), active exclusive silencers (red), and active dual CRMs (yellow) in the K562 cells. The purple squares represent genes, and their sizes are proportional to the degrees, i.e., the number of their regulating CRMs. The edges represent regulation relationships between CRMs (circles) and genes (squares). The edges linked to active CRMs are colored by the same colors as the type of active CRMs. B The zooming-in view of the connected component from Fig.  3 A in light blue shadow

Our model outperforms the distance-based method CNA

We compared our method CAPP with a commonly used distance-based method, CNA. First, for each of the 168,366 predicted enhancers, we assigned the gene(s) whose TSS(s) was(were) closest to the enhancer as its target gene(s), resulting in 189,675 regulations (a predicted enhancer may have multiple closest genes). We evaluated the possibility of these CRMs being enhancers of the assigned target genes. To this end, we tested whether the expression levels of the putative target genes of each predicted enhancer were significantly higher in the cell/tissue types where the CRM was active than in the other cell/tissue types where the CRM was inactive. At an FDR of 0.1, 106,031 (55.9%) of the 189,675 regulations predicted by CNA exhibited significantly positive regulation, while the rest 44.1% did not show enhancer-gene regulations (Fig. 4A). In contrast, all of our predicted 750,058 enhancer-gene regulations exhibited significantly positive regulations (Fig. 4A). Consistently, 76,083 (40.1%) of the 189,675 regulations predicted by CNA coincided with our 750,058 predicted enhancer-gene links (Fig. 4B), implying that about 40.1% of our enhancers might regulate their closest genes, while the remaining 59.9% regulate distant genes. Figure 4C shows an example of enhancer chrY:20,574,642–20577538 regulating a distant gene TTTY14. On average, our predicted enhancers regulated 4.5 genes while those predicted by CNA only targeted 1.1 genes. Hence, CNA might overlook distant regulations, missing out on the crucial one-to-many regulatory mechanism for enhancers. This is significant as a CRM can target genes located hundreds of thousands to a million bp away [48], and a single CRM might regulate multiple target genes [49].

Fig. 4
figure 4

Comparison of CNA and our method CAPP for predicting target genes of CRMs. A Distribution of FDR-adjusted p-values of enhancer-gene regulations assigned by CNA or predicted by CAPP. The p-values were calculated by one tailed Mann–Whitney U test. B Venn diagram of enhancer-gene regulations assigned by CNA and enhancer-gene regulations predicted by CAPP. C Boxplots of expression levels of gene TTTY14 against the functional states of its predicted distal enhancer chrY:20,574,642–20577538 in the 107 cell/tissue types. p < 1e − 19, one tailed Mann–Whitney U test. D Distribution of FDR-adjusted p-values of silencer-gene regulations assigned by CNA or predicted by CAPP. The p-values were calculated by one tailed Mann–Whitney U test. E Venn diagram of silencer-gene regulations assigned by CNA and silencer-gene regulations predicted by CAPP. F Boxplots of expression levels of gene ARHGAP19 against the functional states of its predicted distal silencer chr10:96,388,444–96,390,765 in the 107 cell/tissue types. p < 1e − 11, one tailed Mann–Whitney U test

Next, we evaluated the possibility of these CRMs being silencers of the assigned target genes. To this end, first, for each of the 3049 predicted silencers, we assigned the gene(s) whose TSS(s) was(were) closest to the silencer as its target gene(s), resulting in 3511 regulations (a predicted silencer may have multiple closest genes). We then tested whether the expression levels of the putative target genes of each predicted silencer were significantly lower in the cell/tissue types where the CRM was active than in the other cell/tissue types where the CRM was inactive. Of the 3511 links, at the same FDR of 0.1, only 318 (9.1%) regulations exhibited significantly negative regulations, while the remaining 90.9% of the regulation did not show significantly negative regulations (Fig. 4D). In contrast, all of our predicted 5247 silencer-gene regulations exhibited significantly negative regulations (Fig. 4D). Consistently, only 99 (2.8%) of the 3511 silencer-gene links predicted by CNA overlapped our 5247 predicted silencer-gene links (Fig. 4E), implying that only about 2.8% of silencers might regulate their closest genes, while the remaining 97.2% regulate distant genes. This result suggests that silencers, compared with enhancers, do not tend to regulate nearby genes. Figure 4F illustrates an example of a silencer chr10:96,388,444–96,390,765 regulating a distant gene ARHGAP19. On average, our predicted silencers regulated 1.7 genes while those predicted by CNA only targeted 1.2 genes. As in the case of enhancers, CNA might neglect distant regulations, leading to a potential oversight of a one-to-many regulatory mechanism of silencers. In summary, CAPP demonstrates superior performance over CNA in predicting the target genes for both enhancers and silencers.

Both CAPP and the activity-by-contact (ABC) model might only capture a small portion of enhancer-gene regulations

We next compared our predicted enhancer-gene regulations with those predicted by the ABC model [38]. The ABC model considers DNA sequences with DNase-seq and/or H3K27ac ChIP–seq signals as regulatory elements (REs), and calculates a score for each RE to regulate a gene within the two 5-Mb flanking regions around the RE. The score is defined as the product of the strength of these epigenetic marks signals (Activity) in the RE and the Hi-C interaction frequency (Contact) between the RE and the gene. Using a predefined threshold of the score, the ABC model predicted a total of 425,350 non-redundant RE-gene (RE-G) links, involving 153,549 REs and 18,752 genes in five human tissue/cell types (GM12878, K562, liver, LNCAP and NCCIT). We consider these regulations as an sCRN predicted by the ABC model, in which on average, each RE was predicted to regulate 2.8 genes, and each gene was regulated by 22.7 enhancers. As we predicted much more enhancer-gene regulations than did the ABC model, for a fair comparison, we compared the ABC model-predicted sCRN with a sub-sCRN predicted by CAPP with the same number of enhancer-gene regulations [425, 350] as in the ABC model-predicted sCRN. Our sub-sCRN involves 77,805 enhancers and 20,003 genes, and on average each enhancer regulates 5.5 genes, and each gene is regulated by 21.3 enhancers. Thus, with the same number [350, 425] of predicted regulations, the ABC model predicts target gene for more REs (153,549 REs vs 77,805 enhancers), while CAPP predicts more (20,003 vs 18,752) target genes for enhancers.

We found that 17.2% REs overlap 16.9% of our enhancers by at least 1 bp (Fig. 5A), and that multiple REs may overlap one of our predicted enhancers, as REs are generally shorter than our enhancers (data not shown). We refer these overlapping REs and enhancers as ERo (enhancer-RE overlapping) REs and enhancers, respectively. The ERo REs and enhancers are involved in 12.7% RE-G links and 19.3% enhancer-G (enhancer-gene) links, respectively. If a pair of overlapping ERo RE and enhancer regulate the same gene, we refer their respective link as an ERo RE-Gm (RE-G link with matched gene) link and an ERo enhancer-Gm (Enhancer-G link with matched gene) link (Fig. 5A); otherwise, we refer their respective link as an ERo RE-Gnm (RE-G link with unmatched gene) link and an ERo enhancer-Gnm (Enhancer-G link with unmatched gene) link (Fig. 5A). This classification yielded 4.0% as ERo RE-Gm links, 8.7% as ERo RE-Gnm links, 2.2% as ERo enhancer-Gm links, and 17.1% as ERo enhancer-Gnm links (Fig. 5A). Thus, the ABC model predicted more RE-Gm links than CAPP predicted enhancer-Gm links, while CAPP predicted more enhancer-Gnm links than the ABC model predicted RE-Gnm links. Moreover, we found that 74.8% REs did not overlap any of the CRMs in our sub-sCRN but overlapped our CRMs not in the sub-sCRN for which we were unable to predict their target genes using the data available to us. We refer these REs as ERno-Co (Enhancer-RE not overlapping, but CRM-RE overlapping) REs, which are involved in 80.8% of RE-G links (Fig. 5A). The remaining 8.0% of REs do not overlap any of the CRMs within TADs, and we refer them as CRno (CRM-RE not overlapping) REs, which are involved in 6.5% of RE-G links (Fig. 5A). To see whether these CRno REs are functional, we plotted the distribution of their nucleotides’ phyloP scores [61]. As shown in Fig. 5B, the density curve of CRno REs is more narrowly distributed with a high peak around 0, compared to those for the ERo REs and ERno-Co REs, indicating that the CRno REs are less likely under natural selection, and thus, at least some of them might not be even parts of authentic enhancers (Fig. 5B). In fact, we found that all of these CRno REs were located in our previously predicted non-CRMs (data not shown); thus, it is not surprising that they have similarly less evolutionary constraints as the non-CRMs [14]. Finally, 83.1% of our 77,805 enhancers do not overlap the REs, and we refer to them as ERno (Enhancer-RE not overlapping) enhancers, which are involved in 80.7% of enhancer-G links (Fig. 5A). These results indicate that only a small proportion of enhancers and their regulatory relationships predicted by the two methods overlap. Additionally, we performed the same comparison using the top-ranked half number (i.e., 212,675) of the ABC model predicted RE-G links and the same number of our top-ranked enhancer-G links, and drew the same conclusion (Additional file 1: Fig. S6).

Fig. 5
figure 5

Comparison of our predicted top-ranked 425,350 enhancer-G links with the same number of RE-G links predicted by the ABC model across different cell/tissue types. A A cartoon showing scenarios of overlaps between the RE-G links and the enhancer-G links. Based on the overlaps between the REs and our enhancers, we divide them into different categories: ERo enhancers and REs overlap each other; ERno-Co REs do not overlap the enhancers but overlap our other CRMs; CRno REs do not overlap any CRMs; and ERno enhancers do not overlap any REs. Accordingly, we also divide the RE-G links and enhancer-G links into different categories: ERo RE-Gm and ERo enhancer-Gm links are those with overlapping enhancers and REs and matched target genes; ERo RE-Gnm and ERo enhancer-Gnm links are those with overlapping enhancers and REs but different target genes; ERno-Co RE-G links are those for REs not overlapping enhancers but overlapping other CRMs; ERno enhancer-G links are those for enhancers that do not overlap REs; and CRno RE-G links are those for REs that do not overlap any CRMs (See in Additional file 2: Table S5 for convenient reference). B Density curves of PhyloP scores of the three categories of REs: ERo, ERno-Co, and CRno

Since H3K27ac is generally considered as a mark for active enhancers, using H3K27ac levels as the activity signal the ABC model is actually aimed to predict target genes of active enhancers in a cell type [38]. We thus also compared ABC model-predicted aCRN (82,810 links) in the K562 cells with a sub-aCRN with the same number [82, 82], 810 of predicted active regulations in the K562 cells, the ABC model predicts target genes for more active REs (34,130 REs vs 24,935 enhancers), while CAPP predicts more (22,542 vs 11,708) target genes for active enhancers. Like the case of sCRNs, we also observed that only a small proportion of the active enhancers and REs in the K562 cells have overlaps (Fig. 6A). Additionally, the REs that do not overlap our CRMs tend to be less likely under evolutionary selection (Fig. 6B) and thus might not be true enhancers. Again, we found that all of these non-overlapping REs were located in our previously predicted non-CRMs (data not shown); hence, it is expected that they have similarly less evolutionary constraints as the non-CRMs [14].

Fig. 6
figure 6

Comparison of our predicted top-ranked 82,810 enhancer-G links with the same number of RE-G links predicted by the ABC model in the K562 cells. A A cartoon showing the scenarios of overlaps between the RE-G links and the enhancer-G links. The definitions are the same as in Fig. 5A (See Additional file 2: Table S5 for convenient reference) for ERo enhancers and REs, ERno-Co REs, CRno REs, and ERno enhancers, as well as ERo RE-Gm and ERo enhancer-Gm links, ERo RE-Gnm and ERo enhancer-Gnm links, ERno-Co RE-G links, ERno enhancer-G links, and CRno RE-G links. B Density curves of PhyloP scores of the three categories of REs: ERo, ERno-Co, and CRno. C Heat maps of ATAC, H3K27ac and H3K4me1 signals of active and inactive REs with ERo RE-Gm links or ERo RE-Gnm links in the K562 cells. D Heat maps of ATAC, H3K27ac and H3K4me1 signals of active and inactive REs with ERno-Co RE-G links or CRno RE-G links in the K562 cells. The heat maps show the mean signal of each 100-bp window in each sequence and the plot above each column of heat maps shows the mean signal of each window position across the sequences sampled in the same categories (Methods). The color code for the categories in the density plot above each column is the same with the heat map legends. E Boxplots of FDR-adjusted p-values for comparing the expression levels of putative target genes of the REs in cell/tissue types where the REs are active with those in other cell/tissue types where the REs are inactive. The p-values were calculated by one tailed Mann–Whitney U test as used in our method (CAPP)

We noted that some REs overlapped our predicted inactive enhancers in the K562 cells, although the REs were identified by their H3K27ac signals. We thus further investigated the issue by applying our LR state predictor to the REs in the K562 cells. Of the 34,130 REs predicted by the ABC model in the K562 cells, while the majority (29,654 or 86.9%) were predicted to be active, the remaining considerable 4476 (13.1%) were predicted to be inactive. Though our LR predictor only used CA as the feature, our predictions of functional states of REs are mostly supported by the signals of the three active enhancer marks (CA, H3K27ac, and H3K4me1) on the predicted (i) active and inactive REs with either ERo RE-Gm links or ERo RE-Gnm links (Fig. 6C) and (ii) active and inactive REs with either ERno-Co RE-G links or CRno RE-G links (Fig. 6D). Specifically, except for the H3K27ac signal for inactive REs with ERo RE-Gm links, in all these cases, the three epigenetic marks are enriched around the centers of the predicted active RE but not around the centers of predicted inactive REs (Figs. 6C,D). Hence, our predictions of functional states for the REs are accurate.

We reason that if the RE-G links are correctly predicted, then like our predicted enhancer-G links (Fig. 4A), the expression levels of the target genes of REs should be significantly higher in certain cell/tissue types where the REs are active than in other certain cell/tissue types where the REs or enhancers are inactive. To ensure statistical power in evaluating the RE-G links, we only consider REs that are active and inactive in at least five cell/tissue types. Interestingly, our reasoning holds for 94.9% of the ERo RE-Gm links (Fig. 6E), suggesting that only 5.1% of ERo RE-Gm links might be false positives. The discrepancy between ERo RE-Gm links and ERo enhancer-Gm arises due to our requirement of a single-bp overlap between REs and enhancers, potentially leading to differing predictions of their functional states. Moreover, of the ERo RE-Gnm links, 54.0% were identified as significant positive regulations, while the remaining 46.0% might be false positives. In addition, 47.8% of the ERno-Co RE-G links and 50.8% of the CRno RE-G links (Fig. 6E) exhibit significant positive regulations (FDR of 0.1). Thus, our predictions might have missed some positive regulations predicted by the ABC model, particularly, the significant predictions in the ERno-Co RE-G links and the CRno RE-G links. However, the ABC model also disregarded ERno enhancers as well as their target genes predicted by our method CAPP. Additionally, we compared the top-ranked half number (i.e., 41,405 regulations) of the ABC model predicted RE-G links with the same number of our top-ranked enhancer-G links in the K562 cells and derived the similar results (Additional file 1: Fig. S7). Considering that the two methods take different approaches to predict CRMs and target genes, the low overlap rate of their predictions indicate that both our method and the ABC model might merely scratch the surface of capturing the extensive landscape of true enhancer-gene regulations.

Our method CAPP outperforms ENCODE-rE2G for recalling CRISPR-determined enhancer-gene regulations

ENCODE project [64] recently developed a supervised-machine-learning model ENCODE-rE2G [65] for predicting CRMs and target genes, trained on 10,411 CRISPR-verified enhancer-gene regulations from three studies [38, 66, 67], over 30,000 fine-mapped eQTLs, and 569 fine-mapped GWAS variant-gene pairs. The model has been used to predict target genes of candidate cis-regulatory elements (cCREs) previously predicted by the ENCODE project [64]. The cCREs were defined based on DNase-seq data, and similarly, the REs used in the ABC model were defined using the same approach [38, 65]. The ENCODE-rE2G model incorporates 13 features, one of which is derived from the ABC scores [65]. In this light, the ENCODE-rE2G model might be considered as an extended version of the ABC model. To compare their predictive performance, we examined the top 82,810 RE-gene links predicted by each model in the K562 cells. The ENCODE-rE2G model predicted 41,647 unique REs linked to 13,968 genes, whereas the ABC model predicted 34,130 REs associated with 11,708 genes. As shown in Additional file 1: Fig. S8A, 51.6% of the ENCODE-rE2G REs overlap with those predicted by the ABC model, while 79.8% of ABC REs overlap with those from ENCODE-rE2G. Furthermore, 40.9% of ENCODE-rE2G RE-gene links are shared with the ABC predictions, and 51.0% of ABC RE-gene links are shared with ENCODE-rE2G. These relatively high levels of overlap between the two sets of REs are expected, as both models rely on similar method and data to define cCREs and REs. Likewise, the shared predictions of RE-gene links are not surprising, considering that the ABC model contributes directly to the ENCODE-rE2G model [65].

We next compared the top-ranked 82,810 enhancer-gene linked predicted by our method with the same number of top-ranked RE-gene predicted by the ENCODE-rE2G model in K562 cells (Additional file 1: Fig. S8B). As mentioned earlier, our predicted enhancer-gene links involve 24,935 enhancers and 22,542 genes. We found that both our predicted enhancers (25.5%) and enhancer-gene links (3.9%) show a greater degree of overlap with those predicted by the ENCODE-rE2G model (Additional file 1: Fig. S8B) than with those (12.8 and 2.0%, respectively) predicted by the ABC model (Fig. 6A), suggesting that our method is more closely aligned with the extended and potentially more comprehensive rE2G model. Nevertheless, the overall overlap remains relatively low, highlighting that even the most advanced current models—including ours and ENCODE-rE2G—may capture only a portion of the true complexity underlying enhancer-gene regulatory interactions.

Finally, we compared the abilities of CAPP, the ABC model, and the ENCODE-rE2G model of recalling the 10,411 CRISPR-verified enhancer-gene regulations. As shown in the Additional file 1: Fig. S9, among the top 10,000 predicted regulation from the three methods, our method CAPP includes more CRISPR-verified links than the ENCODE-rE2G model but fewer than the ABC model. However, for the top 6000 predictions or fewer, CAPP captures an equal or greater number of CRISPR-verified links compared to the other two methods.

CAPP outperforms PECA in accuracy for predicting silencer-gene regulations

SilencerDB [68] contains non-redundant 86,035 RE-G links predicted by paired expression and chromatin accessibility (PECA) [69], involving 79,604 silencers and 10,195 target genes. We consider these regulations as an sCRN predicted by PECA. We then compared the same number [35, 86] of our predicted top-ranked silencer-G (silencer-gene) links involving 38,677 silencers and 20,467 target genes with those (hereafter referred as RE-G links) predicted by PECA. Among the RE-G links, 8104 have their 7412 (9.3%) REs overlapping 3928 (10.2%) our silencers with predicted target genes. Like enhancers, we refer these overlapping silencers and REs as SRo (Silencer-RE overlapping) REs and SRo silencers, respectively (Fig. 7A). Out of the 8104 SRo RE-G links, 1028 (1.2%) also target the same gene as overlapping silencers (SRo RE-Gm links), while 7076 (8.2%) target different genes than overlapping silencers (SRo RE-Gnm links). The 3928 (10.2%) SRo silencers are involved in 11,015 (12.8%) SRo silencer-G links, of which 580 target the same genes as overlapping REs and are involved in 592 (0.7%) SRo silencer-Gm (Silencer-G link with matched gene) links, while 3765 target different genes than overlapping REs and are involved in 10,423 (12.1%) SRo silencer-Gnm (Silencer-G link with unmatched gene) links. It is important to note that an SRo REs and an SRo silencer can be involved in both types of links. Among the RE-G links whose REs do not overlap our silencers, 50,322 (58.5%) have their REs overlapping our CRMs for which we were unable to predict their target genes, and we refer them as SRno-Co (Silencer-RE not overlapping, but CRM-RE overlapping) REs; the remaining 27,609 (32.2%) have their REs not overlapping with our CRMs, and we refer them as CRno REs (Fig. 7A). As expected, in contrast to the REs that overlap with our silencers or other CRMs, the CRno REs are more likely selectively neutral (Fig. 7B), suggesting that most of CRno REs might be false positives (Fig. 7B). Finally, 34,749 (89.8%) of our 38,677 silencers do not overlap the REs, and we refer them as SRno (Silencer-RE not overlapping) silencers, which are involved in 75,020 (87.2%) silencer-G links (Fig. 7A).

Fig. 7
figure 7

Comparison of our predicted top-ranked 86,035 silencer-gene links with same number of RE-gene links predicted by the PECA in silencerDB. A A cartoon showing the overlaps between the RE-G links and our silencer-G links. Based on the overlaps between the REs and our silencers, we divide them into different categories: SRo silencers and REs that overlap each other; SRno-Co REs that do not overlap our silencers but overlap our other CRMs; CRno REs that do not overlap with any CRMs; and SRno silencers that do not overlap REs. Accordingly, we also divide the RE-G links and silencer-G links into different categories: SRo silencer-Gm and SRo RE-Gm links are those with overlapping silencers and REs and matched target genes; SRo RE-Gnm and SRo silencer-Gnm links are those with overlapping silencers and REs but different target genes; SRno-Co RE-G links are those for REs not overlapping silencers but overlapping other CRMs, and SRno silencer-G links are those for silencers not overlapping REs; CRno RE-G links are those for REs that do not overlap any CRMs (See in Additional file 2: Table S5 for convenient reference). B Density curves of PhyloP scores of the three categories of REs: SRo, SRno-Co, and CRno

We then compared our predicted silencer-gene links of which the silencers were predicted to be active in the K562 cells with those links predicted in the K562 cells by PECA (Fig. 8). PECA predicted 1935 RE-gene regulations in the K562 cells, involving 1788 REs and 1434 genes. On average, each RE regulated 1.1 genes, and each gene was targeted by 1.3 REs. For a fair comparison, we selected the same number (1935) of our top-ranked active links in the K562 cells, involving 1203 active silencers and 1423 genes. On average, each silencer regulated 1.6 genes, and each gene was targeted by 1.4 silencers. We observed that, similar to Fig. 7A, only a small proportion of the silencers and REs overlapped (Fig. 8A) and these REs were more likely under natural selection than the REs not overlapping our silencers (Fig. 8B).

Fig. 8
figure 8

Comparison of our predicted top-ranked 1935 silencer-gene links with the same number of RE-gene links predicted by the PECA from silencerDB. A A cartoon showing the overlaps between the RE-G links and our silencer-G links. The definitions are the same as in Fig. 7A (See Additional file 2: Table S5 for convenient reference) for SRo silencers and REs, SRno-Co REs, CRno REs, SRno silencers, SRo silencer-Gm and SRo RE-Gm links, SRo RE-Gnm and SRo silencer-Gnm links, SRno-Co RE-G links, SRno silencer-G links and CRno RE-G links. B Density curves of PhyloP scores of the three categories of REs: SRo, SRno-Co, and CRno. C Heat maps of ATAC and H3K27me3 signals of active and inactive REs with SRo RE-Gm links and SRo RE-Gnm links in the K562 cells. D Heat maps of ATAC and H3K27me3 signals of active and inactive REs with SRno-Co RE-G links or CRno RE-G links in the K562 cells. The heat maps show the mean signal of each 100-bp window in each sequence and the plot above each column of heat maps shows the mean signal of each window position across the sequences sampled in the same categories (see Methods). The color code for the categories in the density plot above each column is the same with the heat map legends. E Boxplots of FDR-adjusted p-values for comparing the expression levels of putative target genes of the REs in cell/tissue types where the REs are active with those in cell/tissue types where the REs are inactive. The p-values were calculated by one tailed Mann–Whitney U test as used in our method (CAPP)

As the functional states of the 1788 REs with target genes predicted by PECA are unknown in the K562 cells, we predicted them by our LR model using CA as the sole feature. We found that 715 (40.0%) were predicted active while the remaining 1073 (60.0%) inactive in the cells. Due to the limited number of REs that overlap our predicted silencers (Fig. 8C), we focus our discussion on REs that do not overlap our predicted silencers (Fig. 8D). As expected, the predicted active REs are enriched for the CA mark, while the predicted inactive REs are depleted of the CA in the cells (Fig. 8D). Furthermore, the predicted active REs exhibit higher H3K27me3 signals compared to the inactive ones (Fig. 8D), suggesting that H3K27me3 may serve as an informative marker for silencers, consistent with previous reports [70, 71].

We reason that if the RE-G links were correctly predicted, then, like our silencer-G links (Fig. 4D), the expression levels of the target genes should be lower in some cell/tissue types where the REs are active than in other cell/tissue types where the REs are inactive. To ensure statistical power in evaluating the RE-G links, we only considered REs that were active and inactive in at least 5 cell/tissue types. The vast majority of the four types of links do not show significantly negative regulation at an FDR of 0.1 (Fig. 8E). This suggests that few of these RE-G links are true regulations. Notably, our silencers target an average of 1.6 genes, while REs are linked to just 1.1 genes on average. It is likely that PECA might overlook certain regulations, as a single silencer, on average, tends to regulate multiple target genes [72]. Notably, an overwhelming majority (99.8%) of PECA’s predictions deviate from the expected behavior of silencers, suggesting that they may be false positives. In this regard, CAPP demonstrates higher accuracy than PECA.

Discussion

In this paper, we introduced a new method CAPP to predict the target genes of CRMs. Leveraging on our previously predicted map of 1.2 M CRMs in the human genome and functional states (active and inactive) of the CRMs, CAPP identifies target genes of the predicted CRMs within TADs by finding genes within the same TADs, whose expression levels are correlated with the functional states of the CRMs across a panel of cell/tissue types, followed by checking whether the CRMs and the genes are in close physical proximity as measured by Hi-C data in multiple cell/tissue types. There are a few merits of our method. Firstly, CAPP can potentially enable us to predict the target genes of all CRMs in the genome when data is available from a sufficiently large number of diverse cell/tissue types. Even using data in only 107 cell/tissue types in this study, we are able to predict target genes for around 14% of the 1.2 M CRMs. It is important to note that evaluating the correlation between the functional states of a CRM and the expression levels of its potential target genes becomes unreliable when the CRM is active or inactive in only a small number of tested cell/tissue types. To ensure meaningful correlation analysis, we included only those CRMs that are active in at least \(t\) cell/tissue types and inactive in at least \(t\) other types. In this study, we set \(t=5\), which corresponds to approximately 5% of the 107 cell/tissue types analyzed. This threshold represents a trade-off between two opposing considerations. A less stringent threshold (e.g., \(t=2\)) would include more CRMs but may lead to an increased false positive rate due to reduced statistical power. Conversely, a more stringent threshold (e.g., \(t=10\)) may reduce false positives but risks excluding a greater number of true CRM-gene links. As a result, a substantial number of CRMs were excluded from the correlation analysis, despite exhibiting PhyloP score distributions similar to those of the selected CRMs (Additional file 1: Fig. S10), suggesting that they are subject to comparable evolutionary constraints and are likely functional. Clearly, a much larger number of diverse cell/tissue types with the minimally required data available are needed in the future to predict target genes for most, if not all, of the CRMs.

Secondly, in addition to Hi-C data in a few cell types, CAPP only needs CA and RNA-seq data in a panel of cell types for the prediction, thus is highly cost-effective. Thirdly, in addition to target genes of CRMs, CAPP also is able to predict the functional types of CRMs as enhancers and/or silencers, the first of its kind, to the best of our knowledge. Fourthly, CAPP predicts the functional types of CRMs based on whether their functional states exhibit positive or negative correlation with the expression level of their target genes, respectively, in a panel of cell/tissue types, thus, overcoming a drawback of our previous LR models using three epigenetic marks [53]. Fifthly, CAPP evaluates every pair of CRM and gene within the same TAD without a fixed distance constraint, allowing it to predict target genes located more than 5 M bp away from the regulating CRM. This is the first of its kind, to our best knowledge, as previous methods predict target genes of candidate enhancers or silencers only in a fixed flanking genomic range. In fact, about 8.4% of our predicted enhancer-gene regulations have a regulation distance longer than 5 M bp, which could be missed by existing methods that use a fixed distance constraint, typically 1 ~ 5 M bp. Sixthly, although CAPP is aimed to predict cell type agnostic sCRNs encoded in the genome, an aCRN in any cell/tissue type can be readily induced from the sCRNs by the active CRMs predicted using only CA data in the cell/tissue type. Seventhly, CAPP predicts CRM-gene links with higher accuracy than existing methods evaluated. The expression levels of our predicted target genes are significantly positively and negatively correlated with the functional states of the predicted regulating enhancers and silencers, respectively. However, such correlations do not hold for a considerable proportion of target genes of enhancers and silencers predicted by the state-of-the-art methods, namely the ABC model and PECA, respectively. Finally, it is remarkable that top-ranked 6000 predictions by CAPP outperform the ABC model and ENCODE-rE2G, even though the latter two models were trained using CRISPR-determined enhancer-gene regulations while our method was not. However, in the absence of a large gold standard set for enhancer-gene and silencer-gene regulations, it is still difficult to calculate the sensitivity and specificity of each method.

Since the candidate enhancers used in the ABC model and the ENCODE-rE2G model were predicted in similar approaches, and since the ABC models is incorporated in the ENCODE-rE2G model, it is not very surprising that both their predicted enhancers and regulations have considerable overlaps. Notably, both our predicted enhancers and enhancer-gene links exhibit greater overlap with those from the more comprehensive ENCODE-rE2G model than with those from the ABC model. This suggests that our method is capable of identifying enhancers and regulatory links that may be missed by the ABC model. However, the overall degree of overlap between our predictions and those of the ENCODE-rE2G model—particularly for enhancer-gene links—remains relatively low. This highlights the intrinsic difficulty of accurately predicting enhancer-gene interactions and underscores the continued need for methodological improvements and experimental validation.

Notably, we identified far fewer silencer-gene links than enhancer-gene links. While this discrepancy may partly reflect limitations in our method, it is unlikely that CAPP systematically fails to detect most silencers while successfully identifying enhancers. This is because CAPP predicts the functional type of each CRM—enhancer and/or silencer—using the same input features: chromatin accessibility and gene expression levels across a panel of cell/tissue types, within the same TAD. These features are evaluated through correlation analysis, followed by assessment of physical interactions. If any bias against silencer detection exists, it may stem from our assumption that regulatory interactions occur within the same TAD, as defined by Hi-C contact data. Should silencers frequently act over longer distances or via contact-independent mechanisms, they may be underrepresented in our predictions. Additionally, due to the lack of gold-standard silencer-gene interaction datasets, we applied the same Hi-C contact threshold used for enhancers to filter potential indirect links. If silencers and enhancers follow different spatial constraints, this uniform threshold could further contribute to the under-detection of silencer-gene associations. However, even prior to applying the Hi-C contact filter, the number of predicted enhancer-gene links [8, 57], 880 at an FDR of 0.1, indicating that physical interaction filtering alone does not account for the observed disparity. This observation is consistent with current understanding: the number of silencers in the human genome is believed to be smaller than the number of enhancers [73,74,75]. While this difference may partly reflect technical and methodological challenges in silencer identification, it is estimated that the human genome contains between several hundreds of thousands to over one million enhancers [14, 76, 77], whereas no comparable estimates exist for silencers, to our knowledge. We therefore conclude that the substantially smaller number of predicted silencers more likely reflects inherent biological differences between enhancers and silencers rather than methodological shortcomings.

Based on our predicted enhancer-gene and silencer-gene regulations, we revealed distinct properties of various CRM types. Firstly, dual CRMs tend to regulate the greatest number of genes, followed by exclusive enhancers and exclusive silencers. Secondly, enhancers display a higher degree of cooperation in gene regulation than silencers. Thirdly, dual CRMs tend to regulate more distant genes than exclusive enhancers and exclusive silencers. Fourthly, enhancers prefer to regulate narrowly expressed genes, whereas silencers tend to regulate more broadly expressed genes. Finally, dual CRMs are more likely to harbor GWAS and eQTL marks than exclusive enhancers and silencers, suggesting that the former type of CRMs are more likely than the latter two types to regulate trait-altering genes.

Conclusions

In this study, we proposed a new correlation and physical proximity method CAPP to predict both the functional types and target genes of CRMs simultaneously. Using CAPP, we have identified hundreds of thousands of enhancer-gene regulations and thousands of silencer-gene regulations. These findings highlight the diverse characteristics of different types of CRMs, their target genes, and their regulation links, providing insights into the distinct regulatory behaviors of exclusive enhancers, exclusive silencers, and dual CRMs. Importantly, our method surpasses or complement with existing state-of-the-art methods, representing a notable advancement in predicting target genes and functional types of CRMs.

Methods

Data sets

We obtained a comprehensive set of 1,225,115 predicted CRMs in the human genome from our prior work [14]. We downloaded Hi-C data of six cell lines from the ENCODE data portal [64] or 4D Nucleome [78] data portals (Additional file 2: Table S6). We downloaded DNase-seq, ATAC-seq, and TF ChIP-seq data of the 67 human cell/tissue types for training from Cistrome Datasets Browser [79, 80] (Additional file 2: Tables S7-S8). We downloaded ATAC-seq data and RNA-seq data of the 107 cell/tissue types for predicting from ENCODE data portal [64] (Additional file 2: Table S9). We downloaded enhancer-gene links predicted by the ABC model [38] from https://osf.io/uhnb4/, and obtained non-redundant silencer-gene links predicted by PECA [69] from SilencerDB [68] at https://health.tsinghua.edu.cn/silencerdb/. We downloaded the ENCODE-rE2G predictions and a set of 10,411 CRISPR-verified enhancer-gene regulations compiled from three studies [38, 66, 67] using the accession IDs ENCSR030MBR and ENCSR998YDI from the ENCODE data portal at https://www.encodeproject.org. We downloaded the GWAS data from GWAS Catalog [81] at https://www.ebi.ac.uk/gwas/docs/file-downloads and the eQTL from GTEx portal [82] at https://www.gtexportal.org/home/downloads/adult-gtex/qtl.

Generation of TADs

TADs were generated in each cell line at various resolutions (5 K bp, 10 K bp, 15 K bp, 25 K bp, 50 K bp, and 100 K bp) using the Arrowhead algorithm of Juicer tools [83] version 2.17.00 with KR normalization method. We subsequently merged overlapping TADs from different resolutions into larger domains in each cell line using the bedtools2/2.29.0 merge command.

Identifications of CRMs within TADs

We identified the CRMs that overlap at least 1 bp with the merged TADs using the bedtools2/2.29.0 intersect command and ended up with 1,178,225 CRMs for further analysis.

CA feature score

For a sequence \(q\), we define its raw CA feature score as: 

$${F}_{raw}\left(q\right)={\sum }_{i=1}^{\text{N}}{r}_{i}{s}_{i}$$

where N is the number of peaks of CA mapping to \(q\) at least 50% of either one, \({r}_{i}\) the ratio of overlapping length between \(q\) and the \({i}_{th}\) peak of CA over the length of the \({i}_{th}\) peak of CA, \({s}_{i}\) the signal of the \({i}_{th}\) peak of CA quantified by MACS2 [84, 85]. We then normalized the raw feature score in each cell/tissue type by the min–max normalization, i.e.,

$$F\left(q\right)=\frac{{F}_{raw}\left(q\right)-{min(F}_{raw}\left(Q\right))}{max\left({F}_{raw}\left(Q\right)\right) -min({F}_{raw}\left(Q\right))},$$

where \(Q\) denotes all candidate sequences in the genome, \({min(F}_{raw}\left(Q\right))\) and \({max(F}_{raw}\left(Q\right))\) the minimum and maximum raw score of CA over Q in the cell/tissue type.

Quantification of gene expression levels

We computed log(TPM + 1) for a gene as its expression level in a cell/tissue type. If multiple RNA-seq datasets were available for a cell/tissue type, we first computed the average expression level (mean(TPM)) of the gene across all the datasets and then computed the log(mean(TPM) + 1).

Prediction of functional states of sequences

We employed a simple LR model using CA as the only feature to predict the functional state of the 1.2 M CRMs within the TADs in each of the 107 cell/tissue types. A CRM in a cell/tissue type is considered active if its activation probability exceeds 0.5.

Construction of positive and negative sets

In each of the 67 cell/tissue types with the required data available, we selected the CRMs as the positive set that overlap TF binding peaks. At the same time, we randomly selected predicted non-CRM candidates with matched numbers of the positive set as the negative set in the cell/tissue type. We pooled the positive and negative sets in all the cell/tissue types to construct a comprehensive positive set and a negative set. The resulting positive set contains 1,784,345 CRMs and the negative set contains the same numbers of non-CRM candidates. Thus, the positive sets and negative sets are well-balanced.

Model training and evaluation

Ten-fold cross-validation was conducted to train and assess model performance. The models were implemented using sci-kit learn v.0.24.2 and the code is available at https://github.com/sisyyuan/Target-Gene-Prediction.

Prediction of target genes for CRMs

We predict target genes for the CRMs and at the same time identify their functional types as enhancers and/or silencers by using a one-tailed Mann–Whitney U test, with Benjamini-Hochberg (B-H) multiple hypothesis correction, followed by validation of physical proximity between the CRMs and the target genes using Hi-C interaction data in any of the six cell/tissue types (Additional file 2: Table S6). Assuming that the regulation of a gene by a CRM remains consistent across various cell/tissue types, then when the CRM is activated in a cell/tissue type, the expression level of the gene will elevate if the CRM functions as an enhancer, or will decrease if the CRM functions as a silencer. We adopted a statistical approach to assess the significance of such correlation.

Step 1: Test correlation between CRM activity and gene expression using Mann–Whitney U test

For each pair of a CRM \(c\) and a gene \(g\) in a TAD \(D\), we perform two one-tailed Mann–Whitney U tests based on two datasets \({Exp}_{\text{active}}\left(g,c\right)\) and \({Exp}_{\text{inactive}}\left(g,c\right),\) where \({Exp}_{active}\) is the expression levels of \(g\) in a minimum of \(t\) cell/tissue types where \(c\) is active (group A), and \({Exp}_{\text{in}active}\) is the expression levels of \(g\) in a minimum of \(t\) cell/tissue types where \(c\) is inactive (group B). The first test is to evaluate whether \(c\) function as an enhancer of \(g\). Thus, the null hypothesis \({H}_{0}\) is: the median of group A is the same as or smaller than the median of group B, and the alternative hypothesis \({H}_{1}\) is: the median of group A is greater than the median of group B. The second test is to evaluate whether \(c\) function as a silencer of \(g\). Thus, the null hypothesis \({H}_{0}\) is: the median of group A is the same as or greater than that of group B, and the alternative hypothesis \({H}_{1}\) is: the median of group A is smaller than the median of group B. The Mann–Whitney U test function “mannwhitneyu” from the scipy.stat library in python3 was used to conduct the tests. Here, to ensure robust statistical analysis, we only consider the CRMs that are predicted to be active and inactive in at least \(t\) cell/tissue types. We applied the B-H procedure “fdr_bh” from the “statsmodels.stats.multitest” library in python3 to correct the p-values with an FDR of 0.1.

In this study, we performed the tests based on the predicted functional states of the CRMs and experimentally measured expression levels of the genes in the 107 cell/tissue types, and we set \(t\)=5 to ensure robust statistical analysis. In other words, we only considered the CRMs that were active and inactive in a minimum of five cell/tissue types.

Step 2: Test for physical contact

As correlation does not guarantee causal regulation relationship, for each significant CRM-gene correlation, we checked whether the CRM and the gene have physical contact. The Hi-C contact matrices were generated using the Straw algorithm [86] from the hicstraw library in python3 with SCALE normalization method at a resolution of 2000 bp. We define a CRM and a gene to have physical contact, if both the CRM and target gene can be mapped to the respective ends of varying numbers of pairs of Hi-C reads from any of the six cell/tissue types (Additional file 2: Table S6), with at least 1 bp overlap.

Closest neighbor assignment to CRMs

For each CRM under consideration, we assign the gene whose TSS is linearly closest to either end of the CRM as its CNA target gene. In cases where a CRM overlaps a gene’s TSS, we consider the gene as the CRM’s target genes. Obviously, when TSSs of multiple genes are located within a CRM, multiple target genes will be assigned to the CRM.

Predictions by the ABC model

For enhancers and target genes predicted by the ABC model, we lifted over their coordinates from hg19 to hg38 and then converted gene symbols to gene stable IDs. It is important to note that the ABC model includes only distal (non-promoter) elements in its predictions. In our comparison, we included proximal elements, as these can also function as distal enhancers for other genes.

The \(\tau\) index

We used the \(\tau\) index [63, 87] to measure the cell/tissue specificity of a gene based on its expression profiles across a panel of cell/tissue types, defined as:

$$\tau (g)=\frac{\sum_{i=1}^{N}(1-{{e}{\prime}}_{i})}{N-1} ,{e{\prime}}_{i}=\frac{{e}_{i}}{\genfrac{}{}{0pt}{}{\text{max}}{1\le j\le N}{e}_{j}},$$

where \({e}_{i}\) denotes the expression level of gene \(g\) in cell type \(i\), \(e^{\prime}_i\) the expression level of \(g\) normalized by the maximum of the expression levels of \(g \text{in}\) \(N\) cell/tissue types.

Heat maps of epigenetic marks

Heat maps of epigenetic mark signals were generated using the “EnrichedHeatmap” package [88] within R version 4.2.2. For a mark on each element in a set of sequences, we considered a 2 k bp extension on either side of the element, and computed the mean signal of the mark in a 100-bp sliding window along the element (lower heat maps). For a mark in a set of sequences, we also computed the mean of the mean signal of the mark in the 100-bp sliding windows across all the sequences in the set (upper line annotations). The elements in each set of sequences were organized based on their CA signal in descending order.

Enrichment analysis of eQTL and GWAS variants within different types of CRMs

The genomic positions of GWAS variants were first converted from hg19 to hg38 using the UCSC liftOver tool. Subsequently, both GWAS and eQTL variants were mapped to dual CRMs and pooled exclusive enhancers and silencers (exclusive elements). For each CRM type, the number of variant positions was counted, and enrichment was quantified as the number of variant positions per billion base pairs. Statistical significance of the differences in overlap between CRM types was assessed using Fisher’s exact test.

Data availability

All data generated or analysed during this study are included in this published article, its supplementary information files and publicly available repositories: Zenodo (https://zenodo.org/records/15628122)[89] and GitHub (https://github.com/sisyyuan/Target-Gene-Prediction)[90]. The predicted CRMs in human are available at http://cci-bioinfo.uncc.edu. The Hi-C data of six cell lines are available at the ENCODE data portal[64] with the accession IDs ENCFF685BLG [91] and ENCFF080DPJ [92] or 4D Nucleome[78] data portal with the accession IDs 4DNFI1UEG1HD [93], 4DNFIQYQWPF5[94], 4DNFICEGAHRC [95] and 4DNFICSTCJQZ [96]. The DNase-seq, ATAC-seq and TF ChIP-seq data of the 67 human cell/tissue types for training are available at http://cistrome.org/db. The ATAC-seq data and RNA-seq data of the 107 cell/tissue types for predicting are available at https://www.encodeproject.org/. The enhancer-gene links predicted by the ABC model [38] are available at https://osf.io/uhnb4/, and the non-redundant silencer-gene links predicted by PECA[69] are available at https://health.tsinghua.edu.cn/silencerdb/. The ENCODE-rE2G predictions and the 10,411 CRISPR-verified enhancer-gene regulations are available at https://www.encodeproject.org with the accession IDs ENCSR030MBR [97] and ENCSR998YDI [98]. The GWAS data are available at https://www.ebi.ac.uk/gwas/docs/file-downloads and the eQTL data are available at https://www.gtexportal.org/home/downloads/adult-gtex/qtl.

Abbreviations

CRM:

Cis-regulatory modules

CAPP:

Correlation and Physical Proximity

CA:

Chromatin accessibility

TF:

Transcription factor

TFBS:

Transcription factor binding site

CTCF:

CCCTC-binding factor

TAD:

Topologically associating domain

DNase-seq:

DNase I hypersensitive sites sequencing

ATAC-seq:

Assay for transposase-accessible chromatin using sequencing

ChIP-seq:

Chromatin immunoprecipitation sequencing

FDR:

False discovery rate

ChIA-PET:

Chromatin interaction analysis with paired-end-tag sequencing

CRISPR:

Clustered regularly interspaced short palindromic repeats

CRISPRa:

CRISPR activation

CRISPRi:

CRISPR interference

TSS:

Transcription start site

CNA:

Closest neighbor assignment

DHS:

DNase I hypersensitive site

LR:

Logistic regression

AUROC:

Area under the receiver operator characteristic curve

eQTL:

Expression quantitative trait loci

GWAS:

Genome-wide association studies

TPM:

Transcript per million

sCRN:

Static cis-regulatory network

aCRN:

Active cis-regulatory network

ABC:

Activity-by-Contact

RE:

Regulatory element

RE-G:

RE-gene

ERo:

Enhancer-RE overlapping

Enhancer-G:

Enhancer-gene

RE-Gm:

RE-G link with matched gene

Enhancer-Gm:

Enhancer-G link with matched gene

RE-Gnm:

RE-G link with unmatched gene

Enhancer-Gnm:

Enhancer-G link with unmatched gene

ERno-Co:

Enhancer-RE not overlapping, but CRM-RE overlapping

CRno:

CRM-RE not overlapping

ERno:

Enhancer-RE not overlapping

cCRE:

Candidate cis-regulatory element

PECA:

Paired expression and chromatin accessibility

Silencer-G:

Silencer-gene

SRo:

Silencer-RE overlapping

Silencer-Gm:

Silencer-G link with matched gene

Silencer-Gnm:

Silencer-G link with unmatched gene

SRno-Co:

Silencer-RE not overlapping, but CRM-RE overlapping

SRno:

Silencer-RE not overlapping

B-H:

Benjamini-Hochberg

References

  1. Huang D, Ovcharenko I. Enhancer-silencer transitions in the human genome. Genome Res. 2022;32(3):437–48.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Erceg J, Pakozdi T, Marco-Ferreres R, Ghavi-Helm Y, Girardot C, Bracken AP, et al. Dual functionality of cis-regulatory elements as developmental enhancers and Polycomb response elements. Gene Dev. 2017;31(6):590–602.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Hardison RC, Taylor J. Genomic approaches towards finding cis-regulatory modules in animals. Nat Rev Genet. 2012;13(7):469–83.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Suryamohan K, Halfon MS. Identifying transcriptional cis-regulatory modules in animal genomes. Wiley Interdiscip Rev Dev Biol. 2015;4(2):59–84.

    Article  CAS  PubMed  Google Scholar 

  5. Chen D, Lei EP. Function and regulation of chromatin insulators in dynamic genome organization. Curr Opin Cell Biol. 2019;58:61–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Liu L, Jin G, Zhou X. Modeling the relationship of epigenetic modifications to transcription factor binding. Nucleic Acids Res. 2015;43(8):3873–85.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Wang M, Zhang K, Ngo V, Liu C, Fan S, Whitaker JW, et al. Identification of DNA motifs that regulate DNA methylation. Nucleic Acids Res. 2019;47(13):6753–68.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Hoellinger T, Mestre C, Aschard H, Le Goff W, Foissac S, Faraut T, et al. Enhancer/gene relationships: Need for more reliable genome-wide reference sets. Front Bioinform. 2023;3:1092853.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Boyle AP, Davis S, Shulha HP, Meltzer P, Margulies EH, Weng Z, et al. High-resolution mapping and characterization of open chromatin across the genome. Cell. 2008;132(2):311–22.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Song L, Zhang Z, Grasfeder LL, Boyle AP, Giresi PG, Lee BK, et al. Open chromatin defined by DNaseI and FAIRE identifies regulatory elements that shape cell-type identity. Genome Res. 2011;21(10):1757–67.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Crawford GE, Holt IE, Whittle J, Webb BD, Tai D, Davis S, et al. Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). Genome Res. 2006;16(1):123–31.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods. 2013;10(12):1213–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, et al. High-resolution profiling of histone methylations in the human genome. Cell. 2007;129(4):823–37.

    Article  CAS  PubMed  Google Scholar 

  14. Ni P, Su Z. Accurate prediction of cis-regulatory modules reveals a prevalent regulatory genome of humans. NAR Genom Bioinform. 2021;3(2):lqab052.

  15. Catarino RR, Stark A. Assessing sufficiency and necessity of enhancer activities for gene expression and the mechanisms of transcription activation. Genes Dev. 2018;32(3–4):202–23.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Kheradpour P, Ernst J, Melnikov A, Rogov P, Wang L, Zhang X, et al. Systematic dissection of regulatory motifs in 2000 predicted human enhancers using a massively parallel reporter assay. Genome Res. 2013;23(5):800–11.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Dogan N, Wu W, Morrissey CS, Chen KB, Stonestrom A, Long M, et al. Occupancy by key transcription factors is a more accurate predictor of enhancer activity than histone modifications or chromatin accessibility. Epigenetics Chromatin. 2015;8:16.

    Article  PubMed  PubMed Central  Google Scholar 

  18. Arbel H, Basu S, Fisher WW, Hammonds AS, Wan KH, Park S, et al. Exploiting regulatory heterogeneity to systematically identify enhancers with high accuracy. Proc Natl Acad Sci U S A. 2019;116(3):900–8.

    Article  CAS  PubMed  Google Scholar 

  19. Kwasnieski JC, Fiore C, Chaudhari HG, Cohen BA. High-throughput functional testing of ENCODE segmentation predictions. Genome Res. 2014;24(10):1595–602.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Kleftogiannis D, Kalnis P, Bajic VB. DEEP: a general computational framework for predicting enhancers. Nucleic Acids Res. 2015;43(1): e6.

    Article  PubMed  Google Scholar 

  21. Podsiadlo A, Wrzesien M, Paja W, Rudnicki W, Wilczynski B. Active enhancer positions can be accurately predicted from chromatin marks and collective sequence motif data. BMC Syst Biol. 2013;7(Suppl 6):S16.

    Article  PubMed  PubMed Central  Google Scholar 

  22. Belton JM, McCord RP, Gibcus JH, Naumova N, Zhan Y, Dekker J. Hi-C: a comprehensive technique to capture the conformation of genomes. Methods. 2012;58(3):268–76.

    Article  CAS  PubMed  Google Scholar 

  23. Li G, Cai L, Chang H, Hong P, Zhou Q, Kulakova EV, et al. Chromatin Interaction Analysis with Paired-End Tag (ChIA-PET) sequencing technology and application. BMC Genomics. 2014;15 Suppl 12(Suppl 12):S11.

  24. Moore JE, Pratt HE, Purcaro MJ, Weng Z. A curated benchmark of enhancer-gene interactions for evaluating enhancer-target gene prediction methods. Genome Biol. 2020;21(1):17.

    Article  PubMed  PubMed Central  Google Scholar 

  25. O’Connor T, Grant CE, Boden M, Bailey TL. T-Gene: improved target gene prediction. Bioinformatics. 2020;36(12):3902–4.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Ron G, Globerson Y, Moran D, Kaplan T. Promoter-enhancer interactions identified from Hi-C data using probabilistic models and hierarchical topological domains. Nat Commun. 2017;8(1):2237.

    Article  PubMed  PubMed Central  Google Scholar 

  27. Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009;326(5950):289–93.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Fortin J-P, Hansen KD. Reconstructing A/B compartments as revealed by Hi-C using long-range correlations in epigenetic data. Genome Biol. 2015;16(1):180.

    Article  PubMed  PubMed Central  Google Scholar 

  29. Szabo Q, Bantignies F, Cavalli G. Principles of genome folding into topologically associating domains. Sci Adv. 2019;5(4).

  30. Davidson IF, Bauer B, Goetz D, Tang W, Wutz G, Peters JM. DNA loop extrusion by human cohesin. Science. 2019;366(6471):1338–45.

    Article  CAS  PubMed  Google Scholar 

  31. Fudenberg G, Abdennur N, Imakaev M, Goloborodko A, Mirny LA. Emerging Evidence of Chromosome Folding by Loop Extrusion. Cold Spring Harb Symp Quant Biol. 2017;82:45–55.

    Article  PubMed  Google Scholar 

  32. Roayaei Ardakany A, Gezer HT, Lonardi S, Ay F. Mustache: multi-scale detection of chromatin loops from Hi-C and Micro-C maps using scale-space representation. Genome Biol. 2020;21(1):256.

    Article  PubMed  PubMed Central  Google Scholar 

  33. Canver MC, Bauer DE, Orkin SH. Functional interrogation of non-coding DNA through CRISPR genome editing. Methods. 2017;121–122:118–29.

    Article  PubMed  PubMed Central  Google Scholar 

  34. Joung J, Engreitz JM, Konermann S, Abudayyeh OO, Verdine VK, Aguet F, et al. Genome-scale activation screen identifies a lncRNA locus regulating a gene neighbourhood. Nature. 2017;548(7667):343–6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Liu SJ, Horlbeck MA, Cho SW, Birk HS, Malatesta M, He D, et al. CRISPRi-based genome-scale identification of functional long noncoding RNA loci in human cells. Science. 2017;355(6320).

  36. Stuart WD, Guo M, Fink-Baldauf IM, Coleman AM, Clancy JP, Mall MA, et al. CRISPRi-mediated functional analysis of lung disease-associated loci at non-coding regions. NAR Genom Bioinform. 2020;2(2):lqaa036.

  37. Tian R, Gachechiladze MA, Ludwig CH, Laurie MT, Hong JY, Nathaniel D, et al. CRISPR Interference-Based Platform for Multimodal Genetic Screens in Human iPSC-Derived Neurons. Neuron. 2019;104(2):239–55 e12.

  38. Fulco CP, Nasser J, Jones TR, Munson G, Bergman DT, Subramanian V, et al. Activity-by-contact model of enhancer-promoter regulation from thousands of CRISPR perturbations. NatGenet. 2019;51(12):1664–9.

    CAS  Google Scholar 

  39. Sikora-Wohlfeld W, Ackermann M, Christodoulou EG, Singaravelu K, Beyer A. Assessing computational methods for transcription factor target gene identification based on ChIP-seq data. PLoS Comput Biol. 2013;9(11): e1003342.

    Article  PubMed  PubMed Central  Google Scholar 

  40. Fishilevich S, Nudel R, Rappaport N, Hadar R, Plaschkes I, Iny Stein T, et al. GeneHancer: genome-wide integration of enhancers and target genes in GeneCards. Database (Oxford). 2017;2017.

  41. Sheffield NC, Thurman RE, Song L, Safi A, Stamatoyannopoulos JA, Lenhard B, et al. Patterns of regulatory activity across diverse human cell types predict tissue identity, transcription factor binding, and long-range interactions. Genome Res. 2013;23(5):777–88.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Silva TC, Coetzee SG, Gull N, Yao L, Hazelett DJ, Noushmehr H, et al. ELMER vol 2: an R/Bioconductor package to reconstruct gene regulatory networks from DNA methylation and transcriptome profiles. Bioinformatics. 2019;35(11):1974–7.

    Article  CAS  PubMed  Google Scholar 

  43. He B, Chen C, Teng L, Tan K. Global view of enhancer-promoter interactome in human cells. Proc Natl Acad Sci U S A. 2014;111(21):E2191–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Hariprakash JM, Ferrari F. Computational Biology Solutions to Identify Enhancers-target Gene Pairs. Comput Struct Biotechnol J. 2019;17:821–31.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Kielbasa SM, Bluthgen N, Fahling M, Mrowka R. Targetfinder.org: a resource for systematic discovery of transcription factor target genes. Nucleic Acids Res. 2010;38(Web Server issue):W233–8.

  46. Zhao C, Li X, Hu H. PETModule: a motif module based approach for enhancer target gene prediction. Sci Rep. 2016;6:30043.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Hafez D, Karabacak A, Krueger S, Hwang YC, Wang LS, Zinzen RP, et al. McEnhancer: predicting gene expression via semi-supervised assignment of enhancers to target genes. Genome Biol. 2017;18(1):199.

    Article  PubMed  PubMed Central  Google Scholar 

  48. Pennacchio LA, Bickmore W, Dean A, Nobrega MA, Bejerano G. Enhancers: five essential questions. Nat Rev Genet. 2013;14(4):288–95.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Ferretti E, Cambronero F, Tumpel S, Longobardi E, Wiedemann LM, Blasi F, et al. Hoxb1 enhancer and control of rhombomere 4 expression: Complex interplay between PREP1-PBX1-HOXB1 binding sites. Mol Cell Biol. 2005;25(19):8541–52.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, et al. The accessible chromatin landscape of the human genome. Nature. 2012;489(7414):75–82.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Aran D, Sabato S, Hellman A. DNA methylation of distal regulatory sites characterizes dysregulation of cancer genes. Genome Biol. 2013;14(3):R21.

    Article  PubMed  PubMed Central  Google Scholar 

  52. Ni P, Moe J, Su Z. Accurate prediction of functional states of cis-regulatory modules reveals common epigenetic rules in humans and mice. BMC Biol. 2022;20(1):221.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Yuan S, Ni P, Su Z. Simultaneous Prediction of Functional States and Types of <em>cis</em>-regulatory Modules Reveals Their Prevalent Dual Uses as Enhancers and Silencers. bioRxiv. 2024:2024.05.07.592879.

  54. Acemel RD, Maeso I, Gomez-Skarmeta JL. Topologically associated domains: a successful scaffold for the evolution of gene regulation in animals. Wiley Interdiscip Rev Dev Biol. 2017;6(3).

  55. Bolt CC, Duboule D. The regulatory landscapes of developmental genes. Development. 2020;147(3).

  56. Furlong EEM, Levine M. Developmental enhancers and chromosome topology. Science. 2018;361(6409):1341–5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Symmons O, Uslu VV, Tsujimura T, Ruf S, Nassari S, Schwarzer W, et al. Functional and topological characteristics of mammalian regulatory domains. Genome Res. 2014;24(3):390–400.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Krefting J, Andrade-Navarro MA, Ibn-Salem J. Evolutionary stability of topologically associating domains is associated with conserved gene regulation. Bmc Biology. 2018;16.

  59. Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012;485(7398):376–80.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Rao SSP, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, et al. A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping (vol 159, pg 1665, 2014). Cell. 2015;162(3):687–8.

    Article  CAS  Google Scholar 

  61. Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010;20(1):110–21.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Schoenfelder S, Fraser P. Long-range enhancer-promoter contacts in gene expression control. Nat Rev Genet. 2019;20(8):437–55.

    Article  CAS  PubMed  Google Scholar 

  63. Kryuchkova-Mostacci N, Robinson-Rechavi M. A benchmark of gene expression tissue-specificity metrics. Brief Bioinform. 2017;18(2):205–14.

    CAS  PubMed  Google Scholar 

  64. Luo Y, Hitz BC, Gabdank I, Hilton JA, Kagda MS, Lam B, et al. New developments on the Encyclopedia of DNA Elements (ENCODE) data portal. Nucleic Acids Res. 2020;48(D1):D882–9.

    Article  CAS  PubMed  Google Scholar 

  65. Gschwind AR, Mualim KS, Karbalayghareh A, Sheth MU, Dey KK, Jagoda E, et al. An encyclopedia of enhancer-gene regulatory interactions in the human genome. bioRxiv. 2023.

  66. Nasser J, Bergman DT, Fulco CP, Guckelberger P, Doughty BR, Patwardhan TA, et al. Genome-wide enhancer maps link risk variants to disease genes. Nature. 2021;593(7858):238–43.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Gasperini M, Hill AJ, McFaline-Figueroa JL, Martin B, Kim S, Zhang MD, et al. A Genome-wide Framework for Mapping Gene Regulation via Cellular Genetic Screens. Cell. 2019;176(6):1516.

    Article  CAS  PubMed  Google Scholar 

  68. Zeng W, Chen S, Cui X, Chen X, Gao Z, Jiang R. SilencerDB: a comprehensive database of silencers. Nucleic Acids Res. 2021;49(D1):D221–8.

    Article  CAS  PubMed  Google Scholar 

  69. Duren Z, Chen X, Jiang R, Wang Y, Wong WH. Modeling gene regulation from paired expression and chromatin accessibility data. Proc Natl Acad Sci U S A. 2017;114(25):E4914–23.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  70. Cai Y, Zhang Y, Loh YP, Tng JQ, Lim MC, Cao Z, et al. H3K27me3-rich genomic regions can function as silencers to repress gene expression via chromatin interactions. Nat Commun. 2021;12(1):719.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  71. Huang D, Petrykowska HM, Miller BF, Elnitski L, Ovcharenko I. Identification of human silencers by correlating cross-tissue epigenetic profiles and gene expression. Genome Res. 2019;29(4):657–67.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. Riethoven JJ. Regulatory regions in DNA: promoters, enhancers, silencers, and insulators. Methods Mol Biol. 2010;674:33–42.

    Article  CAS  PubMed  Google Scholar 

  73. Doni Jayavelu N, Jajodia A, Mishra A, Hawkins RD. Candidate silencer elements for the human and mouse genomes. Nat Commun. 2020;11(1):1061.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. Halfon MS. Silencers, Enhancers, and the Multifunctional Regulatory Genome. Trends Genet. 2020;36(3):149–51.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  75. Pang BX, van Weerd JH, Hamoen FL, Snyder MP. Identification of non-coding silencer elements and their regulation of gene expression. Nat Rev Mol Cell Bio. 2023;24(6):383–95.

    Article  CAS  Google Scholar 

  76. Consortium EP, Moore JE, Purcaro MJ, Pratt HE, Epstein CB, Shoresh N, et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature. 2020;583(7818):699–710.

    Article  Google Scholar 

  77. Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, et al. An atlas of active enhancers across human cell types and tissues. Nature. 2014;507(7493):455–61.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  78. Dekker J, Belmont AS, Guttman M, Leshyk VO, Lis JT, Lomvardas S, et al. The 4D nucleome project. Nature. 2017;549(7671):219–26.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  79. Mei S, Qin Q, Wu Q, Sun H, Zheng R, Zang C, et al. Cistrome Data Browser: a data portal for ChIP-Seq and chromatin accessibility data in human and mouse. Nucleic Acids Res. 2017;45(D1):D658–62.

    Article  CAS  PubMed  Google Scholar 

  80. Zheng R, Wan C, Mei S, Qin Q, Wu Q, Sun H, et al. Cistrome Data Browser: expanded datasets and new tools for gene regulatory analysis. Nucleic Acids Res. 2019;47(D1):D729–35.

    Article  CAS  PubMed  Google Scholar 

  81. Cerezo M, Sollis E, Ji Y, Lewis E, Abid A, Bircan KO, et al. The NHGRI-EBI GWAS Catalog: standards for reusability, sustainability and diversity. Nucleic Acids Res. 2025;53(D1):D998–1005.

    Article  PubMed  Google Scholar 

  82. Consortium GT. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369(6509):1318–30.

    Article  Google Scholar 

  83. Durand NC, Shamim MS, Machol I, Rao SS, Huntley MH, Lander ES, et al. Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments. Cell Syst. 2016;3(1):95–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  84. Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008;9(9):R137.

    Article  PubMed  PubMed Central  Google Scholar 

  85. Feng J, Liu T, Qin B, Zhang Y, Liu XS. Identifying ChIP-seq enrichment using MACS. Nat Protoc. 2012;7(9):1728–40.

    Article  CAS  PubMed  Google Scholar 

  86. Durand NC, Robinson JT, Shamim MS, Machol I, Mesirov JP, Lander ES, et al. Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom. Cell Syst. 2016;3(1):99–101.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  87. Yanai I, Benjamin H, Shmoish M, Chalifa-Caspi V, Shklar M, Ophir R, et al. Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics. 2005;21(5):650–9.

    Article  CAS  PubMed  Google Scholar 

  88. Gu Z, Eils R, Schlesner M, Ishaque N. EnrichedHeatmap: an R/Bioconductor package for comprehensive visualization of genomic signal associations. BMC Genomics. 2018;19(1):234.

    Article  PubMed  PubMed Central  Google Scholar 

  89. Yuan S. Data for “Prediction of target genes and functional types of cis-regulatory modules in the human genome reveals their distinct properties”. Zenodo https://zenodoorg/records/15628122.2025.

  90. Yuan S. Data for “Prediction of target genes and functional types of cis-regulatory modules in the human genome reveals their distinct properties”. Github https://githubcom/sisyyuan/Target-Gene-Prediction.2025.

  91. Paul Sud JMC. ENCFF685BLG (hic). ENCODE Processing Pipeline https://wwwencodeprojectorg/files/ENCFF685BLG/. 2021.

  92. Paul Sud JMC. ENCFF080DPJ (hic). ENCODE Processing Pipeline https://wwwencodeprojectorg/files/ENCFF080DPJ/. 2021.

  93. Aiden EL. in situ Hi-C on gm12878 with MboI and bio-dATP. 4DN Data Portal https://data4dnucleomeorg/files-processed/4DNFI1UEG1HD/.2018.

  94. Dekker J. in situ Hi-C on H1-hESC (Tier 1) with DpnII. 4DN Data Portal https://data4dnucleomeorg/files-processed/4DNFIQYQWPF5.2018.

  95. Dekker J. HiC experiment done on HeLa-S3. 4DN Data Portal https://data4dnucleomeorg/files-processed/4DNFICEGAHRC. 2018.

  96. Dekker J. HiC experiment done on HepG2. 4DN Data Portal https://data4dnucleomeorg/files-processed/4DNFICSTCJQZ. 2018.

  97. Engreitz J. ENCODE-rE2G predictions of enhancer-gene regulatory interactions for K562; Donor: ENCDO000AAD. ENCODE https://wwwencodeprojectorg/annotations/ENCSR030MBR/.2023.

  98. Engreitz J. Combined CRISPR dataset created from harmonizing data from Nasser et al., 2021, Gasperini et al., 2019 and Schraivogel et al., 2020. ENCODE https://wwwencodeprojectorg/annotations/ENCSR998YDI/.2023.

Download references

Acknowledgements

We thank the ENCODE, 4DN, CISTROME, SilencerDB, GWAS Catalog, and GTEx projects for providing data used in this paper.

Funding

The work was supported by the US National Science Foundation (DBI-1661332). The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

National Science Foundation,DBI-1661332,DBI-1661332,DBI-1661332

Author information

Authors and Affiliations

Authors

Contributions

ZS and SY designed the study. SY performed the bioinformatic analysis with contributions from PN and ZS. SY and ZS wrote the manuscript with input from all co-authors. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Zhengchang Su.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

12915_2025_2313_MOESM1_ESM.docx

Additional file 1. Figure S1. Coverage of the genome, CRMs and genes by TADs identified at various resolutions and by merged TADs in different cell lines. Figure S2. ROC curves of the LR model with CA as the sole feature. Figure S3. Effect of different Hi-C contact thresholds on the predictions of CAPP. Figure S4. Density curves of PhyloP scores of the three categories of CRMs predicted by CAPP: dual CRMs, exclusive enhancers sand exclusive silencers. Figure S5. Numbers of Intervening genes between CRMs and their target genes. Figure S6. Comparison of our predicted top-ranked 425,350 enhancer-G links with the same number of RE-G links predicted by the ABC model across different cell/tissue types. Figure S7. Comparison of our predicted top-ranked 41,405 enhancer-G links with the same number of RE-G links predicted by the ABC model in the K562 cells. Figure S8. Comparison of the predicted top-ranked 82,810 RE-G links by the rE2G with the same number of links predicted by the ABC modeland our model CAPPin the K562 cells. Figure S9. Numbers of CRISPR-verified predictions of the top-ranked predictions of the ABC model, ENCODE-rE2G and CAPP. Figure S10. Density curves of PhyloP scores of selected CRMs that were predicted active and inactive in at least five cell/tissue types and the remaining unselected CRMs

12915_2025_2313_MOESM2_ESM.xlsx

Additional file 2. Table S1. Number of Predicted Active CRMs in Each of the 107 Human Cell/Tissue Types. Table S2. Overlaps between eQTL variants and the different types of CRMs. Table S3. Overlaps between GWAS SNPs and the different types of CRMs. Table S4. Number of TADs and Regulatory Network Components on Each Human Chromosome. Table S5. Definitions and abbreviations of the terms for regulatory elements and regulation links. Table S6. Summary of Hi-C Data in Six Cell Lines. Table S7. Summary of Chromatin Accessibility Data from 67 Cell/Tissue Types for Cistrome for Training and Evaluation. Table S8. Summary of Transcription Factor ChIP-seq Data from Cistrome for Training and Evaluation. Table S9. Summary of ATAC-seq and RNA-seq Data in 107 Cell/Tissue Types from the Encode Portal

Additional file 3. Figure S1. The sub-static and sub-active cis-regulatory networks on each chromosome.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yuan, S., Ni, P. & Su, Z. Prediction of target genes and functional types of cis-regulatory modules in the human genome reveals their distinct properties. BMC Biol 23, 211 (2025). https://doi.org/10.1186/s12915-025-02313-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1186/s12915-025-02313-9

Keywords