A spectrum of nonsense-mediated mRNA decay efficiency along the degree of mutational constraint

Kim, Young-gon; Kang, Hyunju; Lee, Beomki; Jang, Hyeok-Jae; Park, Jong-ho; Ha, Changhee; Park, Hogun; Kim, Jong-Won

doi:10.1038/s42003-024-07136-y

Download PDF

Article
Open access
Published: 07 November 2024

A spectrum of nonsense-mediated mRNA decay efficiency along the degree of mutational constraint

Communications Biology volume 7, Article number: 1461 (2024) Cite this article

5929 Accesses
5 Citations
Metrics details

Subjects

Abstract

Despite its importance for regulating gene expression, nonsense-mediated mRNA decay (NMD) remains poorly understood. Here, we extend the findings of a previous landmark study that proposed several factors associated with NMD efficiency using matched genome and transcriptome data from The Cancer Genome Atlas Program (TCGA) by incorporating additional data including Genotype-Tissue Expression (GTEx), gnomAD, and metrics for mutational constraints. Factors affecting NMD efficiency are analyzed using an allele-specific expression (ASE)-based measure to reduce noise caused by random variations. Additionally, by combining our data with the updated allele frequency database of gnomAD, we demonstrate the spectrum of NMD efficiency according to the degree of gene-level mutational constraints (degree of a gene-tolerating loss-of-function variants). The NMD prediction model, trained on TCGA data, shows that gene-level mutational constraint is an important predictor of NMD efficiency. Findings of this study suggest the potential role of NMD on shaping the landscape of mutational constraints.

Cellular variability of nonsense-mediated mRNA decay

Article Open access 10 December 2021

Nonsense-mediated RNA decay: an emerging modulator of malignancy

Article 27 May 2022

Global impact of unproductive splicing on human gene expression

Article Open access 02 September 2024

Introduction

Nonsense-mediated mRNA decay (NMD) is a biological mechanism that selectively degrades mRNAs harboring premature termination codons (PTCs)^1,2. The exon junction complex (EJC) model has long been accepted as the major mechanism underlying NMD initiation. Based on this model, two canonical rules, namely the “last-exon rule” and the “50 nt rule”, were widely accepted to define the conditions of escaping NMD. According to these rules, PTCs in the last exon of a gene or in the last 50 nt of the penultimate exon are not expected to trigger NMD.

Some studies have aimed to further analyze the mechanisms of NMD efficiency, and a landmark study was published by Lindeboom et al.^3,4. Based on the assumption that the human cancer genome and its matched transcriptome provide a closer look into the true nature of NMD due to the abundance and random distribution of PTC-inducing variants in cancer, they utilized somatic variants in The Cancer Genome Atlas (TCGA) to model NMD efficiency based on multiple parameters. In these studies, NMD efficiency was quantified as the ratio of the mRNA expression level of a PTC-bearing transcript to the median mRNA expression level of transcripts without PTCs in samples of the same tumor type. The expression level was measured as transcripts per million mapped reads (TPM). They incorporated all potential factors into the random forest regression (RFR) model and identified factors affecting NMD efficiency, such as the proximity of the PTC to the start codon and the length of the PTC-containing exon.

In two other studies by Rivas et al.⁵ and Teran et al.⁶, the NMD efficiency of germline nonsense variants in the Genotype-Tissue Expression (GTEx) data was measured using allele-specific expression (ASE). ASE has been more widely used in recent studies on NMD than has the TPM-based method described above^{5,6,7,8,9,10,11,12}. According to Rivas et al.⁵, the tissue specificity of NMD efficiency was suggested based on the varying NMD efficiency observed across tissues. However, in a study performed by Teran et al.⁶, NMD efficiency was consistent across tissues, and allele frequency (AF) was suggested as a parameter for NMD efficiency. Another recent study that utilized single-molecule imaging suggested that additional factors affect NMD efficiency, such as the distance from the PTC to the EJC, the number of EJCs upstream and downstream of the PTC, and the coding sequences around PTCs.

As hypothesized by Lindeboom et al.³, it is reasonable to assume that a database of somatic variants could provide a more abundant and pure source for NMD investigation, as the landscape of PTC-inducing variants in a germline database must have been shaped by evolutionary pressure¹³. As the previous study that trained the NMD prediction model from somatic variants used a TPM-based NMD measurement³, the first purpose of this study was to train a more robust prediction model for NMD efficiency using an ASE-based method that has been more widely used for measuring the degree of NMD.

Second, based on the reported differences in NMD efficiency according to AF in GTEx variants, we investigated the effect of mutational constraints on NMD efficiency. Mutational constraints are defined as the depletion of variants through natural selection. A gene is regarded as highly constrained for loss-of-function (LoF) variants if the frequency of LoF variants is considerably lower than expected due to its detrimental effect on survival^13,14. Examples of measures for mutational constraints include the probability of being LoF-intolerant (pLI)¹⁵ and the LoF-observed/expected upper-bound fraction (LOEUF)¹³. A higher pLI, or equivalently, lower LOEUF of a gene indicates a higher mutational constraint of the gene, thus suggesting an adverse biological effect of the gene. In this study, NMD efficiency was compared across AF-stratified and LOEUF-stratified subsets of TCGA and GTEx. A low LOEUF (close to 0) indicates depletion (high constraint) of LoF variants in the gene in the general population, whereas a high LOEUF (close to 1) indicates abundance (low constraint) of LoF variants.

Third, the characteristics of known factors of NMD efficiency, including the last exon rule, start proximity rule, 3’UTR length, and tissue specificity, were reviewed using the NMD efficiency model constructed in this study by evaluating the importance of each feature for predicting NMD efficiency.

Results

The spectrum of NMD efficiency and the capability of NMD rules for predicting NMD efficiency in somatic variants, rare germline variants, and common germline variants

From the TCGA data, 4257 nonsense variants were collected from 9235 samples of 32 tumor types (Supplementary Data 1). The NMD efficiency distribution of the nonsense variants in TCGA is presented in Fig. 1. Taking only the canonical rules of NMD (the “last exon rule” and the “last 50nt of the penultimate exon rule”) into account, the NMD efficiency of the predicted NMD-sensitive variants exhibited wide variability, while that of the predicted NMD-escaping variants was concentrated around zero (Fig. 1A–C). This is contrary to the previous notion that deviations from canonical rules are common in both predicted NMD-escaping and -sensitive regions. The results of a previous study using TCGA data revealed a wider distribution of NMD efficiency, both in predicted NMD-escaping variants and predicted NMD-sensitive variants, and this was likely due to the high random variation caused by using a TPM-based measure of NMD efficiency³. The benefit of incorporating two additional rules (the “start proximal rule” and the “long exon rule,” Fig. 1D–F) was marginal in terms of the discriminative power that is the capability of separating NMD-escaping and NMD-sensitive variants compared to that of the combination of the two canonical rules.

**Fig. 1: Distribution plots of NMD efficiency observed from TCGA variants.**

The final GTEx dataset was comprised of 18,055 variants from 54 tissue types (Supplementary Data 2). In GTEx variants, the overall NMD efficiency was lower than that in TCGA variants, and all NMD rules exhibited low discriminative power (Fig. 2). In particular, in a subset of common germline variants (AF [0.01, ∞)), the overall NMD efficiency was the lowest (Supplementary Fig. 1). Additionally, in this subset, the NMD efficiency distributions of predicted NMD-escaping variants and predicted NMD-sensitive variants exhibited an almost complete overlap by all four rules (the “last exon,” “last 50nt of the penultimate exon,” “start proximal,” and “long exon” rules), indicating zero discriminative power of the NMD rules. In contrast, NMD efficiency was higher, and NMD rules exhibited higher discriminative power in the AF [0] group of GTEx (Supplementary Fig. 2). In short, the performance of NMD rules for predicting NMD efficiency was the highest for somatic variants, and this was followed by rare germline variants and common germline variants.

**Fig. 2: Distribution plots of NMD efficiency observed from GTEx variants.**

A spectrum of NMD efficiencies along the gene-level mutational constraint

Figure 3A–D presents the NMD efficiency in the subsets of TCGA and GTEx variants grouped according to either AF or LOEUF. Among the four combinations created using the two different datasets and two different grouping strategies, the increasing trend of NMD efficiency with increasing mutational constraints (decreasing AF and LOEUF) was most evident in the NMD efficiency of TCGA variants across the LOEUF deciles (Fig. 3C). In the TCGA dataset, nonsense variants from genes with lower LOEUF exhibited higher NMD efficiency, whereas those from genes with higher LOEUF exhibited diminished NMD efficiency (Fig. 3C). Conversely, the tendency for higher NMD efficiency with a lower LOEUF decile was not evident in the GTEx dataset (Fig. 3D). This can be explained by the depletion of germline nonsense variants with high NMD efficiency and low LOEUF that would have been located in the upper right section of Fig. 3D if not depleted. The depletion of low-LOEUF GTEx variants is evident from the number of variants presented in Fig. 3D.

**Fig. 3: The effect of AF and gene-level mutational constraint (LOEUF) on NMD efficiency.**

As LOEUF was derived from the same database as AF, to avoid the potential confounding effects of AF on LOEUF the AF [0] subsets of TCGA and GTEx (Fig. 3E and F) were additionally analyzed to reveal the sole effect of LOEUF on NMD. The relationship between LOEUF and NMD efficiency was reproduced in the AF [0] subset of TCGA variants (Fig. 3E). Supplementary Fig. 3 and 4 present NMD efficiency distribution plots of the low- and high-LOEUF TCGA subsets, respectively. When rules to predict NMD escape were applied to divide the dataset into two groups, a difference in NMD efficiency was more clearly observed in the low-LOEUF subset.

The correlation coefficients between LOEUF and NMD efficiency measured in TCGA and MMRF-TARGET data are presented in Supplementary Data 3. Although the correlation was significant in TCGA data across all tumor DOC thresholds applied, in the MMRF-TARGET data, the correlation was generally weaker and significant only when a higher threshold for tumor DOC was applied.

Based on the speculation that LOEUF may possess a lower accuracy for measuring gene-level mutational constraints for smaller genes¹⁶, the correlation between LOEUF and NMD efficiency was additionally analyzed after excluding variants from smaller genes. However, the degree of correlation was not enhanced by filtering variants from smaller genes (Supplementary Data 4). Additionally, the GeneBayes metrics for gene-level mutational constraints provided in the publication mentioned above¹⁶ did not exhibit a higher correlation with NMD efficiency than that of LOEUF in our TCGA data (Supplementary Data 5).

ASE-based model explaining the greater portion of the total variance of NMD efficiency compared to that of the TPM-based model

Based on the observation that NMD efficiency is more distinguishable according to the NMD rules in TCGA data, an RFR model was constructed using TCGA data. Factors known to affect NMD efficiency, including canonical and recently reported rules, were collected from existing studies^3,4,17. Along with the factors affecting NMD efficiency, LOEUF was incorporated as a predictor variable in the RFR model. The final RFR model incorporated 14 different features (Table 1) and explained 43.7% of the total variance. The non-random variance calculated in the TCGA dataset was 46% (Supplementary Fig. 5), thus making the proportion of explained variance 95% (43.7/46.0).

Table 1 Selected features for the model construction and their relevance to NMD efficiency

Full size table

The downstream exon count possessed the highest importance, whereas the importance of the last exon was relatively low, and this was likely due to redundancy between these two features (the last exons have zero downstream exon count). The importance of the Last 50nt penultimate exon was also low, and this could be attributed to the small number of variants belonging to this category³. The 3’UTR length exhibited positive correlation with NMD efficiency and the 4th highest importance in the model. The importance of LOEUF was eighth among the 14 features, and this was higher than that of length of transcript, mRNA half-life, 5’UTR length, upstream exon count, AF, and last 50nt penultimate exon.

ASE-based and TPM-based models were directly compared using the same datasets and combinations of filtering parameters (Supplementary Data 6). Deriving the TPM-based NMD efficiency according to the methods described in a previous study³ required an additional filtering process to reduce the number of variants. The explainable variance was significantly higher in the ASE-based model for all parameter combinations.

Comparison between predicted and observed NMD efficiencies

In addition to the observed NMD efficiency measured with ASE, the NMD efficiency can be predicted from the RFR model using TCGA data. The observed and predicted NMD efficiencies of the variants in different datasets are presented in Fig. 4A (Supplementary Data 7). The predicted values were considerably higher in all AF subsets of the GTEx data, thus suggesting the depletion of variants with high NMD efficiency in the general population and preferential negative selection of variants with higher than predicted NMD efficiency. The difference was smallest in the AF [0] subset of GTEx, and this was likely due to the lesser effect of negative selection, as the variants in this subset were younger. This result is in agreement with those of previous studies that reported higher NMD efficiencies in lower AF subsets of GTEx⁶. However, the predicted and observed NMD efficiencies of the rarest variants in GTEx were lower than those of the somatic variants in TCGA, thus suggesting a non-negligible cumulative effect of evolutionary pressure on these variants.

**Fig. 4: Observed (from RNAseq) and predicted (by the RFR model trained in this study) NMD efficiency.**

The observed NMD efficiency of an independent set of somatic variants (MMRF-TARGET data) was significantly lower than predicted. However, the correlation between the predicted and observed NMD efficiencies was much higher than that of the GTEx data (Fig. 4B–D). The overall NMD efficiency of variants in the MMRF-TARGET data was lower than that of the TCGA variants, and all NMD rules exhibited low discriminative power (Supplementary Figs. 6 and 7).

A set of curated de novo variants, the de novo DB, was used as an additional set of variants free from evolutionary pressure. In contrast to the AF [0] subset of GTEx, the predicted NMD efficiency of the de novo variants was as high as that of the somatic variants in TCGA, presumably due to the lack of influence of natural selection. The observed NMD efficiency could not be derived from these data due to the lack of RNA-seq results.

Reduced NMD efficiency in start proximal PTCs and the potential effect of translation re-initiation

The NMD efficiency, according to the PTC-bearing regions (<150nt to the start codon, the rest, the last 50nt of the penultimate exon, and the last exon), is presented in Supplementary Fig. 8. NMD efficiency according to TCGA data indicated a higher degree of discrimination across regions than that of GTEx. Supplementary Fig. 9 presents NMD efficiency trends in the proximal region of the start codon. A gradual increase in NMD efficiency from the start codon to 200nt was generally observed in both TCGA and GTEx data, and this is consistent with previous findings, although unexpected patterns were observed in common germline variants, likely due to the limited number of samples used for drawing trends in the 200nt range (n = 419 and 517 for Supplementary Fig. 9B and C, respectively).

Although other mechanisms, including transcript stabilization by the interaction between poly(A)-binding protein C1 (PABPC1) and translation termination machinery, have been proposed, translation re-initiation from the downstream start codon is considered the main mechanism that explains the reduced NMD efficiency of proximal start PTCs. To determine the effect of translation re-initiation capability on NMD efficiency, we compared the NMD efficiency of start proximal nonsense variants (<150nt from the start codon) according to the presence of downstream in-frame start codons. Although statistically significant differences were observed only in several subsets of GTEx (Supplementary Fig. 10B, D), transcripts with downstream AUG generally exhibited diminished NMD efficiency, including TCGA data.

Cancer type in TCGA is not important for predicting NMD efficiency

The NMD efficiencies for each cancer type in TCGA and each tissue type in GTEx are presented in Fig. 5 and Supplementary Data 8 and 9, respectively. To exclude non-significant outliers, cancers, and tissue types consisting of fewer than 100 variants were excluded (Fig. 5). Considering the effect of evolutionary pressure on variants present in GTEx, TCGA results may provide a better estimate of NMD efficiency across tissues, which has not been used in previous studies. Adding the cancer type feature to our RFR model that was based on TCGA data did not improve model performance (R² from 43.7% to 43.1%). Thus, cancer type does not appear to be an efficient parameter for predicting NMD efficiency.

**Fig. 5: NMD efficiencies of each cancer type in TCGA and tissue type in GTEx that consist of more than 100 variants.**

Discussion

In this study, the landscapes of nonsense variants in TCGA and GTEx were compared, and the factors affecting NMD efficiency were analyzed using both datasets. The profound effect of natural selection on GTEx was clearly revealed, indicating that TCGA is a better source for studying NMD independent of evolutionary pressure. The relationship between NMD efficiency and LOEUF observed in TCGA was absent in GTEx, and this is plausible considering the selective depletion of variants with high NMD efficiency in GTEx during natural selection.

Using an ASE-based measure of NMD efficiency, we discovered a spectrum of NMD efficiencies along gene-level mutational constraints represented as LOEUF. In contrast, the effect of AF on the NMD efficiency was not as profound. The relationship between LOEUF and NMD efficiency remained observable, even after removing the potential confounding effect of AF on LOEUF. Although we observed a difference in NMD efficiency according to gene-level mutational constraints, causality may be the opposite. NMD efficiency may have shaped the landscape of gene-level mutational constraints. A gene for which NMD is less efficient would be less constrained due to the production of a partially functional protein that would make pLoF variants more tolerable. Therefore, gene-level differences in NMD efficiency could be a valuable area for future studies to uncover the drivers that shape gene-level mutational constraints.

A list of studies that analyzed NMD efficiency using public data and the data sources used in each study are provided in Supplementary Data 10. Teran et al.⁶ utilized ASE, represented by VAF_RNA, to analyze NMD efficiency across various tissue types and constructed a LASSO regression model using GTEx. Rhee et al.¹⁸ utilized the allele fraction difference (VAF_RNA–VAF_DNA) of a variant observed from whole-exome sequencing (WES) and RNA-seq data from the TCGA (five major cancer types) for univariate analysis of previously known NMD-modulating factors without multiparameter NMD efficiency model construction. Previous studies that trained multivariate NMD efficiency models using TCGA data relied on a TPM-based approach^3,4. Compared to those studies, in the current study, the explainable variance expressed by non-random variance was significantly improved using the ASE.

Before Teran et al.⁶ argued that NMD is highly stable across different tissues, most studies focused on the same topic reported differences in NMD efficiency across different tissues^{5,19,20,21,22,23}. As variants in TCGA were less influenced by selective pressure, the conclusion deduced from our analysis using TCGA that cancer types exert a limited effect on NMD efficiency supports the results of Teran et al.⁶.

This study possesses some limitations. First, although the use of ASE reduced random noise compared to the TPM-based method, the measure of NMD efficiency based on VAF was still an indirect estimation that was bound to random variance. Future studies investigating NMD efficiency are warranted to utilize higher quality data with higher DOC or to utilize direct measures of NMD efficiency, such as the single-molecule imaging utilized by Hoek et al.¹⁷. Second, in the validation using independent MMRF-TARGET data, the predicted values were considerably higher than the observed values despite the correlation between the observed and predicted NMD efficiencies. We assume that this discrepancy was caused by the high expression of truncated transcripts in amplified genetic regions, as the variants in the MMRF-TARGET data could not be restricted to those from copy-neutral regions due to the lack of copy number information in those datasets. The effect of NMD could have been compensated for by amplification, as gene amplification is prevalent in cancer²⁴. Third, frameshift variants were not utilized due to their potential for high mapping bias. When utilizing nonsense variants, while the effect of mapping bias could be minimized using a WASP-filtered version of the ASE tables in GTEx, the potential effect of mapping bias was not considered in TCGA and MMRF-TARGET datasets. The mapping bias may have added some degree of inaccuracy to the analysis of TCGA and MMRF-TARGET data. However, given the high degree of random variance mentioned above, we assumed that the effect of inaccuracy introduced by mapping bias caused by single-nucleotide variants may have been minor. Fourth, although we assumed a smaller effect of evolutionary pressure on somatic variants than that on germline variants (as assumed in a previous study³), considering the process of clonal evolution during tumorigenesis, driver mutations may not have been free from selective pressure. The results of a previous study suggested enrichment of NMD-sensitive variants in tumor-suppressor genes and NMD-insensitive variants in oncogenes and essential genes, thus indicating selective pressure acting on these classes of genes in cancer cells³. We assume that the effect of the selective pressure on this study was minor considering the limited number of variants from oncogenes, tumor-suppressor genes, and essential genes reported in the previous study³ and the relative abundance of passenger mutations in the cancer genome²⁵. However, future studies utilizing a pure set of passenger mutations in the cancer database would be helpful for revealing the true nature of NMD regardless of the selective pressure.

In conclusion, the characteristics of NMD were more clearly revealed in somatic variants than they were in germline variants. Considering the observed relationship between NMD efficiency and mutational constraints, NMD may play a more vital role in natural selection than was previously thought.

Methods

TCGA data preprocessing

The data preprocessing procedure is illustrated in Supplementary Fig. 11. From the GDC portal (https://portal.gdc.cancer.gov/, accessed 29 Oct 2023), whole-exome sequencing (WES) annotation files, copy number (CN) analysis results, and expression data from 9235 tumor samples from 32 tumor types were downloaded from the GDC portal. The CN analysis results were merged with the WES annotation files from which the nonsense variants of copy-neutral genes (CN = 2) were extracted. Among these nonsense variants, all variants supported by fewer than five sequencing reads were excluded. Thereafter, variants with a matched-normal variant allele frequency (VAF) of greater than 0 but less than 0.3 were also excluded considering the possibility of systematic false variants at the locus (the VAF of matched-normal tissue is expected to be 0 in the case of somatic variants and approximately 0.5 in the case of germline heterozygous variants). When a gene in a sample contained more than two variants that could trigger NMD (nonsense, canonical splice site, or frameshift indel variants), it was not included in the analysis, as the effect of one variant could potentially confound that of the other variants in the same gene. The initial filtering process yielded 50,591 nonsense variants.

Genomic locations of exon-intron boundaries downloaded from the Table Browser of the UCSC genome browser²⁶ were used to annotate features that included last exon, last 50 nt penultimate exon, dist PTC to start codon, upstream exon count, downstream exon count, PTC-containing exon length, dist PTC to downstream EJ, dist PTC to the normal stop codon, and length of the transcript. AF and LOEUF were annotated using gnomAD v4.0¹³, and the mRNA half-life was annotated using previously published data²⁷.

The RNA-seq data were downloaded from the TCGA GDC API from 22 August 2022 to 23 August 2022 using the request (version 2.22.0) library in Python (version 3.8.10). We requested BAM files containing the regions of interest based on the chromosomal location of each PTC. Variant calling was performed using Mutect2 from GATK (version 4.2.6.1) with reference to hg38. The VAF of the variants resulting in PTC were collected from the vcf files. The list of nonsense variants generated from the RNA-seq data with annotations such as DOC and VAF was merged with the list of variants obtained from the DNA data.

The VAF observed from RNA-seq (VAF_RNA) reflects the ASE. The VAF of the same variant observed in tumor DNA sequencing (VAF_DNA) could be an estimate of VAF_RNA in the absence of NMD. The VAF_DNA and VAF_RNA values from the merged data were used to calculate the NMD efficiency of each nonsense variant in TCGA data using the following equation:

$${{{\boldsymbol{NMD\; efficiency}}}}=-{{{{\bf{log }}}}}_{{{{\bf{2}}}}}\frac{{{{{\boldsymbol{VAF}}}}}_{{{{\boldsymbol{RNA}}}}}}{{{{{\boldsymbol{VAF}}}}}_{{{{\boldsymbol{DNA}}}}}}$$

(1)

In this formula, the efficiency of no NMD (VAF_RNA = VAF_DNA) is zero, and that of complete NMD (VAF_RNA = 0) is infinity.

To calculate TPM-based NMD efficiency, the same formula and data-filtering procedures were adopted from a previous study³. Expression data containing TPM values were downloaded along with WES and CN analysis files. The median TPM was calculated from the NMF clusters defined in GDAC Firehose (https://gdac.broadinstitute.org/; accessed 01 Sep 2023).

To determine the final filtering parameters for tumor VAF, tumor DOC, and RNA DOC, 18 combinations were generated using two tumor VAFs (0.2 and 0.3), three tumor DOC values (10, 20, and 30), and three RNA DOC values (10, 20, and 30). The model performance in terms of R^2, was compared (Supplementary Data 6). The selected parameters were tumor VAF ≥ 0.3, tumor DOC ≥ 30, and RNA DOC ≥ 30. Applying these conditions resulted in 4366 variants, and the final dataset comprised 4257 variants after collapsing those that appeared in multiple samples in an NMF cluster using the mean values.

GTEx data and de novo variant data preprocessing

From the ASE data of 844 individuals in GTEx V8, 131,584 nonsense variants were extracted. A WASP-filtered version of the ASE tables was used, as it provides a better estimate of the true ASE by reducing the effect of mapping bias. For each ASE table, the nonsense variants were extracted using the condition VARIANT_ANNOTATION = stop_gained. Consistent with TCGA data, variants with an RNA-seq DOC < 30 were excluded. Variants appearing in multiple tissue samples were collapsed to mean values. The ensemble variant effect predictor (VEP, release 109) was used to annotate the coding DNA-based variant descriptions. The annotation of potential factors affecting NMD efficiency was performed using the same methods used for TCGA data. The NMD efficiency was calculated using the same principle as that for TCGA data. As all variants included in the ASE tables were germline heterozygous variants, VAF_DNA was replaced with 0.5:

$${{{\boldsymbol{NMD\; efficiency}}}}=-{{{{\bf{log }}}}}_{{{{\bf{2}}}}}\frac{{{{{\boldsymbol{VAF}}}}}_{{{{\boldsymbol{RNA}}}}}}{{{{\bf{0.5}}}}}$$

(2)

A collection of curated germline de novo variants was downloaded from the denovo-db website (https://denovo-db.gs.washington.edu/denovo-db/, accessed Oct 14, 2024). A repository of non-Simons Simplex Collection samples consisting of collected de novo variants from various published studies, including developmental delay, autism, intellectual disability, and congenital heart disease cohorts, was downloaded. Among the 415,516 variants, there were 1,817 nonsense variants. After annotation using VEP to remove duplicate descriptions of a single variant, variant descriptions from non-canonical transcripts were removed, ultimately resulting in 864 nonsense variants in the final list (Supplementary Data 11).

MMRF and TARGET data preprocessing

To validate our NMD model, two independent datasets were used, including the MMRF (Multiple Myeloma Research Foundation) CoMMpass dataset and the TARGET (Therapeutically Applicable Research to Generate Effective Treatments) dataset. The preprocessing of these datasets followed the same procedures as those applied to the TCGA dataset. WES and RNA-seq data were downloaded using the GDC API from 29 July 29 2024 to 02 August 2024. First, the annotation files for WES were used to extract nonsense variants that resulted in PTC. Second, the RNA-seq BAM files of regions with a nonsense variant were downloaded. Third, the variants and their VAF were extracted from the RNA-seq data using Mutect2. Finally, the NMD efficiency of the nonsense variants in these datasets was calculated using the same formula as that used for TCGA dataset analysis (Supplementary Data 12). Notably, due to the lack of comprehensive CN analysis in these datasets, we were unable to restrict the analysis to nonsense variants of copy-neutral genes.

Modeling NMD using RFR

RFR was selected in the same manner as that used in previous studies that investigated NMD using TCGA data. RFR is a suitable model for NMD modeling not only for comparison to previous studies but also for robust modeling, as it is less affected by outliers in the dataset. Variants for which LOEUF was not available in the TCGA data (n = 130, 3.05%) and GTEx data (n = 953, 5.28%) were excluded from the training and prediction of the RFR. The NMD efficiency was the target variable in the regression problem. Feature selection was performed based on univariate and correlation analyses of previously reported parameters^3,4,17,28 and on LOEUF. Table 1 lists the features of these models. RFR was implemented with 10,000 trees under the constraint of a maximum of three features on the TCGA dataset resulting from hyperparameter analysis. The trained model was evaluated using k-fold cross-validation to resolve sampling bias by setting k to five. To plot the NMD efficiency predicted by the RFR model (Fig. 4), TCGA data were randomly divided into a training set (n = 3301) and a test set (n = 826). An evaluation metric (R-squared) was used to examine the explanatory power of the input variables for predicting the NMD efficiency. Python (version 3.8.5) was used for the analysis with libraries that included Scikit-learn (version 1.2.2), Pandas (version 1.1.3), and SciPy (version 1.5.2).

Statistics and reproducibility

The Wilcoxon signed-rank test was used to compare means. Pearson’s correlation was used to analyze the correlation between two numerical variables when the normality assumption was satisfied. Otherwise, Spearman’s correlation analysis was used. The normality assumption was tested using the Shapiro-Wilk test. All tests were performed using R software (version 4.2.2).

To reproduce the datasets used in this study from the ground up, source data used in this study can be downloaded from the public database described in the Data Availability section below. Following the filtering process of each dataset described above in each subsection of Methods, the same final dataset used in this study can be obtained. Alternatively, the processed version of data can be used from the Supplementary Data. Correlation analysis, RFR training, validation, and test case regression can be reproduced using the Python code and running environment provided in the Code Availability section below.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All source data used in this study are available from public data sources. TCGA, MMRF, and TARGET data are available from GDC portal (https://portal.gdc.cancer.gov/) and GTEx data are available from GTEx Portal (https://gtexportal.org/home/). De novo variants from denovo-db are available from its website (https://denovo-db.gs.washington.edu/denovo-db/). All processed data are available in this published article (and its Supplementary Data).

Code availability

The code used in this study is available at https://doi.org/10.5281/zenodo.13918125²⁹ or https://github.com/hjkng/nmdeff. The version of software is v1.0 at the time of final submission of this manuscript.

References

Lykke-Andersen, S. & Jensen, T. H. Nonsense-mediated mRNA decay: an intricate machinery that shapes transcriptomes. Nat. Rev. Mol. Cell Biol. 16, 665–677 (2015).
Article CAS PubMed Google Scholar
Tan, K., Stupack, D. G. & Wilkinson, M. F. Nonsense-mediated RNA decay: an emerging modulator of malignancy. Nat. Rev. Cancer 22, 437–451 (2022).
Article CAS PubMed PubMed Central Google Scholar
Lindeboom, R. G., Supek, F. & Lehner, B. The rules and impact of nonsense-mediated mRNA decay in human cancers. Nat. Genet. 48, 1112–1118 (2016).
Article CAS PubMed PubMed Central Google Scholar
Lindeboom, R. G. H., Vermeulen, M., Lehner, B. & Supek, F. The impact of nonsense-mediated mRNA decay on genetic disease, gene editing and cancer immunotherapy. Nat. Genet. 51, 1645–1651 (2019).
Article CAS PubMed PubMed Central Google Scholar
Rivas, M. A. et al. Effect of predicted protein-truncating genetic variants on the human transcriptome. Science 348, 666–669 (2015).
Article CAS PubMed PubMed Central Google Scholar
Teran, N. A. et al. Nonsense-mediated decay is highly stable across individuals and tissues. Am. J. Hum. Genet. 108, 1401–1408 (2021).
Article CAS PubMed PubMed Central Google Scholar
McKean, D. M. et al. Loss of RNA expression and allele-specific expression associated with congenital heart disease. Nat. Commun. 7, 12824 (2016).
Article CAS PubMed PubMed Central Google Scholar
Tournier, I. et al. Analysis of the allele-specific expression of the mismatch repair gene MLH1 using a simple DHPLC-based method. Hum. Mutat. 23, 379–384 (2004).
Article CAS PubMed Google Scholar
Karam, R. et al. The NMD mRNA surveillance pathway downregulates aberrant E-cadherin transcripts in gastric cancer cells and in CDH1 mutation carriers. Oncogene 27, 4255–4260 (2008).
Article CAS PubMed Google Scholar
Pirinen, M. et al. Assessing allele-specific expression across multiple tissues from RNA-seq read data. Bioinformatics 31, 2497–2504 (2015).
Article CAS PubMed PubMed Central Google Scholar
Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013).
Article CAS PubMed PubMed Central Google Scholar
MacArthur, D. G. et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science 335, 823–828 (2012).
Article CAS PubMed PubMed Central Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Article CAS PubMed PubMed Central Google Scholar
Chen, S. et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature 625, 92–100 (2024).
Article CAS PubMed Google Scholar
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Article CAS PubMed PubMed Central Google Scholar
Zeng, T., Spence, J. P., Mostafavi, H. & Pritchard, J. K. Bayesian estimation of gene constraint from an evolutionary model with gene features. Nat. Genet. 1–12 https://doi.org/10.1038/s41588-024-01820-9 (2024).
Hoek, T. A. et al. Single-molecule imaging uncovers rules governing nonsense-mediated mRNA decay. Mol. Cell 75, 324–339.e311 (2019).
Article CAS PubMed PubMed Central Google Scholar
Rhee, J. K., Lee, S., Park, W. Y., Kim, Y. H. & Kim, T. M. Allelic imbalance of somatic mutations in cancer genomes and transcriptomes. Sci. Rep. 7, 1653 (2017).
Article PubMed PubMed Central Google Scholar
Mendell, J. T., Medghalchi, S. M., Lake, R. G., Noensie, E. N. & Dietz, H. C. Novel Upf2p orthologues suggest a functional link between translation initiation and nonsense surveillance complexes. Mol. Cell. Biol. 20, 8944–8957 (2000).
Article CAS PubMed PubMed Central Google Scholar
Bateman, J. F., Freddi, S., Nattrass, G. & Savarirayan, R. Tissue-specific RNA surveillance? Nonsense-mediated mRNA decay causes collagen X haploinsufficiency in Schmid metaphyseal chondrodysplasia cartilage. Hum. Mol. Genet. 12, 217–225 (2003).
Article CAS PubMed Google Scholar
Zetoune, A. B. et al. Comparison of nonsense-mediated mRNA decay efficiency in various murine tissues. BMC Genet. 9, 83 (2008).
Article PubMed PubMed Central Google Scholar
Huang, L. & Wilkinson, M. F. Regulation of nonsense-mediated mRNA decay. Wiley Interdiscip. Rev. RNA 3, 807–828 (2012).
Article CAS PubMed Google Scholar
Dyle, M. C., Kolakada, D., Cortazar, M. A. & Jagannathan, S. How to get away with nonsense: mechanisms and consequences of escape from nonsense-mediated RNA decay. Wiley Interdiscip. Rev. RNA 11, e1560 (2020).
Article CAS PubMed Google Scholar
Albertson, D. G. Gene amplification in cancer. TRENDS Genet. 22, 447–455 (2006).
Article CAS PubMed Google Scholar
Bozic, I. et al. Accumulation of driver and passenger mutations during tumor progression. Proc. Natl Acad. Sci. USA 107, 18545–18550 (2010).
Article CAS PubMed PubMed Central Google Scholar
Karolchik, D. et al. The UCSC table browser data retrieval tool. Nucleic Acids Res. 32, D493–D496 (2004).
Article CAS PubMed PubMed Central Google Scholar
Friedel, C. C., Dolken, L., Ruzsics, Z., Koszinowski, U. H. & Zimmer, R. Conserved principles of mammalian transcriptional regulation revealed by RNA half-life. Nucleic Acids Res. 37, e115 (2009).
Article PubMed PubMed Central Google Scholar
Supek, F., Lehner, B. & Lindeboom, R. G. H. To NMD or not to NMD: nonsense-mediated mRNA decay in cancer and other genetic diseases. Trends Genet. 37, 657–668 (2021).
Article CAS PubMed Google Scholar
Kang, H. & Kim, Y.-G. NMDeff v1.0 https://doi.org/10.1016/j.tig.2020.11.002 (2024).

Download references

Acknowledgements

The results published here are in part based on data generated by TCGA Research Network: https://www.cancer.gov/tcga. This study was supported by a Medical Scientist Training Program from the Ministry of Science & ICT of Korea. This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No.1711180477, 2022R1A2C2091571). This work has also been supported by the Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No.2019-0-00421, Artificial Intelligence Graduate School Program (Sungkyunkwan University)).

Author information

These authors contributed equally: Young-gon Kim, Hyunju Kang, Beomki Lee.

Authors and Affiliations

Department of Laboratory Medicine and Genetics, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea
Young-gon Kim, Beomki Lee, Changhee Ha & Jong-Won Kim
Department of Artificial Intelligence, Sungkyunkwan University College of Computing and Informatics, Suwon, Republic of Korea
Hyunju Kang & Hogun Park
Graduate School of Medical Science and Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea
Beomki Lee
Samsung Advanced Institute for Health Sciences & Technology (SAIHST), Seoul, Republic of Korea
Hyeok-Jae Jang
Clinical Genomics Center, Samsung Medical Center, Seoul, Republic of Korea
Jong-ho Park

Authors

Young-gon Kim
View author publications
Search author on:PubMed Google Scholar
Hyunju Kang
View author publications
Search author on:PubMed Google Scholar
Beomki Lee
View author publications
Search author on:PubMed Google Scholar
Hyeok-Jae Jang
View author publications
Search author on:PubMed Google Scholar
Jong-ho Park
View author publications
Search author on:PubMed Google Scholar
Changhee Ha
View author publications
Search author on:PubMed Google Scholar
Hogun Park
View author publications
Search author on:PubMed Google Scholar
Jong-Won Kim
View author publications
Search author on:PubMed Google Scholar

Contributions

Y-g.K. conceived the study and wrote the manuscript. H.K. and B.L. performed the analyses and participated in writing the manuscript. H-J.J., J-h.P., and C.H. performed data extraction, filtering, and annotation. H.P. and J.-W.K. supervised the study. All authors have read and approved the final manuscript.

Corresponding authors

Correspondence to Hogun Park or Jong-Won Kim.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Biology thanks Stephen Montgomery and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Laura Rodríguez Perez and Kaliya Georgieva. [A peer review file is available.]

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Description of Additional Supplementary Files

Supplementary Data 1-12

Reporting Summary

Transparent Peer Review file

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Kim, Yg., Kang, H., Lee, B. et al. A spectrum of nonsense-mediated mRNA decay efficiency along the degree of mutational constraint. Commun Biol 7, 1461 (2024). https://doi.org/10.1038/s42003-024-07136-y

Download citation

Received: 07 April 2024
Accepted: 24 October 2024
Published: 07 November 2024
Version of record: 07 November 2024
DOI: https://doi.org/10.1038/s42003-024-07136-y

This article is cited by

Variable efficiency of nonsense-mediated mRNA decay across human tissues, tumors and individuals
- Guillermo Palou-Márquez
- Fran Supek
Genome Biology (2025)