Abstract
Despite its importance for regulating gene expression, nonsense-mediated mRNA decay (NMD) remains poorly understood. Here, we extend the findings of a previous landmark study that proposed several factors associated with NMD efficiency using matched genome and transcriptome data from The Cancer Genome Atlas Program (TCGA) by incorporating additional data including Genotype-Tissue Expression (GTEx), gnomAD, and metrics for mutational constraints. Factors affecting NMD efficiency are analyzed using an allele-specific expression (ASE)-based measure to reduce noise caused by random variations. Additionally, by combining our data with the updated allele frequency database of gnomAD, we demonstrate the spectrum of NMD efficiency according to the degree of gene-level mutational constraints (degree of a gene-tolerating loss-of-function variants). The NMD prediction model, trained on TCGA data, shows that gene-level mutational constraint is an important predictor of NMD efficiency. Findings of this study suggest the potential role of NMD on shaping the landscape of mutational constraints.
Similar content being viewed by others
Introduction
Nonsense-mediated mRNA decay (NMD) is a biological mechanism that selectively degrades mRNAs harboring premature termination codons (PTCs)1,2. The exon junction complex (EJC) model has long been accepted as the major mechanism underlying NMD initiation. Based on this model, two canonical rules, namely the “last-exon rule” and the “50 nt rule”, were widely accepted to define the conditions of escaping NMD. According to these rules, PTCs in the last exon of a gene or in the last 50 nt of the penultimate exon are not expected to trigger NMD.
Some studies have aimed to further analyze the mechanisms of NMD efficiency, and a landmark study was published by Lindeboom et al.3,4. Based on the assumption that the human cancer genome and its matched transcriptome provide a closer look into the true nature of NMD due to the abundance and random distribution of PTC-inducing variants in cancer, they utilized somatic variants in The Cancer Genome Atlas (TCGA) to model NMD efficiency based on multiple parameters. In these studies, NMD efficiency was quantified as the ratio of the mRNA expression level of a PTC-bearing transcript to the median mRNA expression level of transcripts without PTCs in samples of the same tumor type. The expression level was measured as transcripts per million mapped reads (TPM). They incorporated all potential factors into the random forest regression (RFR) model and identified factors affecting NMD efficiency, such as the proximity of the PTC to the start codon and the length of the PTC-containing exon.
In two other studies by Rivas et al.5 and Teran et al.6, the NMD efficiency of germline nonsense variants in the Genotype-Tissue Expression (GTEx) data was measured using allele-specific expression (ASE). ASE has been more widely used in recent studies on NMD than has the TPM-based method described above5,6,7,8,9,10,11,12. According to Rivas et al.5, the tissue specificity of NMD efficiency was suggested based on the varying NMD efficiency observed across tissues. However, in a study performed by Teran et al.6, NMD efficiency was consistent across tissues, and allele frequency (AF) was suggested as a parameter for NMD efficiency. Another recent study that utilized single-molecule imaging suggested that additional factors affect NMD efficiency, such as the distance from the PTC to the EJC, the number of EJCs upstream and downstream of the PTC, and the coding sequences around PTCs.
As hypothesized by Lindeboom et al.3, it is reasonable to assume that a database of somatic variants could provide a more abundant and pure source for NMD investigation, as the landscape of PTC-inducing variants in a germline database must have been shaped by evolutionary pressure13. As the previous study that trained the NMD prediction model from somatic variants used a TPM-based NMD measurement3, the first purpose of this study was to train a more robust prediction model for NMD efficiency using an ASE-based method that has been more widely used for measuring the degree of NMD.
Second, based on the reported differences in NMD efficiency according to AF in GTEx variants, we investigated the effect of mutational constraints on NMD efficiency. Mutational constraints are defined as the depletion of variants through natural selection. A gene is regarded as highly constrained for loss-of-function (LoF) variants if the frequency of LoF variants is considerably lower than expected due to its detrimental effect on survival13,14. Examples of measures for mutational constraints include the probability of being LoF-intolerant (pLI)15 and the LoF-observed/expected upper-bound fraction (LOEUF)13. A higher pLI, or equivalently, lower LOEUF of a gene indicates a higher mutational constraint of the gene, thus suggesting an adverse biological effect of the gene. In this study, NMD efficiency was compared across AF-stratified and LOEUF-stratified subsets of TCGA and GTEx. A low LOEUF (close to 0) indicates depletion (high constraint) of LoF variants in the gene in the general population, whereas a high LOEUF (close to 1) indicates abundance (low constraint) of LoF variants.
Third, the characteristics of known factors of NMD efficiency, including the last exon rule, start proximity rule, 3’UTR length, and tissue specificity, were reviewed using the NMD efficiency model constructed in this study by evaluating the importance of each feature for predicting NMD efficiency.
Results
The spectrum of NMD efficiency and the capability of NMD rules for predicting NMD efficiency in somatic variants, rare germline variants, and common germline variants
From the TCGA data, 4257 nonsense variants were collected from 9235 samples of 32 tumor types (Supplementary Data 1). The NMD efficiency distribution of the nonsense variants in TCGA is presented in Fig. 1. Taking only the canonical rules of NMD (the “last exon rule” and the “last 50nt of the penultimate exon rule”) into account, the NMD efficiency of the predicted NMD-sensitive variants exhibited wide variability, while that of the predicted NMD-escaping variants was concentrated around zero (Fig. 1A–C). This is contrary to the previous notion that deviations from canonical rules are common in both predicted NMD-escaping and -sensitive regions. The results of a previous study using TCGA data revealed a wider distribution of NMD efficiency, both in predicted NMD-escaping variants and predicted NMD-sensitive variants, and this was likely due to the high random variation caused by using a TPM-based measure of NMD efficiency3. The benefit of incorporating two additional rules (the “start proximal rule” and the “long exon rule,” Fig. 1D–F) was marginal in terms of the discriminative power that is the capability of separating NMD-escaping and NMD-sensitive variants compared to that of the combination of the two canonical rules.
Mean NMD efficiency and number of variants in each distribution are noted on the plot. A Last-exon nonsense variants versus other nonsense variants. B Nonsense variants in the last 50 nt of the penultimate exon versus other nonsense variants. C Nonsense variants located in the last exon or last 50nt of the penultimate exon versus other nonsense variants. D Nonsense variants closer than 150nt from the start codon versus other nonsense variants. E Nonsense variants located in the exon longer than 407 nt versus other nonsense variants. F Nonsense variants located in the last exon, the last 50 nt of the penultimate exon, closer than 150 nt from the start codon, or the exon longer than 407 nt versus other nonsense variants.
The final GTEx dataset was comprised of 18,055 variants from 54 tissue types (Supplementary Data 2). In GTEx variants, the overall NMD efficiency was lower than that in TCGA variants, and all NMD rules exhibited low discriminative power (Fig. 2). In particular, in a subset of common germline variants (AF [0.01, ∞)), the overall NMD efficiency was the lowest (Supplementary Fig. 1). Additionally, in this subset, the NMD efficiency distributions of predicted NMD-escaping variants and predicted NMD-sensitive variants exhibited an almost complete overlap by all four rules (the “last exon,” “last 50nt of the penultimate exon,” “start proximal,” and “long exon” rules), indicating zero discriminative power of the NMD rules. In contrast, NMD efficiency was higher, and NMD rules exhibited higher discriminative power in the AF [0] group of GTEx (Supplementary Fig. 2). In short, the performance of NMD rules for predicting NMD efficiency was the highest for somatic variants, and this was followed by rare germline variants and common germline variants.
Mean NMD efficiency and number of variants in each distribution are noted on the plot. A Last-exon nonsense variants versus other nonsense variants. B Nonsense variants in the last 50nt of the penultimate exon versus other nonsense variants. C Nonsense variants located in the last exon or last 50nt of the penultimate exon versus other nonsense variants. D Nonsense variants closer than 150nt from the start codon versus other nonsense variants. E Nonsense variants located in the exon longer than 407nt versus other nonsense variants. F Nonsense variants located in the last exon, the last 50nt of the penultimate exon, closer than 150nt from the start codon, or the exon longer than 407nt versus other nonsense variants.
A spectrum of NMD efficiencies along the gene-level mutational constraint
Figure 3A–D presents the NMD efficiency in the subsets of TCGA and GTEx variants grouped according to either AF or LOEUF. Among the four combinations created using the two different datasets and two different grouping strategies, the increasing trend of NMD efficiency with increasing mutational constraints (decreasing AF and LOEUF) was most evident in the NMD efficiency of TCGA variants across the LOEUF deciles (Fig. 3C). In the TCGA dataset, nonsense variants from genes with lower LOEUF exhibited higher NMD efficiency, whereas those from genes with higher LOEUF exhibited diminished NMD efficiency (Fig. 3C). Conversely, the tendency for higher NMD efficiency with a lower LOEUF decile was not evident in the GTEx dataset (Fig. 3D). This can be explained by the depletion of germline nonsense variants with high NMD efficiency and low LOEUF that would have been located in the upper right section of Fig. 3D if not depleted. The depletion of low-LOEUF GTEx variants is evident from the number of variants presented in Fig. 3D.
A NMD efficiency of AF-stratified subsets of TCGA variants and B GTEx variants. C NMD efficiency of TCGA variants binned by LOEUF decile. D NMD efficiency of GTEx variants binned by LOEUF decile. E NMD efficiency of TCGA variants (AF = 0) binned by LOEUF decile. F NMD efficiency of GTEx variants (AF = 0) binned by LOEUF decile. AF, allele frequency; LOEUF, loss-of-function observed/expected upper-bound fraction.
As LOEUF was derived from the same database as AF, to avoid the potential confounding effects of AF on LOEUF the AF [0] subsets of TCGA and GTEx (Fig. 3E and F) were additionally analyzed to reveal the sole effect of LOEUF on NMD. The relationship between LOEUF and NMD efficiency was reproduced in the AF [0] subset of TCGA variants (Fig. 3E). Supplementary Fig. 3 and 4 present NMD efficiency distribution plots of the low- and high-LOEUF TCGA subsets, respectively. When rules to predict NMD escape were applied to divide the dataset into two groups, a difference in NMD efficiency was more clearly observed in the low-LOEUF subset.
The correlation coefficients between LOEUF and NMD efficiency measured in TCGA and MMRF-TARGET data are presented in Supplementary Data 3. Although the correlation was significant in TCGA data across all tumor DOC thresholds applied, in the MMRF-TARGET data, the correlation was generally weaker and significant only when a higher threshold for tumor DOC was applied.
Based on the speculation that LOEUF may possess a lower accuracy for measuring gene-level mutational constraints for smaller genes16, the correlation between LOEUF and NMD efficiency was additionally analyzed after excluding variants from smaller genes. However, the degree of correlation was not enhanced by filtering variants from smaller genes (Supplementary Data 4). Additionally, the GeneBayes metrics for gene-level mutational constraints provided in the publication mentioned above16 did not exhibit a higher correlation with NMD efficiency than that of LOEUF in our TCGA data (Supplementary Data 5).
ASE-based model explaining the greater portion of the total variance of NMD efficiency compared to that of the TPM-based model
Based on the observation that NMD efficiency is more distinguishable according to the NMD rules in TCGA data, an RFR model was constructed using TCGA data. Factors known to affect NMD efficiency, including canonical and recently reported rules, were collected from existing studies3,4,17. Along with the factors affecting NMD efficiency, LOEUF was incorporated as a predictor variable in the RFR model. The final RFR model incorporated 14 different features (Table 1) and explained 43.7% of the total variance. The non-random variance calculated in the TCGA dataset was 46% (Supplementary Fig. 5), thus making the proportion of explained variance 95% (43.7/46.0).
The downstream exon count possessed the highest importance, whereas the importance of the last exon was relatively low, and this was likely due to redundancy between these two features (the last exons have zero downstream exon count). The importance of the Last 50nt penultimate exon was also low, and this could be attributed to the small number of variants belonging to this category3. The 3’UTR length exhibited positive correlation with NMD efficiency and the 4th highest importance in the model. The importance of LOEUF was eighth among the 14 features, and this was higher than that of length of transcript, mRNA half-life, 5’UTR length, upstream exon count, AF, and last 50nt penultimate exon.
ASE-based and TPM-based models were directly compared using the same datasets and combinations of filtering parameters (Supplementary Data 6). Deriving the TPM-based NMD efficiency according to the methods described in a previous study3 required an additional filtering process to reduce the number of variants. The explainable variance was significantly higher in the ASE-based model for all parameter combinations.
Comparison between predicted and observed NMD efficiencies
In addition to the observed NMD efficiency measured with ASE, the NMD efficiency can be predicted from the RFR model using TCGA data. The observed and predicted NMD efficiencies of the variants in different datasets are presented in Fig. 4A (Supplementary Data 7). The predicted values were considerably higher in all AF subsets of the GTEx data, thus suggesting the depletion of variants with high NMD efficiency in the general population and preferential negative selection of variants with higher than predicted NMD efficiency. The difference was smallest in the AF [0] subset of GTEx, and this was likely due to the lesser effect of negative selection, as the variants in this subset were younger. This result is in agreement with those of previous studies that reported higher NMD efficiencies in lower AF subsets of GTEx6. However, the predicted and observed NMD efficiencies of the rarest variants in GTEx were lower than those of the somatic variants in TCGA, thus suggesting a non-negligible cumulative effect of evolutionary pressure on these variants.
A Observed and predicted NMD efficiencies of AF-stratified subsets of GTEx variants, TCGA variants, MMRF-TARGET variants, and de novo variants (observed NMD efficiency from de novo DB was unavailable due to a lack of RNAseq data) are presented. For this experiment, TCGA data were randomly separated into a training set (n = 3301) and a test set (n = 826). B Correlation between observed and predicted NMD efficiency in the AF [0.01, ∞) subset of GTEx. C Correlation between observed and predicted NMD efficiency in the AF [0] subset of GTEx. D Correlation between observed and predicted NMD efficiency in MMRF-TARGET data. Pearson’s test was performed.
The observed NMD efficiency of an independent set of somatic variants (MMRF-TARGET data) was significantly lower than predicted. However, the correlation between the predicted and observed NMD efficiencies was much higher than that of the GTEx data (Fig. 4B–D). The overall NMD efficiency of variants in the MMRF-TARGET data was lower than that of the TCGA variants, and all NMD rules exhibited low discriminative power (Supplementary Figs. 6 and 7).
A set of curated de novo variants, the de novo DB, was used as an additional set of variants free from evolutionary pressure. In contrast to the AF [0] subset of GTEx, the predicted NMD efficiency of the de novo variants was as high as that of the somatic variants in TCGA, presumably due to the lack of influence of natural selection. The observed NMD efficiency could not be derived from these data due to the lack of RNA-seq results.
Reduced NMD efficiency in start proximal PTCs and the potential effect of translation re-initiation
The NMD efficiency, according to the PTC-bearing regions (<150nt to the start codon, the rest, the last 50nt of the penultimate exon, and the last exon), is presented in Supplementary Fig. 8. NMD efficiency according to TCGA data indicated a higher degree of discrimination across regions than that of GTEx. Supplementary Fig. 9 presents NMD efficiency trends in the proximal region of the start codon. A gradual increase in NMD efficiency from the start codon to 200nt was generally observed in both TCGA and GTEx data, and this is consistent with previous findings, although unexpected patterns were observed in common germline variants, likely due to the limited number of samples used for drawing trends in the 200nt range (n = 419 and 517 for Supplementary Fig. 9B and C, respectively).
Although other mechanisms, including transcript stabilization by the interaction between poly(A)-binding protein C1 (PABPC1) and translation termination machinery, have been proposed, translation re-initiation from the downstream start codon is considered the main mechanism that explains the reduced NMD efficiency of proximal start PTCs. To determine the effect of translation re-initiation capability on NMD efficiency, we compared the NMD efficiency of start proximal nonsense variants (<150nt from the start codon) according to the presence of downstream in-frame start codons. Although statistically significant differences were observed only in several subsets of GTEx (Supplementary Fig. 10B, D), transcripts with downstream AUG generally exhibited diminished NMD efficiency, including TCGA data.
Cancer type in TCGA is not important for predicting NMD efficiency
The NMD efficiencies for each cancer type in TCGA and each tissue type in GTEx are presented in Fig. 5 and Supplementary Data 8 and 9, respectively. To exclude non-significant outliers, cancers, and tissue types consisting of fewer than 100 variants were excluded (Fig. 5). Considering the effect of evolutionary pressure on variants present in GTEx, TCGA results may provide a better estimate of NMD efficiency across tissues, which has not been used in previous studies. Adding the cancer type feature to our RFR model that was based on TCGA data did not improve model performance (R2 from 43.7% to 43.1%). Thus, cancer type does not appear to be an efficient parameter for predicting NMD efficiency.
The NMD efficiency did not exhibit conspicuous differences across cancer or tissue types A Cancer types in TCGA data. BLCA bladder urothelial carcinoma. BLCA bladder urothelial carcinoma, BRCA breast invasive carcinoma, CESC cervical squamous cell carcinoma and endocervical adenocarcinoma, COAD colon adenocarcinoma, LGG brain lower grade glioma, READ rectum adenocarcinoma, SKCM skin cutaneous melanoma, STAD stomach adenocarcinoma, UCEC uterine corpus endometrial carcinoma). B Tissue types in GTEx data. ADPSBQ adipose-subcutaneous, aDPVSC adipose-visceral (Omentum), ADRNLG adrenal gland, ARTAORT artery-aorta, ARTCRN artery-coronary, ARTTBL artery-tibial, BREAST breast, BRNACC brain-anterior cingulate cortex (BA24), BRNAMY brain-amygdala, BRNCDT brain-caudate (basal ganglia), BRNCHA brain-cerebellum, BRNCHB brain-cerebellar Hemisphere, BRNCTXA brain-Cortex, BRNCTXB brain-frontal cortex (BA9), BRNHPP brain-hippocampus, BRNHPT brain-hypothalamus, BRNNCC brain-nucleus accumbens (basal ganglia), BRNPTM brain-putamen (basal ganglia), BRNSNG brain-substantia nigra, BRNSPC brain-spinal cord (cervical c-1), CLNSGM colon-sigmoid, CLNTRN colon-transverse, ESPGEJ esophagus-gastroesophageal junction, ESPMCS esophagus-mucosa, ESPMSL esophagus-muscularis, FIBRBLS cell-cultured fibroblasts, HRTAA heart-atrial appendage, HRTLV heart-left ventricle, KDNCTX kidney-cortex, LCL cell-EBV-transformed lymphocytes, LIVER liver, LUNG lung, MSCLSK muscle-skeletal, NERVET nerve-tibial, OVARY ovary, PNCREAS pancreas, PRSTTE prostate, PTTARY pituitary, SKINNS skin-not Sun Exposed (suprapubic), SKINS skin-sun exposed (lower leg), SLVRYG minor salivary gland, SNTTRM small intestine-terminal ileum, SPLEEN spleen, STMACH stomach, TESTIS testis, THYROID thyroid, UTERUS uterus, VAGINA vagina, WHLBLD whole blood.
Discussion
In this study, the landscapes of nonsense variants in TCGA and GTEx were compared, and the factors affecting NMD efficiency were analyzed using both datasets. The profound effect of natural selection on GTEx was clearly revealed, indicating that TCGA is a better source for studying NMD independent of evolutionary pressure. The relationship between NMD efficiency and LOEUF observed in TCGA was absent in GTEx, and this is plausible considering the selective depletion of variants with high NMD efficiency in GTEx during natural selection.
Using an ASE-based measure of NMD efficiency, we discovered a spectrum of NMD efficiencies along gene-level mutational constraints represented as LOEUF. In contrast, the effect of AF on the NMD efficiency was not as profound. The relationship between LOEUF and NMD efficiency remained observable, even after removing the potential confounding effect of AF on LOEUF. Although we observed a difference in NMD efficiency according to gene-level mutational constraints, causality may be the opposite. NMD efficiency may have shaped the landscape of gene-level mutational constraints. A gene for which NMD is less efficient would be less constrained due to the production of a partially functional protein that would make pLoF variants more tolerable. Therefore, gene-level differences in NMD efficiency could be a valuable area for future studies to uncover the drivers that shape gene-level mutational constraints.
A list of studies that analyzed NMD efficiency using public data and the data sources used in each study are provided in Supplementary Data 10. Teran et al.6 utilized ASE, represented by VAFRNA, to analyze NMD efficiency across various tissue types and constructed a LASSO regression model using GTEx. Rhee et al.18 utilized the allele fraction difference (VAFRNA–VAFDNA) of a variant observed from whole-exome sequencing (WES) and RNA-seq data from the TCGA (five major cancer types) for univariate analysis of previously known NMD-modulating factors without multiparameter NMD efficiency model construction. Previous studies that trained multivariate NMD efficiency models using TCGA data relied on a TPM-based approach3,4. Compared to those studies, in the current study, the explainable variance expressed by non-random variance was significantly improved using the ASE.
Before Teran et al.6 argued that NMD is highly stable across different tissues, most studies focused on the same topic reported differences in NMD efficiency across different tissues5,19,20,21,22,23. As variants in TCGA were less influenced by selective pressure, the conclusion deduced from our analysis using TCGA that cancer types exert a limited effect on NMD efficiency supports the results of Teran et al.6.
This study possesses some limitations. First, although the use of ASE reduced random noise compared to the TPM-based method, the measure of NMD efficiency based on VAF was still an indirect estimation that was bound to random variance. Future studies investigating NMD efficiency are warranted to utilize higher quality data with higher DOC or to utilize direct measures of NMD efficiency, such as the single-molecule imaging utilized by Hoek et al.17. Second, in the validation using independent MMRF-TARGET data, the predicted values were considerably higher than the observed values despite the correlation between the observed and predicted NMD efficiencies. We assume that this discrepancy was caused by the high expression of truncated transcripts in amplified genetic regions, as the variants in the MMRF-TARGET data could not be restricted to those from copy-neutral regions due to the lack of copy number information in those datasets. The effect of NMD could have been compensated for by amplification, as gene amplification is prevalent in cancer24. Third, frameshift variants were not utilized due to their potential for high mapping bias. When utilizing nonsense variants, while the effect of mapping bias could be minimized using a WASP-filtered version of the ASE tables in GTEx, the potential effect of mapping bias was not considered in TCGA and MMRF-TARGET datasets. The mapping bias may have added some degree of inaccuracy to the analysis of TCGA and MMRF-TARGET data. However, given the high degree of random variance mentioned above, we assumed that the effect of inaccuracy introduced by mapping bias caused by single-nucleotide variants may have been minor. Fourth, although we assumed a smaller effect of evolutionary pressure on somatic variants than that on germline variants (as assumed in a previous study3), considering the process of clonal evolution during tumorigenesis, driver mutations may not have been free from selective pressure. The results of a previous study suggested enrichment of NMD-sensitive variants in tumor-suppressor genes and NMD-insensitive variants in oncogenes and essential genes, thus indicating selective pressure acting on these classes of genes in cancer cells3. We assume that the effect of the selective pressure on this study was minor considering the limited number of variants from oncogenes, tumor-suppressor genes, and essential genes reported in the previous study3 and the relative abundance of passenger mutations in the cancer genome25. However, future studies utilizing a pure set of passenger mutations in the cancer database would be helpful for revealing the true nature of NMD regardless of the selective pressure.
In conclusion, the characteristics of NMD were more clearly revealed in somatic variants than they were in germline variants. Considering the observed relationship between NMD efficiency and mutational constraints, NMD may play a more vital role in natural selection than was previously thought.
Methods
TCGA data preprocessing
The data preprocessing procedure is illustrated in Supplementary Fig. 11. From the GDC portal (https://portal.gdc.cancer.gov/, accessed 29 Oct 2023), whole-exome sequencing (WES) annotation files, copy number (CN) analysis results, and expression data from 9235 tumor samples from 32 tumor types were downloaded from the GDC portal. The CN analysis results were merged with the WES annotation files from which the nonsense variants of copy-neutral genes (CN = 2) were extracted. Among these nonsense variants, all variants supported by fewer than five sequencing reads were excluded. Thereafter, variants with a matched-normal variant allele frequency (VAF) of greater than 0 but less than 0.3 were also excluded considering the possibility of systematic false variants at the locus (the VAF of matched-normal tissue is expected to be 0 in the case of somatic variants and approximately 0.5 in the case of germline heterozygous variants). When a gene in a sample contained more than two variants that could trigger NMD (nonsense, canonical splice site, or frameshift indel variants), it was not included in the analysis, as the effect of one variant could potentially confound that of the other variants in the same gene. The initial filtering process yielded 50,591 nonsense variants.
Genomic locations of exon-intron boundaries downloaded from the Table Browser of the UCSC genome browser26 were used to annotate features that included last exon, last 50 nt penultimate exon, dist PTC to start codon, upstream exon count, downstream exon count, PTC-containing exon length, dist PTC to downstream EJ, dist PTC to the normal stop codon, and length of the transcript. AF and LOEUF were annotated using gnomAD v4.013, and the mRNA half-life was annotated using previously published data27.
The RNA-seq data were downloaded from the TCGA GDC API from 22 August 2022 to 23 August 2022 using the request (version 2.22.0) library in Python (version 3.8.10). We requested BAM files containing the regions of interest based on the chromosomal location of each PTC. Variant calling was performed using Mutect2 from GATK (version 4.2.6.1) with reference to hg38. The VAF of the variants resulting in PTC were collected from the vcf files. The list of nonsense variants generated from the RNA-seq data with annotations such as DOC and VAF was merged with the list of variants obtained from the DNA data.
The VAF observed from RNA-seq (VAFRNA) reflects the ASE. The VAF of the same variant observed in tumor DNA sequencing (VAFDNA) could be an estimate of VAFRNA in the absence of NMD. The VAFDNA and VAFRNA values from the merged data were used to calculate the NMD efficiency of each nonsense variant in TCGA data using the following equation:
In this formula, the efficiency of no NMD (VAFRNA = VAFDNA) is zero, and that of complete NMD (VAFRNA = 0) is infinity.
To calculate TPM-based NMD efficiency, the same formula and data-filtering procedures were adopted from a previous study3. Expression data containing TPM values were downloaded along with WES and CN analysis files. The median TPM was calculated from the NMF clusters defined in GDAC Firehose (https://gdac.broadinstitute.org/; accessed 01 Sep 2023).
To determine the final filtering parameters for tumor VAF, tumor DOC, and RNA DOC, 18 combinations were generated using two tumor VAFs (0.2 and 0.3), three tumor DOC values (10, 20, and 30), and three RNA DOC values (10, 20, and 30). The model performance in terms of R2, was compared (Supplementary Data 6). The selected parameters were tumor VAF ≥ 0.3, tumor DOC ≥ 30, and RNA DOC ≥ 30. Applying these conditions resulted in 4366 variants, and the final dataset comprised 4257 variants after collapsing those that appeared in multiple samples in an NMF cluster using the mean values.
GTEx data and de novo variant data preprocessing
From the ASE data of 844 individuals in GTEx V8, 131,584 nonsense variants were extracted. A WASP-filtered version of the ASE tables was used, as it provides a better estimate of the true ASE by reducing the effect of mapping bias. For each ASE table, the nonsense variants were extracted using the condition VARIANT_ANNOTATION = stop_gained. Consistent with TCGA data, variants with an RNA-seq DOC < 30 were excluded. Variants appearing in multiple tissue samples were collapsed to mean values. The ensemble variant effect predictor (VEP, release 109) was used to annotate the coding DNA-based variant descriptions. The annotation of potential factors affecting NMD efficiency was performed using the same methods used for TCGA data. The NMD efficiency was calculated using the same principle as that for TCGA data. As all variants included in the ASE tables were germline heterozygous variants, VAFDNA was replaced with 0.5:
A collection of curated germline de novo variants was downloaded from the denovo-db website (https://denovo-db.gs.washington.edu/denovo-db/, accessed Oct 14, 2024). A repository of non-Simons Simplex Collection samples consisting of collected de novo variants from various published studies, including developmental delay, autism, intellectual disability, and congenital heart disease cohorts, was downloaded. Among the 415,516 variants, there were 1,817 nonsense variants. After annotation using VEP to remove duplicate descriptions of a single variant, variant descriptions from non-canonical transcripts were removed, ultimately resulting in 864 nonsense variants in the final list (Supplementary Data 11).
MMRF and TARGET data preprocessing
To validate our NMD model, two independent datasets were used, including the MMRF (Multiple Myeloma Research Foundation) CoMMpass dataset and the TARGET (Therapeutically Applicable Research to Generate Effective Treatments) dataset. The preprocessing of these datasets followed the same procedures as those applied to the TCGA dataset. WES and RNA-seq data were downloaded using the GDC API from 29 July 29 2024 to 02 August 2024. First, the annotation files for WES were used to extract nonsense variants that resulted in PTC. Second, the RNA-seq BAM files of regions with a nonsense variant were downloaded. Third, the variants and their VAF were extracted from the RNA-seq data using Mutect2. Finally, the NMD efficiency of the nonsense variants in these datasets was calculated using the same formula as that used for TCGA dataset analysis (Supplementary Data 12). Notably, due to the lack of comprehensive CN analysis in these datasets, we were unable to restrict the analysis to nonsense variants of copy-neutral genes.
Modeling NMD using RFR
RFR was selected in the same manner as that used in previous studies that investigated NMD using TCGA data. RFR is a suitable model for NMD modeling not only for comparison to previous studies but also for robust modeling, as it is less affected by outliers in the dataset. Variants for which LOEUF was not available in the TCGA data (n = 130, 3.05%) and GTEx data (n = 953, 5.28%) were excluded from the training and prediction of the RFR. The NMD efficiency was the target variable in the regression problem. Feature selection was performed based on univariate and correlation analyses of previously reported parameters3,4,17,28 and on LOEUF. Table 1 lists the features of these models. RFR was implemented with 10,000 trees under the constraint of a maximum of three features on the TCGA dataset resulting from hyperparameter analysis. The trained model was evaluated using k-fold cross-validation to resolve sampling bias by setting k to five. To plot the NMD efficiency predicted by the RFR model (Fig. 4), TCGA data were randomly divided into a training set (n = 3301) and a test set (n = 826). An evaluation metric (R-squared) was used to examine the explanatory power of the input variables for predicting the NMD efficiency. Python (version 3.8.5) was used for the analysis with libraries that included Scikit-learn (version 1.2.2), Pandas (version 1.1.3), and SciPy (version 1.5.2).
Statistics and reproducibility
The Wilcoxon signed-rank test was used to compare means. Pearson’s correlation was used to analyze the correlation between two numerical variables when the normality assumption was satisfied. Otherwise, Spearman’s correlation analysis was used. The normality assumption was tested using the Shapiro-Wilk test. All tests were performed using R software (version 4.2.2).
To reproduce the datasets used in this study from the ground up, source data used in this study can be downloaded from the public database described in the Data Availability section below. Following the filtering process of each dataset described above in each subsection of Methods, the same final dataset used in this study can be obtained. Alternatively, the processed version of data can be used from the Supplementary Data. Correlation analysis, RFR training, validation, and test case regression can be reproduced using the Python code and running environment provided in the Code Availability section below.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All source data used in this study are available from public data sources. TCGA, MMRF, and TARGET data are available from GDC portal (https://portal.gdc.cancer.gov/) and GTEx data are available from GTEx Portal (https://gtexportal.org/home/). De novo variants from denovo-db are available from its website (https://denovo-db.gs.washington.edu/denovo-db/). All processed data are available in this published article (and its Supplementary Data).
Code availability
The code used in this study is available at https://doi.org/10.5281/zenodo.1391812529 or https://github.com/hjkng/nmdeff. The version of software is v1.0 at the time of final submission of this manuscript.
References
Lykke-Andersen, S. & Jensen, T. H. Nonsense-mediated mRNA decay: an intricate machinery that shapes transcriptomes. Nat. Rev. Mol. Cell Biol. 16, 665–677 (2015).
Tan, K., Stupack, D. G. & Wilkinson, M. F. Nonsense-mediated RNA decay: an emerging modulator of malignancy. Nat. Rev. Cancer 22, 437–451 (2022).
Lindeboom, R. G., Supek, F. & Lehner, B. The rules and impact of nonsense-mediated mRNA decay in human cancers. Nat. Genet. 48, 1112–1118 (2016).
Lindeboom, R. G. H., Vermeulen, M., Lehner, B. & Supek, F. The impact of nonsense-mediated mRNA decay on genetic disease, gene editing and cancer immunotherapy. Nat. Genet. 51, 1645–1651 (2019).
Rivas, M. A. et al. Effect of predicted protein-truncating genetic variants on the human transcriptome. Science 348, 666–669 (2015).
Teran, N. A. et al. Nonsense-mediated decay is highly stable across individuals and tissues. Am. J. Hum. Genet. 108, 1401–1408 (2021).
McKean, D. M. et al. Loss of RNA expression and allele-specific expression associated with congenital heart disease. Nat. Commun. 7, 12824 (2016).
Tournier, I. et al. Analysis of the allele-specific expression of the mismatch repair gene MLH1 using a simple DHPLC-based method. Hum. Mutat. 23, 379–384 (2004).
Karam, R. et al. The NMD mRNA surveillance pathway downregulates aberrant E-cadherin transcripts in gastric cancer cells and in CDH1 mutation carriers. Oncogene 27, 4255–4260 (2008).
Pirinen, M. et al. Assessing allele-specific expression across multiple tissues from RNA-seq read data. Bioinformatics 31, 2497–2504 (2015).
Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013).
MacArthur, D. G. et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science 335, 823–828 (2012).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Chen, S. et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature 625, 92–100 (2024).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Zeng, T., Spence, J. P., Mostafavi, H. & Pritchard, J. K. Bayesian estimation of gene constraint from an evolutionary model with gene features. Nat. Genet. 1–12 https://doi.org/10.1038/s41588-024-01820-9 (2024).
Hoek, T. A. et al. Single-molecule imaging uncovers rules governing nonsense-mediated mRNA decay. Mol. Cell 75, 324–339.e311 (2019).
Rhee, J. K., Lee, S., Park, W. Y., Kim, Y. H. & Kim, T. M. Allelic imbalance of somatic mutations in cancer genomes and transcriptomes. Sci. Rep. 7, 1653 (2017).
Mendell, J. T., Medghalchi, S. M., Lake, R. G., Noensie, E. N. & Dietz, H. C. Novel Upf2p orthologues suggest a functional link between translation initiation and nonsense surveillance complexes. Mol. Cell. Biol. 20, 8944–8957 (2000).
Bateman, J. F., Freddi, S., Nattrass, G. & Savarirayan, R. Tissue-specific RNA surveillance? Nonsense-mediated mRNA decay causes collagen X haploinsufficiency in Schmid metaphyseal chondrodysplasia cartilage. Hum. Mol. Genet. 12, 217–225 (2003).
Zetoune, A. B. et al. Comparison of nonsense-mediated mRNA decay efficiency in various murine tissues. BMC Genet. 9, 83 (2008).
Huang, L. & Wilkinson, M. F. Regulation of nonsense-mediated mRNA decay. Wiley Interdiscip. Rev. RNA 3, 807–828 (2012).
Dyle, M. C., Kolakada, D., Cortazar, M. A. & Jagannathan, S. How to get away with nonsense: mechanisms and consequences of escape from nonsense-mediated RNA decay. Wiley Interdiscip. Rev. RNA 11, e1560 (2020).
Albertson, D. G. Gene amplification in cancer. TRENDS Genet. 22, 447–455 (2006).
Bozic, I. et al. Accumulation of driver and passenger mutations during tumor progression. Proc. Natl Acad. Sci. USA 107, 18545–18550 (2010).
Karolchik, D. et al. The UCSC table browser data retrieval tool. Nucleic Acids Res. 32, D493–D496 (2004).
Friedel, C. C., Dolken, L., Ruzsics, Z., Koszinowski, U. H. & Zimmer, R. Conserved principles of mammalian transcriptional regulation revealed by RNA half-life. Nucleic Acids Res. 37, e115 (2009).
Supek, F., Lehner, B. & Lindeboom, R. G. H. To NMD or not to NMD: nonsense-mediated mRNA decay in cancer and other genetic diseases. Trends Genet. 37, 657–668 (2021).
Kang, H. & Kim, Y.-G. NMDeff v1.0 https://doi.org/10.1016/j.tig.2020.11.002 (2024).
Acknowledgements
The results published here are in part based on data generated by TCGA Research Network: https://www.cancer.gov/tcga. This study was supported by a Medical Scientist Training Program from the Ministry of Science & ICT of Korea. This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No.1711180477, 2022R1A2C2091571). This work has also been supported by the Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No.2019-0-00421, Artificial Intelligence Graduate School Program (Sungkyunkwan University)).
Author information
Authors and Affiliations
Contributions
Y-g.K. conceived the study and wrote the manuscript. H.K. and B.L. performed the analyses and participated in writing the manuscript. H-J.J., J-h.P., and C.H. performed data extraction, filtering, and annotation. H.P. and J.-W.K. supervised the study. All authors have read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Communications Biology thanks Stephen Montgomery and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Laura Rodríguez Perez and Kaliya Georgieva. [A peer review file is available.]
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Kim, Yg., Kang, H., Lee, B. et al. A spectrum of nonsense-mediated mRNA decay efficiency along the degree of mutational constraint. Commun Biol 7, 1461 (2024). https://doi.org/10.1038/s42003-024-07136-y
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s42003-024-07136-y