+
ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Method Article
Revised

DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics

[version 2; peer review: 2 approved]
PUBLISHED 06 Dec 2016
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the RPackage gateway.

This article is included in the Bioconductor gateway.

Abstract

There are many instances in genomics data analyses where measurements are made on a multivariate response. For example, alternative splicing can lead to multiple expressed isoforms from the same primary transcript. There are situations where differences (e.g. between normal and disease state) in the relative ratio of expressed isoforms may have significant phenotypic consequences or lead to prognostic capabilities. Similarly, knowledge of single nucleotide polymorphisms (SNPs) that affect splicing, so-called splicing quantitative trait loci (sQTL) will help to characterize the effects of genetic variation on gene expression. RNA sequencing (RNA-seq) has provided an attractive toolbox to carefully unravel alternative splicing outcomes and recently, fast and accurate methods for transcript quantification have become available. We propose a statistical framework based on the Dirichlet-multinomial distribution that can discover changes in isoform usage between conditions and SNPs that affect relative expression of transcripts using these quantifications. The Dirichlet-multinomial model naturally accounts for the differential gene expression without losing information about overall gene abundance and by joint modeling of isoform expression, it has the capability to account for their correlated nature. The main challenge in this approach is to get robust estimates of model parameters with limited numbers of replicates. We approach this by sharing information and show that our method improves on existing approaches in terms of standard statistical performance metrics. The framework is applicable to other multivariate scenarios, such as Poly-A-seq or where beta-binomial models have been applied (e.g., differential DNA methylation). Our method is available as a Bioconductor R package called DRIMSeq.

Keywords

DRIMSeq, genomics, single nucleotide polymorphism, RNA-seq, splicing, statistical framework

Revised Amendments from Version 1

In version 2 of the manuscript, we have reworded sections in the Introduction to clarify the scope of existing methods, with respect to the term 'differential splicing'. We have added additional analyses for differential splicing analyses, to better understand how the null P-value distributions compare across different simulation scenarios and dispersion estimators. For the detected tuQTLs, we added an analysis with respect to enrichment of splicing-related features.

See the authors' detailed response to the review by Alejandro Reyes
See the authors' detailed response to the review by Robert Castelo

Introduction

With the development of digital high-throughput sequencing technologies, the analysis of count data in genomics has become an important theme motivating the investigation of new, more powerful and robust approaches that handle complex overdispersion patterns while accommodating the typical small numbers of experimental units.

The basic distribution for modeling univariate count responses is the Poisson distribution, which also approximates the binomial distribution. One important limitation of the Poisson distribution is that the mean is equal to the variance, which is not sufficient for modeling, for example, gene expression from RNA sequencing (RNA-seq) data where the variance is higher than the mean due to technical sources and biological variability15. A natural extension of the Poisson distribution that accounts for overdispersion is the negative-binomial distribution, which has been extensively studied in the small-sample situation and has become an essential tool in genomics applications13.

Analogously, the fundamental distribution for modeling multivariate count data is the multinomial distribution, which models proportions across multiple features. To account for overdispersion, the multinomial can be extended to the Dirichlet-multinomial (DM) distribution6. Because of its flexibility, the DM distribution has found applications in forensic genetics7, microbiome data analysis8, the analysis of single-cell data9 and for identifying nucleosome positions10. Another extension of the multinomial is the Dirichlet negative multinomial distribution11, which allows modeling of correlated count data and was applied in the analysis of clinical trial recruitment12. Notably, the beta-binomial distribution, such as those used in differential methylation from bisulphite sequencing data1315, represent a special case of the DM.

Genes may express diverse transcript isoforms (mRNA variants) as a consequence of alternative splicing or due to the differences in transcription start sites and polyadenylation sites16. Hence, gene expression can be viewed as a multivariate expression of transcripts or exons and such a representation allows the study of not only the overall gene expression, but also the expressed variant composition. Differences in the relative expression of isoforms can have significant phenotypic consequences and aberrations are associated with disease17,18. Thus, biologists are interested in using RNA-seq data to discover differences in transcript usage between conditions or to study the specific molecular mechanisms that mediate these changes, for example, alternative splice site usage. In general terms, we collect all these together under the term “differential splicing” (DS)19.

Alternative splicing is a process regulated by complex protein-RNA interactions that can be altered by genetic variation. Knowledge of single nucleotide polymorphisms (SNPs) that affect splicing, known as splicing quantitative trait loci (sQTL), can help to characterize this layer of regulation.

In this article, we propose the DM distribution to model relative usage of isoforms. The DM model treats transcript expression as a multivariate response and allows for flexible small-sample estimation of overdispersion. We address the challenge of obtaining robust estimates of the model parameters, especially dispersion, when only a small number of replicates is available by applying an empirical Bayes approach to share information, similar to those proven successful in negative-binomial frameworks1,20. In particular, weighted likelihood is used to moderate the gene-wise dispersion toward a common or trended value.

The Dirichlet-multinomial framework, implemented as a Bioconductor R package called DRIMSeq, is applicable to both differential transcript usage (DTU) analysis between conditions and transcript usage quantitative trait loci (tuQTL) analysis. It has been evaluated and compared to the current best methods in extensive simulations and in real RNA-seq data analysis using transcript and exon counts, highlighting that DRIMSeq performs best with transcript counts. Furthermore, the framework can be applied to other types of emerging multivariate genomic data, such as PolyA-seq where the collection of polyadenylated sites for a given gene are measured21 and to settings where the beta-binomial is already applied (e.g., differential methylation, allele-specific differential gene expression).

Approaches to DS and sQTL analyses

RNA-seq has provided an attractive toolbox to unravel alternative splicing outcomes. There are various methods designed explicitly to detect DS based on samples from different experimental conditions19,22,23. Independently, a set of methods was developed for detecting genetic variation associated with changes in splicing (sQTLs). While sQTL detection represents a different application, it is essentially DS between groups defined by genotypes. In the following overview, we do not distinguish between applications but rather between the general concepts used to detect differences in splicing.

DS can be studied in three main ways: as differential transcript usage (DTU) or, in a more local context, as differential exon or exon junction usage (DEU) or as specific splicing events (e.g., exon skipping), and all have their advantages and disadvantages. A survey of the main methods can be found in Table S1 (Supplementary File). From the quantification perspective, exon-level abundance estimation is straightforward since it is based on counting read-region overlaps (e.g., featureCounts24). Exons from different isoforms may have different boundaries, thus the authors of DEXSeq25 quantify with HTSeq26 non-overlapping windows defined by projecting all exons to the linear genome. However, this strategy does not utilize the full information from junction reads. Such reads are counted multiple times (in all exons that they overlap with), artificially increasing the total number of counts per gene and ignoring that junction reads support the isoforms that explicitly contain the combinations of exons spanned by these reads. This issue is captured in Altrans27, which quantifies exon-links (exon junctions) or in MISO28, rMATS29, SUPPA30 and SGSeq31, all of which calculate splicing event inclusion levels expressed as percentage spliced in (PSI). Such events capture not only cassette exons but also alternative 3’ and 5’ splice sites, mutually exclusive exons or intron retention. GLiMMPS32 and Jia et al.33, with quantification from PennSeq34, use event inclusion levels for detecting SNPs that are associated with differential splicing. However, there are (hypothetical) instances where changes in splicing pattern may not be captured by exon-level quantifications (Figure 1A in the paper by Monlog et al.35). Furthermore, detection of more complex transcript variations remains a challenge for exon junction or PSI methods (see Figure S5 in the paper by Ongen et al.27). Soneson et al.23 considered counting which accommodates various types of local splicing events, such as exon paths traced out by paired reads, junction counts or events that correspond to combinations of isoforms; in general, the default exon-based counting resulted in strongest performance for DS gene detection.

f017ffc9-0878-46c9-9886-361ee2e6db75_figure1.gif

Figure 1. Results of two-group (3 versus 3 samples) DS analyses on data simulated from the DM null model.

In the first scenario, all genes have the same (common) dispersion, and in the second one, each gene has a different (genewise) dispersion. All genes have expression equal to 1000 and 3 or 10 features with the same proportions estimated from kallisto counts from Kim et al. data set. For each of the scenarios, common, genewise, with and without moderation to common dispersion is estimated with maximum likelihood using the dirmult R package, the raw profile likelihood and the Cox-Reid APL. A: Absolute error of concentration γ+ estimates. B: False positive (FP) rate for the p-value threshold of 0.05 of the null two-group comparisons based on the likelihood ratio statistics. Dashed line indicates the 0.05 level. C: Median raw error of γ+ estimates. D: Distributions of p-values of the null two-group comparisons based on the likelihood ratio statistics. Additionally, results when true concentration estimates are used are indicated with the gray color.

The above methods allow for detection of differential usage of local splicing features, which can serve as an indicator of differential transcript usage but often without knowing specifically which isoforms are differentially regulated. This can be a disadvantage in cases where knowing the isoform ratio changes is important, since isoforms are the ultimate determinants of proteins. Moreover, exons are not independent transcriptional units but building blocks of transcripts. Thus, the main alternative is to make a calculation of DS using isoform-level quantitations. A vast number of methods is available for gene isoform quantification, such as MISO28, BitSeq36, casper37, Cufflinks38, RSEM39, FlipFlop40 and more recent, extremely fast pseudoalignment-based methods, such as Sailfish41, kallisto42 and Salmon43. Additionally, Cufflinks, casper and FlipFlop allow for de novo transcriptome assembly. Recently, performance of various methods was extensively studied44,45, including a webtool45 to allow further comparisons. Regardless of this progress, it remains a complex undertaking to quantify isoform expression from short cDNA fragments since there is a high degree of overlap between transcripts in complex genes; this is a limitation of the technology, not the algorithms. In the case of incomplete transcript annotation, local approaches may be more robust and can detect differential changes due to transcripts that are not in the catalog23,27. Nevertheless, DS at the resolution of isoforms is the ultimate goal within the DRIMSeq framework, and with the emergence of longer reads (fragments), transcript quantifications will become more accurate and methods for multivariate transcript abundances will be needed.

Whether the differential analysis is done at the transcript or local level, modeling and testing independently each transcript46,47 or exon ratio48 ignores the correlated structure of these quantities (e.g., proportions must sum to 1). Similarly, separate modeling and testing of exon junctions (Altrans27) or splicing events (rMATS29, GLiMMPS32, Jia et al.33, Montgomery et al.49) of a gene leads to non-independent statistical tests, although the full effect of this on calibration (e.g., controlling the rate of false discoveries) is not known. Nevertheless, with the larger number of tests, the multiple testing correction becomes more extreme. In sQTL analyses, this burden is even larger since there are many SNPs tested for each gene. There, the issue of multiple comparisons is usually accounted for by applying a permutation scheme in combination with the false discovery rate (FDR) estimation27,32,35,46,4850.

DEXSeq and voom-diffSplice4,5 undertake another approach, where the modeling is done per gene. DEXSeq fits a generalized linear model (GLM), assuming that (exonic) read counts follow the negative-binomial distribution. A bin is deemed differentially used when its corresponding group-bin interaction is significantly different. The exact details of voom-diffSplice are not published. Nevertheless, exons are again treated as independent in the gene-level model.

In contrast, MISO28, Cuffdiff38,51 and sQTLseekeR35 model alternative splicing as a multivariate response. MISO is designed for DS analyses only between two samples and does not handle replicates. Variability among replicates is captured within Cuffdiff via the Jensen-Shannon divergence metric on probability distributions of isoform proportions as a measure of changes in isoform relative abundances between samples. sQTLseekeR tests for the association between genotype and transcript composition, using an approach similar to a multivariate analysis of variance (MANOVA) without assuming any probabilistic distribution and Hellinger distance as a dissimilarity measure between transcript ratios. Very recently, LeafCutter52 gives intron usage quantifications that can be used for both DS analyses (also using the DM model) and sQTL analyses via a correlation-based approach with FastQTL50.

sQTLseekeR, Altrans, LeafCutter and other earlier methods for the sQTL analysis35,4648 employ feature ratios to account for the overall gene expression. A potential drawback of this approach is that feature ratios do not take into account whether they are based on high or low expression, while the latter have more uncertainty in them. DRIMSeq naturally builds this in via the multinomial model.

Dirichlet-multinomial model for relative transcript usage

In the application of the DM model to DS, we refer to features of a gene. These features can be transcripts, exons, exonic bins or other multivariate measurable units, which for DS, contain information about isoform usage and can be quantified with (estimated) counts.

Assume that a gene has q features with relative expression defined by a vector of proportions π = (π1,…,πq), and the feature counts Y = (Y1, …, Yq) are random variables. Let y = (y1, …, yq) be the observed counts and m=j=1qyj. Here, m is treated as an ancillary statistic since it depends on the sequencing depth and gene expression, but not on the model parameters. The simplest way to model feature counts is with the multinomial distribution with probability function defined as:

fM(y ;π)=(my)j=1qπjyj,(1)
where the mean and the covariance matrix of Y are 𝔼(Y) = mπ and 𝕍(Y) = diag(π) –ππT, respectively.

To account for overdispersion due to true biological variation between experimental units as well as technical variation, such as library preparation and errors in transcript quantification, we assume the feature proportions, Π, follow the (conjugate) Dirichlet distribution, with density function:

fD(π;γ)=Γ(γ+)j=1qΓ(γj)j=1qπjγj1,(2)
where γj, j = 1, …, q are the Dirichlet parameters and γ+=j=1qγj. The mean and covariance matrix of random proportions Π are 𝔼(Π) = γ/γ+ = π and 𝕍()={γ+diag(γ)γγT}/{γ+2(γ++1)}, respectively. We can see that proportions Π are proportional to γ and their variance is inversely proportional to γ+, which is called the concentration or precision parameter. As γ+ gets larger, the proportions are more concentrated around their means.

We can derive the marginal distribution of Y by multiplying densities (1) and (2) and integrating over π. Then, feature counts Y follow the DM distribution6 with probability function defined as:

fDM(y;γ)=πfM(y;π)fD(π;γ)dπ=(my)Γ(γ+)Γ(m+γ+)j=1qΓ(yj+γj)Γ(γj).(3)

The mean of Y is unchanged at 𝔼(Y) = 𝔼{𝔼(Y|Π)} = 𝔼(mΠ) = mγ/γ+ = mπ, while the covariance matrix of Y is given by 𝕍(Y) = cm{diag(π) − ππT}, where c = (m+γ+)/(1+γ+) is an additional factor when representing the Dirichlet-multinomial covariance to the ordinary multinomial covariance. c depends on concentration parameter γ+ which controls the degree of overdispersion and is inversely proportional to variance of Y.

We can represent the DM distribution using an alternative parameterization: π = γ/γ+ and θ = 1/(1 + γ+); then, the covariance of Y can be represented as 𝕍(Y) = n{diag(π) − ππT} {1 + θ(n − 1)}, where θ can be interpreted as a dispersion parameter. When θ grows (γ+ gets smaller), the variance becomes larger. From the knowledge of the gamma function, xΓ(x) = Γ(x + 1), we can write Γ(α+x)Γ(α)=r=1x{α+(r1)}. Hence, the DM density function becomes:

fDM(y;π,θ)=(my)j=1qr=1yj{πj(1θ)+(r1)θ}r=1m{1θ+(r1)θ},(4)
such that for θ = 0, DM reduces to multinomial.

Detecting DTU and tuQTLs with the Dirichlet-multinomial model

Within DRIMSeq, the DM method can be used to detect the differential usage of gene features between two or more conditions. For simplicity, suppose that features of a gene are transcripts and the comparison is done between two groups. The aim is to determine whether transcript ratios of a gene are different in these two conditions. Formally, we want to test the hypothesis H0 : π1 = π2 against the alternative H1 : π1π2. For the convenience of parameter estimation, we decide to use the DM parameterization with precision parameter γ+, which can take any non-negative value, instead of dispersion parameter θ, which is bounded to values between 0 and 1. Because our goal is to compare the proportions from two groups, γ+ is a nuisance parameter that gets estimated in the first step (see the following Section). Let l(π1, π2, γ+) be the joint log-likelihood function. Assuming γ+=γ^+, the maximum likelihood (ML) estimates of π1, π2 are the solution of dld(π1,π2)(π1,π2,γ+=γ^+)=0. Under the hypothesis H1 : π1 = π2 = π, the ML estimate of π is the solution of dldπ(π1=π,π2=π,γ+=γ^+)=0. We test the null hypothesis using a likelihood ratio statistic of the form

D=2l(π1=π^1,π2=π^2,γ+=γ^+)2l(π1=π^,π2=π^,γ+=γ^+),(5)
which asymptotically follows the chi-squared distribution χq12 with q − 1 degrees of freedom. In comparisons across c groups, the number of degrees of freedom is (c − 1) × (q − 1). After all genes are tested, p-values can be adjusted for multiple comparisons with the Benjamini-Hochberg method.

In a DTU analysis, groups are defined by the design of an experiment and are the same for each gene. In tuQTL analyses, the aim is to find nearby (bi-allelic) SNPs associated with transcript usage of a gene. Model fitting and testing is performed for each gene-SNP pair, and grouping of samples is defined by the genotype, typically translated into the number of minor alleles (0, 1 or 2). Thus, tuQTL analyses are similar to DTU analyses with the difference that multiple models are fitted and tested for each gene. Additional challenges to be handled in tuQTL analyses include a large number of tests per gene with highly variable allele frequencies (models) and linkage disequilibrium, which can be accounted for in the multiple testing corrections. As in other sQTL studies35,49,50, we apply a permutation approach to empirically assess the null distribution of associations and use it for the adjustment of nominal p-values (see Supplementary Note 2 in Supplementary File). For computational efficiency, SNPs within a given gene that exhibit the same genotypes are grouped into blocks. In this way, blocks define unique models to be fit, reducing computation and the degree of multiple testing correction.

Dispersion estimation with adjusted profile likelihood and moderation

Accurate parameter estimation is a challenge when only a small number of replicates is available. Following the edgeR strategy1,2,53, we propose multiple approaches for dispersion estimation, all based on the maximization and adjustment of the profile likelihood, since standard maximum likelihood (ML) is known to produce biased estimates as it tends to underestimate variance parameters by not allowing for the fact that other unknown parameters are estimated from the same data54,55.

In the DM model parameterization of our choice, we are interested in estimating the precision (concentration) parameter, γ+ (inverse proportional to dispersion θ). Hence, at this stage, proportions π1 and π2 can be considered nuisance parameters and the profile log-likelihood (PL) for γ+ can be constructed by maximizing the log-likelihood function with respect to proportions π1 and π2 for fixed γ+:

PL(γ+;π^1,π^2,y)=maxπ1,π2l(π1,π2,γ+,y).(6)

The profile likelihood is then treated as an ordinary likelihood function for estimation and inference about parameters of interest. Unfortunately, with large numbers of nuisance parameters, this approach can produce inefficient or even inconsistent estimates54,55. To correct for that, one can apply an adjustment proposed by Cox and Reid56 and obtain an adjusted profile likelihood (APL):

APL(γ+;π^1,π^2,y)=PL(γ+;π^1,π^2,y)12log(detmI),(7)
where det denotes determinant and I is the observed information matrix for π1 and π2. The interpretation of the correction term in APL is that it penalizes values of γ+ for which the information about π1 and π2 is relatively large. When data consists of many samples, one can use gene-wise dispersion estimates, i.e., the dispersion is estimated for each gene g = 1,…,G separately:

argmax{APLg(γ+g)}=argmax{AP L(γ+g;π^1g,π^2g,yg)}.(8)

These estimates become more unstable as the sample size decreases. At the other extreme, one can assume a common dispersion for all genes and use all genes to estimate it:

argmax{1Gg=1GAP Lg(γ+g)}.(9)

Common dispersion estimates are more stable but the assumption of a single dispersion for all genes is rather strong, given that some genes are under tighter regulation than others57,58. Thus, moderated dispersion is a trade-off between gene-wise and common dispersion and estimates are calculated with an empirical Bayes approach, which uses a weighted combination of the common and individual likelihood:

argmax{AP Lg(γ+g)+W.1Gg=1GAPLg(γ+g)}.(10)

If a dispersion-mean trend is present (see Figure S16, Figure S17, Figure S28 and Figure S29 in Supplementary File), as commonly observed in gene-level differential expression analyses1,3, one can apply shrinkage towards this trend instead of to the common dispersion:

argmax{APLg(γ+g)+W.1|C|gCAPLg(γ+g)},(11)

where C is a set of genes that have similar gene expression as gene g and W is a weight defining the strength of moderation (see Supplementary Note 1 for further details).

Estimation and inference: simulations from the Dirichlet-multinomial model

We first investigated the performance of the DM model and the approach for parameter estimation and inference in the case where only few replicates are available. We performed simulations that correspond to a two-group comparison with no DTU (i.e. null model) where feature counts were generated from the DM distribution with identical parameters in both groups. Simulations were repeated 50 times for 1000 genes. In these simulations, we can vary the overall expression (m), number of features (q), proportions (prop) and sample size in one condition (n). Proportions follow a uniform or decaying distribution or are estimated based on kallisto transcripts or HTSeq exon counts from Kim et al. and Brooks et al. data (more details on these datasets below). In the first case, all genes have the same (common) dispersion, and in the second one, each gene has different (genewise) dispersion. Simulations for evaluating the dispersion moderation are intended to better resemble a real dataset. For these instances (repeated 25 times for 5000 genes), genes have expression, dispersion and proportions that were estimated from the real data. See Supplementary Note 3 for the additional details.

Figure 1A and Figure S1 confirm that using the Cox-Reid adjustment (CR) improves the estimation (in terms of median absolute error and extreme error values) of the concentration parameter γ+ in comparison to raw profile likelihood (PL) estimates. Additionally, the median error of concentration estimates for Cox-Reid APL is always lower than for PL or maximum likelihood (ML) used in the dirmult package7 (Figure 1C, Figure S2). This translates directly into the inference performance where the CR approach leads to lower false positive (FP) rate than other approaches (Figure 1B, Figure S3).

Accurate estimates of dispersion do not always lead to expected control of FP rate. Notably, using the true concentration parameters in genes with many features (with decaying proportions) results in higher than expected nominal FP rates (Figure 1B, Figure S3 and Figure S6A). Meanwhile, for genes with uniform proportions, even with many features, the FP rate for true dispersion is controlled (Figure S3 and Figure S6B). Also, the Cox-Reid adjustment tends to underestimate the concentration (overestimate dispersion) for genes with many features and decaying proportions, especially for very small sample size (Figure 1C, Figure S2, Figure S5A, Figure S5E), which leads to accurate FP rate control not achieved even with the true dispersion (Figure S6A).

As expected, common dispersion estimation is effective when all genes indeed have the same dispersion, though this cannot be generally assumed in most real RNA-seq datasets (see results of simulations in the following section). In contrast, pure gene-wise estimates of dispersion lead to relatively high estimation error in small sample sizes (Figure 1A, Figure S1 and Figure S8). Thus, sharing information about concentration (dispersion) between genes by moderating the gene-wise APL is applied. This improves concentration estimation in terms of median error (Figure 1C and Figure S8) and by shrinking extremely large values (on the boundary of the parameter space, see Figure S7) toward common or trended concentration. Therefore, moderated gene-wise estimates lead to better control of the nominal FP rate (Figure 1B and Figure S10).

In these simulations, the overall best performance of the DM model is achieved when dispersion parameters are estimated with the Cox-Reid APL and the dispersion moderation is applied. This strategy leads to p-value distributions that in most of the cases are closer to the uniform distribution (Figure 1D, Figure S4 and Figure S11).

Comparison on simulations that mimic real RNA-seq data

Next, we use the simulated data from Soneson et al.23, where RNA-seq reads were generated such that 1000 genes had isoform switches between two conditions of the two most abundant transcripts. For each condition three replicates were simulated resulting in 3 versus 3 comparisons. Altogether, we summarize results for three scenarios: i) Drosophila melanogaster with no differential gene expression; ii) Homo sapiens without differential gene expression; iii) Homo sapiens with differential gene expression.

The aim of these analyses is to compare the performance of DRIMSeq against DEXSeq, which emerged among the top performing methods for detection of DTU from RNA-seq data23. For DRIMSeq, we consider different dispersion estimates: common, gene-wise with no moderation and with moderation-to-common and to-trended dispersion. We use the exonic bin counts provided by HTSeq (same input to the DEXSeq pipeline), and transcript counts obtained with kallisto. Additionally, we use HTSeq and kallisto counts that are re-estimated after the removal of lowly expressed transcripts (less than 5% in all samples) from the gene annotation (pre-filtering) as proposed by Soneson et al.23 and kallisto filtered counts that exclude the lowly expressed transcripts (also less than 5% in all samples). DRIMSeq returns a p-value per gene. To make results comparable, we used the module within DEXSeq that summarizes exon-level p-values to a gene-level adjusted p-value.

As expected, common dispersion estimates lead to worse performance (lower power and higher FDR) compared to gene-wise dispersions. DRIMSeq achieves the best performance with moderated gene-wise dispersion estimates, while the difference in performance between moderating to common or to trended dispersion is quite small, with moderated-to-trend dispersion estimates being slightly more conservative (Figure 2 and Figure S15).

f017ffc9-0878-46c9-9886-361ee2e6db75_figure2.gif

Figure 2. True positive rate (TPR) versus achieved false discovery rate (FDR) for three FDR thresholds (0.01, 0.05 and 0.1) obtained by DEXSeq and DRIMSeq.

DRIMSeq was run with different dispersion estimation strategies: common dispersion and genewise dispersion with no moderation (genewise_grid_none), moderation to common dispersion (genewise_grid_common) and moderation to trended dispersion (genewise_grid_trended). Results presented for Drosophila melanogaster and Homo sapiens simulations with DTU (nonull) and no differential gene expression (node) using transcript counts from kallisto and exonic counts from HTSeq. Additionally, filtered counts (kallistofiltered5, htseqprefiltered5) are used. When the achieved FDR is smaller than the threshold, circles are filled with the corresponding color, otherwise, they are white.

As noted by Soneson et al.23, detecting DTU in human is harder than in fruit fly due to the more complex transcriptome of the first one; all methods have much smaller false discovery rate (FDR). Nevertheless, none of the methods manages to control the FDR at a given threshold in either of the simulations.

Annotation pre-filtering, suggested as a solution to mitigate high FDRs23, affects DEXSeq and DRIMSeq in a different way. For DEXSeq, it strongly reduces the FDR. For DRIMSeq, it increases power without a strong reduction of FDR. Moreover, the results for kallisto filtered and pre-filtered are almost identical (Figure S15 and Figure S24), which means that the re-estimation step based on the reduced annotation is not necessary for kallisto when used with DRIMSeq or DEXSeq. Additionally, we have considered how other filtering approaches affect DTU detection.

From Figure S24, we can see that DS analysis based on transcript counts are more robust to different variations of filtering and indeed some filtering improves the inference. For exonic counts, filtering should be less stringent and the pre-filtering approach is the best performing strategy.

DRIMSeq performs well when coupled with transcript counts from kallisto. In the case when no filtering is applied to the data, it outperforms DEXSeq. When transcript counts are pre-filtered, both methods have very similar performance (Figure S15). For both differential engines, the performance decreases substantially with increasing number of transcripts per gene, with DRIMSeq having slightly more power when genes have only a few transcripts (Figure S17). DRIMSeq has poor performance for the exonic counts in the human simulation, where achieved FDRs of more than 50% are observed for an expected 5%; consequently, we recommend the use of DRIMSeq on transcript counts only. On the other hand, the concordance of DRIMSeq and DEXSeq top-ranked genes is quite high and similar even for exonic counts (Figure S16).

The p-value distributions highlight a better fit of the DM model to transcript counts compared to exonic counts (it is more uniform with a sharp peak close to zero). Similarly, dispersion estimation gives better results for transcript counts (Figure S19 and Figure S20). In particular, for exonic counts, a large number of genes have concentration parameter estimates at the boundary of the parameter space, unlike the situation for transcript counts (Figure S19 and Figure S20).

DS analyses on real datasets

To compare the methods further, we consider two public RNA-seq data sets. The first is the pasilla dataset59 (Brooks et al.). The aim was to identify genes regulated by pasilla, the Drosophila ortholog of mammalian splicing factors NOVA1 and NOVA2. In this experiment, libraries were prepared from seven biologically independent samples: four control samples and three samples in which pasilla was knocked down. Libraries were sequenced using a mixture of single-end and paired-end reads as well as different read lengths. The second data set is from matched human lung normal and adenocarcinoma samples from six Korean female nonsmoking patients60, using paired-end reads (Kim et al.).

Both datasets have a more complex design than those used in the simulations; in addition to the grouping variable of interest, there are additional covariates to adjust for (e.g., library layout for the fruit fly data, patient identifier for the paired human study). In order to account for such effects, one should rather use a regression approach, which currently is not supported by DRIMSeq, but can be applied within DEXSeq’s GLM framework. To make the comparison fair, we fit multiple models. For the pasilla dataset, we compare four control samples versus three pasilla knock-down samples without taking into account the library layout (model full) as well as compare only the paired-end samples, which removes the covariate. To not diminish DEXSeq for its ability to fit more complex models, we run it using a model that does the four control versus three knock-down comparison with library layout as an additional covariate (model full 2). For the adenocarcinoma data, we do a two-group comparison of six normal versus six cancer samples (model full) and for DEXSeq, we fit an extra model that takes into account patient effects (model full 2). Additionally, we do so-called "mock" analyses where samples from the same condition are compared (model null), and the expectation is to detect no DS since it is a within-condition comparison (see Supplementary Note 5 for the exact definition of these null models).

In the full comparisons with transcript counts, DRIMSeq calls similar or fewer DS genes than DEXSeq, and a majority of them are contained within the DEXSeq calls (Figure S26, Figure S27) showing high concordance between DRIMSeq and DEXSeq and slightly more conservative nature of DRIMSeq. Accounting for covariates in DEXSeq (model full 2) or performing the analysis on a subgroup without covariates (model full paired) results in more DS genes detected (Figure S28, Figure S29 and Figure S30).

In the "mock" analyses, as expected, both methods detect considerably fewer DS genes, except in two cases. First, for the pasilla data (model null 3), where the two versus two control samples were from single-end library in one group and from paired-end library in the second group, leading to a comparison between batches in which both of the methods found more DS genes than in the comparison of control versus knock-down showing that the "batch" effect is very strong. Second, in the adenocarcinoma data (model null normal 1), where the two groups of individuals (each consisting of three women) happened to be very distinct (Figure S25). Therefore, we do not include these two cases when referring to the null models.

Overall, in the full comparisons, there are more DS genes detected based on differential transcript usage than differential exon usage (Figure S26). For DEXSeq, this is also the case in the null comparisons, which shows that DEXSeq works better with exonic counts than with transcript counts. DRIMSeq, on the other hand, has better performance on transcript counts, for which it calls less DS genes in the null analysis than with exon counts. In particular, the p-value distributions under the null indicate that DM fits better to transcript counts than exon counts (Figure S14, Figure S31 and Figure S32).

Method comparisons based on real data are very challenging as the truth is simply not known. In this sense the pasilla data is very precious, as the authors of this study have validated alternative usage of exons in 16 genes using RT-PCR. Of course, these validations represent an incomplete truth, and ideally, large-scale independent validation would be needed to comprehensively compare the DTU detection methods. In Figure 3, Figure S33, Figure S34 and Figure S35 again it is shown that DRIMSeq is slightly more conservative than DEXSeq. DRIMSeq performs poorly on exon-level but returns strong performance on transcript-level quantification (e.g., kallisto) and even outperforms DEXSeq when the sample size is very small (model full paired).

f017ffc9-0878-46c9-9886-361ee2e6db75_figure3.gif

Figure 3. Results of DS analysis on the pasilla dataset showing how many of the 16 validated genes are called by DRIMSeq and DEXSeq using different counting strategies and different models.

On each curve, "X" indicates the number of DS genes detected for the FDR of 0.05. Model full - comparison of 4 control samples versus 3 knock-down. Model full paired - comparison of 2 versus 2 paired-end samples. Model full 2 - as model full but including the information about library layout (no results for DRIMSeq because currently, it is not able to fit models with multiple covariates).

tuQTL analyses

To demonstrate the application of DRIMSeq to tuQTL analysis, we use the data from the GEUVADIS project46 where 465 RNA-seq samples from lymphoblastoid cell lines were sequenced, 422 of which were sequenced in the 1000 Genomes Project Phase 1. Here, we present the analysis of 91 samples corresponding to the CEU population and 89 samples from the YRI population. Expected transcript counts (obtained with Flux Capacitor) and genotype data were downloaded from the GEUVADIS project website. We choose to compare the performance of DRIMSeq with sQTLseekeR, because it is a very recent tool that performs well35, can be directly applied to transcript count data and models transcript usage as a multivariate outcome.

For both of the methods, we investigate only the bi-allelic SNPs with a minor allele present in at least five samples (minor allele frequency approximately equal to 5%) and at least two alleles present in a population. Given a gene, we keep the SNPs that are located within 5 Kb upstream or downstream of the gene. We use the same pre-filtered counts in DRIMSeq and sQTLseekeR to have the same baseline for the comparison of the statistical engines offered by these packages. We keep the protein coding genes that have at least 10 counts in 70 or more samples and at least two transcripts left after the transcript filtering, which keeps the one that has at least 10 counts and proportion of at least 5% in 5 or more samples. The numbers of tested and associated genes and tuQTLs are indicated in Figure 4, Figure S38 and Figure S39.

f017ffc9-0878-46c9-9886-361ee2e6db75_figure4.gif

Figure 4. Results of the tuQTL analysis on the CEU population from the GEUVADIS data.

A: Concordance between sQTLseekeR and DRIMSeq. "X" indicates number of tuQTLs for FDR = 0.05. Panel B, C and D show characteristics of tuQTLs and genes detected by sQTLseekeR or DRIMSeq for FDR = 0.05. Values in brackets indicate numbers of tuQTLs or genes in a given set. Dark gray line corresponds to tuQTLs or genes that were identified by both of the methods (overlap). B: Distance to the closest exon of intronic tuQTLs. The light gray line (non_sqtl) corresponds to intronic tuQTLs that were not called by any of the methods. C: Distribution of mean gene expression for genes that are associated with tuQTLs. D: Distribution of the number of expressed transcripts for genes that are associated with tuQTLs. The light gray lines (all_genes) represent corresponding features for all the analyzed genes.

In Figure 4A and Figure S40 we can see that the concordance between DRIMSeq and sQTLseekeR is quite high and reaches 75%. Nevertheless, there is considerable difference between the number and type of genes that are uniquely identified by each method. sQTLseekeR finds more genes with alternative splicing associated to genetic variation (Figure S38 and Figure S39), but these genes have fewer transcripts expressed and lower overall expression in comparison to genes detected by DRIMSeq (Figure 4C, Figure 4D, Figure S40C and Figure S40D). To further investigate characteristics of detected tuQTLs, we measured enrichment of splicing-related features as used in a previous comparison35. This includes the location of tuQTLs within exons, within splice sites, in the surrounding of GWAS SNPs and distance to the closest exon. tuQTLs detected by DRIMSeq show higher enrichment for all splicing related features (Table 1 and Figure 4B), than sQTLseekeR tuQTLs, suggesting that by accounting for the overall gene expression, one can detect more meaningful associations.

Table 1. Enrichment in splicing related features for tuQTLs detected by DRIMSeq and sQTLseekeR in CEU and YRI populations for FDR = 0.05.

% within
exons
% within
splice sites
% within 1Kb
of a GWAS
CEUYRICEUYRICEUYRI
DRIMSeq26.0935.8919.7621.4212.7515.43
sQTLseekeR20.9525.4313.5217.410.2210.09
Overlap26.8540.5816.1725.3613.4218.14
Non tuQTLs5.255.241.751.531.150.97

Discussion

We have created a statistical framework called DRIMSeq based on the Dirichlet-multinomial distribution to model alternative usage of transcript isoforms from RNA-seq data. We have shown that this framework can be used for detecting differential isoform usage between experimental conditions as well as for identifying tuQTLs. In principle, the framework is suitable for differential analysis of any type of multinomial-like responses. From our simulations and real data analyses towards DS and sQTL analyses, DRIMSeq seems better suited to model transcript counts rather than exonic counts.

Overall, there are many tradeoffs to be made in DS analyses. For example, deriving transcript abundances from RNA-seq data is more difficult (e.g., complicated overlapping genes at medium to low expression levels) than directly counting exon inclusion levels of specific events. On the other hand, local splicing events may be not able to capture biologically interesting splice changes (e.g., switching between two different transcripts) but have ultimately more ability to detect DS in case when the transcript catalog is incomplete. Despite these tradeoffs and given the results observed here, DRIMSeq finds its place as a method to make downstream calculations on transcript quantifications. With emerging technologies that sequence longer DNA fragments (either truly or synthetically), we may see in the near future more direct counting of full-length transcripts, making transcript-level quantification more robust and accurate. Even with current standard RNA-seq data, ultrafast and lightweight methods make transcript counting more accessible and users will want to make comparative analyses directly from these estimates.

In principle, existing DS methods that allow multiple group comparisons could be adapted to the sQTL framework and vice versa; DRIMSeq is one of few tools that bridge these two applications. In particular, parameter estimation with DRIMSeq is suited for a situation where only a few replicates are available per group (common in DS analysis) as well as analyses over larger samples sizes (typical in sQTL analysis). For small sample sizes, accurate dispersion estimation is especially challenging. Thus, we incorporate estimation techniques analogous to those used in negative binomial frameworks, such as Cox-Reid APL; perhaps not surprisingly, raw profile likelihood or standard maximum likelihood approaches do not perform as well in our tests of estimation performance. In addition, as with many successful genomics modeling frameworks, sharing information across genes leads to more stable and accurate estimation and therefore better inference (e.g., better control of nominal FP rates).

In comparison to other available methods, DRIMSeq seems to be more conservative than both DEXSeq (using transcript counts) and sQTLseekeR, identifying fewer DTU genes and tuQTLs, respectively. On the other hand, DEXSeq is known to be somewhat liberal23. Moreover, the sQTL associations detected by DRIMSeq have more enrichment in splicing-related features than sQTLseekeR tuQTLs, which could be due to the fact that DRIMSeq accounts for the higher uncertainty of lowly expressed genes by using transcript counts instead of transcript ratios.

Our developed DM framework is general enough that it can be applied to other genomic data with multivariate count outcomes. For example, PolyA-seq data quantifies the usage of multiple RNA polyadenylation sites. During polyadenylation, poly(A) tails can be added to different sites and thus more than one transcript can be produced from a single gene (alternative polyadenylation); comparisons between groups of replicates can be conducted with DRIMSeq. As mentioned, the DM distribution is a multivariate generalization of the beta-binomial distribution, as the binomial and beta distributions are univariate versions of the multinomial and Dirichlet distributions, respectively. Although untested here, the DRIMSeq framework could be applied to analyses where the beta-binomial distribution are used with the advantage of naturally accommodating small-sample datasets. Interesting beta-binomial-based analyses include differential methylation using bisulphite sequencing data, where counts of methylated and unmethylated cytosines (a bivariate outcome) at specific genomic loci are compared, or allele-specific gene expression, where the expression of two alleles (again, a bivariate outcome) are compared across experimental groups.

One particularly important future enhancement is a regression framework, which would allow direct analysis of more complex experimental designs. For example, covariates such as batch, sample pairing or other factors could be adjusted for in the model. In the tuQTL analysis, it would allow studying samples from the pooled populations, with the subpopulation as a covariate, allowing larger sample sizes and increased power to detect interesting changes. Another potential limitation is that DRIMSeq treats transcript estimates as fixed, even though they have different uncertainty, depending on the read coverage and complexity of the set of transcripts within a gene. Although untested here, propagation of this uncertainty could be achieved by incorporating observational weights that are inversely proportional to estimated uncertainties or, in case of fast quantification methods like kallisto, by making effective use of bootstrap samples. At this stage, there is no consensus on how these approaches will perform and ultimately may require considerable additional computation.

Software availability

The Dirichlet-multinomial framework described in this paper is implemented within an R package called DRIMSeq. In addition to the user friendly workflow for the DTU and tuQTL analyses, it provides plotting functions that generate diagnostic figures such as the dispersion versus mean gene expression figures and histograms of p-values. User can also generate figures of the observed proportions and the DM estimated ratios for the genes of interest to visually investigate their individual splicing patterns.

The release version of DRIMSeq is available on Bioconductor http://bioconductor.org/packages/DRIMSeq, and the latest development version can be found on GitHub https://github.com/markrobinsonuzh/DRIMSeq.

Data availability

Data for simulations that mimic real RNA-seq was obtained from Soneson et al.23, where all the details on data generation and accessibility are available.

Differential splicing analyses were performed on the publicly available pasilla dataset, which was downloaded from the NCBI’s Gene Expression Omnibus (GEO) under the accession number GSE18508, and adenocarcinoma dataset under the accession number GSE37764.

Data for the tuQTL analyses was downloaded from the GEUVADIS project website.

All the details about data availability and preprocessing are described in the Supplementary Materials.

Archived source code as at the time of publication

DRIMSeq analyses for this paper were done with version 0.3.3 available on Zenodo https://zenodo.org/record/5308461 and Bioconductor release 3.2. Source code used for the analyses in this paper is available on Zenodo https://zenodo.org/record/16730562.

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 13 Jun 2016
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Nowicka M and Robinson MD. DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics [version 2; peer review: 2 approved]. F1000Research 2016, 5:1356 (https://doi.org/10.12688/f1000research.8900.2)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 2
VERSION 2
PUBLISHED 06 Dec 2016
Revised
Views
56
Cite
Reviewer Report 20 Dec 2016
Robert Castelo, Department of Experimental and Health Sciences, Pompeu Fabra University, Barcelona, Spain 
Approved
VIEWS 56
I appreciate that the authors have made an effort to address my comments and I'm particularly happy to see that my suggestion to check the overlap of tuQTLs with splice site binding sites reveals an improved enrichment by DRIMSeq. I ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Castelo R. Reviewer Report For: DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics [version 2; peer review: 2 approved]. F1000Research 2016, 5:1356 (https://doi.org/10.5256/f1000research.11139.r18253)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Version 1
VERSION 1
PUBLISHED 13 Jun 2016
Views
108
Cite
Reviewer Report 06 Jul 2016
Robert Castelo, Department of Experimental and Health Sciences, Pompeu Fabra University, Barcelona, Spain 
Approved with Reservations
VIEWS 108
This article introduces a new statistical method, called DRIMSeq and implemented in a R/Bioconductor package of the same name, to detect isoform expression changes between two conditions from RNA-seq data. The same method can be used to search for significant ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Castelo R. Reviewer Report For: DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics [version 2; peer review: 2 approved]. F1000Research 2016, 5:1356 (https://doi.org/10.5256/f1000research.9577.r14580)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 06 Dec 2016
    Mark Robinson, University of Zurich, Switzerland
    06 Dec 2016
    Author Response
    Thank you for taking the time to read and review our paper.

    DEXSeq is a package designed for the differential exon usage (DEU) and returns exon-level p-values, which can ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 06 Dec 2016
    Mark Robinson, University of Zurich, Switzerland
    06 Dec 2016
    Author Response
    Thank you for taking the time to read and review our paper.

    DEXSeq is a package designed for the differential exon usage (DEU) and returns exon-level p-values, which can ... Continue reading
Views
105
Cite
Reviewer Report 24 Jun 2016
Alejandro Reyes, Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany 
Approved
VIEWS 105
Nowicka and Robinson propose a novel method, called DRIMSeq, to test for differential transcript usage between groups of samples using RNA-seq. The method is based on the Dirichlet-multinomial distribution.
The authors evaluate different existing approaches to estimate the parameters ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Reyes A. Reviewer Report For: DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics [version 2; peer review: 2 approved]. F1000Research 2016, 5:1356 (https://doi.org/10.5256/f1000research.9577.r14338)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 06 Dec 2016
    Mark Robinson, University of Zurich, Switzerland
    06 Dec 2016
    Author Response
    Thank you for taking the time to read and review our paper.

    As per your suggestion, we have now stressed that DRIMSeq can be applied to differential transcript usage ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 06 Dec 2016
    Mark Robinson, University of Zurich, Switzerland
    06 Dec 2016
    Author Response
    Thank you for taking the time to read and review our paper.

    As per your suggestion, we have now stressed that DRIMSeq can be applied to differential transcript usage ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 13 Jun 2016
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载