Introduction

Many psychologically interesting research questions involve comparing competing theories: Does sleep deprivation cause attentional lapses? Does alcohol impair the speed of information processing or reduce cautiousness, or both? Does the forgetting curve follow a power or exponential function? In many cases, the competing theories can be represented as a set of quantitative models that are applied (“fitted”) to the observed data. We can then estimate a metric that quantifies the degree to which each model accounts for the patterns observed in data balanced against its flexibility. Model flexibility is often defined as the range of data patterns that a model can predict, which includes patterns that were observed as well as patterns that were not observed. Flexibility is an important consideration in model choice as it assesses the degree to which a theory is suitably constrained to predict the observed data and that it does not simply predict many possible patterns that were not observed; the latter form of model is theoretically non-informative as it “predicts” almost any pattern that could be observed. Many methods exist that attempt to quantitatively account for the flexibility of models, and hence inform model choice, including likelihood ratio tests, various information criteria (e.g., Akaike, Bayesian and Deviance Information Criteria; AIC, BIC, and DIC, respectively), minimum description length, cross validation and others, with varying degrees of success. The emerging gold standard in quantitative psychology is the marginal likelihood; the commonly cited Bayes factor is the ratio of the marginal likelihoods of two models.

The marginal likelihood of a model is the average likelihood of the model given the data, where the average is taken over the prior distribution of the model parameters. By integrating the likelihood across the prior predictive space of a model, the marginal likelihood has the attractive property that it inherently accounts for the flexibility of the model. This removes the need for post-hoc complexity corrections to goodness of fit measures, as implemented in the “information criteria” metrics (e.g., AIC, BIC, DIC), because it provides a guarantee that a model which accounts for the observed patterns in data and few other possible though unobserved patterns will be quantitatively favored over a competing model which accounts for the observed data but also provides the capacity to account for many other patterns that were not observed. This occurs because the latter model would have low likelihood across the (many) regions of the prior space where data were not observed, which lowers the marginal likelihood.

Although theoretically attractive, estimating the marginal likelihood for psychologically interesting models poses a number of practical challenges. The likelihood function is the density function of the data with the participant-level parameters (also called random effects) integrated out, when viewed as a function of the group-level parameters. It is usually analytically or computationally intractable for psychological and statistical models because it is an integral that cannot be evaluated. This can be the case even in conceptually very simple models. For example, in generalized linear mixed models where random effects (e.g., model parameters for individual participants) account for the dependence between observations from the same individual (Fitzmaurice, Laird, & Ware, 2011), the likelihood is often intractable because it is an integral over the random effects. Recent advances in particle Markov chain Monte Carlo (P-MCMC) methods extend the scope of the applications of Bayesian inference to cases where the likelihood is intractable (see Andrieu and Roberts (2009), Andrieu et al., (2010), Chopin et al., (2013), and Tran et al., (2016)). However, some basic problems such as estimating the marginal likelihood and the standard error of this estimator remain challenging, because the marginal likelihood is the density of the data with both the group-level parameters and random effects integrated out. “Marginal likelihood estimation for hierarchical models” discusses some existing approaches for addressing this challenge.

Our article proposes an importance sampling (IS) approach for estimating the marginal likelihood when the likelihood is intractable but can be estimated unbiasedly. The method is implemented by first running a sampling scheme such as MCMC, whose draws are used to form an importance distribution for the fixed parameters. We then estimate the marginal likelihood using an unbiased estimate of the likelihood combined with this importance distribution. We call this approach Importance Sampling Squared \(\left (\text {IS}^{2}\right )\), as it is itself an importance sampling (IS) procedure and we can often estimate the likelihood by IS. We claim that it has several advantages over competing methods for estimating the marginal likelihood. The three most important ones are that a) it produces an unbiased estimate of the marginal likelihood; b) the method is robust against potential issues with the samples used to form the proposals; and c) it provides an easily computable and cheap estimate of the variability of the marginal likelihood estimator. Unbiasedness means that we can approach the true value and easily lower the variance of the estimator by getting more samples. Robustness means that the marginal likelihood estimator is simulation consistent even when the MCMC sampler has not yet converged to the posterior distribution of interest. This can mean that the MCMC may have a slightly perturbed posterior as a target or that the MCMC has the posterior as a target but it may not as yet have converged. The reason that \(\left (\text {IS}^{2}\right )\) works robustly is that it uses importance weights to correct for bad importance samples. Estimating the variability of the estimator directly means that we do not have to replicate the algorithm to obtain reliable estimates of the standard error of the estimator. These observations in turn suggest that it may be desirable to run the initial MCMC in parallel, and without rigorously requiring each MCMC to converge before forming the proposals for IS. Sections “Marginal likelihood estimation for hierarchical models”, “General discussion and future work” and Appendix A of the appendix expand on these points.

The rest of the paper is organized as follows. “Marginal likelihood estimation for hierarchical models” introduces the general model under consideration, reviews several competing approaches for estimating the marginal likelihood, and briefly compares them to our approach. “Estimating the marginal likelihood by IS2” gives a detailed description of the IS2 method. “Applying the IS2 method to data” illustrates the IS2 method in two applications, which show that the method gives accurate estimates of the marginal likelihood and its standard error in reasonable computing time. “General discussion and future work” concludes and discusses future work on speeding up the computation. There are five appendices. Appendix A outlines how our results on estimating the marginal likelihood extend to estimating posterior expectations. Appendix B studies the effect on IS of using an estimated likelihood. Appendix C provides further estimation details for the two applications in “Applying the IS2 method to data” and Appendix D provides additional applications of the method. Finally, Appendix E contains the proofs of all the results in the paper.

Marginal likelihood estimation for hierarchical models

This section defines the class of hierarchical models for which the density function of the observations (individual participant data), conditional on the group-level parameters 𝜃 and a vector of individual level parameters α (random effects) is available, but the likelihood is intractable because it is an integral that cannot be computed over the individual random effects. We also briefly review competing approaches.

Let yj be the vector of observations (responses) for the jth subject and define \({y}={y}_{1:S}=\left ({y}_{1},...,{y}_{S}\right )\) as the vector of observations for all S subjects. Let \({\alpha }_{j}\in {\mathbb {R}^{d_{\alpha }}}\), where \(\mathbb {R}^{d}\) denotes d-dimensional Euclidean space, be the vector of random effects (subject-level parameters) for subject j and define \(p\left ({\alpha }_{j}|{\theta }\right )\) as its density. We define \({\alpha }={\alpha }_{1:S}=\left ({\alpha }_{1},...,{\alpha }_{S}\right )\) as the vector of all random effects in the model (i.e., all subject-level parameters). Let \({\theta }\in {\mathbb {R}^{d_{\theta }}}\) be the vector of unknown group-level parameters and let \(p\left ({\theta }\right )\) be the prior for 𝜃. Assuming that the yj,αj, j = 1,...,S are independent given 𝜃, the joint density of the random effects and the observations is

$$ p\left( {y},{\alpha}|{\theta}\right)=\prod\limits_{j=1}^{S}p\left( {y}_{j},{\alpha}_{j}|{\theta}\right)=\prod\limits_{j=1}^{S}p\left( {y}_{j}|{\alpha}_{j},{\theta}\right)p\left( {\alpha}_{j}|{\theta}\right). $$
(1)

We assume the densities \(p\left ({y}|{\alpha },{\theta }\right )\), \(p\left ({\alpha }|{\theta }\right )\), and \(p\left ({\theta }\right )\) are available analytically or numerically (e.g., by numerical integration). However, even when these densities are separately available, the likelihood of the hierarchical model is typically analytically intractable because it is an integral over the random effects:

$$ p\left( {y}|{\theta}\right)=\prod\limits_{j=1}^{S}p\left( {y}_{j}|{\theta}\right) \text{with} p\left( {y}_{j}|{\theta}\right)=\int p\left( {y}_{j}|{\alpha}_{j},{\theta}\right)p\left( {\alpha}_{j}|{\theta}\right)d{\alpha}_{j}. $$
(2)

By Bayes’ rule, we can express the joint posterior density of 𝜃 and α as

$$ \pi\left( {\theta},{\alpha}\right):= p\left( {y}|{\alpha},{\theta}\right)p\left( {\alpha}|{\theta}\right)p\left( {\theta}\right)/p\left( {y}\right), $$
(3)

where

$$ \begin{array}{@{}rcl@{}} p\left( {y}\right) & =\int\int p\left( {y}|{\alpha},{\theta}\right)p\left( {\alpha}|{\theta}\right)p\left( {\theta}\right)d{\alpha}d{\theta} \end{array} $$
(4)

is the marginal likelihood.

The main goal of the article is to develop an efficient method for estimating the marginal likelihood, given samples from the posterior distribution over the group-level parameters and individual-level random effects. These samples will have been obtained through generic sampling routines (e.g., JAGS, STAN) or custom (e.g., differential evolution-MCMC (DE-MCMC) in Turner, Sederberg, Brown, & Steyvers, 2013). Even custom samplers still suffer from the problems of high autocorrelation between successive iterates of samples from the posterior, and of slow or uncertain convergence in some problems; for example, when the vector of random effects is high dimensional. These problems can lead to unreliable or biased posterior samples, in which case model selection metrics derived from those posterior samples can also be dramatically wrong. This issue has the potential to influence the reliability of some modern developments in estimating the marginal likelihood for cognitive models, including bridge sampling (Gronau et al., 2017; Gronau, Wagenmakers, Heck, & Matzke, 2019) and thermodynamic integration (Evans & Annis, 2019). Both of those approaches have the same structure: they begin with posterior samples, and post-process these samples to estimate the marginal likelihood. Both bridge sampling and thermodynamic integration are therefore sensitive to the accuracy of the sampling method – when the posterior samples are biased in some way, the estimated marginal likelihood will also be biased. An alternative approach is to bypass posterior samples altogether. For example, Evans and Brown (2018) proposed estimating the marginal likelihood directly by Monte Carlo integration over the prior. This method works well in small problems, but it can be highly inefficient. This “direct” method quickly becomes computationally infeasible in high-dimensional problems, or with hierarchically-specified models.

Our paper proposes a new approach. The basic idea is that we first obtain samples of the model parameters (𝜃; group level), but not the random effects (α; individual participants). These samples are used to form efficient proposal densities for an importance sampling algorithm on the model parameters. The IS2 method therefore does not suffer from the same potential drawbacks as the bridge sampling and thermodynamic integration methods, because the samples are only used to form the proposals for 𝜃. This means that unreliable samples of 𝜃 may lead to inefficient proposal densities, which can decrease the efficiency of the IS2 method, but will not lead to bias in the marginal likelihood. In fact, the IS2 method makes it possible to obtain unbiased and simulation consistent estimators of the marginal likelihood without ensuring that the underlying Markov chain(s) have in fact converged, although they need to produce reasonable estimates of the posterior. Section “General discussion and future work” discusses the robustness property of the IS2 estimator and how to use it to speed up the estimation of the marginal likelihood and posterior moments.

Estimating the marginal likelihood by IS2

The likelihood of hierarchical models of the form in Eq. 2 can be analytically intractable, though it can be estimated unbiasedly using importance sampling (other hierarchical models such as state space models may require a particle filter). Let \(\left \{ m_{j}\left ({\alpha }_{j}|{\theta },{y}_{j}\right );j=1,...,S\right \} \) be a family of proposal densities that we use to approximate the conditional posterior densities \(\left \{ \pi \left ({\alpha }_{j}|{\theta }\right );j=1,...,S\right \}\). Let \(g_{\text {IS}}\left ({\theta }\right )\) be a proposal density for the group-level parameters, 𝜃. We note that many different methods can be used to obtain efficient and reliable proposals \(g_{\text {IS}}\left ({\theta }\right )\) for the group-level parameters and \(m_{j}\left ({\alpha }_{j}|{\theta },{y}_{j}\right )\) for the random effects αj, for each subject j = 1,...,S. A simple frequently used approach takes the proposal as a multivariate Student t distribution with a small number of degrees of freedom, whose location is a mode of the log-likelihood and whose scale matrix is the inverse of minus the Hessian matrix at this mode. Our article constructs the proposal distribution by first running MCMC, even just a short MCMC, to obtain samples from the posterior distribution. We then construct a proposal distribution in the IS2 procedure by fitting a mixture of normal or Student t distributions to these samples, as outlined in “Applying the IS2 method to data” and Appendix C.

The density \(p\left ({y}_{j}|{\theta }\right )\) is estimated unbiasedly by

$$ \widehat{p}\left( {y}_{j}|{\theta}\right)=\frac{1}{N}\sum\limits_{i=1}^{N}w\left( {\alpha}_{j}^{\left( i\right)},{\theta}\right), \text{where} w\left( {\alpha}_{j}^{\left( i\right)},{\theta}\right)=\frac{p\left( {y}_{j}|{\alpha}_{j}^{\left( i\right)},{\theta}\right) p\left( {\alpha}_{j}^{\left( i\right)}|{\theta}\right)}{m_{j}\left( {\alpha}_{j}^{\left( i\right)}|{\theta},{y}_{j}\right)}, $$
(5)

where \({\alpha }_{j}^{\left (i\right )}\overset {iid}{\sim }m_{j}\left ({\alpha }_{j}|{\theta },{y}_{j}\right )\) and N is the number of importance samples used in the likelihood estimate, which we will refer to as the number of “particles”. We note that for importance sampling to be effective, it is necessary for the support of \(m_{j}\left ({\alpha }_{j}|{\theta },{y}_{j}\right )\) to contain the support of its corresponding conditional density (i.e., the tails of the proposal should be fatter than the tails of the target). This condition for importance sampling is usually easily satisfied using the defensive sampling approach in Hesterberg (1995). Appendix C discusses the defensive sampling approach and the proposal \(m_{j}\left ({\alpha }_{j}|{\theta },{y}_{j}\right )\). Hence,

$$ \widehat{p}_{N}\left( {y}|{\theta}\right)=\prod\limits_{j=1}^{S}\widehat{p}\left( {y}_{j}|{\theta}\right) $$
(6)

is an unbiased estimator of the likelihood \(p\left ({y}|{\theta }\right )\). To choose an optimal number of particles N, discussed in Appendix B1, we use the delta method to estimate the variance

$$ {\mathbb V}\left( \log\widehat{p}_{N}\left( {y}|{\theta}\right)\right)=\sum\limits_{j=1}^{S}{\mathbb V}\left( \log\widehat{p}\left( {y}_{j}|{\theta}\right)\right) $$
(7)

of \(\log \widehat {p}_{N}\left ({y}|{\theta }\right )\). This gives the estimate

$$ \hat {\mathbb V}\left( \log\widehat{p}\left( {y}_{j}|{\theta}\right)\right)\approx\frac{{\sum}_{i=1}^{N}w\left( {\alpha}_{j}^{\left( i\right)},{\theta}\right)^{2}}{\left( {\sum}_{i=1}^{N}w\left( {\alpha}_{j}^{\left( i\right)},{\theta}\right)\right)^{2}}-\frac{1}{N}. $$

We now rewrite Eq. 4 to show how it can be estimated by IS. Let u consist of all random variables used to construct \(\hat {p}_{N}(y|\theta )\) for a given value of 𝜃, with pN(u|𝜃) the density of u. In practice, as will be clear shortly, it is unnecessary to know pN(u|𝜃). We also equivalently write \(\hat {p}_{N}(y|\theta )\) as \(\hat {p}_{N}(y|\theta ,u)\). Then \(\hat {p}_{N}(y|\theta )\) is unbiased if

$$ \begin{array}{@{}rcl@{}} p(y|\theta) & = {\int}_{U} \hat{p}_{N}(y|\theta,u) p_{N}(u|\theta) du . \end{array} $$

Let gIS(𝜃) be the proposal for 𝜃 obtained by MCMC or otherwise. Then, the marginal likelihood is given by

$$ \begin{array}{@{}rcl@{}} p(y) & =& {\int}_{\Theta} {\int}_{U} \hat{p}_{N}(y|\theta,u) p_{N}(u|\theta) p(\theta) du d\theta , \\ & =& {\int}_{\Theta} {\int}_{U} \frac{\hat{p}_{N}(y|\theta,u) p_{N}(u|\theta)p(\theta)} {g_{\text{IS}}(\theta)p_{N}(u|\theta)} g_{\text{IS}}(\theta)p_{N}(u|\theta) du d\theta, \\ & =& {\int}_{\Theta} {\int}_{U} w(\theta, u) g_{IS}(\theta) p_{N}(u|\theta) du d\theta , \quad \text{where}\\ w(\theta,u) &=& \frac{\hat{p}_{N}(y|\theta,u)p(\theta) } { g_{\text{IS}}(\theta)} . \end{array} $$
(8)

This leads to the IS2 estimator,

$$ \begin{array}{@{}rcl@{}} \hat{p}_{\text{IS}^{2}}(y) & =\frac1M {\sum}_{i=1}^{M} \widetilde{w}(\theta_{i}) \quad \text{where}\quad \widetilde{w}\left( {\theta}_{i}\right)=w(\theta_{i},u_{i}), i=1,...,M . \end{array} $$
(9)

Algorithm 1 outlines the IS2 algorithm for obtaining an estimate of the marginal likelihood. On a practical note, the algorithm is well-suited to efficient parallelization, by processing particles (importance samples) independently in steps (1) and (2).

figure a

The usual way to estimate the standard error of the marginal likelihood estimator is by replication, i.e., by estimating the marginal likelihood estimator a number of times and then computing the standard deviation of these estimates. However, reliable estimation by replication can take a long time. An important advantage of our method is that it is straightforward to estimate the standard error of the marginal likelihood estimator \(\widehat {p}_{\text {IS}^{2}}\left ({y}\right )\). This affords researchers greater confidence in application of the estimated marginal likelihood, and also permits a simpler investigation of the tradeoff between efficiency and bias. The variance estimator is

$$ \widehat{{\mathbb{V}}}\left( \widehat{p}_{\text{IS}^{2}}\left( {y}\right)\right)=\frac{1}{M}\widehat{\sigma}^{2}_{p_{\text{IS}^{2}}(y)}, \text{where} \widehat{\sigma}^{2}_{p_{\text{IS}^{2}}(y)}=\frac{1}{M}\sum\limits_{i=1}^{M}\left( \widetilde{w}\left( {\theta}_{i}\right)-\widehat{p}_{\text{IS}^{2}}\left( {y}\right)\right)^{2}. $$
(12)

It is clear that under mild conditions the IS2 estimator is unbiased, simulation consistent and tends to normality as it is an IS estimator. Theorem 1 formally states these results and Appendix E gives its proof.

Theorem 1

Let M be the number of samples for 𝜃 and N the number of particles for estimating the likelihood. Under the assumptions in Theorem A1 in Section A.1 of the Appendix.

  • (i) \(\mathbb {E}\left (\widehat {p}_{\text {IS}^{2}}\left ({y}\right )\right )=p\left ({y}\right )\) and \(\widehat {p}_{\text {IS}^{2}}\left ({y}\right )\overset {a.s.}{\rightarrow }p\left ({y}\right )\) as \(M\rightarrow \infty \) for any N ≥ 1.

  • (ii) \(\sqrt {M}\left (\widehat {p}_{\text {IS}^{2}}\left ({y}\right )-p\left ({y}\right )\right )\overset {d}{\rightarrow }N\left (0,{\sigma }^{2}_{p_{\text {IS}^{2}}(y)}\right )\) and \(\widehat {\sigma }^{2}_{p_{\text {IS}^{2}}(y)}\overset {a.s.}{\rightarrow }{\sigma }^{2}_{p_{\text {IS}^{2}}(y)}=\mathbb {V}(\widetilde {w}\left ({\theta }\right ))\) as \(M\rightarrow \infty \) for any N ≥ 1.

There is a trade-off between the statistical precision of the IS2 estimator (in terms of the variance from Eq. 12) and the computational cost of obtaining the estimate. A large number of particles N gives a more accurate estimate of the likelihood with greater computational cost, while a small N results in a likelihood estimator \(\widehat p_{N}(y|\theta )\) that is cheap to evaluate but has larger variance. Appendix B provides theoretical and practical guidelines of this tradeoff. Proposition A1 in Appendix B.2 shows that the optimal N in the tradeoff between accuracy and computational cost is such that the variance of the log-likelihood estimator \(\mathbb {V}(\log \widehat p(y|\theta ))\) is approximately 1.

Applying the IS2 method to data

This section describes two applications of the IS2 method to real data. The applications are chosen from quite different domains – the first focuses on choice in a health context, the second focuses on an experimental manipulation in cognition. We chose these applications as they are challenging cases for estimating marginal likelihoods, and thus represent good test cases for illustrating the power of our method. They also emphasize that the IS2 method is not restricted to a particular domain of application or model class. It is a general method for estimating the marginal likelihood and the standard error of that estimate and can be applied in a variety of other contexts where quantitative models are estimated from data in a hierarchical framework including the study of attention, decision making, categorization, and memory.

Individual differences in health-related decisions

The first application explores individual differences in preferences for health decision making, by examining choices for hypothetical appointments with a health practitioner. A common approach to understanding individual differences in applied contexts is to assume that different individuals have different utility parameters, where the utility represents the subjective value placed on the attributes that describe the products and services on offer. In this sense the utility is a quantitative estimate of what people like and what they dislike. A common way to model individual differences in choices is to then assume individual differences in utility parameters, commonly described as “taste heterogeneity”. The mixed logit model is the most common framework for modeling individual differences in utilities (Train, 2009), which assumes utility parameters are multivariate normal distributed in the population. Nevertheless, more recent research suggests that all individuals might hold the same (or have sufficiently similar) utilities except that those utilities can be scaled up or down for different individuals (Fiebig, Keane, Louviere, & Wasi, 2010). This approach assumes that the ratio of the utilities for pairs of items is constant across participants though the absolute difference between those utilities can differ. This “scale heterogeneity” means that some people make more random choices than others, though their latent preferences, in the limit of very many trials, are similar.

The key model comparison question in this context is whether the choices observed in a population of participants are more consistent with individual differences in latent preferences (taste) or individual differences in choice consistency (scale). We tested this idea for the data reported in Fiebig et al., (2010) in a study where S = 79 women made choices about T = 32 choice scenarios for a pap smear test, where each scenario was defined by a combination of the values of the K = 5 choice attributes listed in Table 1 (not all unique combinations were presented). For example, one choice scenario might be that the pap smear would be conducted by a female doctor who is unknown to the patient, the test is not due though the doctor recommends it, with a cost of $20. The participant is then asked whether they would take this test or not.

Table 1 Choice attributes for the pap smear data set and their associated parameters

Our goal is to test whether choices in the pap smear test data provide greater support for the presence of individual differences in latent preferences, or the presence of individual differences in latent preferences and choice consistency. For this goal, we estimated different models which instantiated the different hypotheses. We then used the IS2 method to estimate the marginal likelihoods of those models, and used those estimates for model selection via Bayes factors.

Behavioural model

We analyse the choice data with the generalized multinomial logit (GMNL) model (Fiebig et al., 2010), a generalization of the mixed logit model, that accounts for individual differences in both latent preferences and choice consistency. We let the observed choice for participant j in scenario i be yij = 1 if the participant chooses to take the test and yij = 0 otherwise. The general form of the GMNL for the probability of individual j on trial i selecting the test is

$$ p(y_{ij} = 1|X_{ij},\beta^{(j)}) = \frac{\exp(\beta_{0j}+{\sum}_{k=1}^{K}\beta_{kj}x_{ijk})}{1+\exp(\beta_{0j}+{\sum}_{k=1}^{K}\beta_{kj}x_{ijk})}, $$
(13)

where β(j) = (β0j,β1j,…,βKj) and Xij = (xij1,…,xijK) are the vectors of utility weights and choice attributes respectively. As is standard in most consumer decision research, the GMNL model assumes that the attributes comprising an option (i.e., a hypothetical appointment) each have an associated utility, and the utility of an option as a whole is the sum of the component attribute utilities. The summed utility probabilistically generates a choice via the Luce choice rule, as shown in Eq. 13. The model also allows for an a priori bias toward accepting or rejecting the test at each trial independent of the feature values of the test, known as an alternative specific constant; β0j = β0 + η0j, with \(\eta _{0j}\sim \mathcal {N}(0,{\sigma _{0}^{2}})\).

The utilities for individual participants are modeled as

$$ \begin{array}{@{}rcl@{}} \beta_{jk}&=&\lambda_j\beta_k+\gamma\eta_{jk}+(1-\gamma)\lambda_j\eta_{jk},\\ \lambda_j&=&\exp(-\delta^{2}/2+\delta\zeta_{j}),\\ \qquad k&=&1,\ldots,K, \end{array} $$

with \(\eta _{jk}\sim {\mathcal {N}}(0,{\sigma _{k}^{2}})\) and \(\zeta _{j}\sim {\mathcal {N}}(0,1)\). The expected value of the scaling coefficients λj is one, implying that \(\mathbb {E}(\beta _{jk})=\beta _{k}\). We set the across-participant variance parameter for the cost attribute to zero (\({\sigma _{5}^{2}}=0\)) as we found no evidence of heterogeneity across individuals for this attribute, beyond the scaling effect. As the sum of the probabilities over the possible choice options in Eq. 13 is 1 (i.e., take the test, or not), to make the model identified we set the coefficients β with respect to yij = 0 (not taking the test) to zero.

The GMNL model accounts for individual differences in latent preferences (taste) and choice consistency (scale) through the parameters δ and γ, respectively. When δ = 0 (so that λj = 1 for all individuals), the GMNL model reduces to the mixed logit model. The mixed logit model captures heterogeneity in consumer preferences by allowing individuals to weight the choice attributes differently (i.e., individual differences in attribute utilities). By introducing taste heterogeneity, the mixed logit model avoids the restrictive independence of irrelevant alternatives property of the standard multinomial logit model.

The GMNL model additionally allows for differing choice consistency across individuals, known as scale heterogeneity, through the random variable λj. This variable changes all attribute weights simultaneously, allowing the model to predict choices to be more random for some consumers than others, such that smaller values of δ (so that \(\lambda _{j}\rightarrow 1\)) indicate differences in latent preferences where larger values of δ (so that \(\lambda _{j}\rightarrow 0\)) indicate differences in choice consistency. The γ parameter weights the specification between two alternative ways of introducing heterogeneity into the model.

The parameter vector is \(\theta =(\beta _{0},\beta _{1},\ldots ,\beta _{K},{\sigma _{0}^{2}},{\sigma _{1}^{2}},\ldots ,{\sigma _{K}^{2}},\delta ^{2},\gamma )'\), while the vector of random effects for each individual is αj = (η0j,…,ηjK,λj). The likelihood is therefore

$$ \begin{array}{@{}rcl@{}} p(y|\theta)=\prod\limits_{j=1}^{S}p(y_{j}|\theta)=\prod\limits_{j=1}^{S}\left[\int \left( \prod\limits_{i=1}^{T} p(y_{ij}|\alpha_{j},\theta)\right) p(\alpha_{j}|\theta) d\alpha_{j}\right], \end{array} $$
(14)

where yij is the observed choice, y = (y11,…,y1T,…,yj1,…,yST) and p(yij|αj,𝜃) is given by the choice probability in Eq. 13. We use diffuse priors specified as \(\beta _{0}\sim {\mathcal {N}}(0,100)\), \(\sigma _{0} \propto (1+{\sigma _{0}^{2}})^{-1}\), \(\beta _{k}\sim {\mathcal {N}}(0,100)\), \(\sigma _{k} \propto (1+{\sigma _{k}^{2}})^{-1}\), for k = 1,…,K, δ ∝ (1 + δ/0.2)− 1, and \(\gamma \sim \text {U}(0,1)\). The standard deviation parameters have half-Cauchy priors (Gelman, 2006). Appendix C provides complete details of the estimation procedure.

Results

We estimated the posterior distribution using M = 50,000 importance samples for the parameters, and estimated the Monte Carlo standard errors by bootstrapping the importance samples because we found that bootstrapping gave the most stable results of the estimator. Appendix C gives implementation details. The log marginal likelihood (with standard error) for the mixed logit and GMNL models were − 981.24 (.003) and − 978.27 (.012), respectively. The Monte Carlo standard errors are very small, indicating the IS2 method is highly efficient. The estimates of the log marginal likelihood allow us to calculate a Bayes factor of approximately 20 for the GMNL model over the mixed logit model; BFGMNL:mixed = \(\exp (-978.27 -(-981.24))\approx 19.5\).

The larger marginal likelihood for the GMNL model than the mixed logit model provides some evidence for the presence of scale heterogeneity for this data set. Corroborating evidence comes from the parameter estimates. The posterior mean for γ was .17 (90% credible interval [.014–.452]), where the weight for scale heterogeneity in the GMNL model was 1 − γ. This means that assuming individuals only differed in their latent preferences (taste heterogeneity) did not sufficiently capture the trends in data. Some participants also made more variable choices than others (scale heterogeneity).

The speed-accuracy tradeoff in perceptual decisions

Our second application explores the cognitive processes involved in perceptual decisions. We examine decision making in the context of the well-studied speed-accuracy tradeoff: the finding that electing to make faster decisions decreases decision accuracy and, conversely, that electing to increase decision accuracy causes decisions to become slower (see Heitz, 2014 for a review). The speed-accuracy tradeoff is typically attributed to changes in caution – the quantity of evidence required to trigger a response (i.e., the response threshold). Here, we apply the IS2 method to a class of decision making models in the Linear Ballistic Accumulator (LBA; Brown and Heathcote, 2008) framework to confirm that model choice via the marginal likelihood is consistent with existing methods of model choice in the literature.

We draw upon data reported by Forstmann, Dutilh, Brown, Neumann, and von Cramon (2008) coming from an experiment in which S = 19 participants were required to make difficult motion discrimination decisions under varying degrees of time pressure. The participants made repeated decisions about whether a cloud of semi-randomly moving dots appeared to move to the left or to the right of a display. Participants made these decisions under three conditions that differed in their task instructions: they were asked to respond as accurately as possible (condition 1: “accuracy emphasis”), at their own pace (condition 2: “neutral emphasis”), or as quickly as possible (condition 3: “speed emphasis”). The three conditions were randomized across trials. A visual cue described the decision emphasis on a trial-by-trial basis prior to stimulus onset. Each subject made 280 decisions in each condition for a total of T = 840 trials. See Forstmann et al., (2008) for all remaining details.

Freely available code applying the IS2 method to one of the hierarchical LBA models described here is available at osf.io/xv59c. Appendix D also provides two additional applications of the IS2 method to the hierarchical LBA: one application to the speed-accuracy tradeoff in lexical decisions, and a second application to response bias in lexical decisions. Code for these two examples is also available at the same repository.

Behavioural model

We analyse the choice and response time data with a hierarchical LBA model. In a typical perceptual decision making experiment, the ith observation for the jth participant contains two pieces of information. The first is the response choice, denoted by \(RE_{ij}\in \left \{ 1,...,C\right \} \), where C is the number of response alternatives. The second is the response time (RT), denoted by \(RT_{ij}\in \left (0,\infty \right )\). Brown and Heathcote (2008) derived the joint density and cumulative distribution function of the finishing time of an LBA accumulator over response choice and response time. The finishing time distribution for each of the LBA accumulators is determined by five parameters: the mean and standard deviation of the drift rate distribution \(\left (v,s\right )\), the threshold parameter b, the width of the start point distribution A, and the non-decision time τ. It is common to set the standard deviation s of the drift rate distribution to one, i.e. s = 1 (see also Donkin, Brown, & Heathcote, 2009). The probability density of the cth accumulator reaching the threshold at time t, and the other accumulators not reaching the threshold prior to time t, is

$$ \text{LBA}\left( {RE_{ij}}=c,RT_{ij}=t|b,A,v,s,\tau\right)=f_{c}\left( t\right) \underset{k\neq c}{\prod} \left( 1-F_{k}\left( t\right)\right), $$

where fc and Fc are the density and distribution functions for the finishing times of a single LBA accumulator (for details see Brown & Heathcote, 2008; Terry et al., 2015). In a two-choice experiment where response bias is not expected, it is convenient to define the means of the drift rate distributions as \({v}=\left (v^{\left (e\right )},v^{\left (c\right )}\right )\), where \(v^{\left (e\right )}\) is the mean of the drift rate for the accumulator corresponding to the incorrect response and \(v^{\left (c\right )}\) is the mean of the drift rate for the accumulator corresponding to the correct response. Together, these assumptions imply that each subject j, j = 1,...,S, has the vector of random effects

$$ \left\{ b_{j},A_{j},v^{\left( e\right)}_{j},v^{\left( c\right)}_{j},\tau_{j}\right\}. $$

With the usual assumption of independence, the conditional density of all the observations is

$$ p\left( RE,RT|{b},{A},{\tau},{v}\right)=\prod\limits_{j=1}^{S}\prod\limits_{i=1}^{T}\text{LBA}\left( RE_{ij},RT_{ij}|b_{j},A_{j},v^{\left( e\right)}_{j},v^{\left( c\right)}_{j},\tau_{j}\right). $$

We follow the notation and assumptions in Gunawan, Hawkins, Tran, Kohn, and Brown (2019). For each subject j = 1,...,S, let the vector of random effects

$$ {\alpha}_{j}=\left( \alpha_{b_{j}},\alpha_{A_{j}},\alpha_{v_{j}^{\left( e\right)}},\alpha_{v_{j}^{\left( c\right)}},\alpha_{\tau_{j}}\right), $$

where \(\alpha _{b_{j}}=\log \left (b_{j}\right )\), \(\alpha _{A_{j}}=\log \left (A_{j}\right )\), \(\alpha _{v_{j}^{\left (e\right )}}=\log \left (v_{j}^{\left (e\right )}\right )\), \(\alpha _{v_{j}^{\left (c\right )}}=\log \left (v_{j}^{\left (c\right )}\right )\) and \(\alpha _{\tau _{j}}=\log \left (\tau _{j}\right )\). We take the following prior densities (as in Gunawan et al., 2019): αj is \(N\left ({\alpha }_{j};{\mu }_{\alpha },{\Sigma }_{\alpha }\right )\), \({\mu }_{\alpha }\sim N\left (0,I_{d_{\alpha }}\right )\) where dα is the dimension of αj, and the hierarchical prior for Σα is

$$ \begin{array}{@{}rcl@{}} {\Sigma}_{\alpha}|a_{1},...,a_{d_{\alpha}} & \sim & IW\left( v_{\alpha}+d_{\alpha}-1,\ 2v_{\alpha}\text{diag}\left( 1/a_{1},...,1/a_{d_{\alpha}}\right)\right), \end{array} $$
(15)
$$ \begin{array}{@{}rcl@{}} a_{1},...,a_{d_{\alpha}} & \sim & IG\left( \frac{1}{2}, \frac{1}{\mathcal{A}_{d}^{2}}\right),\ d=1,...,d_{\alpha}, \end{array} $$
(16)

where vα, \(\mathcal {A}_{1},...,\mathcal {A}_{d_{\alpha }}\) are positive scalars and \(\text {diag}\left (1/a_{1},...,1/a_{d_{\alpha }}\right )\) is the diagonal matrix with diagonal elements \(1/a_{1},...,1/a_{d_{\alpha }}\).Footnote 1 This is the marginally non-informative prior of Huang and Wand (2013). Following Gunawan et al., (2019), we set vα = 2 and \(\mathcal {A}_{d}=1\) for all d. These settings lead to half-t marginal prior distributions for the standard deviations of Σα and uniform marginal prior distributions for the off-diagonal correlations implied by Σα.

We tested whether the changes in task instructions across the three experimental conditions caused changes in observed decision behaviour that is consistent with changes in decision caution. We compared four LBA models. For consistency with the notation above, we refer to instructions that emphasize accuracy, neutral or speedy decisions as a, n and s, respectively, and the correct and error accumulators as c and e, respectively.

Model I:

assumes no differences in parameters across the three conditions, i.e., a null model. This model corresponds to the psychological assumption that performance is not influenced by the instructions given to participants. The vector of random effects for subject j is

$$ \alpha_{j}=\left( \alpha_{b_{j}},\alpha_{A_{j}},\alpha_{v_{j}^{\left( e\right)}},\alpha_{v_{j}^{\left( c\right)}},\alpha_{\tau_{j}}\right). $$
Model II:

assumes there are two response threshold parameters. One threshold parameter defines decision-making in the accuracy- and neutral-emphasis conditions, and the second threshold parameter in the speed-emphasis condition. This makes the simplifying assumption that performance in two of the three conditions is so similar as to be effectively identically distributed,

$$ \alpha_{j}=\left( \alpha_{b_{j}^{\left( a/n\right)}},\alpha_{b_{j}^{\left( s\right)}},\alpha_{A_{j}},\alpha_{v_{j}^{\left( e\right)}},\alpha_{v_{j}^{\left( c\right)}},\alpha_{\tau_{j}}\right). $$
Model III:

assumes there are three response threshold parameters, one for each condition. This model assumes that the participant responds to each condition with a different level of caution, though otherwise with identically distributed parameters,

$$ \alpha_{j}=\left( \alpha_{b_{j}^{\left( a\right)}},\alpha_{b_{j}^{\left( n\right)}},\alpha_{b_{j}^{\left( s\right)}},\alpha_{A_{j}},\alpha_{v_{j}^{\left( e\right)}},\alpha_{v_{j}^{\left( c\right)}},\alpha_{\tau_{j}}\right). $$
Model IV:

assumes there are separate response threshold and non-decision time parameters for each condition. This model assumes that instructions to emphasise speedy, neutral or accurate decisions influence not only cautiousness but also the time required to perceptually encode the stimulus and produce a motor response (non-decision time),

$$ \alpha_{j} = \left( \alpha_{b_{j}^{\left( a\right)}},\alpha_{b_{j}^{\left( n\right)}},\alpha_{b_{j}^{\left( s\right)}},\alpha_{A_{j}},\alpha_{v_{j}^{\left( e\right)}},\alpha_{v_{j}^{\left( c\right)}},\alpha_{\tau_{j}^{\left( a\right)}},\alpha_{\tau_{j}^{\left( n\right)}},\alpha_{\tau_{j}^{\left( s\right)}}\!\right). $$

As in the first application, Appendix C provides complete details of the estimation procedure.

Results

Once we obtained samples from the posterior, we ran the IS2 algorithm with M = 10,000 draws to estimate the log of the marginal likelihood, \(\log \ \widehat {p}(y)\), and adaptively set the number of particles N that were used to estimate the likelihood such that the variance of the log-likelihood estimates did not exceed 1; recall that \(\sigma _{opt}^{2}=1\) is the optimal variance of the log-likelihood (cf. “Estimating the marginal likelihood by IS2). We initiated this adaptive procedure by starting with N = 250 particles and computed the variance of the log-likelihood given in Eq. 7. If the variance was greater than 1, we increased the number of particles. We again estimated the Monte Carlo standard error of the log of the marginal likelihood estimates by bootstrapping the importance samples because we found that bootstrapping gave the most stable results.

The log marginal likelihoods (with standard errors) for Models I, II, III and IV are, respectively, 5204.17 (0.11), 7352.75 (0.06), 7453.73 (0.10) and 7521.44 (0.17), with a total computation time of around 80 minutes. As in the first application, the Monte Carlo standard errors are very small, suggesting that IS2 is highly efficient. The marginal likelihoods favor Model IV – allowing a separate response threshold and a separate non-decision time for each instruction condition. This outcome is partially consistent with Forstmann et al., (2008), but also suggests that the non-decision time – the combined time required to perceptually encode a stimulus and produce a motor response – changes with instruction condition.

General discussion and future work

Model comparison is the means by which competing theories are rigorously evaluated and compared, which is fundamental to advancing psychological science. Many quantitative approaches to model comparison in psychology have struggled to appropriately account for model flexibility – the range of data patterns that a model can predict. For example, commonly used methods such as AIC and BIC measure the flexibility of a model simply by counting the number of freely estimated model parameters. This is problematic because not all parameters are equal, in terms of complexity, and models with the same numbers of parameters can be quite different in complexity. In some cases, adding extra parameters to a model can even decrease the complexity, for example when a hierarchical distribution structure is added to constrain otherwise-independent models for individual participants. Model selection via the marginal likelihood is one of the few methods that naturally accounts for model flexibility.

Despite its theoretical advantages, the marginal likelihood has seen limited uptake in practice for all but the simplest of psychological models, owing to its prohibitive computational cost. Although many psychologically interesting models have analytic, numerical or rapidly-simulated solutions for the responses of individual participants (e.g., the density of observed choices and response times of individual participants), the likelihood for hierarchical versions of those same models can be analytically intractable because they contain an integral over the individual random effects. This has proved to be the stumbling block in previous attempts to compute the marginal likelihood for hierarchical implementations of psychological models.

Our article develops a method for Bayesian model comparison when the likelihood (of a hierarchical model) is intractable but can be estimated unbiasedly. We show that when the density of individual observations conditioned on group- and individual-level parameters is available, an importance sampling algorithm can provide an unbiased estimate of the marginal likelihood of hierarchical models. We term this approach importance sampling squared (IS2). The IS2 method can be applied to samples from the posterior distribution over a model’s parameters obtained following any sampling method. Appendix B studies the convergence properties of the IS2 estimator and provides practical guidelines for optimally selecting the number of particles required to estimate the likelihood.

We show how the IS2 method can be used to estimate the marginal likelihood in hierarchical models of health decision making and choice response time models. In both applications, the method estimates the marginal likelihood in a principled way, providing conclusions that are consistent with the current literature. In all cases, the marginal likelihood is estimated with very small Monte-Carlo standard errors. To aid researchers in using the IS2 method in their own research, we provide scripts implementing the key elements of the method, as applied to data from Forstmann et al., (2008) and Wagenmakers, Ratcliff, Gomez, and McKoon (2008) – see osf.io/xv59c for more details.

The data applications highlight two important properties of the IS2 method: it is an efficient and unbiased estimator of the marginal likelihood, and it provides a standard error of the estimator. The marginal likelihood is a key quantity in appropriately accounting for model flexibility in quantitative model comparison. The standard error of an estimator is essential for interpreting any model comparison metric, not just marginal likelihoods, as it expresses the variability of the estimated metric. This is equally important in quantitatively comparing psychological models as it is in the interpretation of conventional statistics, despite being routinely overlooked. No researcher would conclude the population means of two groups differ solely on the basis of a difference in their sampled group means – we would demand a measure that expresses the magnitude of the mean difference as a function of its variability, like a t-test. The IS2 method provides the researcher with the tools to do precisely this with potentially complex and non-linear psychological models: it estimates the magnitude of differences (marginal likelihoods for different models) as a function of the variability of those estimates (the standard errors).

We emphasize that the IS2 method is general and can be used to estimate the marginal likelihood and its standard error from models where the density of individual observations conditioned on group-level parameters and a vector of individual-level parameters is available. This is quite a general scenario – reaching beyond the models studied here, and decision-making models more generally – that applies to a very large category of psychological models for which hierarchical Bayesian parameter estimation has been used to date. This generality holds great promise for IS2 as a vehicle for performing model comparison via the marginal likelihood in psychological research.

We also note that the IS2 method is robust, by which we mean that the MCMC draws used to form the proposals do not have to converge to draws from the exact posterior as they are used in IS2 to form importance sampling proposals. Hence, as long as they are roughly in the same region as the posterior, the marginal likelihood estimates will be simulation consistent. The same remarks hold if the MCMC targets a slightly perturbed posterior.

For faster computation, we conjecture that future work could speed up the estimation of the marginal likelihood as follows: a) First, run the MCMC procedure in parallel on K chains without worrying about the precise convergence of each chain. In practice this means we can run each chain for far fewer iterates than would be required for carrying out Bayesian inference if the inference was based only on the output of each chain. b) Second, form a proposal density based on the output of all the chains. c) Third, use IS to get K robust and unbiased estimates of the marginal likelihood from each of K processors. d) Finally, average the K estimates to get a final unbiased estimate of the marginal likelihood whose variance is 1/K th the variance of each individual estimator. The robustness and unbiasedness properties of the IS2 estimator are crucial in ensuring that bias does not dominate the variance as K becomes large.

We note in Appendix A that posterior expectations of functions of the parameters can similarly be estimated with IS2, albeit with some very minor bias. We conjecture that such posterior moment estimators will be far more efficient than those obtained from standard MCMC output, but leave to future work such speeding up of the computations of the marginal likelihood and posterior moments.

Finally, we note that the marginal likelihood is sometimes sensitive to the prior, and this might lead to some controversy in using marginal likelihood for model comparison. Addressing this issue is beyond the scope of this paper, and we focus only on how to estimate the marginal likelihood efficiently when the likelihood function is intractable.