Abstract
Recent advances in Markov chain Monte Carlo (MCMC) extend the scope of Bayesian inference to models for which the likelihood function is intractable. Although these developments allow us to estimate model parameters, other basic problems such as estimating the marginal likelihood, a fundamental tool in Bayesian model selection, remain challenging. This is an important scientific limitation because testing psychological hypotheses with hierarchical models has proven difficult with current model selection methods. We propose an efficient method for estimating the marginal likelihood for models where the likelihood is intractable, but can be estimated unbiasedly. It is based on first running a sampling method such as MCMC to obtain samples for the model parameters, and then using these samples to construct the proposal density in an importance sampling (IS) framework with an unbiased estimate of the likelihood. Our method has several attractive properties: it generates an unbiased estimate of the marginal likelihood, it is robust to the quality and target of the sampling method used to form the IS proposals, and it is computationally cheap to estimate the variance of the marginal likelihood estimator. We also obtain the convergence properties of the method and provide guidelines on maximizing computational efficiency. The method is illustrated in two challenging cases involving hierarchical models: identifying the form of individual differences in an applied choice scenario, and evaluating the best parameterization of a cognitive model in a speeded decision making context. Freely available code to implement the methods is provided. Extensions to posterior moment estimation and parallelization are also discussed.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Many psychologically interesting research questions involve comparing competing theories: Does sleep deprivation cause attentional lapses? Does alcohol impair the speed of information processing or reduce cautiousness, or both? Does the forgetting curve follow a power or exponential function? In many cases, the competing theories can be represented as a set of quantitative models that are applied (“fitted”) to the observed data. We can then estimate a metric that quantifies the degree to which each model accounts for the patterns observed in data balanced against its flexibility. Model flexibility is often defined as the range of data patterns that a model can predict, which includes patterns that were observed as well as patterns that were not observed. Flexibility is an important consideration in model choice as it assesses the degree to which a theory is suitably constrained to predict the observed data and that it does not simply predict many possible patterns that were not observed; the latter form of model is theoretically non-informative as it “predicts” almost any pattern that could be observed. Many methods exist that attempt to quantitatively account for the flexibility of models, and hence inform model choice, including likelihood ratio tests, various information criteria (e.g., Akaike, Bayesian and Deviance Information Criteria; AIC, BIC, and DIC, respectively), minimum description length, cross validation and others, with varying degrees of success. The emerging gold standard in quantitative psychology is the marginal likelihood; the commonly cited Bayes factor is the ratio of the marginal likelihoods of two models.
The marginal likelihood of a model is the average likelihood of the model given the data, where the average is taken over the prior distribution of the model parameters. By integrating the likelihood across the prior predictive space of a model, the marginal likelihood has the attractive property that it inherently accounts for the flexibility of the model. This removes the need for post-hoc complexity corrections to goodness of fit measures, as implemented in the “information criteria” metrics (e.g., AIC, BIC, DIC), because it provides a guarantee that a model which accounts for the observed patterns in data and few other possible though unobserved patterns will be quantitatively favored over a competing model which accounts for the observed data but also provides the capacity to account for many other patterns that were not observed. This occurs because the latter model would have low likelihood across the (many) regions of the prior space where data were not observed, which lowers the marginal likelihood.
Although theoretically attractive, estimating the marginal likelihood for psychologically interesting models poses a number of practical challenges. The likelihood function is the density function of the data with the participant-level parameters (also called random effects) integrated out, when viewed as a function of the group-level parameters. It is usually analytically or computationally intractable for psychological and statistical models because it is an integral that cannot be evaluated. This can be the case even in conceptually very simple models. For example, in generalized linear mixed models where random effects (e.g., model parameters for individual participants) account for the dependence between observations from the same individual (Fitzmaurice, Laird, & Ware, 2011), the likelihood is often intractable because it is an integral over the random effects. Recent advances in particle Markov chain Monte Carlo (P-MCMC) methods extend the scope of the applications of Bayesian inference to cases where the likelihood is intractable (see Andrieu and Roberts (2009), Andrieu et al., (2010), Chopin et al., (2013), and Tran et al., (2016)). However, some basic problems such as estimating the marginal likelihood and the standard error of this estimator remain challenging, because the marginal likelihood is the density of the data with both the group-level parameters and random effects integrated out. “Marginal likelihood estimation for hierarchical models” discusses some existing approaches for addressing this challenge.
Our article proposes an importance sampling (IS) approach for estimating the marginal likelihood when the likelihood is intractable but can be estimated unbiasedly. The method is implemented by first running a sampling scheme such as MCMC, whose draws are used to form an importance distribution for the fixed parameters. We then estimate the marginal likelihood using an unbiased estimate of the likelihood combined with this importance distribution. We call this approach Importance Sampling Squared \(\left (\text {IS}^{2}\right )\), as it is itself an importance sampling (IS) procedure and we can often estimate the likelihood by IS. We claim that it has several advantages over competing methods for estimating the marginal likelihood. The three most important ones are that a) it produces an unbiased estimate of the marginal likelihood; b) the method is robust against potential issues with the samples used to form the proposals; and c) it provides an easily computable and cheap estimate of the variability of the marginal likelihood estimator. Unbiasedness means that we can approach the true value and easily lower the variance of the estimator by getting more samples. Robustness means that the marginal likelihood estimator is simulation consistent even when the MCMC sampler has not yet converged to the posterior distribution of interest. This can mean that the MCMC may have a slightly perturbed posterior as a target or that the MCMC has the posterior as a target but it may not as yet have converged. The reason that \(\left (\text {IS}^{2}\right )\) works robustly is that it uses importance weights to correct for bad importance samples. Estimating the variability of the estimator directly means that we do not have to replicate the algorithm to obtain reliable estimates of the standard error of the estimator. These observations in turn suggest that it may be desirable to run the initial MCMC in parallel, and without rigorously requiring each MCMC to converge before forming the proposals for IS. Sections “Marginal likelihood estimation for hierarchical models”, “General discussion and future work” and Appendix A of the appendix expand on these points.
The rest of the paper is organized as follows. “Marginal likelihood estimation for hierarchical models” introduces the general model under consideration, reviews several competing approaches for estimating the marginal likelihood, and briefly compares them to our approach. “Estimating the marginal likelihood by IS2” gives a detailed description of the IS2 method. “Applying the IS2 method to data” illustrates the IS2 method in two applications, which show that the method gives accurate estimates of the marginal likelihood and its standard error in reasonable computing time. “General discussion and future work” concludes and discusses future work on speeding up the computation. There are five appendices. Appendix A outlines how our results on estimating the marginal likelihood extend to estimating posterior expectations. Appendix B studies the effect on IS of using an estimated likelihood. Appendix C provides further estimation details for the two applications in “Applying the IS2 method to data” and Appendix D provides additional applications of the method. Finally, Appendix E contains the proofs of all the results in the paper.
Marginal likelihood estimation for hierarchical models
This section defines the class of hierarchical models for which the density function of the observations (individual participant data), conditional on the group-level parameters 𝜃 and a vector of individual level parameters α (random effects) is available, but the likelihood is intractable because it is an integral that cannot be computed over the individual random effects. We also briefly review competing approaches.
Let yj be the vector of observations (responses) for the jth subject and define \({y}={y}_{1:S}=\left ({y}_{1},...,{y}_{S}\right )\) as the vector of observations for all S subjects. Let \({\alpha }_{j}\in {\mathbb {R}^{d_{\alpha }}}\), where \(\mathbb {R}^{d}\) denotes d-dimensional Euclidean space, be the vector of random effects (subject-level parameters) for subject j and define \(p\left ({\alpha }_{j}|{\theta }\right )\) as its density. We define \({\alpha }={\alpha }_{1:S}=\left ({\alpha }_{1},...,{\alpha }_{S}\right )\) as the vector of all random effects in the model (i.e., all subject-level parameters). Let \({\theta }\in {\mathbb {R}^{d_{\theta }}}\) be the vector of unknown group-level parameters and let \(p\left ({\theta }\right )\) be the prior for 𝜃. Assuming that the yj,αj, j = 1,...,S are independent given 𝜃, the joint density of the random effects and the observations is
We assume the densities \(p\left ({y}|{\alpha },{\theta }\right )\), \(p\left ({\alpha }|{\theta }\right )\), and \(p\left ({\theta }\right )\) are available analytically or numerically (e.g., by numerical integration). However, even when these densities are separately available, the likelihood of the hierarchical model is typically analytically intractable because it is an integral over the random effects:
By Bayes’ rule, we can express the joint posterior density of 𝜃 and α as
where
is the marginal likelihood.
The main goal of the article is to develop an efficient method for estimating the marginal likelihood, given samples from the posterior distribution over the group-level parameters and individual-level random effects. These samples will have been obtained through generic sampling routines (e.g., JAGS, STAN) or custom (e.g., differential evolution-MCMC (DE-MCMC) in Turner, Sederberg, Brown, & Steyvers, 2013). Even custom samplers still suffer from the problems of high autocorrelation between successive iterates of samples from the posterior, and of slow or uncertain convergence in some problems; for example, when the vector of random effects is high dimensional. These problems can lead to unreliable or biased posterior samples, in which case model selection metrics derived from those posterior samples can also be dramatically wrong. This issue has the potential to influence the reliability of some modern developments in estimating the marginal likelihood for cognitive models, including bridge sampling (Gronau et al., 2017; Gronau, Wagenmakers, Heck, & Matzke, 2019) and thermodynamic integration (Evans & Annis, 2019). Both of those approaches have the same structure: they begin with posterior samples, and post-process these samples to estimate the marginal likelihood. Both bridge sampling and thermodynamic integration are therefore sensitive to the accuracy of the sampling method – when the posterior samples are biased in some way, the estimated marginal likelihood will also be biased. An alternative approach is to bypass posterior samples altogether. For example, Evans and Brown (2018) proposed estimating the marginal likelihood directly by Monte Carlo integration over the prior. This method works well in small problems, but it can be highly inefficient. This “direct” method quickly becomes computationally infeasible in high-dimensional problems, or with hierarchically-specified models.
Our paper proposes a new approach. The basic idea is that we first obtain samples of the model parameters (𝜃; group level), but not the random effects (α; individual participants). These samples are used to form efficient proposal densities for an importance sampling algorithm on the model parameters. The IS2 method therefore does not suffer from the same potential drawbacks as the bridge sampling and thermodynamic integration methods, because the samples are only used to form the proposals for 𝜃. This means that unreliable samples of 𝜃 may lead to inefficient proposal densities, which can decrease the efficiency of the IS2 method, but will not lead to bias in the marginal likelihood. In fact, the IS2 method makes it possible to obtain unbiased and simulation consistent estimators of the marginal likelihood without ensuring that the underlying Markov chain(s) have in fact converged, although they need to produce reasonable estimates of the posterior. Section “General discussion and future work” discusses the robustness property of the IS2 estimator and how to use it to speed up the estimation of the marginal likelihood and posterior moments.
Estimating the marginal likelihood by IS2
The likelihood of hierarchical models of the form in Eq. 2 can be analytically intractable, though it can be estimated unbiasedly using importance sampling (other hierarchical models such as state space models may require a particle filter). Let \(\left \{ m_{j}\left ({\alpha }_{j}|{\theta },{y}_{j}\right );j=1,...,S\right \} \) be a family of proposal densities that we use to approximate the conditional posterior densities \(\left \{ \pi \left ({\alpha }_{j}|{\theta }\right );j=1,...,S\right \}\). Let \(g_{\text {IS}}\left ({\theta }\right )\) be a proposal density for the group-level parameters, 𝜃. We note that many different methods can be used to obtain efficient and reliable proposals \(g_{\text {IS}}\left ({\theta }\right )\) for the group-level parameters and \(m_{j}\left ({\alpha }_{j}|{\theta },{y}_{j}\right )\) for the random effects αj, for each subject j = 1,...,S. A simple frequently used approach takes the proposal as a multivariate Student t distribution with a small number of degrees of freedom, whose location is a mode of the log-likelihood and whose scale matrix is the inverse of minus the Hessian matrix at this mode. Our article constructs the proposal distribution by first running MCMC, even just a short MCMC, to obtain samples from the posterior distribution. We then construct a proposal distribution in the IS2 procedure by fitting a mixture of normal or Student t distributions to these samples, as outlined in “Applying the IS2 method to data” and Appendix C.
The density \(p\left ({y}_{j}|{\theta }\right )\) is estimated unbiasedly by
where \({\alpha }_{j}^{\left (i\right )}\overset {iid}{\sim }m_{j}\left ({\alpha }_{j}|{\theta },{y}_{j}\right )\) and N is the number of importance samples used in the likelihood estimate, which we will refer to as the number of “particles”. We note that for importance sampling to be effective, it is necessary for the support of \(m_{j}\left ({\alpha }_{j}|{\theta },{y}_{j}\right )\) to contain the support of its corresponding conditional density (i.e., the tails of the proposal should be fatter than the tails of the target). This condition for importance sampling is usually easily satisfied using the defensive sampling approach in Hesterberg (1995). Appendix C discusses the defensive sampling approach and the proposal \(m_{j}\left ({\alpha }_{j}|{\theta },{y}_{j}\right )\). Hence,
is an unbiased estimator of the likelihood \(p\left ({y}|{\theta }\right )\). To choose an optimal number of particles N, discussed in Appendix B1, we use the delta method to estimate the variance
of \(\log \widehat {p}_{N}\left ({y}|{\theta }\right )\). This gives the estimate
We now rewrite Eq. 4 to show how it can be estimated by IS. Let u consist of all random variables used to construct \(\hat {p}_{N}(y|\theta )\) for a given value of 𝜃, with pN(u|𝜃) the density of u. In practice, as will be clear shortly, it is unnecessary to know pN(u|𝜃). We also equivalently write \(\hat {p}_{N}(y|\theta )\) as \(\hat {p}_{N}(y|\theta ,u)\). Then \(\hat {p}_{N}(y|\theta )\) is unbiased if
Let gIS(𝜃) be the proposal for 𝜃 obtained by MCMC or otherwise. Then, the marginal likelihood is given by
This leads to the IS2 estimator,
Algorithm 1 outlines the IS2 algorithm for obtaining an estimate of the marginal likelihood. On a practical note, the algorithm is well-suited to efficient parallelization, by processing particles (importance samples) independently in steps (1) and (2).
The usual way to estimate the standard error of the marginal likelihood estimator is by replication, i.e., by estimating the marginal likelihood estimator a number of times and then computing the standard deviation of these estimates. However, reliable estimation by replication can take a long time. An important advantage of our method is that it is straightforward to estimate the standard error of the marginal likelihood estimator \(\widehat {p}_{\text {IS}^{2}}\left ({y}\right )\). This affords researchers greater confidence in application of the estimated marginal likelihood, and also permits a simpler investigation of the tradeoff between efficiency and bias. The variance estimator is
It is clear that under mild conditions the IS2 estimator is unbiased, simulation consistent and tends to normality as it is an IS estimator. Theorem 1 formally states these results and Appendix E gives its proof.
Theorem 1
Let M be the number of samples for 𝜃 and N the number of particles for estimating the likelihood. Under the assumptions in Theorem A1 in Section A.1 of the Appendix.
-
(i) \(\mathbb {E}\left (\widehat {p}_{\text {IS}^{2}}\left ({y}\right )\right )=p\left ({y}\right )\) and \(\widehat {p}_{\text {IS}^{2}}\left ({y}\right )\overset {a.s.}{\rightarrow }p\left ({y}\right )\) as \(M\rightarrow \infty \) for any N ≥ 1.
-
(ii) \(\sqrt {M}\left (\widehat {p}_{\text {IS}^{2}}\left ({y}\right )-p\left ({y}\right )\right )\overset {d}{\rightarrow }N\left (0,{\sigma }^{2}_{p_{\text {IS}^{2}}(y)}\right )\) and \(\widehat {\sigma }^{2}_{p_{\text {IS}^{2}}(y)}\overset {a.s.}{\rightarrow }{\sigma }^{2}_{p_{\text {IS}^{2}}(y)}=\mathbb {V}(\widetilde {w}\left ({\theta }\right ))\) as \(M\rightarrow \infty \) for any N ≥ 1.
There is a trade-off between the statistical precision of the IS2 estimator (in terms of the variance from Eq. 12) and the computational cost of obtaining the estimate. A large number of particles N gives a more accurate estimate of the likelihood with greater computational cost, while a small N results in a likelihood estimator \(\widehat p_{N}(y|\theta )\) that is cheap to evaluate but has larger variance. Appendix B provides theoretical and practical guidelines of this tradeoff. Proposition A1 in Appendix B.2 shows that the optimal N in the tradeoff between accuracy and computational cost is such that the variance of the log-likelihood estimator \(\mathbb {V}(\log \widehat p(y|\theta ))\) is approximately 1.
Applying the IS2 method to data
This section describes two applications of the IS2 method to real data. The applications are chosen from quite different domains – the first focuses on choice in a health context, the second focuses on an experimental manipulation in cognition. We chose these applications as they are challenging cases for estimating marginal likelihoods, and thus represent good test cases for illustrating the power of our method. They also emphasize that the IS2 method is not restricted to a particular domain of application or model class. It is a general method for estimating the marginal likelihood and the standard error of that estimate and can be applied in a variety of other contexts where quantitative models are estimated from data in a hierarchical framework including the study of attention, decision making, categorization, and memory.
Individual differences in health-related decisions
The first application explores individual differences in preferences for health decision making, by examining choices for hypothetical appointments with a health practitioner. A common approach to understanding individual differences in applied contexts is to assume that different individuals have different utility parameters, where the utility represents the subjective value placed on the attributes that describe the products and services on offer. In this sense the utility is a quantitative estimate of what people like and what they dislike. A common way to model individual differences in choices is to then assume individual differences in utility parameters, commonly described as “taste heterogeneity”. The mixed logit model is the most common framework for modeling individual differences in utilities (Train, 2009), which assumes utility parameters are multivariate normal distributed in the population. Nevertheless, more recent research suggests that all individuals might hold the same (or have sufficiently similar) utilities except that those utilities can be scaled up or down for different individuals (Fiebig, Keane, Louviere, & Wasi, 2010). This approach assumes that the ratio of the utilities for pairs of items is constant across participants though the absolute difference between those utilities can differ. This “scale heterogeneity” means that some people make more random choices than others, though their latent preferences, in the limit of very many trials, are similar.
The key model comparison question in this context is whether the choices observed in a population of participants are more consistent with individual differences in latent preferences (taste) or individual differences in choice consistency (scale). We tested this idea for the data reported in Fiebig et al., (2010) in a study where S = 79 women made choices about T = 32 choice scenarios for a pap smear test, where each scenario was defined by a combination of the values of the K = 5 choice attributes listed in Table 1 (not all unique combinations were presented). For example, one choice scenario might be that the pap smear would be conducted by a female doctor who is unknown to the patient, the test is not due though the doctor recommends it, with a cost of $20. The participant is then asked whether they would take this test or not.
Our goal is to test whether choices in the pap smear test data provide greater support for the presence of individual differences in latent preferences, or the presence of individual differences in latent preferences and choice consistency. For this goal, we estimated different models which instantiated the different hypotheses. We then used the IS2 method to estimate the marginal likelihoods of those models, and used those estimates for model selection via Bayes factors.
Behavioural model
We analyse the choice data with the generalized multinomial logit (GMNL) model (Fiebig et al., 2010), a generalization of the mixed logit model, that accounts for individual differences in both latent preferences and choice consistency. We let the observed choice for participant j in scenario i be yij = 1 if the participant chooses to take the test and yij = 0 otherwise. The general form of the GMNL for the probability of individual j on trial i selecting the test is
where β(j) = (β0j,β1j,…,βKj)′ and Xij = (xij1,…,xijK)′ are the vectors of utility weights and choice attributes respectively. As is standard in most consumer decision research, the GMNL model assumes that the attributes comprising an option (i.e., a hypothetical appointment) each have an associated utility, and the utility of an option as a whole is the sum of the component attribute utilities. The summed utility probabilistically generates a choice via the Luce choice rule, as shown in Eq. 13. The model also allows for an a priori bias toward accepting or rejecting the test at each trial independent of the feature values of the test, known as an alternative specific constant; β0j = β0 + η0j, with \(\eta _{0j}\sim \mathcal {N}(0,{\sigma _{0}^{2}})\).
The utilities for individual participants are modeled as
with \(\eta _{jk}\sim {\mathcal {N}}(0,{\sigma _{k}^{2}})\) and \(\zeta _{j}\sim {\mathcal {N}}(0,1)\). The expected value of the scaling coefficients λj is one, implying that \(\mathbb {E}(\beta _{jk})=\beta _{k}\). We set the across-participant variance parameter for the cost attribute to zero (\({\sigma _{5}^{2}}=0\)) as we found no evidence of heterogeneity across individuals for this attribute, beyond the scaling effect. As the sum of the probabilities over the possible choice options in Eq. 13 is 1 (i.e., take the test, or not), to make the model identified we set the coefficients β with respect to yij = 0 (not taking the test) to zero.
The GMNL model accounts for individual differences in latent preferences (taste) and choice consistency (scale) through the parameters δ and γ, respectively. When δ = 0 (so that λj = 1 for all individuals), the GMNL model reduces to the mixed logit model. The mixed logit model captures heterogeneity in consumer preferences by allowing individuals to weight the choice attributes differently (i.e., individual differences in attribute utilities). By introducing taste heterogeneity, the mixed logit model avoids the restrictive independence of irrelevant alternatives property of the standard multinomial logit model.
The GMNL model additionally allows for differing choice consistency across individuals, known as scale heterogeneity, through the random variable λj. This variable changes all attribute weights simultaneously, allowing the model to predict choices to be more random for some consumers than others, such that smaller values of δ (so that \(\lambda _{j}\rightarrow 1\)) indicate differences in latent preferences where larger values of δ (so that \(\lambda _{j}\rightarrow 0\)) indicate differences in choice consistency. The γ parameter weights the specification between two alternative ways of introducing heterogeneity into the model.
The parameter vector is \(\theta =(\beta _{0},\beta _{1},\ldots ,\beta _{K},{\sigma _{0}^{2}},{\sigma _{1}^{2}},\ldots ,{\sigma _{K}^{2}},\delta ^{2},\gamma )'\), while the vector of random effects for each individual is αj = (η0j,…,ηjK,λj). The likelihood is therefore
where yij is the observed choice, y = (y11,…,y1T,…,yj1,…,yST)′ and p(yij|αj,𝜃) is given by the choice probability in Eq. 13. We use diffuse priors specified as \(\beta _{0}\sim {\mathcal {N}}(0,100)\), \(\sigma _{0} \propto (1+{\sigma _{0}^{2}})^{-1}\), \(\beta _{k}\sim {\mathcal {N}}(0,100)\), \(\sigma _{k} \propto (1+{\sigma _{k}^{2}})^{-1}\), for k = 1,…,K, δ ∝ (1 + δ/0.2)− 1, and \(\gamma \sim \text {U}(0,1)\). The standard deviation parameters have half-Cauchy priors (Gelman, 2006). Appendix C provides complete details of the estimation procedure.
Results
We estimated the posterior distribution using M = 50,000 importance samples for the parameters, and estimated the Monte Carlo standard errors by bootstrapping the importance samples because we found that bootstrapping gave the most stable results of the estimator. Appendix C gives implementation details. The log marginal likelihood (with standard error) for the mixed logit and GMNL models were − 981.24 (.003) and − 978.27 (.012), respectively. The Monte Carlo standard errors are very small, indicating the IS2 method is highly efficient. The estimates of the log marginal likelihood allow us to calculate a Bayes factor of approximately 20 for the GMNL model over the mixed logit model; BFGMNL:mixed = \(\exp (-978.27 -(-981.24))\approx 19.5\).
The larger marginal likelihood for the GMNL model than the mixed logit model provides some evidence for the presence of scale heterogeneity for this data set. Corroborating evidence comes from the parameter estimates. The posterior mean for γ was .17 (90% credible interval [.014–.452]), where the weight for scale heterogeneity in the GMNL model was 1 − γ. This means that assuming individuals only differed in their latent preferences (taste heterogeneity) did not sufficiently capture the trends in data. Some participants also made more variable choices than others (scale heterogeneity).
The speed-accuracy tradeoff in perceptual decisions
Our second application explores the cognitive processes involved in perceptual decisions. We examine decision making in the context of the well-studied speed-accuracy tradeoff: the finding that electing to make faster decisions decreases decision accuracy and, conversely, that electing to increase decision accuracy causes decisions to become slower (see Heitz, 2014 for a review). The speed-accuracy tradeoff is typically attributed to changes in caution – the quantity of evidence required to trigger a response (i.e., the response threshold). Here, we apply the IS2 method to a class of decision making models in the Linear Ballistic Accumulator (LBA; Brown and Heathcote, 2008) framework to confirm that model choice via the marginal likelihood is consistent with existing methods of model choice in the literature.
We draw upon data reported by Forstmann, Dutilh, Brown, Neumann, and von Cramon (2008) coming from an experiment in which S = 19 participants were required to make difficult motion discrimination decisions under varying degrees of time pressure. The participants made repeated decisions about whether a cloud of semi-randomly moving dots appeared to move to the left or to the right of a display. Participants made these decisions under three conditions that differed in their task instructions: they were asked to respond as accurately as possible (condition 1: “accuracy emphasis”), at their own pace (condition 2: “neutral emphasis”), or as quickly as possible (condition 3: “speed emphasis”). The three conditions were randomized across trials. A visual cue described the decision emphasis on a trial-by-trial basis prior to stimulus onset. Each subject made 280 decisions in each condition for a total of T = 840 trials. See Forstmann et al., (2008) for all remaining details.
Freely available code applying the IS2 method to one of the hierarchical LBA models described here is available at osf.io/xv59c. Appendix D also provides two additional applications of the IS2 method to the hierarchical LBA: one application to the speed-accuracy tradeoff in lexical decisions, and a second application to response bias in lexical decisions. Code for these two examples is also available at the same repository.
Behavioural model
We analyse the choice and response time data with a hierarchical LBA model. In a typical perceptual decision making experiment, the ith observation for the jth participant contains two pieces of information. The first is the response choice, denoted by \(RE_{ij}\in \left \{ 1,...,C\right \} \), where C is the number of response alternatives. The second is the response time (RT), denoted by \(RT_{ij}\in \left (0,\infty \right )\). Brown and Heathcote (2008) derived the joint density and cumulative distribution function of the finishing time of an LBA accumulator over response choice and response time. The finishing time distribution for each of the LBA accumulators is determined by five parameters: the mean and standard deviation of the drift rate distribution \(\left (v,s\right )\), the threshold parameter b, the width of the start point distribution A, and the non-decision time τ. It is common to set the standard deviation s of the drift rate distribution to one, i.e. s = 1 (see also Donkin, Brown, & Heathcote, 2009). The probability density of the cth accumulator reaching the threshold at time t, and the other accumulators not reaching the threshold prior to time t, is
where fc and Fc are the density and distribution functions for the finishing times of a single LBA accumulator (for details see Brown & Heathcote, 2008; Terry et al., 2015). In a two-choice experiment where response bias is not expected, it is convenient to define the means of the drift rate distributions as \({v}=\left (v^{\left (e\right )},v^{\left (c\right )}\right )\), where \(v^{\left (e\right )}\) is the mean of the drift rate for the accumulator corresponding to the incorrect response and \(v^{\left (c\right )}\) is the mean of the drift rate for the accumulator corresponding to the correct response. Together, these assumptions imply that each subject j, j = 1,...,S, has the vector of random effects
With the usual assumption of independence, the conditional density of all the observations is
We follow the notation and assumptions in Gunawan, Hawkins, Tran, Kohn, and Brown (2019). For each subject j = 1,...,S, let the vector of random effects
where \(\alpha _{b_{j}}=\log \left (b_{j}\right )\), \(\alpha _{A_{j}}=\log \left (A_{j}\right )\), \(\alpha _{v_{j}^{\left (e\right )}}=\log \left (v_{j}^{\left (e\right )}\right )\), \(\alpha _{v_{j}^{\left (c\right )}}=\log \left (v_{j}^{\left (c\right )}\right )\) and \(\alpha _{\tau _{j}}=\log \left (\tau _{j}\right )\). We take the following prior densities (as in Gunawan et al., 2019): αj is \(N\left ({\alpha }_{j};{\mu }_{\alpha },{\Sigma }_{\alpha }\right )\), \({\mu }_{\alpha }\sim N\left (0,I_{d_{\alpha }}\right )\) where dα is the dimension of αj, and the hierarchical prior for Σα is
where vα, \(\mathcal {A}_{1},...,\mathcal {A}_{d_{\alpha }}\) are positive scalars and \(\text {diag}\left (1/a_{1},...,1/a_{d_{\alpha }}\right )\) is the diagonal matrix with diagonal elements \(1/a_{1},...,1/a_{d_{\alpha }}\).Footnote 1 This is the marginally non-informative prior of Huang and Wand (2013). Following Gunawan et al., (2019), we set vα = 2 and \(\mathcal {A}_{d}=1\) for all d. These settings lead to half-t marginal prior distributions for the standard deviations of Σα and uniform marginal prior distributions for the off-diagonal correlations implied by Σα.
We tested whether the changes in task instructions across the three experimental conditions caused changes in observed decision behaviour that is consistent with changes in decision caution. We compared four LBA models. For consistency with the notation above, we refer to instructions that emphasize accuracy, neutral or speedy decisions as a, n and s, respectively, and the correct and error accumulators as c and e, respectively.
- Model I:
-
assumes no differences in parameters across the three conditions, i.e., a null model. This model corresponds to the psychological assumption that performance is not influenced by the instructions given to participants. The vector of random effects for subject j is
$$ \alpha_{j}=\left( \alpha_{b_{j}},\alpha_{A_{j}},\alpha_{v_{j}^{\left( e\right)}},\alpha_{v_{j}^{\left( c\right)}},\alpha_{\tau_{j}}\right). $$ - Model II:
-
assumes there are two response threshold parameters. One threshold parameter defines decision-making in the accuracy- and neutral-emphasis conditions, and the second threshold parameter in the speed-emphasis condition. This makes the simplifying assumption that performance in two of the three conditions is so similar as to be effectively identically distributed,
$$ \alpha_{j}=\left( \alpha_{b_{j}^{\left( a/n\right)}},\alpha_{b_{j}^{\left( s\right)}},\alpha_{A_{j}},\alpha_{v_{j}^{\left( e\right)}},\alpha_{v_{j}^{\left( c\right)}},\alpha_{\tau_{j}}\right). $$ - Model III:
-
assumes there are three response threshold parameters, one for each condition. This model assumes that the participant responds to each condition with a different level of caution, though otherwise with identically distributed parameters,
$$ \alpha_{j}=\left( \alpha_{b_{j}^{\left( a\right)}},\alpha_{b_{j}^{\left( n\right)}},\alpha_{b_{j}^{\left( s\right)}},\alpha_{A_{j}},\alpha_{v_{j}^{\left( e\right)}},\alpha_{v_{j}^{\left( c\right)}},\alpha_{\tau_{j}}\right). $$ - Model IV:
-
assumes there are separate response threshold and non-decision time parameters for each condition. This model assumes that instructions to emphasise speedy, neutral or accurate decisions influence not only cautiousness but also the time required to perceptually encode the stimulus and produce a motor response (non-decision time),
$$ \alpha_{j} = \left( \alpha_{b_{j}^{\left( a\right)}},\alpha_{b_{j}^{\left( n\right)}},\alpha_{b_{j}^{\left( s\right)}},\alpha_{A_{j}},\alpha_{v_{j}^{\left( e\right)}},\alpha_{v_{j}^{\left( c\right)}},\alpha_{\tau_{j}^{\left( a\right)}},\alpha_{\tau_{j}^{\left( n\right)}},\alpha_{\tau_{j}^{\left( s\right)}}\!\right). $$
As in the first application, Appendix C provides complete details of the estimation procedure.
Results
Once we obtained samples from the posterior, we ran the IS2 algorithm with M = 10,000 draws to estimate the log of the marginal likelihood, \(\log \ \widehat {p}(y)\), and adaptively set the number of particles N that were used to estimate the likelihood such that the variance of the log-likelihood estimates did not exceed 1; recall that \(\sigma _{opt}^{2}=1\) is the optimal variance of the log-likelihood (cf. “Estimating the marginal likelihood by IS2”). We initiated this adaptive procedure by starting with N = 250 particles and computed the variance of the log-likelihood given in Eq. 7. If the variance was greater than 1, we increased the number of particles. We again estimated the Monte Carlo standard error of the log of the marginal likelihood estimates by bootstrapping the importance samples because we found that bootstrapping gave the most stable results.
The log marginal likelihoods (with standard errors) for Models I, II, III and IV are, respectively, 5204.17 (0.11), 7352.75 (0.06), 7453.73 (0.10) and 7521.44 (0.17), with a total computation time of around 80 minutes. As in the first application, the Monte Carlo standard errors are very small, suggesting that IS2 is highly efficient. The marginal likelihoods favor Model IV – allowing a separate response threshold and a separate non-decision time for each instruction condition. This outcome is partially consistent with Forstmann et al., (2008), but also suggests that the non-decision time – the combined time required to perceptually encode a stimulus and produce a motor response – changes with instruction condition.
General discussion and future work
Model comparison is the means by which competing theories are rigorously evaluated and compared, which is fundamental to advancing psychological science. Many quantitative approaches to model comparison in psychology have struggled to appropriately account for model flexibility – the range of data patterns that a model can predict. For example, commonly used methods such as AIC and BIC measure the flexibility of a model simply by counting the number of freely estimated model parameters. This is problematic because not all parameters are equal, in terms of complexity, and models with the same numbers of parameters can be quite different in complexity. In some cases, adding extra parameters to a model can even decrease the complexity, for example when a hierarchical distribution structure is added to constrain otherwise-independent models for individual participants. Model selection via the marginal likelihood is one of the few methods that naturally accounts for model flexibility.
Despite its theoretical advantages, the marginal likelihood has seen limited uptake in practice for all but the simplest of psychological models, owing to its prohibitive computational cost. Although many psychologically interesting models have analytic, numerical or rapidly-simulated solutions for the responses of individual participants (e.g., the density of observed choices and response times of individual participants), the likelihood for hierarchical versions of those same models can be analytically intractable because they contain an integral over the individual random effects. This has proved to be the stumbling block in previous attempts to compute the marginal likelihood for hierarchical implementations of psychological models.
Our article develops a method for Bayesian model comparison when the likelihood (of a hierarchical model) is intractable but can be estimated unbiasedly. We show that when the density of individual observations conditioned on group- and individual-level parameters is available, an importance sampling algorithm can provide an unbiased estimate of the marginal likelihood of hierarchical models. We term this approach importance sampling squared (IS2). The IS2 method can be applied to samples from the posterior distribution over a model’s parameters obtained following any sampling method. Appendix B studies the convergence properties of the IS2 estimator and provides practical guidelines for optimally selecting the number of particles required to estimate the likelihood.
We show how the IS2 method can be used to estimate the marginal likelihood in hierarchical models of health decision making and choice response time models. In both applications, the method estimates the marginal likelihood in a principled way, providing conclusions that are consistent with the current literature. In all cases, the marginal likelihood is estimated with very small Monte-Carlo standard errors. To aid researchers in using the IS2 method in their own research, we provide scripts implementing the key elements of the method, as applied to data from Forstmann et al., (2008) and Wagenmakers, Ratcliff, Gomez, and McKoon (2008) – see osf.io/xv59c for more details.
The data applications highlight two important properties of the IS2 method: it is an efficient and unbiased estimator of the marginal likelihood, and it provides a standard error of the estimator. The marginal likelihood is a key quantity in appropriately accounting for model flexibility in quantitative model comparison. The standard error of an estimator is essential for interpreting any model comparison metric, not just marginal likelihoods, as it expresses the variability of the estimated metric. This is equally important in quantitatively comparing psychological models as it is in the interpretation of conventional statistics, despite being routinely overlooked. No researcher would conclude the population means of two groups differ solely on the basis of a difference in their sampled group means – we would demand a measure that expresses the magnitude of the mean difference as a function of its variability, like a t-test. The IS2 method provides the researcher with the tools to do precisely this with potentially complex and non-linear psychological models: it estimates the magnitude of differences (marginal likelihoods for different models) as a function of the variability of those estimates (the standard errors).
We emphasize that the IS2 method is general and can be used to estimate the marginal likelihood and its standard error from models where the density of individual observations conditioned on group-level parameters and a vector of individual-level parameters is available. This is quite a general scenario – reaching beyond the models studied here, and decision-making models more generally – that applies to a very large category of psychological models for which hierarchical Bayesian parameter estimation has been used to date. This generality holds great promise for IS2 as a vehicle for performing model comparison via the marginal likelihood in psychological research.
We also note that the IS2 method is robust, by which we mean that the MCMC draws used to form the proposals do not have to converge to draws from the exact posterior as they are used in IS2 to form importance sampling proposals. Hence, as long as they are roughly in the same region as the posterior, the marginal likelihood estimates will be simulation consistent. The same remarks hold if the MCMC targets a slightly perturbed posterior.
For faster computation, we conjecture that future work could speed up the estimation of the marginal likelihood as follows: a) First, run the MCMC procedure in parallel on K chains without worrying about the precise convergence of each chain. In practice this means we can run each chain for far fewer iterates than would be required for carrying out Bayesian inference if the inference was based only on the output of each chain. b) Second, form a proposal density based on the output of all the chains. c) Third, use IS to get K robust and unbiased estimates of the marginal likelihood from each of K processors. d) Finally, average the K estimates to get a final unbiased estimate of the marginal likelihood whose variance is 1/K th the variance of each individual estimator. The robustness and unbiasedness properties of the IS2 estimator are crucial in ensuring that bias does not dominate the variance as K becomes large.
We note in Appendix A that posterior expectations of functions of the parameters can similarly be estimated with IS2, albeit with some very minor bias. We conjecture that such posterior moment estimators will be far more efficient than those obtained from standard MCMC output, but leave to future work such speeding up of the computations of the marginal likelihood and posterior moments.
Finally, we note that the marginal likelihood is sometimes sensitive to the prior, and this might lead to some controversy in using marginal likelihood for model comparison. Addressing this issue is beyond the scope of this paper, and we focus only on how to estimate the marginal likelihood efficiently when the likelihood function is intractable.
Notes
The notation \(IW\left (a,A\right )\) means an inverse Wishart distribution with degrees of freedom a and scale matrix A and the notation \(IG\left (a,b\right )\) means an inverse Gamma distribution with scale parameter a and shape parameter b.
References
Andrieu, C., Doucet, A., & Holenstein, R. (2010). Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society Series B, 72, 1–33.
Andrieu, C., & Roberts, G. (2009). The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics, 37, 697–725.
Brown, S., & Heathcote, A. (2008). The simple complete model of choice reaction time: Linear Ballistic accumulation. Cognitive Psychology, 57, 153–178.
Chopin, N., Jacob, P. E., & Papaspiliopoulos, O. (2013). SMC2: An efficient algorithm for sequential analysis of state space models. Journal of Royal Statistical Society Series B, 75(3), 397–426.
Donkin, C., Brown, S. D., & Heathcote, A. J. (2009). The over-constraint of response time models: Rethinking the scaling problem. Psychonomic Bulletin & Review, 16, 1129–1135.
Doucet, A., Pitt, M. K., Deligiannidis, G., & Kohn, R. (2015). Efficient implementation of Markov chain Monte Carlo when using an unbiased likelihood estimator. Biometrika, 102(2), 295–313.
Evans, N. J., & Annis, J. (2019). Thermodynamic integration via differential evolution: A method for estimating marginal likelihoods. Behavior Research Methods.
Evans, N. J., & Brown, S. D. (2018). Bayes factors for the linear ballistic accumulator model of decision-making. Behavior Research Methods, 50(2), 589–603.
Fiebig, D. G., Keane, M. P., Louviere, J., & Wasi, N. (2010). The generalized multinomial logit model: accounting for scale and coefficient heterogeneity. Marketing Science, 29(3), 393–421.
Fitzmaurice, G. M., Laird, N. M., & Ware, J. H. (2011) Applied longitudinal analysis, (2nd edn.) New Jersey: Wiley.
Forstmann, B. U., Brown, S., Dutilh, G., Neumann, J., & Wagenmakers, E. -J. (2010). The neural substrate of prior information in perceptual decision making: a model-based analysis. Frontiers in Human Neuroscience, 4, 40.
Forstmann, B. U., Dutilh, G., Brown, S., Neumann, J., & von Cramon, D. Y. (2008). Striatum and pre-sma facilitate decision making under time pressure. Proceedings of the National Academy of Sciences, 105, 17538–17542.
Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models. Bayesian Analysis, 1(3), 515–534.
Geweke, J. (1989). Bayesian inference in econometric models using Monte Carlo integration. Econometrica, 57(6), 1317–1339.
Gourieroux, C., & Monfort, A. (1995) Statistics and econometric models Vol. 1. Cambridge: Cambridge University Press.
Gronau, Q. F., Sarafoglou, A., Matzke, D., Ly, A., Boehm, U., Marsman, M., ..., Steingroever, H. (2017). A tutorial on bridge sampling. Journal of Mathematical Psychology, 81, 80–97.
Gronau, Q. F., Wagenmakers, E. -J., Heck, D. W., & Matzke, D. (2019). A simple method for comparing complex models: Bayesian model comparison for hierarchical multinomial processing tree models using warp-iii bridge sampling. Psychometrika, 84(1), 261–284.
Gunawan, D., Hawkins, G. E., Tran, M. -N., Kohn, R., & Brown, S. D. (2019). New estimation approaches for the linear ballistic accumulator model. Working paper. arXiv:1806.10089.
Heitz, R. P. (2014). The speed-accuracy tradeoff: history, physiology, methodology, and behavior. Frontiers in neuroscience, 8, 150.
Hesterberg, T. (1995). Weighted average importance sampling and defensive mixture distributions. Technometrics, 37, 185–194.
Hoogerheide, L., Opschoor, A., & Van Dijk, H. K. (2012). A class of adaptive importance sampling weighted EM algorithms for efficient and robust posterior and predictive simulation. Journal of Econometrics, 171(2), 101–120.
Huang, A., & Wand, M. P. (2013). Simple marginally noninformative prior distributions for covariance matrices. Bayesian Analysis, 8(2), 439–452.
Pitt, M. K., Silva, R. S., Giordani, P., & Kohn, R. (2012). On some properties of Markov chain Monte Carlo simulation methods based on the particle filter. Journal of Econometrics, 171(2), 134–151.
Rae, B., Heathcote, A., Donkin, C., Averell, L., & Brown, S. (2014). The hare and the tortoise: emphasizing speed can change the evidence used to make decisions. Journal of experimental psychology:, learning, memory, and cognition, 40(5), 1226–1243.
Richard, J.-F., & Zhang, W. (2007). Efficient high-dimensional importance sampling. Journal of Econometrics, 141(2), 1385–1411.
Ripley, B. (1987) Stochastic simulation. New York: Wiley.
Terry, A., Marley, A., Barnwal, A., Wagenmakers, E.-J., Heathcote, A., & Brown, S. D. (2015). Generalising the drift rate distribution for linear ballistic accumulators. Journal of Mathematical Psychology, 68, 49–58.
Train, K. E. (2009) Discrete choice methods with simulation. Cambridge: Cambridge university press.
Tran, M. N., Kohn, R., Quiroz, M., & Villani, M. (2016). Block-wise pseudo marginal Metropolis-Hastings. arXiv:1603.02485v2.
Turner, B. M., Sederberg, P. B., Brown, S. D., & Steyvers, M. (2013). A method for efficiently sampling from distributions with correlated dimensions. Psychological Methods, 18(3), 368–384.
Voss, A., Rothermund, K., & Voss, J. (2004). Interpreting the parameters of the diffusion model: an empirical validation. Memory & Cognition, 32(7), 1206–1220.
Wagenmakers, E. J., Ratcliff, R., Gomez, P., & McKoon, G. (2008). A diffusion model account of criterion shifts in the lexical decision task. Journal of Memory and Language, 58, 140–159.
Acknowledgments
The research of Tran, Gunawan, Kohn and Brown was partially supported by Australian Research Council grant DP180102195. Hawkins was partially supported by Australian Research Council grant DE170100177. Tran and Kohn thank Michael Pitt for some useful discussions that led to the development of Proposition A1.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Practices Statement
The two applications cover previously published data sets (Fiebig et al., 2010; Forstmann, Dutilh, Brown, Neumann, & von Cramon, 2008). Data and code are available at osf.io/xv59c.
Appendices
Appendix A: Using the IS2 method to obtain posterior expectations
The motivation for using IS2 so far in this article was to develop an efficient and robust estimator for the marginal likelihood and the standard error of the estimator. This section shows that it is also straightforward to use IS2 to robustly estimate expectations with respect to the posterior distribution, as well as the standard errors of these estimators. We also discuss the convergence properties of these estimators.
Using the same notation as in “Marginal likelihood estimation for hierarchical models”, the posterior expectation of the function φ of 𝜃 is
When the likelihood is intractable, but can be estimated unbiasedly, we can use IS2 to estimate unbiasedly and robustly both the numerator and denominator on the right side of Eq. 17, similarly to “Estimating the marginal likelihood by IS2”, to obtain
In Eq. 18, \(\hat {p}_{N}(y|\theta )\) is the unbiased estimate of p(y|𝜃), with N the number of samples or particles used to estimate the likelihood. It is clear that under mild conditions both the numerator and denominator of Eq. 18 will converge to the numerator and denominator respectively of Eq. 17 as \(M \rightarrow \infty \), and hence \({\hat {\mathbb {E}}}_{\pi }(\varphi )\) converges to \(\mathbb {E}_{\pi }(\varphi )\). It is also clear that under very mild conditions the numerator of Eq. 18 becomes normally distributed as \(M \rightarrow \infty \), which then means that \({\hat {\mathbb {E}}}_{\pi }(\varphi )\) also converges to normality. These issues are discussed more rigorously below.
We note that while both the numerator and denominator in Eq. 17 are estimated unbiasedly, \({\hat {\mathbb {E}}}_{\pi }(\varphi )\) is a biased estimator of \(\mathbb {E}_{\pi }(\varphi )\), although the bias goes to 0 as \(M \rightarrow \infty \).
In practice, the variances of both the numerator and denominator in Eq. 18, as well as their covariance, can be estimated by the bootstrap and hence both the variance and bias of \({\hat {\mathbb {E}}}_{\pi }(\varphi )\) can be estimated.
A.1 Some technical results
We now give some large sample (in M) convergence results for \({\hat {\mathbb {E}}}_{\pi }(\varphi )\) assuming that,
Assumption A1
\(\mathbb {E}[\hat {p}_{N}(y|\theta )]=p(y|\theta )\) for every 𝜃 ∈Θ, where the expectation is with respect to the random variables generated in the process of estimating the likelihood.
It is useful in the rest of this section and Appendix E to follow (Pitt, Silva, Giordani, & Kohn, 2012) and write \(\hat {p}_{N}(y|\theta )\) as p(y|𝜃)ez, where \(z:=\log \hat {p}_{N}(y|\theta )-\log p(y|\theta )\) is a scalar random variable whose distribution conditional on 𝜃 is governed by the randomness due to estimating the likelihood p(y|𝜃). Thus, the scalar z replaces the multivariate u in “Estimating the marginal likelihood by IS2”. Let gN(z|𝜃) be the density of z conditional on 𝜃. Assumption A1 implies that \(\mathbb {E}(e^{z}|\theta )\! =\! {\int \limits }_{\mathbb {R}} e^{z}g_{N}(z|\theta )dz=1\).
Theorem A1
Suppose that Assumption A1 holds, \(\mathbb {E}_{\pi }(|\varphi |)\) exists and is finite, and \(\text {Supp}(\pi )\subseteq \text {Supp}(g_{\text {IS}})\), where Supp(π) denotes the support of the distribution π.
-
(i) For any N ≥ 1, \({\hat {\mathbb {E}}}_{\pi }(\varphi ) ~{}^{\underset {\longrightarrow }{a.s.}}~\mathbb {E}_{\pi }(\varphi )\) as \(M\to \infty \).
-
(ii) If
$$ \begin{array}{@{}rcl@{}} \int h(\theta)^{2} \left (\frac{\pi(\theta) } { g_{\text{IS}}(\theta) } \right )^{2} \Big (\int \exp(2z)g_{N}(z|\theta)dz \Big ) g_{\text{IS}}(\theta) d\theta \end{array} $$(19)is finite for h(𝜃) = φ(𝜃) and h(𝜃) = 1 for all N, then
$$ \sqrt{M}\left( {\hat{\mathbb{E}}}_\pi(\varphi)-\mathbb{E}_\pi(\varphi)\right) ~{}^{\underset{\to}{d}}~{\mathcal{N}}\left( 0,\sigma^{2}_{\mathrm{I}\mathrm{S}^{2}}(\varphi)\right), as M\to\infty, $$(20)where the asymptotic variance in M for fixed N is given by
$$ \sigma^{2}_{\text{IS}^{2}}(\varphi)=\mathbb{E}_\pi \left\{\left( \varphi(\theta)-\mathbb{E}_\pi(\varphi)\right)^{2}\frac{\pi(\theta)}{g_{\text{IS}}(\theta)}\mathbb{E}_{g_{N}}[\exp(2z)]\right\}. $$(21) -
(iii) Define
$$ \widehat{\sigma}^{2}_{\text{IS}^{2}}(\varphi):=\frac{M{\sum}_{i=1}^{M}\left( \varphi(\theta_{i})-\hat{\mathbb{E}}_\pi(\varphi)\right)^{2}\widetilde{w}(\theta_{i})^{2}}{\left( {\sum}_{i=1}^{M}\widetilde{w}(\theta_{i})\right)^{2}}. $$(22)If the conditions in (ii) hold, then \(\widehat \sigma ^{2}_{\text {IS}^{2}}(\varphi )\stackrel {a.s.}{\longrightarrow }\sigma ^{2}_{\text {IS}^{2}}(\varphi )\) as \(M\to \infty \), for given N. The proof is in Appendix E.
Although both \(\sigma ^{2}_{\text {IS}^{2}}(\varphi )\) and \(\widehat \sigma ^{2}_{\text {IS}^{2}}(\varphi )\) depend on N, we do not show this dependence explicitly to simplify the notation. Here, all the probabilistic statements, such as the almost sure convergence, must be understood on the extended probability space that takes into account the extra randomness occurring when estimating the likelihood. When the likelihood can be computed, the analogous results are well known in the literature (see, e.g., p.1318, Geweke, 1989).
Appendix B: The effect on importance sampling of estimating the likelihood
The results in the previous section show that it is straightforward to use importance sampling even when the likelihood is intractable but unbiasedly estimated. This section addresses the question of how much asymptotic efficiency is lost when working with an estimated likelihood. We follow Pitt et al., (2012) and Doucet, Pitt, Deligiannidis, and Kohn (2015) and make the following idealized assumption to make it possible to develop some intuition and practical guidelines for selecting N. Appendix E contains all the proofs unless stated otherwise.
Assumption A2
-
(i)
There exists a function γ2(𝜃) such that the density gN(z|𝜃) of z is \({\mathcal {N}}(-\frac {\gamma ^{2}(\theta )}{2N},\frac {\gamma ^{2}(\theta )}{N})\), where \({\mathcal {N}}(a,b^{2})\) is a univariate normal density with mean a and variance b2.
-
(ii)
For a given σ2 > 0, define \(N_{\sigma ^{2}}(\theta ):=\gamma ^{2}(\theta )/\sigma ^{2}\). Then, \(\mathbb {V}(z|\theta , N =N_{\sigma ^{2}}(\theta ) )\equiv \sigma ^{2}\) for all 𝜃 ∈Θ.
If gN(z|𝜃) is Gaussian, then, by Assumption A1, its mean is \(-\frac 12 \) times its variance because \(\mathbb {E}_{g_{N}}(\exp (z))=1\). Assumption A2 (ii) keeps the variance \(\mathbb {V}(z|\theta , N)\) constant across different values of 𝜃, thus making it easy to associate the IS2 asymptotic variances with σ. Under Assumption A2, the density gN(z|𝜃) depends only on σ and we write it as g(z|σ).
Lemma A1
If Assumption A2 holds for a fixed σ2, then Eq. 19 becomes
for both h = φ and h = 1.
These are the standard conditions for IS (Geweke, 1989). The proof of this lemma is straightforward and omitted.
Recall that \({\sigma }_{\text {IS}}^{2}(\varphi )\) and \(\sigma ^{2}_{\text {IS}^{2}}(\varphi )\) are respectively the asymptotic variances of the IS estimators we would obtain when the likelihood is available and when it is estimated. We refer to the ratio \(\sigma ^{2}_{\text {IS}^{2}}(\varphi )/{\sigma }_{\text {IS}}^{2}(\varphi )\) as the inflation factor, which measures how much the asymptotic variance is inflated when working with an estimated likelihood. Theorem A2 obtains an expression for the inflation factor, shows that it is independent of φ, greater than or equal to 1 and increases exponentially with σ2, which is the variance of z.
Theorem A2
Under Assumption A2 and the conditions in Theorem A1,
B.1 Optimal N for estimating posterior expectations
This section aims to explicitly take into account both the statistical precision of the IS2 estimators and their computational cost. It is apparent that there is a trade-off between these two considerations. A large value of N results in a precise estimator so that σ2 is small and the relative variance \(\mathbb {V}(\hat {\varphi }_{\text {IS}^{2}})/\mathbb {V}(\hat {\varphi }_{\text {IS}})={\sigma ^{2}_{\text {IS}^{2}}(\varphi )}/{{\sigma }_{\text {IS}}^{2}(\varphi )}=\exp (\sigma ^{2})\) given by Eq. 24 is close to 1. However, the cost of such an estimator will be large due to the large number of particles, N, required. Conversely, a small value of N results in an estimator that is cheap to evaluate but has a large value of σ2 and hence the variance of the IS2 estimator relative to the IS estimator will be large. To explicitly trade-off these considerations, we introduce the computational time CT(σ2) which is a product of the relative variance and the computational effort. Minimising CT(σ2) results in an optimal value for σ2 and hence N.
Under Assumption A2, \(N=N_{\sigma ^{2}}(\theta )=\gamma ^{2}(\theta )/\sigma ^{2}\), so that the expected number of particles, over the draws of 𝜃, is \({\overline N}=\overline {{\gamma }^{2}}/\sigma ^{2}\), where \(\overline {{\gamma }^{2}}:=\mathbb {E}_{g_{IS}}[\gamma ^{2}(\theta )]\). This motivates the assumption that the computational cost is proportional to 1/σ2. From Theorems A1 and A2, the variance of the estimator \(\hat {\varphi }_{\text {IS}^{2}}\) based on M importance samples from gIS(𝜃) is approximated by
If κ∗ is the target precision of the estimator \(\hat \varphi _{\text {IS}^{2}}\), then the required number of samples is \(M={{\sigma }_{\text {IS}}^{2}(\varphi )e^{\sigma ^{2}}}/{\kappa ^{*}}\), and the required computational cost is proportional to
Therefore, it makes sense to define the measure of the computing time of the IS2 method (in order to obtain a given precision for the IS2 estimators) as
It is straightforward to check that CT(σ2) is minimized at
and the optimal number of particles N is such that \(\mathbb {V}(z|\theta , N)=\mathbb {V}(\log \hat {p}_{N}(y|\theta ))={\sigma }_{\text {opt}}^{2}\).
B.2 Optimal N for estimating the marginal likelihood
Section B.1 derived the optimal N for estimating integrals of the form in Eq. 17. We now show that the optimal N is the same when the main goal is to estimate the marginal likelihood.
The IS2 marginal likelihood estimator with an estimated likelihood is \(\widehat {p}_{\text {IS}^{2}}(y) = {\sum }_{i=1}^{M} {\widetilde {w}}(\theta _{i})/M\), with the weights \({\widetilde {w}}(\theta _{i})\) from Eq. 10. We noted that \( \mathbb {E}_{g_{\text {IS}}}[\omega (\theta )]=\mathbb {E}_{\widetilde {g}_{\text {IS}} }[\widetilde {\omega }(\theta )]=p(y), \) where \(\widetilde {\omega }(\theta )=e^{z}\omega (\theta )\). We again wish to compare the variance of the IS2 estimator to that of the IS estimator. Under Assumption A2 that z is Gaussian and independent of 𝜃,
Consequently, the relative variance of the two schemes is given by
where \(v=\mathbb {V}_{_{g_{\text {IS}}}}[\omega (\theta )/p(y)]=\mathbb {V}_{_{g_{\text {IS}}}}[\pi (\theta )/g_{\text {IS}}(\theta )]\). Following the argument of Appendix B.1, the corresponding computing time can be defined as
where the subscript ML indicates this is for the marginal likelihood estimator.
Let \({\sigma }_{\min \limits }^{2}(v)\) minimize CTML(σ2) for a given v. The following proposition, whose proof is obvious and omitted, summarizes some properties of \({\sigma }_{\min \limits }^{2}(v)\), where \({\sigma }_{\text {opt}}^{2}\) below is given by Eq. 27.
Proposition A1
-
(i)
For any value of v, CTML(σ2) is a convex function of σ2; therefore \({\sigma }_{\min \limits }^{2}(v)\) is unique.
-
(ii)
\({\sigma }_{\min \limits }^{2}(v)\) increases as v increases and \({\sigma }_{\min \limits }^{2}(v)\longrightarrow {\sigma }_{\text {opt}}^{2}=1\) in Eq. 27 as \(v\longrightarrow \infty \).
Table 2 illustrates some of the results of Proposition A1 and shows \({\sigma }_{\min \limits }^{2}(v)\) and the ratio \(\text {CT}_{\text {ML}}({\sigma }_{\min \limits }^{2})/\text {CT}_{\text {ML}}({\sigma }_{\text {opt}}^{2})\) for some common values of v used in practice. The table shows that the values of \({\sigma }_{\min \limits }^{2}\) and the ratio are insensitive to v, and \({\sigma }_{\min \limits }^{2}(v)\longrightarrow {\sigma }_{\text {opt}}^{2}=1\) as v increases. From these observations, we advocate using \({\sigma }_{\text {opt}}^{2}=1\) as the optimal value of the variance of the log likelihood estimates in estimating the marginal likelihood.
Appendix C: Estimation details for the two applications
C.1 Individual differences in health-related decisions
We estimate the likelihood (Eq. 14 of the main text) by integrating out the random effects vector for each individual separately using different approaches for the mixed logit and GMNL models. For the mixed logit model, we combine the efficient importance sampling (EIS) method of (Richard & Zhang, 2007) with the defensive sampling approach of Hesterberg (1995). The importance density is the two component defensive mixture
where hEIS(αj|yj1,…,yjT) is a multivariate Gaussian importance density obtained using EIS. Following (Hesterberg, 1995), including the natural sampler p(αj) in the mixture ensures that the importance weights are bounded. We set the mixture weight as π = 0.5. For the GMNL model, we follow Fiebig et al., (2010) and use the model density p(αj) as an importance sampler. We implement this simpler approach for the GMNL model because the occurrence of large values of λj causes the defensive mixture estimates of the log-likelihood to be right skewed in this case.
The repeated questioning of each individual (i.e., T = 32 choice scenarios) implies that the log-likelihood estimates are sums of independent estimates \(\log \widehat {p}(y_{j}|\theta )\) for each individual. To target a certain variance σ2 for the log-likelihood estimator, we chose the number of particles for each individual (Nj) and parameter combination 𝜃 such that \(\mathbb {V}(\log \widehat {p}(y_{j}|\theta ))\approx \sigma ^{2}/S\). We implemented this scheme by using a fixed number of initial importance samples and then used the jackknife method to estimate \({\gamma ^{2}_{j}}(\theta )\), the asymptotic variance of \(\log \widehat {p}(y_{j}|\theta )\), and selected \(N_{j}(\theta )=\widehat {{\gamma ^{2}_{j}}}(\theta )S/\sigma ^{2}\). The preliminary number of particles is N = 20 for the mixed logit model and N = 2, 500 for the GMNL model.
To construct efficient and reliable proposal densities for the parameters gIS(𝜃), we used the “Mixture of t by Importance Sampling Weighted Expectation Maximization” approach (MitISEM; Hoogerheide, Opschoor, and Van Dijk (2012)). MitISEM constructs a mixture of t densities for approximating the target distribution by minimizing the Kullback–Leibler divergence between the target density and the t mixture, and it can handle target distributions that have non-standard shapes such as multimodality and skewness; MitISEM effectively approximates the posterior of the two models as two component mixtures of multivariate Student’s t distributions. We write \(\widehat p_{N}(y|\theta )=\widehat p_{N}(y|\theta ,u)\), with u a fixed random number stream for all 𝜃. The target distribution in the MitISEM is \(p(\theta )\widehat p_{N}(y|\theta ,u)/p(y)\), which can be considered as the posterior p(𝜃|y,u) conditional on y and the common random numbers u. Our procedure is analogous to using common random numbers u to obtain simulated maximum likelihood estimates of 𝜃 (see, e.g., Gourieroux & Monfort 1995), except that we obtain a histogram estimate of the “posterior” p(𝜃|u,y) ∝ p(y|𝜃,u)p(u). This “posterior” is biased but is sufficient to obtain a good proposal density.
We implemented two standard variance reduction methods at each IS stage: stratified mixture sampling and antithetic sampling. The first consists of sampling from each component at the exact proportion of the mixture weights. For example, when estimating the likelihood for the mixed logit model we generate exactly πN draws from the efficient importance density hEIS(αj|yj1,…,yjT) and (1 − π)N draws from p(αj). The antithetic sampling method consists of generating pairs of perfectly negatively correlated draws from each mixture component (see, e.g., Ripley, 1987).
C.2 The speed-accuracy tradeoff in perceptual decisions
We used the Particle Metropolis within Gibbs (PMwG) sampler of Gunawan et al., (2019) to obtain 10,000 posterior draws of 𝜃 and α1:S for each of the four models described in the main text. We fitted a mixture of normal distributions to the samples from the posterior of 𝜃 to obtain the proposal density for the group-level parameters
where ϕ(μ,Σ) denotes the multivariate normal density function with mean μ and variance-covariance matrix Σ, and the \(w_{k}^{MIX}\) are the component weights. For practical purposes, we select the number of components K using the Bayesian Information Criterion (BIC), and estimate the mixture of normals via Matlab’s built-in function fitgmdist.
Proposal densities for the random effects (\(m_{j}\left ({\alpha }_{j}|{\theta },{y}_{j}\right )\)) were constructed participant-by-participant. For the jth participant, we first fitted a normal distribution to the samples of \(\left ({\alpha }_{j},{\theta }\right )\), then derived the conditional distribution \(g\left ({\alpha }_{j}|{\theta }\right )\sim N\left ({\alpha }_{j};{\mu }_{j,prop},{\Sigma }_{j,prop}\right )\) for j = 1,..,S. The proposal density for subject j is the two-component defensive mixture
with \(p\left ({\alpha }_{j}|{\theta }\right )\) the prior density of αj. The inclusion of the prior \(p\left ({\alpha }_{j}|{\theta }\right )\) ensures that the importance weights in Eq. 5 are bounded (Hesterberg, 1995). We set the mixture weight in Eq. 30 to \(w_{\alpha }^{MIX}=0.95\).
Appendix D: Additional Applications of the Hierarchical LBA to Data
This Appendix applies the hierarchical LBA model discussed in “The speed-accuracy tradeoff in perceptual decisions” to two additional data sets.
D.1 The speed-accuracy tradeoff in lexical decisions
This section extends the analysis of the speed-accuracy tradeoff covered in the main text to judgments about the identity of strings of letters. Experiment 1 of Wagenmakers et al., (2008) had 17 participants perform a basic lexical decision task that involved repeated decisions regarding whether letter strings were valid words or not (‘non-words’). Prior to each block of trials, participants were given instructions to respond as quickly as possible (condition 1: speed emphasis) or as accurately as possible (condition 2: accuracy emphasis), where the instruction emphasis alternated between blocks. Concurrent to the speed-accuracy manipulation, word frequency was manipulated across four levels: words of very low frequency (vlf), low frequency (lf) and high frequency (hf), and non-words (nw). Participants completed 20 blocks with 96 trials per block for a total of 1920 lexical decisions per participant. See Wagenmakers et al., (2008) for all remaining details.
Rae, Heathcote, Donkin, Averell, and Brown (2014) re-analysed Wagenmakers et al., (2008)’s data to test whether instructions manipulating the speed-accuracy tradeoff only cause changes in decision caution (response threshold parameters) or decision caution and the speed of information processing (drift rate parameters). Through a large-scale maximum likelihood-based parameter estimation exercise, Rae et al., (2014) found evidence that emphasising the speed of decisions caused participants to lower their response threshold and increase their drift rates for correct and incorrect responses. The latter result suggests that time pressure increased the overall drive to respond (an overall increase in both correct and incorrect drift rates) while decreasing accuracy. We reassessed these findings, through the lens of the marginal likelihood, comparing five LBA models. For clarity in the following, we refer to instructions that emphasise accuracy and speed as a and s, respectively, and the correct and error accumulators as c and e, respectively.
- Model I:
-
assumes there are no behavioural differences across all conditions (i.e., a null model), corresponding to a single set of LBA parameters over conditions. The vector of random effects for subject j is
$$ \alpha_{j}=\left( \alpha_{b_{j}},\alpha_{A_{j}},\alpha_{v_{j}^{\left( e\right)}},\alpha_{v_{j}^{\left( c\right)}},\alpha_{\tau_{j}}\right). $$ - Model II:
-
assumes that decision caution differs as a function of the task instructions (speed vs accuracy), thus allowing two response threshold parameters,
$$ \alpha_{j}=\left( \alpha_{b_{j}^{\left( s\right)}},\alpha_{b_{j}^{\left( a\right)}},\alpha_{A_{j}},\alpha_{v_{j}^{\left( e\right)}},\alpha_{v_{j}^{\left( c\right)}},\alpha_{\tau_{j}}\right). $$ - Model III:
-
assumes the speed of information processing for the correct response, but not the incorrect response, differs as a function of task instructions,
$$ \alpha_{j}=\left( \alpha_{b_{j}},\alpha_{A_{j}},\alpha_{v_{j}^{\left( e\right)}},\alpha_{v_{j}^{\left( c,s\right)}},\alpha_{v_{j}^{\left( c,a\right)}},\alpha_{\tau_{j}}\right), $$where \(v_{j}^{\left (c,s\right )}\) and \(v_{j}^{\left (c,a\right )}\) are the mean drift rates for the accumulators corresponding to the correct response in the speed- and accuracy-emphasis conditions, respectively.
- Model IV:
-
extends Model III to also allow variation in the speed of information processing for the incorrect response across task instructions,
$$ \alpha_{j}=\left( \alpha_{b_{j}},\alpha_{A_{j}},\alpha_{v_{j}^{\left( e,s\right)}},\alpha_{v_{j}^{\left( e,a\right)}},\alpha_{v_{j}^{\left( c,s\right)}},\alpha_{v_{j}^{\left( c,a\right)}},\alpha_{\tau_{j}}\right), $$where \(v_{j}^{\left (e,s\right )}\) and \(v_{j}^{\left (e,a\right )}\) are the mean drift rates for the incorrect accumulators in the speed- and accuracy-emphasis conditions, respectively.
- Model V:
-
combines Models II and IV, thus allowing the response threshold and the drift rates for correct and incorrect responses to vary as a function of task instructions,
$$ \alpha_{j}=\left( \alpha_{b_{j}^{\left( s\right)}},\alpha_{b_{j}^{\left( a\right)}},\alpha_{A_{j}},\alpha_{v_{j}^{\left( e,s\right)}},\alpha_{v_{j}^{\left( e,a\right)}},\alpha_{v_{j}^{\left( c,s\right)}},\alpha_{v_{j}^{\left( c,a\right)}},\alpha_{\tau_{j}}\right). $$
Table 3 shows the log of the marginal likelihood estimates (with standard errors in parentheses) for the five models. The Monte Carlo standard errors for the estimates are again small for all models, suggesting that the IS2 method is very efficient. Unsurprisingly, Model I, the most rigid model that allows no parameter variation across conditions, has the smallest marginal likelihood estimate. Model II, which allows response thresholds to vary with task instruction, provides a better representation of the data than Model III, which only allows the mean correct drift rates to vary with task instruction. Interestingly, Model IV, which allows the correct and incorrect drift rates to vary across speed and accuracy instructions, performs better than a model that only allows the response thresholds to differ over those conditions (i.e., Model II). Overall, the results favour Model V, which allows the response thresholds and the correct and incorrect drift rates to vary with task instructions. This is consistent with the general conclusions of Rae et al., (2014). Relative to instructions emphasising decision accuracy, speed-emphasis instructions cause people to lower their response thresholds and increase their rates of accumulation for both the correct and incorrect response options.
D.2 Biasing lexical decisions
Here we analyse decisions in the presence of response bias. A common method for inducing biased decisions is to manipulate the proportion of stimuli from different response categories; for example, presenting more stimuli from category 1 than category 2 will result in a greater proportion of responses in favour of category 1 over 2. The effect of this form of bias manipulation is most commonly explained as a decision process that is primed to give a response for the more common stimulus on the basis of less evidence than would be required to give a response for the less common stimulus (Voss, Rothermund, & Voss, 2004). The LBA model captures this effect of response-relevant information through differences in the threshold across response options: the more common stimulus has a lower response threshold than the less common stimulus (Brown & Heathcote, 2008), thus requiring less evidence to trigger a response. This parameter change has been confirmed experimentally (Forstmann, Brown, Dutilh, Neumann, & Wagenmakers, 2010).
Here, we investigate whether changing the proportion of words and non-words induces response biases in lexical decisions, as reflected in the parameter estimates of the LBA model. Experiment 2 of Wagenmakers et al., (2008) had 19 participants perform a lexical decision task where the proportion of word vs non-word stimuli alternated across blocks: in a ‘word’ block, 75% of the stimuli were words and 25% were non-words; in a ‘non-word’ block the proportions switched such that 75% of the stimuli were non-words and 25% were words. Each block had an approximately equal amount of high, low, and very low frequency words. Participants completed 20 blocks with 96 trials per block for a total of 1920 lexical decisions per participant. See Wagenmakers et al., (2008) for all other details.
We test whether the word vs non-word response proportion manipulation affected the response threshold parameters of the LBA model in the expected direction. We used the marginal likelihood to compare five LBA models. In addition to the notation introduced above, we refer to the stimulus manipulation of the proportion of words and non-words as w and nw, where w refers to the condition with 75% words and 25% non-words and vice versa for nw, and W and NW to refer to a response of word and non-word, respectively.
- Model I:
-
again assumes constancy in LBA parameters across conditions,
$$ \alpha_{j}=\left( \alpha_{b_{j}},\alpha_{A_{j}},\alpha_{v_{j}^{\left( e\right)}},\alpha_{v_{j}^{\left( c\right)}},\alpha_{\tau_{j}}\right). $$ - Model II:
-
assumes that the response threshold varies as a function of the proportion manipulation (w, nw) but does not differ for responses of words or non-words (W, NW ),
$$ \alpha_{j}=\left( \alpha_{b_{j}^{\left( w\right)}},\alpha_{b_{j}^{\left( nw\right)}},\alpha_{A_{j}},\alpha_{v_{j}^{\left( e\right)}},\alpha_{v_{j}^{\left( c\right)}},\alpha_{\tau_{j}}\right). $$ - Model III:
-
assumes that the two responses (“word”, W; “non-word”, NW ) are governed by accumulators that had different response thresholds, and these thresholds are also free to vary across the proportion manipulation,
$$ \alpha_{j}{\kern-.5pt}=\left( {\kern-.5pt}\alpha_{b_{j}^{\left( w,W\right)}},\alpha_{b_{j}^{\left( w,NW\right)}},\alpha_{b_{j}^{\left( nw,W\right)}},\alpha_{b_{j}^{\left( nw,NW\right)}},\alpha_{A_{j}},\alpha_{v_{j}^{\left( e\right)}},\alpha_{v_{j}^{\left( c\right)}},\alpha_{\tau_{j}}{\kern-.5pt}\right), $$where \(b_{j}^{\left (w,W\right )}\) and \(b_{j}^{\left (w,NW\right )}\) refer to the threshold parameters for word and non-word responses in the 75% word condition, respectively, and similarly for \(b_{j}^{\left (nw,W\right )}\) and \(b_{j}^{\left (nw,NW\right )}\) in the 75% non-word condition.
- Model IV:
-
has a parallel structure to Model III except that the proportion manipulation (w, nw) and response (W, NW ) are assumed to selectively influence non-decision time rather than the response threshold,
$$ \alpha_{j}=\left( \alpha_{b_{j}},\alpha_{A_{j}},\alpha_{v_{j}^{\left( e\right)}},\alpha_{v_{j}^{\left( c\right)}},\alpha_{\tau_{j}^{\left( w,W\right)}},\alpha_{\tau_{j}^{\left( w,NW\right)}},\alpha_{\tau_{j}^{\left( nw,W\right)}},\alpha_{\tau_{j}^{\left( nw,NW\right)}}\right), $$where the random effects \(\tau _{j}^{\left (w,W\right )}\), \(\tau _{j}^{\left (w,NW\right )}\), \(\tau _{j}^{\left (nw,W\right )}\), and \(\tau _{j}^{\left (nw,NW\right )}\) were similarly defined as in the threshold parameters of Model III.
- Model V:
-
assumes that the proportion manipulation influences the rate of evidence accumulation for correct and incorrect “word” and “non-word” responses, separately for each proportion condition,
$$ \begin{array}{@{}rcl@{}} \alpha_{j}=\left( \alpha_{b_{j}}\right.,\alpha_{A_{j}},&& \alpha_{v_{j}^{\left( e,w,W\right)}},\alpha_{v_{j}^{\left( e,w,NW\right)}},\alpha_{v_{j}^{\left( e,nw,W\right)}},\alpha_{v_{j}^{\left( e,nw,NW\right)}}, \\ && \alpha_{v_{j}^{\left( c,w,W\right)}},\alpha_{v_{j}^{\left( c,w,NW\right)}},\alpha_{v_{j}^{\left( c,nw,W\right)}},\alpha_{v_{j}^{\left( c,nw,NW\right)}}\left.,\alpha_{\tau_{j}} \right), \end{array} $$where \(v_{j}^{\left (e,\ ,\ \right )}\) and \(v_{j}^{\left (c,\ ,\ \right )}\) are the vectors of mean drift rates for incorrect and correct responses, respectively. The relevant pair of correct-incorrect drift rates for a given datum are contingent on the proportion condition of the trial (75% words or non-words; w, nw) and the observed response (word or non-word; W, NW ). Phrased differently, each datum (RT and response choice pair from a single trial) contains information to update 2 of the 8 drift rate parameters.
Table 4 shows the log of the marginal likelihoods for the five models. As in the previous applications, the Monte Carlo standard errors for the estimates are small for all models, although the more complex model V has slightly higher standard errors; nevertheless, even this larger standard error is still much smaller than the differences in the log of the marginal likelihoods between models.
Models that do not allow any parameter variation (I) or a simple threshold change across proportion condition but not response type (II) provide a poor representation of the data; these models do not account for response bias, which suggests there were considerable bias effects present in the data. Assuming independent drift rates for each combination of correct/incorrect, response type and proportion condition (V) provided a much improved fit, though it was still inferior to models that assume a selective influence of the response threshold (III) or non-decision time (IV) on the response type × proportion condition interaction. The best account of the data is one in which the biased responses are assumed to arise from a change in the response thresholds (III). The parameter estimates of the preferred model suggest that the biased responses arose from lowered response thresholds for more probable responses.
Appendix E: Proofs
Using the same notation as in Section A.1, we follow Pitt et al., (2012) and define the joint prior density of 𝜃 and z as \(p(\theta )\exp (z) g_{N}(z|\theta )\), so that the posterior density of 𝜃 and z is
and has p(𝜃|y) as a marginal. Hence,
where
Thus, we can consider \(\widetilde g_{IS} \) in Eq. 32 as an importance density in z and 𝜃, which will lead to the same IS2 estimate for the marginal likelihood as in “Estimating the marginal likelihood by IS2”.
Proof of Theorem 1
Unbiasedness holds because we are dealing with IS. The assumptions in Theorem A1 ensure that \(\widetilde {w}(\theta _{i})\)’s are i.i.d with a finite second moment. The results of (i) and (ii) then follow. □
Proof of Theorem A1
If \(\text {Supp}(\pi )\subseteq \text {Supp}(g_{\text {IS}})\) then \(\text {Supp}(\pi _{N})\subseteq \text {Supp}(\widetilde g_{\text {IS}})\). This, together with the existence and finiteness of \(\mathbb {E}_{\pi }(\varphi )\) ensures that \(\mathbb {E}_{\widetilde {g}_{\text {IS}}}[\varphi (\theta )\widetilde {w}(\theta ,z)]=p(y)\mathbb {E}_{\pi }(\varphi )\) and \(\mathbb {E}_{\widetilde {g}_{\text {IS}}}[\hat {w}(\theta ,z)]=p(y)\) exist and are finite. Result (i) follows from the strong law of large numbers. To prove (ii), write
Let \(X_{i}=\left (\varphi (\theta _{i})-\mathbb {E}_{\pi }(\varphi )\right )\widetilde {w}(\theta _{i},z_{i})\), i = 1,...,M, \(S_{M}=\frac {1}{M}{\sum }_{i=1}^{M} X_{i}\) and \(Y_{M}=\frac 1{M}{\sum }_{i=1}^{M}\widetilde {w}(\theta _{i},z_{i})\). The Xi are independently and identically distributed with \(\mathbb {E}_{\widetilde {g}_{\text {IS}}}(X_{i})=0\) and
By the central limit theorem for a sum of independently and identically distributed random variables with a finite second moment, \(\sqrt {M}S_{M}\stackrel {d}{\to }\mathcal {N}(0,{\nu ^{2}_{N}})\). By the strong law of large numbers, \(Y_{M}\stackrel {P}{\to }\mathbb {E}_{\widetilde {g}_{\text {IS}}}(\widetilde {w}(\theta ,z))=p(y)\). By Slutsky’s theorem,
The asymptotic variance is given by
To prove (iii), write
□
Proof of Theorem A2
Under Assumption 2, \(g_{N}(z|\theta )={\mathcal {N}}(-\sigma ^{2}/2,\sigma ^{2})\) and \(\mathbb {E}_{g_{N}(z|\theta )}[\exp (2z)]=\exp (\sigma ^{2})\). From Eqs. 21 and 22,
□
Rights and permissions
About this article
Cite this article
Tran, MN., Scharth, M., Gunawan, D. et al. Robustly estimating the marginal likelihood for cognitive models via importance sampling. Behav Res 53, 1148–1165 (2021). https://doi.org/10.3758/s13428-020-01348-w
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.3758/s13428-020-01348-w