This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
Issues of model selection have dominated the theoretical and applied statistical literature for decades. Model selection methods such as ridge regression, the lasso, and the elastic net have replaced ad hoc methods such as stepwise regression as a means of model selection. In the end, however, these methods lead to a single final model that is often taken to be the model considered ahead of time, thus ignoring the uncertainty inherent in the search for a final model. One method that has enjoyed a long history of theoretical developments and substantive applications, and that accounts directly for uncertainty in model selection, is Bayesian model averaging (BMA). BMA addresses the problem of model selection by not selecting a final model, but rather by averaging over a space of possible models that could have generated the data. The purpose of this paper is to provide a detailed and up-to-date review of BMA with a focus on its foundations in Bayesian decision theory and Bayesian predictive modeling. We consider the selection of parameter and model priors as well as methods for evaluating predictions based on BMA. We also consider important assumptions regarding BMA and extensions of model averaging methods to address these assumptions, particularly the method of Bayesian stacking. Simple empirical examples are provided and directions for future research relevant to psychometrics are discussed.
Keywords: Bayesian model averaging, Bayesian stacking, predictionIssues surrounding the specification of statistical models for prediction and explanation have dominated the theoretical and applied statistical literature for decades. The underlying issue has been one of the well-known bias-variance trade-off problem—namely that under-specified models can lead to parameter bias, and over-specified models can lead to poor predictions in future samples. Model selection methods such as ridge regression (Hoerl and Kennard 1970; Hoerl 1985), the lasso (Tibshirani 1996), the elastic net (Zou and Hastie 2005), and their Bayesian counterparts (Hsiang 1975; Park and Casella 2008; Li and Lin 2010) among others, have been advocated as a means of regularizing models to provide a balance between bias and variance. In the end, however, these methods lead to a single final model that is often taken to be the model considered ahead of time. The problem is that, in practice, model uncertainty often goes unnoticed, and the impact of this uncertainty can be quite serious. Hoeting et al. (1999) wrote
Standard statistical practice ignores model uncertainty. Data analysts typically select a model from some class of models and then proceed as if the selected model had generated the data. This approach ignores the uncertainty in model selection, leading to over-confident inferences and decisions that are more risky than one thinks they are. (pg. 382)
Similar observations were made earlier by Leamer (1978, pg. 91) who noted that
. ambiguity over a model should dilute information about the regression coefficients, since part of the evidence is spent to specify the model
and by Draper et al. (1987, pg. iii) who stated
This [model selection] tends to underestimate Your actual uncertainty, with the result that Your actions both inferentially in science and predictively in decision-making, are not sufficiently conservative. [Capitalization authors.]
The Bayesian framework recognizes that model selection is conducted under pervasive uncertainty insofar as a particular model is typically chosen among a set of competing models that could also have generated the data. Although a number of methods exist in the Bayesian literature to aid in improving model prediction, including sensitivity analyses via posterior predictive checking (Gelman et al. 1996) as well as the Bayesian regularization methods cited earlier, in the end, a single model is chosen for prediction and/or explanatory purposes. As the quotes by Hoeting et al. (1999), Leamer (1978), and Draper et al. (1987) suggest, it is risky to settle on a single model. Rather, it may be prudent to draw predictive strength through combining models. Arguably, the most popular approach for addressing the problem of model uncertainty from a Bayesian perspective lies in the method of Bayesian model averaging (BMA).
The purpose of this paper is to provide a detailed review of the methodology of BMA. A review of the extant literature on BMA reveals interesting connections to Bayesian decision theory and Bayesian predictive modeling. As such, we will draw on a number of seminal sources addressing the problem of model uncertainty quantification from a Bayesian perspective. In particular, we will draw heavily from Bernardo and Smith (2000), Vehtari and Ojanen (2012), Piironen and Vehtari (2017), and Draper (2013) with respect to the idea of belief models and Bayesian decision theory; Bernardo and Smith (2000), Clyde and Iversen (2013), and Clarke and Clarke (2018) with respect to so-called M -frameworks; Raftery and his colleagues (Madigan and Raftery 1994; Hoeting et al. 1999; Raftery et al. 1997) with respect to historical developments in BMA and algorithms; and Clyde and Iversen (2013), Vehtari et al. (2019), and Yao et al. (2018) with respect to Bayesian stacking as means of addressing certain assumptions underlying BMA.
The organization of this paper is as follows. In the next section, we discuss Bayesian predictive modeling as embedded in Bayesian decision theory. Here we discuss the concepts of expected utility and expected loss, and frame these ideas within the use of information-theoretic methods for judging decisions. We show that the action which optimizes the expected utility is the BMA solution. Next, we discuss the statistical elements of BMA including connections to Bayes factors, computation considerations, and the problem parameter and model priors. This is followed by a simple example of BMA in linear regression modeling and a comparison of results based on different parameter and model prior settings. This is then followed by a presentation of methods for evaluating the quality of a solution based on BMA, including the use of scoring rules and how they tie back to the information-theoretic concepts discussed earlier in the paper. We then discuss the main problem associated with conventional BMA—namely that BMA assumes that the true data generating model is contained in the set of models that are being averaged and demonstrate the method of Bayesian stacking that directly deals with this assumption. A simple example of Bayesian stacking is provided. The paper concludes with a discussion of challenges and opportunities associated with considering model uncertainty in psychometrics and the social and behavioral sciences more generally.
This overview of BMA is situated in the Bayesian predictivist framework discussed in Bernardo and Smith (2000). An excellent review of Bayesian predictive modeling can be also found in Vehtari and Ojanen (2012), and we will borrow notation from their paper.
Arguably, the overarching goal of statistics is prediction. In other words, a key characteristic of statistics is to develop accurate predictive models, and all other things being equal, a given model is to be preferred over other competing models if it provides better predictions of what actually occurred (Dawid 1984). Indeed, it is hard feel confident about inferences drawn from a model that does a poor job of predicting the extant data. The problem, however, is how to develop accurate predictive models, and, importantly, how to evaluate their accuracy. The approach to developing accurate predictive models discussed in this review is BMA, and the evaluation of BMA-based analyses is best situated in the context of Bayesian decision theory.
Bayesian decision theory (see, e.g., Good 1952; Lindley 1991; Berger 2013) provides a natural and intuitive approach to evaluating Bayesian predictive models generally, and BMA in particular. Specifically, as will be expanded on below, Bayesian decision theory casts the problem of predictive evaluation in the context of maximizing the expected utility of a model—that is, the benefit that is accrued from using a particular model to predict future observations. The greater the expected utility, the better the model is at predictive performance in comparison to other models.
Let D = < y i , x i >i = 1 n be a set of data assumed to be fixed in the Bayesian sense, where y i is an outcome of interest and x i is a (possibly vector-valued) set of predictors. Further, let ( y ~ , x ~ ) be future observations of the outcome and the set of predictors, respectively. Further, let M = < M k >k = 1 K represent a set of models specified to provide a prediction of the outcome y ~ , and let M k represent a specific chosen model.
The elements of Bayesian decision theory that we adopt in this paper have been described by Bernardo and Smith (2000) and Vehtari and Ojanen (2012) among many others. These elements consist of (a) an unknown state of the world denoted as ω ∈ Ω , (b) an action a ∈ A , where A is the action space, (c) a utility function u ( a , ω ) : A × Ω → R that rewards an action a when the state of the world is realized as ω , and (d) p ( ω | D ) representing one’s current belief about the state of world conditional on observing the data, D.
To provide a context for these ideas, and in anticipation of our empirical example, consider the problem of predicting reading performance measured on 15-year-old students in the USA using data from the OECD Program on International Student Assessment (PISA) (OECD 2017). In line with Bernardo and Smith (2000), Lindley (1991), Vehtari and Ojanen (2012) and Berger (2013) and the notation given previously, (a) the states of the world correspond to the future observations of reading literacy y ~ ∈ Y , (b) the action a ∈ A is the actual prediction of those future observations, (c) the utility function u ( a , y ~ ) defines the reward attached to the prediction, and (d) p ( y ~ | D , M ∗ ) is a posterior predictive distribution that encodes our belief about the future reading literacy observations conditional on the data, D, and a belief model, M ∗ (described later).
The goal of predictive modeling is to optimize the utility of taking an action a. A number of utility functions exist, but common utility functions rest on the negative quadratic loss function
u ( a , y ~ ) = - ( y ~ - a ) 2 .The optimal action a ∗ is the one that maximizes the posterior expected utility, written as (see Clyde and Iversen 2013)
a ∗ = arg sup a ∈ A ∫ Ω u ( ω , a ) p ( ω | D ) d ωThe idea here is to take an action a that maximizes the utility u when the future observation is y ~ . Clyde and Iversen (2013) show that the optimal decision obtains when a ∗ = E ( y ~ | D ) , which is the posterior predictive mean of y ~ given the data D. Under the assumption that the true model exists and is among the set of models under consideration, this can be expressed as
E ( y ~ | D ) = ∑ k = 1 K E ( y ~ | M k , D ) p ( M k | D ) = ∑ k = 1 K p ( M j | D ) y ~ ^ M kwhere y ~ ^ M k is the posterior predictive mean of y ~ under M k . The expression in (3) is Bayesian model averaging.
It is important to note that when considering the selection of a single model, one might be tempted to choose the model with the highest posterior model probability (PMP) p ( M k | D ) . In the case of only two models, the model with the largest PMP will be the closest to the BMA solution. However, for more than two models, Clyde and Iversen (2013) point out that the model closest to the BMA solution might not be the one with the largest PMP.
In a complex real-world setting such as the one we find ourselves in when trying to develop a predictive model of reading literacy, substantive discussions would suggest that many different models could be entertained as reasonable models of reading literacy. In this case, we agree with Clyde and Iversen (2013) that to maximize our utility it would be best to average over the space of possible models via BMA.
Bayesian model averaging has had a long history of theoretical developments and practical applications. Early work by Leamer (1978) laid the foundation for BMA. Fundamental theoretical work on BMA was conducted in the mid-1990s by Madigan and his colleagues (e.g., Madigan and Raftery 1994; Raftery et al. 1997; Hoeting et al. 1999). Additional theoretical work was conducted by Clyde (1999, 2003). Draper (1995) discussed how model uncertainty can arise even in the context of experimental designs, and Kass and Raftery (1995) provided a review of Bayesian model averaging and the costs of ignoring model uncertainty. Reviews of the general problem of model uncertainty can be found in Clyde and George (2004) and more recently in Steel (2020) with a focus on economics. A review of Bayesian model averaging with a focus on psychology can be found in Hinne et al. (2020).
Practical applications and developments of Bayesian model averaging can be found across a wide variety of domains. A perusal of the extant literature shows applications of Bayesian model averaging to economics (e.g., Fernández et al. 2001b), political science (e.g., Montgomery and Nyhan 2010), bioinformatics of gene express (e.g., Yeung et al. 2005), weather forecasting (e.g., Raftery et al. 2005; Sloughter et al. 2013), propensity score analysis (Kaplan and Chen 2014), structural equation modeling (Kaplan and Lee 2015), missing data (Kaplan and Yavuz 2019), probabilistic forecasting with large-scale assessment data (Kaplan & Huang, under review).
The popularity of BMA across many different disciplines is due to the fact that BMA is known to provide better out-of-sample predictive performance than any other model under consideration as measured by the logarithmic scoring rule (Raftery and Zheng 2003). In addition, Bayesian model averaging has been implemented in the R software programs BMA (Raftery et al. 2015), BMS (Zeugner and Feldkircher 2015), and BAS (Clyde 2017). These packages are quite general, allowing Bayesian model averaging over linear models, generalized linear models, and survival models, with flexible handling of parameter and model priors.
Following Madigan and Raftery (1994), consider a quantity of interest such as a future observation. Again, denoting this quantity as y ~ , our goal is to obtain an optimal prediction of y ~ the sense that the utility of predicting y ~ is maximized. Next, consider a set of competing models M = < M k >k = 1 K that are not necessarily nested. The posterior distribution of y ~ given data D can be written as a mixture distribution,
p ( y ~ | D ) = ∑ k = 1 K p ( y ~ | M k ) p ( M k | D ) ,where p ( M k | D ) is the posterior probability of model M k written as
p ( M k | D ) = p ( D | M k ) p ( M k ) ∑ l = 1 K p ( D | M l ) p ( M l ) ,where the first term in the numerator on the right-hand side of (5) is the probability of the data given model k, also referred to as the integrated likelihood written as
p ( y | M k ) = ∫ p ( y | θ k , M k ) p ( θ k | M k ) d θ k ,where p ( θ k | M k ) is the prior distribution of the parameters θ k under model M k (Raftery et al. 1997). The posterior model probabilities can be considered mixing weights associated with the mixture distribution given in (4) (Clyde and Iversen 2013). The second term p ( M k ) on the right-hand side of (5) is the prior model probability for model k, allowing each model to have a different prior probability based on past performance of that model or a belief regarding which of the models might be the true model. The denominator of (5) ensures that p ( M k | y ) integrates to 1.0, as long as the true model is in the set of models under consideration. We will defer the discussion of a true model until later and then explicitly deal with the case where the true model is not in the set of models under consideration.
An important feature of (5) is that p ( M k | y ) captures the posterior uncertainty that a given model is the true model and this uncertainty will likely vary across models. Herein lies the problem of model selection; given the choice of a particular model, the analyst effectively ignores the uncertainty in other models that could have generated the data. Of course, (5) could be used as a method for model selection by simply choosing the model with the largest posterior model probability. However, model uncertainty is still being ignored because, often in practice, the posterior probability of this model will not be 1.0 which is assumed if one were to select a single model and act as though that was the model the analyst had in mind all along.
Yet another common approach for model selection is the Bayes factor which provides a way to quantify the odds that the data favor one model over another (Kass and Raftery 1995). A key benefit of Bayes factors is that models do not have to be nested. To motivate the Bayes factors, consider two competing models, denoted as M k and M l , which could be nested within a larger space of alternative models. Let θ k and θ l be the two parameter vectors associated with these two models. These could be two regression models with a different number of variables, or two structural equation models specifying very different directions of mediating effects. The goal is to develop a quantity that expresses the extent to which the data support M k over M l . One quantity could be the posterior odds of M k over M l , expressed as
p ( M k | D ) p ( M l | D ) = p ( D | M k ) p ( D | M l ) × p ( M k ) p ( M l ) .The first term on the right-hand side of (7) is the ratio of two integrated likelihoods. This ratio is referred to as the Bayes factor for M k over M l , denoted here as B F kl . In words, our prior opinion regarding the odds of M k over M l , given by p ( M k ) / p ( M l ) , is weighted by our consideration of the data, given by p ( D | M k ) / p ( D | M l ) .
A connection between the Bayes factor in (7) and the posterior model probability in (5) has been pointed by others (see, e.g., Clyde 1999). Specifically, when examining more than two models, and assuming equal prior odds, then the Bayes factor for M k over M l can be written as
B F kl = p ( M k | D ) p ( M l | D ) .Assuming that we fix the first model M l as the baseline model, (5) can be re-expressed as
p ( M k | D ) = B F k 1 p ( M k ) ∑ l = 1 K B F l 1 p ( M l )As pointed out by Hoeting et al. (1999), BMA can be difficult to implement. In particular, they note that the number of terms in (4) can be quite large, the corresponding integrals can be hard to compute, the specification of p ( M k ) may not be straightforward, and choosing the class of models to average over is also challenging. To address the problem of computing (6) the Laplace method, which has been used productively for the computation of Bayes factors (Kass and Raftery 1995), can be used and this will lead to a simple BIC approximation under certain circumstances (see Tierney and Kadane 1986; Raftery 1996). The problem of reducing the overall number of models that one could incorporate in the summation of (4) has led to two interesting solutions. One solution is based on the so-called Occam’s window criterion (Madigan and Raftery 1994) and the other is based on a Metropolis sampler referred to as Markov chain Monte Carlo Model composition (MC 3 ) (Madigan and York 1995)
To motivate the idea behind Occam’s window, consider the problem of finding the best subset of predictors in a linear regression model. Following closely the discussion given in Raftery et al. (1997) we would initially start with very large number of predictors, but the goal would be to pare this to a smaller number of predictors that provide accurate predictions. As noted in the earlier quote by Hoeting et al. (1999), the concern in drawing inferences from a single best model is that the choice of a single set of predictors ignores uncertainty in model selection. Occam’s window provides an approach to BMA that reduces the subset of models under consideration, but instead of settling on a final “best” model, we instead integrate over the parameters of the smaller set with weights reflecting the posterior uncertainty in each model.
The algorithm proceeds as follows (Raftery et al. 1997). In the initial step, the space of possible models is initially reduced by implementing the so-called “leaps and bounds” algorithm developed by Furnival and Wilson (1974) in the context of best subsets regression (see also Raftery 1995). This initial step can substantially reduce the number of models, after which Occam’s window can then be employed. The general idea is that models are eliminated from (4) if they predict the data less well than the model that provides the best predictions based on a caliper value C chosen in advance by the analyst. The caliper C sets the width of Occam’s window. Formally, consider again a set of models < M k >k = 1 K . Then, the set A ′ is defined as
A ′ = < M k : max l < p ( M l | y ) >p ( M k | y ) ≤ C > .In words, (10) compares the model with the largest posterior model probability, m a x l < p ( M l | y ) >, to a given model p ( M k | y ) . If the ratio in (10) is greater than the chosen value C, then it is discarded from the set A ′ of models to be included in the model averaging. Notice that the set of models contained in A ′ is based on Bayes factor values.
The set A ′ now contains models to be considered for model averaging. In the second, optional, step, models are discarded from A ′ if they receive less support from the data than simpler sub-models. Formally, models are further excluded from (4) if they belong to the set
Again, in words (11) states that there exists a model M l within the set A ′ and where M l is simpler than M k . If a complex model receives less support from the data than a simpler sub-model—again based on the Bayes factor—then it is excluded from B . Notice that the second step corresponds to the principle of Occam’s razor (Madigan and Raftery 1994).
With step 1 and step 2, the problem of reducing the size of the model space for BMA is simplified by replacing (4) with
p ( y ~ | y , A ) = ∑ M k ∈ A p ( y ~ | M k , y ) p ( M k | y , A ) ,In other words, models under consideration for BMA are those that are in A ′ but not in B . Formally, A = A ′ \ B .
Madigan and Raftery (1994) outline an approach to the choice between two models to be considered for Bayesian model averaging. To make the approach clear, consider the case of just two models M 1 and M 0 , where M 0 is the simpler of the two models. This could be the case where M 0 contains fewer predictors than M 1 in a regression analysis. In terms of posterior odds, if the odds are positive, indicating support for M 0 , then we reject M 1 . If the posterior odds is large and negative, then we reject M 0 in favor of M 1 . Finally, if the posterior odds lies in between the pre-set criterion, then both models are retained. For linear regression models, the leaps and bounds algorithm combined with Occam’s window is available in the BICREG option in the R program BMA (Raftery et al. 2015).
Markov chain Monte Carlo model composition (MC 3 ) is based on the Metropolis–Hastings algorithm (see, e.g., Gilks et al. 1996) and is also designed to reduce the space of possible models that can be explored via Bayesian model averaging. Following Hoeting et al. (1999), the MC 3 algorithm proceeds as follows. First, let M represent the space of models of interest; in the case of linear regression this would be the space of all possible combinations of variables. Next, the theory behind MCMC allows us to construct a Markov chain < M ( t ) , t = 1 , 2 , … , >which converges to the posterior distribution of model k, that is, p ( M k | y ) .
The manner in which models are retained under MC 3 is as follows. First, for any given model currently explored by the Markov chain, we can define a neighborhood for that model which includes one more variable and one less variable than the current model. So, for example, if our model has four predictors x 1 , x 2 , x 3 and x 4 , and the Markov chain is currently examining the model with x 2 and x 3 , then the neighborhood of this model would include < x 2 >, < x 3 >, < x 2 , x 3 , x 4 >, and < x 1 , x 2 , x 3 >. Now, a transition matrix is formed such that moving from the current model M to a new model M ′ has probability zero if M ′ is not in the neighborhood of M and has a constant probability if M ′ is in the neighborhood of M. The model M ′ is then accepted for model averaging with probability
otherwise, the chain stays in model M.
One of the key steps when implementing BMA is to choose priors for both the parameters of the model and the model space itself. Discussions of the choice of parameter and model priors can be found in Fernández et al. (2001a); Liang et al. (2008); Eicher et al. (2011); Feldkircher and Zeugner (2009), with applications found in Fernández et al. (2001a) and Kaplan and Huang (under review). A large number of choices for model and parameter priors are implemented in the R software program BMS (Zeugner and Feldkircher 2015). This section discusses the extant choices of parameter and model priors, following closely the discussion given in Kaplan and Huang (under review).
The choice of parameter priors available in the extant BMA software rests on variations of Zellner’s g-prior (Zellner 1986). Specifically, Zellner introduced a natural-conjugate Normal-Gamma g-prior for regression coefficients β under the normal linear regression model, written as,
y i = x i ′ β + ε ,where ε is i.i.d. N ( 0 , σ 2 ) . For a give model, say M k , Zellner’s g-prior can be written as
β k | σ 2 , M k , g ∼ N ( 0 , σ 2 g ( x k ′ x k ) - 1 ) .Feldkircher and Zeugner (2009) have argued for using the g-prior for two reasons: its consistency in asymptotically uncovering the true model, and its role as a penalty term for model size.
The g-prior has been the subject of some criticism. In particular, Feldkircher and Zeugner (2009) have pointed out that the particular choice of g can have a very large impact on posterior inferences drawn from BMA. In particular, small values of g can yield a posterior mass that is spread out across many models while large values of g can yield a posterior mass that is concentrated on fewer models. Feldkircher and Zeugner (2009) use the term supermodel effect to describe how values of g impact the posterior statistics including posterior model probabilities (PMPs) and posterior inclusion probabilities (PIPs).
To account for the supermodel effect researchers (Fernández et al. 2001a; Liang et al. 2008; Eicher et al. 2011; Feldkircher and Zeugner 2009) have proposed alternative priors based on extensions of the work of Zellner (1986). Generally speaking, these alternatives can be divided into two categories: fixed priors and flexible priors. Fernández et al. (2001a) recommended using benchmark priors which belong to the class of fixed priors when sample sizes are large. Liang et al. (2008) introduced mixtures of g-priors to address the inconsistency when using fixed priors and showed its advantages compared to other default priors. Instead of only employing model-dependent priors, Feldkircher and Zeugner (2009) proposed a hyper-g-prior that “let the data choose,” thus reducing the sensitivity of the prior choice of the g-prior on the posterior mass. In a detailed study, Eicher et al. (2011) compared twelve candidate default priors and concluded that the unit information prior (UIP) combined with the uniform model prior outperformed the other choices.
The set of fixed priors that are available in BMS are (see Zeugner and Feldkircher 2015):
Unit Information Prior: g = N. This is a typical default prior. Liang et al. (2008) suggested using UIP in combination with the uniform model prior to yield the best predictive performance.
Risk Inflation Criterion Prior (RIC): g = Q 2 , where Q is the number of predictors. Foster and George (1994) showed that the selection of the model with the highest PMP is equivalent to selecting the model with the highest RIC as long as g = Q 2 .
Benchmark risk inflation criterion (BRIC): g = max ( N , Q 2 ) . This is a combination of the UIP and RIC. When N ≤ Q 2 , Fernández et al. (2001a) recommend using g = Q 2 ; When N > Q 2 , use g = N in the variable selection context.
Hannan and Quinn Priors: g = log ( N ) 3 : This prior is based on the Hannan–Quinn criterion for model selection. Hannan and Quinn (1979) advocated to use HQ criteria = 3 for large N.
The set of flexible priors are (see Zeugner & Feldkircher, 2015):
Local Empirical Bayes: g k = arg max ( 0 , F k - 1 ) , where F k = R k 2 ( N - Q k - 1 ) ( 1 - R k 2 ) Q k ; F k is the F-statistic and R k 2 is the regression coefficient of determination for model M k . This approach estimates g separately for each model with maximum likelihood methods based on the observed data (George and Foster 2000; Liang et al. 2008; Hansen and Yu 2001).
Hyper-g priors: This family of priors was proposed for data-dependent shrinkage. Following Feldkircher and Zeugner (2009), the hyper-g prior is a Beta prior on the shrinkage factor g 1 + g , that is p ( g 1 + g ) ∼ B e t a ( 1 , α 2 - 1 ) , with E ( g 1 + g ) = 2 α . Instead of eliciting g directly, the hyper-g prior requires the elicitation of the hyperparameter α ∈ ( 2 , ∞ ) . As α approaches 2, the prior distribution on the shrinkage factor g 1 + g will be close to 1; while for α = 4, the prior distribution on the shrinkage factor will be uniform distributed. In the context of noisy data, the hyper-g prior will distribute posterior model probabilities more uniformly across the model space. In the case of low noise in the data, the hyper-g prior will be concentrated on a few models, and perhaps even more concentrated than in the fixed prior case with large g (Feldkircher and Zeugner 2009).
Here we discuss three model priors that are available in the BMS program: (a) the uniform model prior, (b) the binomial model prior, and (c) the Beta-binomial model prior.
Uniform model prior: The uniform model prior is a common default prior for Bayesian model averaging. Specifically, if there are Q predictors, then the prior on the space of models is 2 - Q . The difficulty with the uniform model prior was pointed out by Zeugner and Feldkircher (2015) who noted that the uniform model prior implies that the expected model size is ∑ q = 0 Q Q q q 2 - Q = Q / 2 . However, the distribution of model sizes is not even—there are more models of size 2 or 5, then there are of size 1 or 6. The result is that the uniform model prior actually places more mass on intermediate size models. A demonstration of the impact of this problem is given in Zeugner and Feldkircher (2015).
Binomial model prior: To address the problem with the uniform model prior, Zeugner and Feldkircher (2015) proposed placing a fixed inclusion probability on each predictor in the model, denoted as θ . Then, for model k, the prior probability for a model of size q is p ( M k ) = θ q k ( 1 - θ ) Q - q m . Notice that the expected model size, say m ¯ , is Q θ , and thus the analysts prior expected model size is m ¯ . Moreover, if θ = . 5 , then the binomial model prior reduces to the uniform model prior. In practice, this suggests that choosing θ < . 5 weights the posterior mass toward smaller models, and visa versa (Zeugner and Feldkircher 2015).
Beta-binomial model prior: The binomial prior discussed above suffers from the fact that the inclusion probability θ is fixed. Following Ley and Steel (2009), greater flexibility is provided by treating θ as random. A logical choice for the probability distribution of θ is the Beta distribution with hyperparameters a , b > 0 , viz., θ ∼ Beta ( a , b ) . Under the Beta-binomial prior the first and second moments of the model size m ¯ are,
E ( m ¯ ) = a a + b Q , var ( m ¯ ) = a b ( a + b + Q ) ( a + b ) 2 ( a + b + 1 ) Q .With such a wide variety of choices available for parameter and model priors, it is important to have a method for evaluating the impact of these choices when applying BMA to substantive problems. Given that the utility of BMA lies in its optimal predictive performance, a reasonable method for evaluation should be based on measures that assess predictive performance—referred to as scoring rules.
Scoring rules provide a measure of the accuracy of probabilistic predictions (or synonymously, forecasts), and a prediction can be said to be “well-calibrated” if the assigned probability of the outcome match the actual proportion of times that the outcome occurred (Dawid 1982). 1
A scoring rule is a utility function (Gneiting and Raftery 2007), and the goal of the forecaster is to be honest and provide a forecast that will maximize the utility. One can consider scoring rules from a subjectivist Bayesian perspective. Here, Winkler (1996) quotes Finetti (1962, pg. 359)
The scoring rule is constructed according to the basic idea that the resulting device should oblige each participant to express his true feelings, because any departure from his own personal probability results in a diminution of his own average score as he sees it.
Because scoring rules only require the stated probabilities and realized outcomes, they can be developed for ex-post or ex-ante probability evaluations. However, as suggested by Winkler (1996), the ex-ante perspective of probability evaluation should lead us to consider strictly proper scoring rules because these rules are maximized if and only if the forecaster is honest in reporting their scores.
Following the discussion and notation given in Winkler (1996, see also; Jose et al. 2008), let p represent the forecaster’s subjective probability distribution of an outcome of interest, let r represent the forecaster’s reported forecast probability, and let e i represent the probability distribution that assigns probability one if the event i occurs and probability zero for all other events. Then, a scoring rule, denoted as S ( r , p ) , provides a score S ( r , e i ) if the event occurs. The expected score obtained when the forecaster reports r when their true distribution is p is
S ( r , p ) = ∑ i p i S ( r , e i )The scoring rule is strictly proper if S ( p , p ) ≥ S ( r , p ) for every r and p with equality when r = p (Jose et al. 2008, pg. 1147).
A large number of scoring rules have been reviewed in the literature (see, e.g., Winkler 1996; Bernardo and Smith 2000; Jose et al. 2008; Merkle and Steyvers 2013; Gneiting and Raftery 2007). Here, however, we highlight two related strictly proper scoring rules that are commonly used to evaluate predictions arising from Bayesian model averaging: the log predictive density score, and the Kullback–Leibler divergence score (see, e.g., Fernández et al. 2001b; Hoeting et al. 1999; Kaplan and Yavuz 2019; Kaplan and Huang, under review, for examples).
The log predictive density score (LPS) (Good 1952; Bernardo and Smith 2000) can be written as
- ∑ i log p ( y ~ i | x , y , x ~ i )where, for example, y ~ i is the predictive density for i th person, x and y represent the model information for the remaining individuals, and x ~ i is the information on the predictors for individual i. The model with the lowest log predictive score is deemed best in terms of long-run predictive performance.
Closely related to the log-predictive score is the Kullback–Leibler Divergence (KLD) score (also referred to as relative entropy (Kullback and Leibler 1951; Kullback 1959, 1987). Here we consider two distributions, p(y) and g ( y | θ ) , where p(y) could be the distribution of observed reading literacy scores, and g ( y | θ ) could be the prediction of these reading scores based on a model. The KLD between these two distributions can be written as
KLD ( f , g ) = ∫ p ( y ) log p ( y ) g ( y | θ ) d ywhere KLD ( f , g ) is the information lost when g ( y | θ ) is used to approximate p(y). For example, the actual reading outcome scores might be compared to the predicted outcome using Bayesian model averaging along with different choices of model and parameter priors. The model with the lowest KLD measure is deemed best in the sense that the information lost when approximating the actual reading outcome distribution with the distribution predicted on the basis of the model is lowest.
The LPS and KLD scoring rules are applicable to continuous outcomes and we will focus on these two in our example below. However, it should be noted that BMA can be applied to binary outcomes, and here, a popular scoring rule that is based on quadratic loss is the Brier score (Brier 1950). Following the notation above, the Brier score can be defined as
- ( e i - p 2 ) 2where the forecast p estimates an indicator vector of an event e . For example, p may represent the forecast probability of rain on a given day, and e i represents the realization of the event scored 1/0 should rain occur on that day or not. The Brier score penalizes the forecaster in proportion to the squared Euclidean distance between the forecast and the event (Jose et al. 2008, p. 1148)
This example will draw on reading literacy data obtained from the Program for International Student Assessment (PISA). Launched in 2000 by the Organization for Economic Cooperation and Development, PISA is a triennial international survey that aims to evaluate education systems worldwide by testing the skills and knowledge of 15-year-old students. In 2018, 600,000 students, statistically representative of 32 million 15-year-old students in 79 countries and economies, took an internationally agreed-upon two-hour test. Students were assessed in science, mathematics, reading, collaborative problem solving, and financial literacy. PISA is arguably the most important policy-relevant international survey that is currently operating (OECD 2002).
The benefit of developing optimally predictive models for large-scale assessments such as PISA is to recognize that these assessments can be used to monitor progress toward internationally agreed-upon educational goals such as the United Nations (UN) adopted the Sustainable Development Goals (SDGs). Also, as educational systems around the world face new challenges due to the COVID-19 pandemic, developing optimally predictive models may help identify the long-run impact of this unprecedented health crisis on global education.
Following the overview in Kaplan and Kuger (2016), the sampling framework for PISA follows a two-stage stratified sample design. Each country/economy provides a list of all “PISA-eligible” schools, and this list constitutes the sampling frame. Schools are then sampled from this frame with sampling probabilities that are proportional to the size of the school, with the size being a function of the estimated number of PISA-eligible students in the school. The second stage of the design requires sampling students within the sampled schools. A target cluster size of 35 students within schools was desired, though for some countries, this target cluster size was negotiable.
We will focus on the reading literacy results from PISA 2018. The method of assessment for PISA follows closely the spiraling design and plausible value methodologies originally developed for National Assessment of Educational Progress. (see, e.g., OECD 2017) In addition to these so-called “cognitive outcomes,” policymakers and researchers alike have begun to focus increasing attention on the non-academic contextual aspects of schooling. Context questionnaires provide important variables for models predicting cognitive outcomes and these variables have become important outcomes in their own right—often referred to as “non-cognitive outcomes” (see, e.g., Heckman and Kautz 2012). PISA also assesses these non-cognitive outcomes via a one-half hour internationally agreed-upon context questionnaire (see Kuger et al. 2016). The list of variables used in this example are given in Table 1 .
PISA 2018 predictors of reading literacy.
Variable name | Variable label |
---|---|
FEMALE | Sex (1 = Female) |
ESCS | Index of economic, social, and cultural status |
METASUM | Meta-cognition: summarising |
PERFEED | Perceived feedback |
HOMEPOS | Home possessions (WLE) a |
ADAPTIVE | Adaptive instruction (WLE) |
TEACHINT | Perceived teacher’s interest |
ICTRES | ICT resources (WLE) |
JOYREAD | Joy/Like reading |
ATTLNACT | Attitude towards school: learning activities |
COMPETE | Competitiveness |
WORKMAST | Work mastery |
GFOFAIL | General fear of failure |
SWBP | Subjective well-being: positive affect |
MASTGOAL | Mastery goal orientation |
BELONG | Subjective well-being: sense of belonging to school (WLE) |
SCREADCOMP | Perception of reading competence |
SCREADDIFF | Perception of reading difficulty (WLE) |
PISADIFF | Perception of difficulty of the PISA test |
PV1READ | First plausible value reading score (outcome variable) b |
a Weighted likelihood estimates. See OECD (2018) for more details.
b Plausible values. See OECD (2018) for more details.
For this example, we use the software package BMS (Zeugner and Feldkircher 2015) which implements the so-called Birth/Death algorithm as a default for conducting MC 3 . It is beyond the scope of this paper to describe the BD algorithm or other choices of algorithms in the BMS program. See Zeugner and Feldkircher (2015) for more detail.
The analysis steps for this example are as follows:
We begin by implementing BMA with default unit information priors for the model parameters and the uniform prior on the model space. We will outline the major components of the results including the posterior model probabilities and the posterior inclusion probabilities.
We next examine the results under different combinations of parameter and model priors available in BMS and compare results using the LPS and KLD.
The Bayesian model averaging results under unit information priors for model parameters and the uniform prior for the model space are shown in Tables 2 and and3. 3 . It should be noted that the results under different choices of parameter and model priors demonstrate some sensitivity to the choice of parameter and model priors and these results are available on the author’s website (http://bmer.wceruw.org/index.html). We note that there are 19 predictors and thus 2 19 = 524288 models in the full space of models to be visited. Table 2 presents a summary of the birth/death algorithm used to implement MC 3 in BMS. We find that the algorithm only visited 471 models (0.09%) out of the total model space; however, these models accounted for 100% of the posterior model mass. 2 The column labeled “Avg # predictors” shows that across all of the models explored by the algorithm, the average number of predictors was 11.8 out of 19.
Summary of birth/death algorithm and top posterior model probabilities.
Algorithm summary | Modelspace 2 K | Models visited | % visited | % Topmodels | Avg. # predictors |
524288 | 471 | 0.09 | 100 | 11.8 | |
PMPs for top five models | Model 1 | Model 2 | Model 3 | Model 4 | Model 5 |
0.35 | 0.20 | 0.06 | 0.04 | 0.03 |
Summary of BMA with unit information parameter priors and uniform model priors.
Predictor | PIP | Post. Coef. | Post. SD | Cond. Pos. Sign |
---|---|---|---|---|
ESCS | 1.00 | 18.97 | 1.30 | 1.00 |
METASUM | 1.00 | 27.99 | 1.29 | 1.00 |
TEACHINT | 1.00 | 12.53 | 1.51 | 1.00 |
JOYREAD | 1.00 | 10.33 | 1.32 | 1.00 |
GFOFAIL | 1.00 | 11.06 | 1.21 | 1.00 |
MASTGOAL | 1.00 | - 13.34 | 1.50 | 0.00 |
SCREADCOMP | 1.00 | 10.13 | 1.49 | 1.00 |
PISADIFF | 1.00 | - 29.71 | 1.46 | 0.00 |
PERFEED | 0.98 | - 5.00 | 1.55 | 0.00 |
SWBP | 0.87 | - 3.95 | 1.92 | 0.00 |
WORKMAST | 0.86 | 4.28 | 2.16 | 1.00 |
FEMALE | 0.64 | 5.14 | 4.37 | 1.00 |
ADAPTIVE | 0.12 | 0.45 | 1.31 | 1.00 |
SCREADDIFF | 0.08 | - 0.19 | 0.78 | 0.00 |
COMPETE | 0.07 | 0.19 | 0.79 | 1.00 |
BELONG | 0.05 | - 0.15 | 0.72 | 0.00 |
HOMEPOS | 0.02 | 0.02 | 0.29 | 1.00 |
ICTRES | 0.01 | - 0.02 | 0.23 | 0.00 |
ATTLNACT | 0.00 | 0.00 | 0.09 | 1.00 |
In the second row of Table 2 we present the posterior model probabilities (PMPs) associated with the top five models out of the 471 models explored by the algorithm. It is important to note that Model 1 would also be associated with the lowest Bayesian information criterion. Hence, on the basis of the low PMP for Model 1 (0.35) we can see that selecting Model 1 and acting as though this is the model we considered ahead of time, considerably underestimates the uncertainty in our model choice. Moreover, as Clyde and Iversen (2013) remind us, this model might not be the one closest to the BMA solution.
The results of the BMA are shown in Table 3 . The column labeled “PIP” shows the posterior inclusion probabilities for each variable, referring to the proportion of times the variable appeared in the models visited by the algorithm. For example, the PIP for ESCS is 1.00, meaning that across all the models selected by the algorithm, ESCS appeared in 100% of the models. In contrast, ATTLNACT only appears in 0.09% of the models visited by the algorithm. The PIP thus provides a different perspective on variable importance. The columns labeled “Post Mean” and “Post SD” are the posterior estimates of the regression coefficients and their posterior standard deviations, respectively. The column labeled “Cond. Pos. Sign” refers to the probability that the sign of the respective regression coefficient is positive conditional on its inclusion in the model. We find, for example, that the sign of ESCS is positive in 100% of the models in which ESCS appears. By contrast, the probability that the sign of the PISADIFF effect positive is zero, meaning that in 100% of the models visited by the algorithm, the sign of PISADIFF is negative. 3
We find that the first 12 predictors (ESCS thru GENDER) have relatively high PIPs. The majority of these predictors have PIPs of 1.0 indicating their importance. It is also interesting to note that these predictors contain a mix of demographic measures (e.g., ESCS, GENDER), attitudes/perceptions (e.g., TEACHINT, JOYREAD, SCREADCOMP) and cognitive strategies involved in reading (e.g., METASUM).
Table 4 presents the results based on comparing LPS and KLD values for different parameter priors under the fixed prior setting (upper half) and flexible prior setting (lower half), respectively. Owing to the large-sample size, the findings show relative robustness to the choice of parameter and model priors.
KLD and LPS values for fixed and flexible priors priors.
KLD | LPS | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
UIP | RIC | BRIC | HQ | UIP | RIC | BRIC | HQ | |||||
Uniform | 0.015 | 0.015 | 0.015 | 0.015 | 5.843 | 5.842 | 5.843 | 5.842 | ||||
Binomial ( m = 2 ) | 0.015 | 0.015 | 0.015 | 0.015 | 5.844 | 5.844 | 5.844 | 5.844 | ||||
Binomial ( m = 4 ) | 0.015 | 0.015 | 0.015 | 0.015 | 5.844 | 5.843 | 5.844 | 5.843 | ||||
Beta-binomial | 0.015 | 0.015 | 0.015 | 0.015 | 5.843 | 5.842 | 5.843 | 5.842 |
KLD | LPS | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
EBL | HQ-3 | HQ-4 | HG-UIP | HG-RIC | HG-BRIC | EBL | HQ-3 | HQ-4 | HG-UIP | HG-RIC | HG-BRIC | |
Uniform | 0.015 | 0.015 | 0.015 | 0.015 | 0.015 | 0.015 | 5.841 | 5.841 | 5.841 | 5.841 | 5.841 | 5.841 |
Binomial ( m = 2 ) | 0.015 | 0.015 | 0.015 | 0.015 | 0.015 | 0.015 | 5.844 | 5.844 | 5.844 | 5.844 | 5.844 | 5.844 |
Binomial ( m = 4 ) | 0.015 | 0.015 | 0.015 | 0.015 | 0.015 | 0.015 | 5.843 | 5.843 | 5.843 | 5.843 | 5.843 | 5.843 |
Beta-binomial | 0.015 | 0.015 | 0.015 | 0.015 | 0.015 | 0.015 | 5.841 | 5.841 | 5.841 | 5.841 | 5.841 | 5.841 |
These results are based on a sample size of 2500 respondents due to a sample size limitation in the BMS software.
Earlier, our discussion made mention of belief models and true models, and a perusal of the extant literature on model averaging reveals a distinction between a so-called actual belief model (sometimes referred to as a Bayesian belief model), M ∗ , and a true model, denoted as M T . Unfortunately, there does not appear to be a consensus about the meaning of belief models or true models, and in some cases, they are viewed as roughly the same thing. For example, Bernardo and Smith (2000) introduced the idea of the actual belief model but seem to be describing it as the true model, and in fact labeled the actual belief model as M T , suggesting that M ∗ is the true but unknown data generating model This position appears to be held by Clyde and Iversen (2013) who adopt a similar notation and seem to use the terms “belief model” and “true model” interchangeably.
In contrast, Vehtari and Ojanen (2012) suggest that M ∗ is different from M T and is something that we have access to insofar as it derived from what we have learned from our encounter with data. For our example, M ∗ would be the result of what has been learned from the construction and criticism of a substantive model of reading literacy. That is, if (a) one has specified a reasonable probability model for the reading literacy outcome, (b) one has access to a rich enough set of policy-relevant predictors of reading literacy, (c) all of the important prior uncertainties have been captured, and (d) the model has withstood criticisms, say in the form of posterior predictive checks (Gelman et al. 1996), then this model is M ∗ . Regarding M T , Vehtari and Ojanen (2012, p. 155) suggest that “the properties of the true model are specified by the modeller a priori, and they are not learned from the data properties.” Vehtari and Ojanen (2012) view the specification of M T in very general terms, such as an i.i.d assumption regarding the outcome of interest.
With respect to model averaging, the distinction between M ∗ and M T is important in practice. First, BMA assumes that M T ∈ M . If that assumption does not hold, then conventional BMA does not make sense because the priors on the model space are elicited to reflect the analyst’s belief about the existence of the true model within the full set of models under consideration. Second, as noted by Vehtari and Ojanen (2012), the process of model averaging begins with assuming the existence of a true model that must be approximated from the data, and this approximation is often based on an actual belief model that the researcher implicitly holds. Moreover, when using BMA software such as BMA or BMS, an intital model must be specified to initiate the search through the model space. The question that arises is whether this initial model is M ∗ or M T . It does not seem reasonable that this model is M T , if by M T we mean some general probability model for the outcome, not conditional on any covariates. Rather, this initial model is much more akin to what is meant by M ∗ , having perhaps resulted from giving careful consideration to an initial model based on past research, expert opinion, and so forth.
To clarify terminology, we prefer the following set of distinctions. First, there is a true model M T that we do not fully know, except in the case of computer simulation studies (Clarke and Clarke 2018), and it may be a highly complex process associated with many covariates in perhaps complicated non-linear and structural relationships. Second, M ∗ is our best, or most convenient, approximation to M T and forms the empirical launching point for model averaging. Finally, the goal of model averaging is to start with M ∗ and locate a model M k that is as close to M T as possible with “closeness” defined in terms of an index such as the Kullback–Leibler divergence (Kullback and Leibler 1951; Kullback 1959, 1987).
Our discussion of belief models and true models is necessary in order to address a critical problem when conducting BMA in practice—namely whether it is reasonable to assume that M T ∈ M . If we assume that M T is in the space of models under consideration, this is referred to as the M -closed framework, introduced by Bernardo and Smith (2000) and further discussed in Clyde and Iversen (2013).
As implied earlier, the M -closed framework for BMA may be especially difficult to warrant in the social and behavioral sciences. Nevertheless, as pointed out by Bernardo and Smith (2000), there may be cases in which it is reasonable to act as though there is a true model, keeping in mind that Bernardo and Smith (2000) seem to suggest that M T is what we are referring to here as M ∗ . For example, we may wish to act as though M-closed holds when a model has demonstrated good predictive capabilities under a wide variety of situations, but that under a new situation, new uncertainties arise. Taking our example of the prediction of reading literacy, an analyst typically would specify a set of variables that cover the range of what theory and past analyses have suggested are important predictors of reading competencies—predictors such (a) as measures of socio-demographics, (b) measures of teacher practices in support of reading literacy, (c) perceptions of classroom and school environments, including instructional resources, and (d) student attitudes and dispositions toward reading. In this example, Bayesian analyses from prior relevant studies such as PISA 2009 (the last reading cycle of PISA) (OECD 2009), might serve to provide informative priors for the analysis of reading data from PISA 2018. Given policy and research papers that derive from the analyses of PISA 2018, it may be reasonable to specify an initial belief model, M ∗ . However, now the analyst might recognize that there is uncertainty in that choice of M ∗ and wish to address that uncertainty using BMA to optimize the prediction of reading competencies for future cycles of PISA. Such uncertainties may be due to applying a model estimated on one PISA country to another. Or, changes in educational policies and practices due to the COVID-19 pandemic may render much greater uncertainty to a model that may have worked well in the past. As long as the analyst is comfortable assigning model priors, then the M -closed framework can be adopted. Nevertheless, the truth or falsity of the M -closed framework notwithstanding, it is important to reiterate that conventional BMA takes place under the M -closed framework and, indeed, readily available BMA software typically employ a non-informative prior to the space of models as a default, with the idea that the true model lies in the model space.
With the M-closed assumption unlikely to hold in practice, we are faced with the problem of how to obtain the benefits of model averaging with respect to predictive accuracy. One approach would be to create a list of simpler “proxy” linear models, < M k >k = 1 K specified for clarity of communication and ease of analysis (Bernardo and Smith 2000). Each of these models would be evaluated in light of the true model. This is referred to as the M -complete framework (Bernardo and Smith 2000). Under M - c o m p l e t e , BMA would not, in principle, be conducted as it does not make sense to place a discrete prior on the model space when one does not believe that M T ∈ M . Instead, as suggested by Clyde and Iversen (2013), Yao et al. (2018), and Vehtari and Ojanen (2012) one simply selects the model M k ∈ M that maximizes expected utility with respect to predictive distributions. However, this suggests that a single model is being used for predictive purposes with the result that model uncertainty is again not being addressed.
More recently, Clarke and Clarke (2018) discussed the idea that M -complete could constitute a range of inaccessibility to M T , and that methods such as BMA could be justified under M - c o m p l e t e insofar as the model priors would encode ones belief as to how good an approximation a given model is to M T under an M - c l o s e d problem. In any event, modeling under the M - c o m p l e t e framework does not provide an approach to directly addressing the problem of model uncertainty, when M - c l o s e d is hard to maintain.
If it is difficult to warrant model priors as required under M-closed, and if selecting a single model under M-complete that maximizes expected utility is not satisfactory, then we need an approach that allows for model averaging without the need to assume that M T ∈ M . This is referred to as the M -open framework (Bernardo and Smith 2000). An example of an M -open problem is in specifying a set of regression models with different choices of predictors. These different regression models would represent reasonable alternative belief models. Then, rather than using posterior model probabilities as weights, each model would yield a separate score without presuming the existence of a true model underlying any of the separate models. These models would be combined using their scores as weights, and the resulting predictive distribution would be obtained. This type of model averaging in the M-open framework describes the methodology of Bayesian stacking which we consider next.
The method of stacking was originally developed in the machine learning literature by Wolpert (1992) and Breiman (1996) and brought into the Bayesian paradigm by Iversen (2013). The basic idea behind stacking is to enumerate a set of K ( k = 1 , 2 , … K ) models and then create a weighted combination of their predictions. Returning to our reading literacy example, we can specify a set of candidate (belief) models of reading literacy as
y = f k ( x ) + ϵwere f k are different models of the reading literacy outcome— e.g., some models may include only demographic predictors, others may include various combinations of attitudes and behaviors related to reading literacy. Predictions from these models are then combined (stacked) as (see Le and Clarke 2017)
y ~ = ∑ k = 1 K w ^ k f ^ k ( x ) ,where f ^ k estimates f k . The weights, w ^ k ( w ^ 1 , w ^ 2 , … w ^ K ) are obtained as
w ^ = arg min w ∑ i = 1 n y i - ∑ k = 1 K w k f ^ k , - i ( x i ) 2where f ^ k , - i ( x i ) is an estimate of f k based on n - 1 observations, leaving the i th observation out.
We see from (24) that a method is needed to estimate f k based on n - 1 observations leaving the i th observation out, and the most common approach is referred to as leave-one-out cross validation. Leave-one-out-cross-validation (LOOCV) is a special case of k-fold cross-validation (k-fold CV) when k = n . In k-fold CV, a sample is split into k groups (folds) and each fold is taken to be the validation set with the remaining k - 1 folds serving as the training set. For LOOCV, each observation serves as the validation set with the remaining n - 1 observations serving as the training set. Leave-one-out cross-validation is available in the R software program loo (Vehtari et al. 2019). 4
Following Vehtari et al. (2017), let y i ( i = 1 , … , n ) be an n-dimensional vector of data following a distribution conditional on parameters θ - viz., p ( y | θ ) = ∏ i = 1 n p ( y i | θ ) . Given a prior distribution on the parameters, p ( θ ) , we can obtain the posterior distribution, p ( θ | y ) as well as a posterior predictive distribution of predicted values y ~ written as p ( y ~ | y ) = ∫ p ( y ~ i | θ ) p ( θ | y ) d θ . The Bayesian LOOCV rests on the derivation of the expected log point-wise predictive density (elpd) for new data defined as
elpd = ∑ i = 1 n ∫ p t ( y i ~ ) log p ( y i ~ | y ) d y i ~ ,where p t ( y i ~ ) represents the distribution of the true but unknown data-generating process for the predicted values y i ~ and where (25) is approximated by cross-validation procedures. The elpd provides a measure of predictive accuracy for the n data points taken one at a time (Vehtari et al. 2017). From here, the Bayesian LOO estimate can be written as
elpd loo = ∑ i = 1 n log p ( y i | y - i ) , p ( y i | y - i ) = ∫ p ( y i | θ ) p ( θ | y - i ) d θ ,which is the leave-one-out predictive distribution using the log predictive score to assess predictive accuracy.
It is useful to note that an information criterion based on LOO (LOOIC) can be easily derived as
LOOIC = - 2 elpd ^ loowhich places the LOOIC on the “deviance scale” (see Vehtari et al. 2017 for more details on the implementation of the LOOIC in loo). Among a set of competing models, the one with the smallest LOOIC is considered best from an out-of-sample point-wise predictive point of view.
As pointed out by Vehtari et al. (2017), it can be time-consuming to calculate exact LOOCV values and this may be a reason why LOOCV is not widely adopted. To remedy this, Vehtari et al. (2017) developed a fast and stable approach to obtaining LOOCV referred to as Pareto-smoothed importance sampling (PSIS-LOO) (see Vehtari et al. 2017, for more details). The PSIS approach is implemented in loo (Vehtari et al. 2019).
It is interesting to note that LOOCV has connections to other types of weights that can be used for stacking. For example, in the case of maximum likelihood estimation, LOOCV weights are asymptotically equivalent to Akaike information criterion (AIC) weights (Akaike 1973) that are used in frequentist model averaging applications (Yao et al. 2018, see also; Burnham and Anderson 2002; Fletcher 2018). In addition, so-called pseudo-BMA weights were proposed by Geisser and Eddy (1979, see also; Gelfand 1996). This approach replaces marginal likelihoods with Bayesian LOOCV predictive densites. The difficulty with pseudo-BMA weights is that they do not take into account uncertainty in future data distributions. To address this Yao et al. (2018) proposed an approach that combines the Bayesian bootstrap (see Rubin 1981) with the ELPD defined earlier. They refer to this approach as pseudo-BMA+ and show that it performs better than BMA and pseudo-BMA in M-open settings, but not as well as stacking using the log-score.
For this paper, we demonstrate Bayesian stacking using the software program loo with the same PISA 2018 data set used to demonstrate BMA. The analysis steps for this demonstration are as follows:
Specify four models of reading literacy. From Table 1 , Model 1 includes only demographic measures (FEMALE, ESCS, HOMEPOS, ICTRES); Model 2 includes only attitudes and behaviors specifically directed toward reading (JOYREAD, PISADIFF, SCREADCOMP, SCREADDIF); Model 3 includes predictors related to academic mindset as well as general well-being; (METASUM, GFOFAIL, MASTGOAL, SWBP, WORKMAST, ADAPTIVITY, COMPETE); and Model 4 includes attitudes toward school (PERFEED, TEACHINT, BELONG).
Obtain results from log-score stacking weights, pseudo-BMA weights, and pseudo-BMA+ weights.Obtain posterior predictive distributions using the R software program rstanarm (Goodrich et al. 2020).
Obtain KLD measures comparing the predicted distribution of reading scores to the observed distribution. All code and data for this example are available on the authors website (http://bmer.wceruw.org/index.html).
Table 5 presents the results for Bayesian stacking. We find that Model 2, which includes predictors-related attitudes and behaviors directed toward reading, has the highest weight regardless of how the weights were calculated. We find that pseudo-BMA and pseudo-BMA+ places almost all of the weight on Model 2 whereas the stacking weights based on the log predictive score are somewhat more spread out, with model 3 having the next highest weight. We also find that the Model 2 has the lowest LOOIC value.
Log-score stacking, PseudoBMA, and PseudoBMA+ weights along with LOOIC and Kullback–Leibler divergence.
Stacking | PseudoBMA | PseudoBMA+ | LOOIC | |
---|---|---|---|---|
Model 1 | 0.001 | 0.000 | 0.000 | 48277.66 |
Model 2 | 0.594 | 1.000 | 0.995 | 47644.65 |
Model 3 | 0.405 | 0.000 | 0.005 | 47860.30 |
Model 4 | 0.000 | 0.000 | 0.000 | 48757.57 |
KLD | 0.016 | 0.018 | 0.018 |
The bottom row of Table 5 presents the KLD measures obtained from comparing the distribution of predicted reading scores to the observed reading scores for each method of obtaining weights. Here we find the that lowest KLD value obtained under the log-score stacking weights. Overall, we find that stacking using loo weights provides overall the best predictive performance. It may be interesting to note that the KLD values for the BMA results in Table 4 are uniformly lower compared to the KLD values in Table 5 although it needs to be reiterated that BMA assumes an M-closed framework.
Although the orientation of this paper was focused on Bayesian methods for quantifying model uncertainty, it should be pointed out that issues of model uncertainty and model averaging have been addressed within the frequentist domain. The topic of frequentist model averaging (FMA) has been covered extensively in Hjort and Claeskens (2003), Claeskens and Hjort (2008) and Fletcher (2018). Our focus on Bayesian model averaging is based on some important advantages over FMA. As noted by Steel (2020), (a) BMA is optimal (under M -closed) in terms of prediction as measured by the log predictive density score; (b) BMA is easier to implement in situations where the model space is large due to very fast algorithms such as MC 3 ; (c) BMA naturally leads to substantively valuable interpretations of posterior model probabilities and posterior inclusion probabilities; and (d) in the majority of content domains wherein model averaging is required, BMA is more frequently used than FMA.
As the history of model uncertainty quantification illustrates, the problem has received considerable attention in statistical theory and practice—particularly in the fields of economics and weather forecasting. However, relatively less attention has been paid to the problem of model uncertainty in psychometrics. Given the pervasive nature of model uncertainty and the consequence of ignoring the problem with respect to predictive accuracy, certain challenges and opportunities for research present themselves. The first challenge relates to shifting our research orientation away from developing models for explanation and toward developing models for prediction. Of course, these orientations are not mutually exclusive, but we argue that prediction should take precedent to explanation simply due to the fact that it is hard to warrant explanations of psychological phenomena derived from a psychometric model if the model has difficulty predicting what has actually occurred (see Dawid 1982). Posterior predictive checking (Gelman et al. 1996) in the context of model selection or model averaging should become standard practice across the social and behavioral sciences. The second challenge derives from listing those uniquely psychometric methods, such as item response theory and factor analysis, and then identifying the elements of model uncertainty that might arise when such methods are applied to real problems. For example, an interesting practical question that might arise concerns the extent of model uncertainty in the estimation of plausible values developed for large-scale assessments such as NAEP and PISA (Mislevy 1991; Mislevy et al. 1992). The estimation of plausible values is essentially a missing data problem, and although Kaplan and Yavuz (2019) showed how BMA could be incorporated into the multiple imputation setting, they did not extend their method to the full machinery of plausible value methodology. Indeed, the problem of model uncertainty as it pertains to the estimation of latent variables generally, either through IRT or factor analysis, would be very interesting to explore and obviously quite relevant to psychometric theory and practice. Promising research on this topic has begun with Rights et al. (2018), but much more work remains.
In conclusion, this review highlighted the ubiquitous problem of model uncertainty and the availability of Bayesian methods to address this problem. Given that model uncertainty raises the risk of “. over-confident inferences and decisions that are more risky than one thinks they are” (Hoeting et al. 1999, pg. 382), future research should continue to be directed to the problem of model uncertainty in the social and behavioral sciences.
The author would like to thank Mingya Huang for valuable contributions to the empirical examples in this paper, and to Betrand Clarke, David Draper, Jonah Gabry, Aki Vehtari, and Yuling Yao for valuable discussions on the problem of model uncertainty. Any errors of commission or omission in this paper are solely the responsibility of the author.
1 Scoring rules should be distinguished from scoring functions. The former describe rules for predictive distributions while the later describe rules for point predictions, and include variants of squared error.
2 This percentage is obtained by summing over the PMPs for all models explored by the algorithm and dividing by the total number of those models.
3 The probabilities listed under the Cond. Pos. Sign results will often range from zero to one, but for these results, it appears that all 471 models show clarity with respect to the sign of the posterior coefficients.
4 The widely applicable information criterion (WAIC) has also been advocated for model selection. Although the WAIC and LOOCV are asymptotically equivalent (Watanabe 2010), the implementation of LOOCV in the loo package is more robust in finite samples with weak priors or influential observations (Vehtari et al. 2017).
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.