Bayesian Measures of Model Complexity and Fit | Journal of the Royal Statistical Society Series B: Statistical Methodology | Oxford Academic

Abstract

We consider the problem of comparing complex hierarchical models in which the number of parameters is not clearly defined. Using an information theoretic argument we derive a measure pD for the effective number of parameters in a model as the difference between the posterior mean of the deviance and the deviance at the posterior means of the parameters of interest. In general pD approximately corresponds to the trace of the product of Fisher’s information and the posterior covariance, which in normal models is the trace of the ‘hat’ matrix projecting observations onto fitted values. Its properties in exponential families are explored. The posterior mean deviance is suggested as a Bayesian measure of fit or adequacy, and the contributions of individual observations to the fit and complexity can give rise to a diagnostic plot of deviance residuals against leverages. Adding pD to the posterior mean deviance gives a deviance information criterion for comparing models, which is related to other information criteria and has an approximate decision theoretic justification. The procedure is illustrated in some examples, and comparisons are drawn with alternative Bayesian and classical proposals. Throughout it is emphasized that the quantities required are trivial to compute in a Markov chain Monte Carlo analysis.


1. Introduction

The development of Markov chain Monte Carlo (MCMC) methods has made it possible to fit increasingly large classes of models with the aim of exploring real world complexities of data (Gilks et al., 1996). This ability naturally leads us to wish to compare alternative model formulations with the aim of identifying a class of succinct models which appear to describe the information in the data adequately: for example, we might ask whether we need to incorporate a random effect to allow for overdispersion, what distributional forms to assume for responses and random effects, and so on.

Within the classical modelling framework, model comparison generally takes place by defining a measure of fit, typically a deviance statistic, and complexity, the number of free parameters in the model. Since increasing complexity is accompanied by a better fit, models are compared by trading off these two quantities and, following early work of Akaike (1973), proposals are often formally based on minimizing a measure of expected loss on a future replicate data set: see, for example, Efron (1986), Ripley (1996) and Burnham and Anderson (1998). A model comparison using the Bayesian information criterion also requires the specification of the number of parameters in each model (Kass and Raftery, 1995), but in complex hierarchical models parameters may outnumber observations and these methods clearly cannot be directly applied (Gelfand and Dey, 1994). The most ambitious attempts to tackle this problem appear in the smoothing and neural network literature (Wahba, 1990; Moody, 1992; MacKay, 1995; Ripley, 1996). This paper suggests Bayesian measures of complexity and fit that can be combined to compare models of arbitrary structure.

In the next section we use an information theoretic argument to motivate a complexity measure pD for the effective number of parameters in a model, as the difference between the posterior mean of the deviance and the deviance at the posterior estimates of the parameters of interest. This quantity can be trivially obtained from an MCMC analysis and algebraic forms and approximations are unnecessary for its use. We nevertheless investigate some of its formal properties in the following three sections: Section 3 shows that pD is approximately the trace of the product of Fisher’s information and the posterior covariance matrix, whereas in Section 4 we show that for normal models pD corresponds to the trace of the ‘hat’ matrix projecting observations onto fitted values and we illustrate its form for various hierarchical models. Its properties in exponential families are explored in Section 5.

The posterior mean deviance D¯ can be taken as a Bayesian measure of fit or ‘adequacy’, and Section 6 shows how in exponential family models an observation’s contributions to D¯ and pD can be used as residual and leverage diagnostics respectively. In Section 7 we tentatively suggest that the adequacy D¯ and complexity pD may be added to form a deviance information criterion DIC which may be used for comparing models. We describe how this parallels the development of non-Bayesian information criteria and provide a somewhat heuristic decision theoretic justification. In Section 8 we illustrate the use of this technique on some reasonably complex examples. Finally, Section 9 draws some conclusions concerning these proposed techniques.

2. The complexity of a Bayesian model

2.1. ‘Focused’ full probability models

Parametric statistical modelling of data y involves the specification of a probability model p(y|θ), θ ∈ Θ. For a Bayesian ‘full’ probability model, we also specify a prior distribution p(θ) which may give rise to a marginal distribution

Particular choices of p(y|θ) and p(θ) will be termed a model ‘focused’ on Θ. Note that we might further parameterize our prior with unknown ‘hyperparameters’ψ to create a hierarchical model, so that the full probability model factorizes as

p(y,θ,ψ)=p(y,θ)p(θ∣ψ)p(ψ).

Then, depending on the parameters in focus, the model may compose the likelihood p(y|θ) and prior

or the likelihood

and prior p(ψ). Both these models lead to the same marginal distribution (1) but can be considered as having different numbers of parameters. A consequence is that in hierarchical modelling we cannot uniquely define a ‘likelihood’ or ‘model complexity’ without specifying the level of the hierarchy that is the focus of the modelling exercise (Gelfand and Trevisani, 2002). In fact, by focusing our models on a particular set of parameters Θ, we essentially reduce all models to non-hierarchical structures.

For example, consider an unbalanced random-effects one-way analysis of variance (ANOVA) focused on the group means:

yi∣θi∼N(θi,τi−1), θi∼N(ψ,λ−1), i=1,…,p.

(2)

This model could also be focused on the overall mean ψ to give

in which case it could reasonably be considered as having a different complexity.

It is natural to wish to measure the complexity of a focused model, both in its own right, say to assess the degrees of freedom of estimators, and as a contribution to model choice: for example, criteria such as BIC (Schwarz, 1978), AIC (Akaike, 1973), TIC (Takeuchi, 1976) and NIC (Murata et al., 1994) all trade off model fit against a measure of the effective number of parameters in the model. However, the foregoing discussion suggests that such measures of complexity may not be unique and will depend on the number of parameters in focus. Furthermore, the inclusion of a prior distribution induces a dependence between parameters that is likely to reduce the effective dimensionality, although the degree of reduction may depend on the data that are available. Heuristically, complexity reflects the ‘difficulty in estimation’ and hence it seems reasonable that a measure of complexity may depend on both the prior information concerning the parameters in focus and the specific data that are observed.

2.2. Is there a true model?

We follow Box (1976) in believing that ‘all models are wrong, but some are useful’. However, it can be useful to posit a ‘true’ distribution pt(Y) of unobserved future data Y since, for any focused model, this defines a ‘pseudotrue’ parameter value θt (Sawa, 1978) which specifies a likelihood p(Y|θt) that minimizes the Kullback–Leibler distance Et[ log {pt(Y)}/p(Y|θt)] from pt(Y). Having observed data y, under reasonably broad conditions (Berk, 1966; Bunke and Milhaud, 1998) p(θ|y) converges to θt as information on the components of θ increases. Thus Bayesian analysis implicitly relies on p(Y|θt) being a reasonable approximation to pt(Y), and we shall indicate where we make use of this ‘good model’ assumption.

2.3. True and estimated residual information

The residual information in data y conditional on θ may be defined (up to a multiplicative constant) as −2 log {p(y|θ)} (Kullback and Leibler, 1951; Burnham and Anderson, 1998) and can be interpreted as a measure of ‘surprise’ (Good, 1956), logarithmic penalty (Bernardo, 1979) or uncertainty. Suppose that we have an estimator θ˜(y) of the pseudotrue parameter θt. Then the excess of the true over the estimated residual information will be denoted

dΘ{y,θt,θ˜(y)}=−2log{p(y∣θt)}+2log[p{y∣θ˜(y)}].

(3)

This can be thought of as the reduction in surprise or uncertainty due to estimation, or alternatively the degree of ‘overfitting’ due to θ˜(y) adapting to the data y. We now argue that dΘ may form the basis for both classical and Bayesian measures of model dimensionality, with each approach differing in how it deals with the unknown true parameters in dΘ.

2.4. Classical measures of model dimensionality

In a non-Bayesian likelihood-based context, we may take θ˜(y) to be the maximum likelihood estimator θ^(y)⁠, expand 2 log {p(y|θt)} around 2log[p{y∣θ^(y)}]⁠, take expectations with respect to the unknown true sampling distribution pt(Y) and hence show (Ripley, 1996) (page 34) that

Et[dΘ{Y,θt,θ˜(Y)}]≈p∗=tr(KJ−1),

(4)

where

J=−Et[∂2log{p(Y∣θt)}∂θ2],K=vart[∂log{p(Y∣θt)}∂θ].

(5)

This is the measure of complexity that is used in TIC (Takeuchi, 1976). Burnham and Anderson (1998) (page 244) pointed out that

where Σ=J−1K__J−1 is the familiar ‘sandwich’ approximation to the variance–covariance matrix of the θ^(y) (Huber, 1967). If pt(y)=p(y|θt), i.e. one of the models is true, then K=J and p*=p, the number of independent parameters in Θ.

For example, in a fixed effect ANOVA model

yi∣θi∼N(θi,τi−1), i=1,…,p,

with τi−1s known,

dΘ{y,θt,θ^(y)}=∑iτi(yi−θit)2,

whose expectation under pt(Y) is p=Σi  τi  Et(Yi−_θ_t)2. If the model is true, Et(Yi−θt)2=τi−1 and so p=p.

Ripley (1996) (page 140) showed how this procedure may be extended to ‘regularized’ models in which a specified prior term p ( θ ) is introduced to form a penalized log-likelihood. Replacing log ( p ) by log { p ( y | θ )}+ log { p ( θ )} in equations (5) yields a more general definition of p that was derived by Moody (1992) and termed the ‘effective number of parameters’. This is the measure of dimensionality that is used in NIC ( Murata et al., 1994 ): the estimation of p is generally not straightforward ( Ripley, 1996 ).

In the random-effects ANOVA example with θi ∼ N(ψ,λ−1), ψ and λ known, let ρi =τi/(τi+λ) be the intraclass correlation coefficient in the _i_th group. We then obtain

which becomes

if the likelihood is true.

2.5. A Bayesian measure of model complexity

From a Bayesian perspective, the unknown θt may be replaced by a random variable θ. Then dΘ{y,θ,θ˜(y)} can be estimated by its posterior expectation with respect to p(θ|y), denoted

pD{y,Θ,θ˜(y)}=Eθ∣y[dΘ{y,θ,θ˜(y)}]=Eθ∣y[−2log{p(y∣θ)}]+2log[p{y∣θ˜(y)}].

(9)

pD{y,Θ,θ˜(y)} is our proposal as the effective number of parameters with respect to a model with focus Θ: we shall usually drop the arguments pD{y,Θ,θ˜(y)} from the notation. In our examples we shall generally take θ˜(y)=E(θ∣y)=θ¯⁠, the posterior mean of the parameters. However, we note that it is not strictly necessary to use the posterior mean as an estimator of either dΘ or θ, and the mode or median could be justified (Section 2.6).

Taking f(y) to be some fully specified standardizing term that is a function of the data alone, pD may be written as

where

D(θ)=−2log{p(y∣θ)}+2log{f(y)}.

We shall term D(θ) the ‘Bayesian deviance’ in general and, more specifically, for members of the exponential family with E(Y)=μ(θ) we shall use the saturated deviance D(θ) obtained by setting f(y)=p{y|μ(θ)=y}: see Section 8.1.

Equation (10) shows that pD can be considered as a ‘mean deviance minus the deviance of the means’. A referee has pointed out the related argument used by Meng and Rubin (1992), who showed that such a difference, between the average of log-likelihood ratios and the likelihood ratio evaluated at the average (over multiple imputations) of the parameters, is the key quantity in estimating the degrees of freedom of a test.

For example, in the random-effects ANOVA (2) with ψ and Ν known,

which is −2 log (likelihood) standardized by the term −2 log {f(y)}=Σi log (2_π_/τi) obtained from setting θi=yi. Now θi∣y~N{ρiyi+(1−ρi)ψ,ρiτi−1} and hence it can be shown that the posterior distribution of D(θ) has the form

D(θ)∼∑ρiχ2{1,(yi−ψ)2(1−ρi)λ},

where χ2(a,b) is a non-central χ2-distribution with mean a+b. Thus, since ρiλ=(1−_ρ_i)τi, we have

D(θ)¯=∑ρi+∑τi(1−ρi)2(yi−ψ)2,D(θ¯)=∑τi(1−ρi)2(yi−ψ)2,

and so

The effective number of parameters is therefore the sum of the intraclass correlation coefficients, which essentially measures the sum of the ratios of the precision in the likelihood to the precision in the posterior. This exactly matches Moody’s approach (8) when the model is true.

If ψ is unknown and given a uniform hyperprior we obtain a posterior distribution ψ~ N{y¯,(λΣρi)−1}⁠, where y¯=Σρiyi/Σρi⁠. It is straightforward to show that

D(θ)¯=∑ρi+λ∑ρi(1−ρi)(yi−y¯)2+∑ρi(1−ρi)/∑ρi,D(θ¯)=λ∑ρi(1−ρi)(yi−y¯)2,

and so pD=Σ ρi+Σ ρi(1−_ρ_i)/Σ ρi. If the groups are independent, λ=0,ρi=1 and pD=p. If the groups all have the same mean, _λ_→ ∞ , ρi→0 and pD→1. If all group precisions are equal, pD=1+(_p_−1)ρ, as obtained by Hodges and Sargent (2001).

2.6. Some observations on pD

  • (a)

    Equation (10) may be rewritten as

  • which can be interpreted as a classical ‘plug-in’ measure of fit plus a measure of complexity. Thus our Bayesian measure of fit, D(θ)¯⁠, could perhaps be better considered as a measure of ‘adequacy’, and we shall use these terms interchangeably. However, in Section 7.3 we shall suggest that an additional penalty for complexity may be reasonable when making model comparisons.

  • (b)

    Simple use of the Bayes theorem reveals the expression

pD=Eθ∣y[−2log{p(θ∣y)p(θ)}]+2log{p(θ˜∣y)p(θ˜)},

  • which can be interpreted as (minus twice) the posterior estimate of the gain in information provided by the data about θ, minus the plug-in estimate of the gain in information.

  • (c)

    It is reasonable that the effective number of parameters in a model might depend on the data, the choice of focus Θ and the prior information (Section 2.1). Less attractive, perhaps, is that pD may also depend on the choice of estimator θ˜(y) , since this can produce a lack of invariance of pD to apparently innocuous transformations, such as making inferences on logits instead of probabilities in Bernoulli trials. Our usual choice of the posterior mean is largely based on the subsequent ability to investigate approximate forms for pD (Section 3), and the positivity properties described below. A choice of, say, posterior medians would produce a measure of model complexity that was invariant to univariate 1–1 transformations, and we explore this possibility in Section 5.

  • (d)

    It follows from equation (10) and Jensen’s inequality that, when using the posterior mean as an estimator θ˜(y),pD⩾ 0 for any likelihood that is log-concave in θ , with 0 being approached for a degenerate prior on θ . Non-log-concave likelihoods can, however, give rise to a negative pD in certain circumstances. For example, consider a single observation from a Cauchy distribution with deviance D ( θ )=2 log {1+( y − θ ) 2 }, with a discrete prior assigning probability 1/11 to θ =0 and 10/11 to θ =3. If we observe y =0, then the posterior probabilities are changed to 0.5 and 0.5, and so θ¯=1.5 . Thus pD=D(θ)¯−D(θ¯)=log(10)−2log(13/4)=log(160/169)<0 . Our experience has been that negative pD s indicate substantial conflict between the prior and data, or where the posterior mean is a poor estimator (such as a symmetric bimodal distribution).

  • (e)

    The posterior distribution that is used in obtaining pD conditions on the truth of the model, and hence pD may only be considered an appropriate measure of the complexity of a model that reasonably describes the data. This is reflected in the finding that pD in the simple ANOVA example (11) will not necessarily be approximately equivalent to the classical p* (7) if the assumptions of the model are substantially inaccurate. This good model assumption (Section 2.2) is further considered when we come to comparisons of models (Section 7.3).

  • (f)

    Provided that D ( θ ) is available in closed form, pD may be easily calculated after an MCMC run by taking the sample mean of the simulated values of D ( θ ), minus the plug-in estimate of the deviance using the sample means of the simulated values of θ . No ‘small sample’ adjustment is necessary. This ease of computation should be contrasted with the frequent difficulty within the classical framework with deriving the functional form of the measure of dimensionality and its subsequent estimation.

  • (g)

    Since the complexity depends on the focus, a decision must be made whether nuisance parameters, e.g. variances, are to be included in Θ or integrated out before specifying the model p ( y | θ ). However, such a removal of nuisance parameters may create computational difficulties.

pD has been defined and is trivially computable by using MCMC methods, and so strictly speaking there is no need to explore exact forms or approximations. However, to provide insight into the behaviour of pD , the following three sections consider the form of pD in different situations and draw parallels with alternative suggestions: note that we are primarily concerned with the ‘preasymptotic’ situation in which prior opinion is still influential and the likelihood has not overwhelmed the prior.

3. Forms for pD based on normal approximations

In Section 2.1 we argued that focused models are essentially non-hierarchical with a likelihood p(y|θ) and prior p(θ). Before considering particular assumptions for these we examine the form of pD under two general conditions: approximately normal likelihoods and negligible prior information.

3.1. pD assuming a normal approximation to the likelihood

We may expand D(θ) around Eθ∣y(θ)=θ¯ to give, to second order,

D(θ)≈D(θ¯)+(θ−θ¯)T∂D∂θ|θ¯+12(θ−θ¯)T∂2D∂θ2|θ¯(θ−θ¯),

(13)

=D(θ¯)−2(θ−θ¯)TLθ¯′−(θ−θ¯)TLθ¯′′(θ−θ¯)

(14)

where L= log {p(y|θ)} and L′ and L′′ represent first and second derivatives with respect to θ. This corresponds to a normal approximation to the likelihood.

Taking expectations of equation (14) with respect to the posterior distribution of θ gives

Eθ∣y{D(θ)}≈D(θ¯)−E[tr{(θ−θ¯)TLθ¯′′(θ−θ¯)}]=D(θ¯)−E[tr{Lθ¯′′(θ−θ¯)(θ−θ¯)T}]=D(θ¯)−tr[Lθ¯′′E{(θ−θ¯)(θ−θ¯)T}]=D(θ¯)+tr(−Lθ¯′′V)

where V=E{(θ−θ¯)(θ−θ¯)T} is the posterior covariance matrix of θ, and −Lθ¯′′ is the observed Fisher information evaluated at the posterior mean of θ. Thus

which can be thought of as a measure of the ratio of the information in the likelihood about the parameters as a fraction of the total information in the likelihood and the prior. We note the parallel with the classical p* in equation (6).

We also note that

where Q′′=∂2 log {p(θ|y)}/∂_θ_2 and P′′=∂2 log {p(θ)}/∂_θ_2, and hence approximation (15) can be written

pD≈tr(−Qθ¯′′V)−tr(−Pθ¯′′V).

Under approximate posterior normality V−1≈−Qθ¯′′ and hence

where p is the cardinality of Θ.

3.2. pD for approximately normal likelihoods and negligible prior information

Consider a focused model in which p(θ) is assumed to be dominated by the likelihood, either because of assuming a ‘flat’ prior or by increasing the sample size. Assume that the approximation

holds, where θ¯=θ^ are the maximum likelihood estimates such that Lθ^′=0 (Bernardo and Smith (1994), section 5.3). From equation (14)

D(θ)≈D(θ^)−(θ−θ^)TLθ^′′(θ−θ^)≈D(θ^)+χp2,

(18)

since, by approximation (17), −(θ−θ^)TLθ^′′(θ−θ^) has an approximate χ2-distribution with p degrees of freedom.

Rearranging approximation (18) and taking expectations with respect to the posterior distribution of θ reveals that

i.e. pD will be approximately the true number of parameters: this approximation could also be derived by letting Pθ¯′′→0 in approximation (16). This approximate identity is illustrated in Section 8.1.

We note in passing that we might use MCMC output to estimate the classical deviance D(θ^) of any likelihood-based model by

Although the maximum likelihood deviance is theoretically the minimum of D over all feasible values of θ⁠, D(θ^) will generally be very badly estimated by the sample minimum over an MCMC run, and so the estimator given by equation (19) may be preferable.

4. pD for normal likelihoods

In this section we illustrate the formal behaviour of pD for normal likelihoods by using exact and approximate identities. However, it is important to keep in mind that in practice such forms are unnecessary for computation and that pD should automatically allow for fixed effects, random effects and unknown precisions.

4.1. The normal linear model

We consider the general hierarchical normal model described by Lindley and Smith (1972). Suppose that

where all matrices and vectors are of appropriate dimension, and C1 and C2 are assumed known and θ is the focus: unknown precisions are considered in Section 4.5. Then the standardized deviance is D(θ)=(y−A1θ)TC1−1(y−A1θ) and the posterior distribution for θ is normal with mean θ¯=Vb and covariance V: V and b will be left unspecified for the moment. Expressing y_−_A1θ as y−A1θ¯+A1θ¯−A1θ reveals that

D(θ)=D(θ¯)−2(y−A1θ¯)TC1−1A1(θ−θ¯)+(θ−θ¯)TA1TC1−1A1(θ−θ¯).

Taking expectations with respect to the posterior distribution of θ eliminates the middle term and gives

and thus pD=tr(A1TC1−1A1V) We note that A1TC1−1A1 is the Fisher information −_L_′′,V is the posterior covariance matrix and hence

an exact version of approximation (15). It is also clear that in this context pD is invariant to affine transformations of θ.

If ψ is assumed known, then Lindley and Smith (1972) showed that V−1=A1TC1−1A1+C2−1 and hence from equation (21)

as an exact version of approximation (16); then 0⩽pD⩽p⁠, and p_−_pD is the measure of the ‘shrinkage’ of the posterior estimates towards the prior means. If (C2−1V)−1=A1TC1−1A1C2+Ip has eigenvalues λi+1,i=1,…,p, then

and hence the upper bound for pD is approached as the eigenvalues of C2 become large, i.e. the prior becomes flat. It can further be shown, in the case A1=In, that pD is the sum of the squared canonical correlations between data Y and the ‘signal’θ.

4.2. The ‘hat’ matrix and leverages

A revealing identity is found by noting that b=A1TC1−1y and the fitted values for the data are given by y^=A1θ¯=A1Vb=A1VA1TC1−1y⁠. Thus the hat matrix that projects the data onto the fitted values is H=A1VA1TC1−1⁠, and

pD=tr(A1TC1−1A1V)=tr(A1VA1TC1−1)=tr(H).

(24)

This identity also holds assuming that ψ is unknown with a uniform prior, in which case Lindley and Smith (1972) showed that V−1=A1TC1−1A1+C2−1−C2−1A2(A2TC2−1A2)−1A2TC2−1⁠.

The identification of the effective number of parameters with the trace of the hat matrix is a standard result in linear modelling and has been applied to smoothing (Wahba, 1990) (page 63) and generalized additive models (Hastie and Tibshirani (1990), section 3.5), and is also the conclusion of Hodges and Sargent (2001) in the context of general linear models. The advantage of using the deviance formulation for specifying pD is that all matrix manipulation and asymptotic approximation is avoided: see Section 4.4 for further discussion. Note that tr(H) is the sum of terms which in regression diagnostics are identified as the individual leverages, the influence of each observation on its fitted value: we shall return to this identity in Section 6.3.

Ye (1998) considered the independent normal model

and suggested that the effective number of parameters should be Σ i  hi , where

the average sensitivity of an unspecified estimate θ˜i to a small change in yi . This is a generalization of the trace of the hat matrix discussed above. In the context of the normal linear models, it is straightforward to show that EY∣θ(θ¯)=Hθ , and hence pD =tr( H ) matches Ye’s suggestion for model complexity. Further connections with Ye (1998) are described in Section 7.2.

4.3. Example: Laird–Ware mixed models

Laird and Ware (1982) specified the mixed normal model as

where the covariance matrices C1 and D are currently assumed known. The random effects are β , and the fixed effects are α , and placing a uniform prior on α we can write this model within the general Lindley–Smith formulation (20) by setting θ =( α , β ), A1 =( X , Z ), ψ =0 and C2 as a block diagonal matrix with ∞ in the top left-hand block, D in the bottom right and 0 elsewhere.

We have already shown that in these circumstances pD=tr{A1TC1−1A1(A1TC1−1A1+C2−1)−1}⁠, and substituting in the appropriate entries for the Laird–Ware model gives pD=tr(V*V−1), where

V∗=(XTC1−1XXTC1−1ZZTC1−1XZTC1−1Z),V=(XTC1−1XXTC1−1ZZTC1−1XZTC1−1Z+D−1)

which is the precision of the parameter estimates assuming that D−1=0, relative to the precision assuming informative D.

4.4. Frequentist approaches to model complexity: smoothing and normal non-linear models

A common model in semiparametric regression is

y∼N(Xα+β,τ−1C1),β∼N(0,λ−1D),

where β is a vector of length n of function values of the nonparametric part of an interpolation spline (Wahba, 1990; van der Linde, 1995) and C1 and D are assumed known. Motivated by the need to estimate the unknown scale factors τ−1 and λ−1, for many years the effective number of parameters has been taken to be the trace of the hat matrix (Wahba (1990), page 63) and so, for example, τ^−1 is the residual sum of squares divided by the ‘effective degrees of freedom’_n_−tr(H). In this class of models this measure of complexity coincides with pD. Interest in regression diagnostics (Eubank, 1985; Eubank and Gunst, 1986) and cross-validation to determine the smoothing parameter τ/λ (Wahba (1990), section 4.2) also drew attention to the diagonal entries of the hat matrix as leverage values.

Links to partially Bayesian interpolation models have been provided by Kimeldorf and Wahba (1970) and Wahba (1978, 1983) and further work built on these ideas. For example, another large class of models can be formulated by using the following extension to the Lindley–Smith model:

y∼N{g(θ),τ−1C1},θ∼N(A2ψ,λ−1D)

where g is a non-linear expression as found, for example, in pharmacokinetics or neural networks: in many situations A2ψ will be 0 and C1 and D will be identity matrices. Define

q(θ)=(y−g(θ))TC1−1(y−g(θ)),r(θ)=(θ−A2ψ)TD−1(θ−A2ψ)

as the likelihood and prior residual variation. MacKay (1992) suggested estimating τ and λ by maximizing the ‘type II’ likelihood p(y|λ,τ) derived from integrating out the unknown θ from the likelihood. Setting derivatives equal to 0 eventually reveals that

τ^−1=q(θ¯)n−pD,λ^−1=r(θ¯)pD,

which are the fitted likelihood and prior residual variation, divided by the appropriate effective degrees of freedom: pD=tr(H) is the key quantity.

These results were derived by MacKay (1992) in the context of ‘regularization’ in complex interpolation models such as neural networks, in which the parameters θ are standardized and assumed to have independent normal priors with mean 0 and precision λ. Then expression (16) may be written

However, MacKay’s use of approximation (26) requires the evaluation of tr(V), whereas our pD arises without any additional computation. We would also recommend including λ and τ in the general MCMC estimation procedure, rather than relying on type II maximum likelihood estimates (Ripley (1996), page 167). In this and the smoothing context a fully Bayesian analysis requires prior distributions for τ−1 and λ−1 to be specified (van der Linde, 2000), and this will both change the complexity of the model and require a choice of estimator of the precisions. We shall now illustrate the form of pD in the restricted situation of unknown τ−1.

4.5. Normal models with unknown sampling precision

Introducing unknown variances as part of the focus confronts us with the need to choose a form for the plug-in posterior estimates. We may illustrate this issue by extending the general hierarchical normal model (20) to the conjugate normal–gamma model with an unknown scale parameter τ in both the likelihood and the prior (Bernardo and Smith (1994), section 5.2.1). Suppose that

y∼N(A1θ,τ−1C1),θ∼N(A2ψ,τ−1C2),

(27)

and we focus on (θ,τ). The standardized deviance is D(θ,τ)=τ  q(θ)−_n_ log (τ), where

is the residual variation. Then, for a currently unspecified estimator τ^⁠,

pD=Eθ,τ∣y(D∣θ,τ)−D(θ¯,τ^)=Eτ∣y[Eθ∣τ,y{τq(θ)}−nlog(τ)]−{τ^q(θ¯)−nlog(τ^)}=tr(H)+q(θ¯)(τ¯−τ^)−n{log(τ)¯−log(τ^)}

(28)

where H=A1TC1−1A1(A1TC1−1A1+C2−1)−1 is the hat matrix which does not depend on τ. Thus the additional uncertain scale parameter adds the second two terms to the complexity of the model.

A conjugate prior _τ_∼gamma(a,b) leads to a posterior distribution τ|_y_∼gamma(a+n/2,b+S/2), where

S=(y−A1A2ψ)T(C1+A1TC2A1)−1(y−A1A2ψ).

It remains to choose the estimator τ^ to place in equation (28), and we shall consider two options.

Suppose that we parameterize in terms of τ and use

making the second term in equation (28) 0. Now if X ∼ gamma(a,b), then E{ log (X)}=ψ(a)− log (b) where ψ is the digamma function, and so log(τ)¯=ψ(a+n/2)−log(b+S/2)⁠. Hence the term contributing to pD due to the unknown precision is

pD−tr(H)=−n{ψ(a+n2)−log(a+n2)}≈1−2a−132a+n

using the approximation ψ(x)≈ log (x)−1/2_x_−1/12_x_2. This term will tend to 1+1/3_n_ as prior information becomes negligible and hence will be close to the ‘correct’ value of 1 for moderate sample sizes.

If we were to parameterize in terms of log (τ) and to use τ^=exp{log(τ)¯}⁠, the third term in equation (28) is 0 and the second term can be shown to be 1−_O_(n−1). Thus for reasonable sample sizes the choice of parameterization of the unknown precision will make little difference to the measure of complexity. However, in Section 7 we shall argue that the log-scale may be more appropriate owing to the better approximation to likelihood normality.

5. Exponential family likelihoods

We assume that we have p groups of observations, where each of the ni observations in group i has the same distribution. Following McCullagh and Nelder (1989), we define a one-parameter exponential family for the _j_th observation in the _i_th group as

log{p(yij∣θi,ϕ)}=wi{yijθi−b(θi)}/ϕ+c(yij,ϕ),

(29)

where

μi=E(Yij∣θi,ϕ)=b′(θi),V(Yij∣θi,ϕ)=b″(θi)ϕ/w,i,

and wi is a constant. If the canonical parameterization Θ is the focus of the model, then writing b¯i=Eθi∣y{b(θi)} we easily obtain that the contribution of the _i_th group to the effective number of parameters is

pDiΘ=2niwi{b¯i−b(θ¯i)}/ϕ.

(30)

These likelihoods highlight the issue of the lack of invariance of pD to reparameterization, since the mean parameterization Ο will give a different complexity pDiΟ⁠. This is first explored within simple binomial and Poisson models with conjugate priors, and then exact and approximate forms of pD are examined for generalized linear and generalized linear mixed models.

5.1. Binomial likelihood with conjugate prior

In the notation of equation (29), φ=1,wi=1 and θ=logit(μ)= log {μ/(1−_μ_)}, and the (unstandardized) deviance is

D(μi)=−2yilog(μi)−2(ni−yi)log(1−μi)

where yi=Σjyij. A conjugate prior μi={1+ exp (−_θ_i)}−1∼beta(a,b) provides a posterior μi∼beta(a+yi,b+ni−_y_i) with mean (a+yi)/(a+b+ni). Now, if X_∼beta(a,b), then E{ log (X)}=ψ(a)−_ψ(a+b) and E{ log (1−_X_)}=ψ(b)−_ψ_(a+b) where ψ is the digamma function, and hence it can be shown that

D(μi)¯=D(θi)¯=−2yiψ(a+yi)−2(ni−yi)ψ(b+ni−yi)+2niψ(a+b+ni)D(μi¯)=−2yilog(a+yi)−2(ni−yi)log(b+ni−yi)+2nilog(a+b+ni)D(θi)=−2yiψ(a+yi)+2yiψ(b+ni−yi)+2nilog[1+exp{ψ(a+yi)−ψ(b+ni−yi)}],D(μimed )=D(θimed )=−2yilog(μimed )−2(ni−yi)log(1−μimed )

where Οimed  denotes the posterior median of Οi.

Exact pDis are obtainable by subtraction, and Fig. 1 shows how the value of pDi depends on the parameterization, the data and the prior. We may also gain further insight into the behaviour of pDi by considering approximate formulae for the mean and canonical parameterizations by using ψ(x)≈log(x)−1/2x≈log(x−12)⁠. This leads to

pDiμ≈yia+yi+ni−yib+ni−yi−nia+b+ni,7pDiΘ≈nia+b+ni−12.

(31)

We make the following observations.

Binomial likelihood—contribution of the i th group to the effective number of parameters under various parameterizations (canonical pDiΘ , mean pDiμ and median pDimed  ) as a function of the data (sample size ni and observed proportion yi / ni ) and prior (effective prior sample size a + b and prior mean a /( a + b )): we are seeking agreement between alternative parameterizations with little dependence on data

Fig. 1

Binomial likelihood—contribution of the i th group to the effective number of parameters under various parameterizations (canonical pDiΘ , mean pDiμ and median pDimed  ) as a function of the data (sample size ni and observed proportion yi / ni ) and prior (effective prior sample size a + b and prior mean a /( a + b )): we are seeking agreement between alternative parameterizations with little dependence on data

5.1.1. Behaviour of pD

For all three parameterizations, as the sample size in each group increases relative to the effective prior sample size, its contribution to pDi tends towards 1.

5.1.2. Agreement between parameterizations

The agreement between parameterizations is generally reasonable except in the situations in which the prior sample size is 10 times that of the data. While the canonical parameterization has pDi≈1/11, the mean and median give increased pDi for extreme prior means.

5.1.3. Dependence on data

With the exception of the sparse data and weak prior scenario for which the approximate formulae do not hold, the canonical pDiΘ does not depend on the data observed and is approximately the ratio of the sample size to the effective posterior sample size. When the mean and median forms depend on data (say when ni=1 and a + b=10), pDi is higher in situations of prior–data conflict.

5.2. Poisson likelihood with conjugate prior

In the notation of equation (29), φ=1,wi=1 and θ= log (μ), and the (unstandardized) deviance is D(μi)=−2_y_i log (μi)+2_n_iμi. A conjugate prior μi= exp (θi)∼gamma(a,b) gives a posterior μi∼gamma(a+yi,b+ni) with mean (a+yi)/(b+ni). If _X_∼gamma(a,b), then E{ log (X)}=ψ(a)− log (b) and hence we can show that

D(μi)¯=D(θi)¯=−2yi{ψ(a+yi)−log(b+ni)}+2nia+yib+ni, D(μi¯)=−2yi{log(a+yi)−log(b+ni)}+2nia+yib+ni,D(θ¯i)=−2yi{ψ(a+yi)−log(b+ni)}+2niexp{ψ(a+yi)}b+ni,D(μimed )=D(θimed )=−2yilog(μimed )+2niμimed .

Exact pDis are obtainable by subtraction. Fig. 2 shows how the value of pDi relates to the parameterization, the data and the prior. Using the same approximation as previously, approximate pDis for the mean and canonical parameterizations are

pDiμ≈yi/(a+yi),pDiΘ≈ni/(b+ni).

Poisson likelihood—contribution of the i th group to the effective number of parameters under various parameterizations (canonical pDiΘ , mean pDiμ and median pDimed  ) as a function of the data (sample size ni and observed total yi ) and prior (mean nia / b and ‘sample size’ b )

Fig. 2

Poisson likelihood—contribution of the i th group to the effective number of parameters under various parameterizations (canonical pDiΘ , mean pDiμ and median pDimed  ) as a function of the data (sample size ni and observed total yi ) and prior (mean nia / b and ‘sample size’ b )

5.2.1. Behaviour of pDi

For all three parameterizations, as the sample size in each group increases relative to the effective prior sample size, its contribution to pDi tends towards 1.

5.2.2. Agreement between parameterizations

The agreement between parameterizations is best when there is no conflict between the prior expectation and the data, but it can be substantial when such conflict is extreme. The median estimator leads to a pDi that is intermediate between those derived from the canonical and mean parameterizations.

5.2.3. Dependence on data

Except in the situation of a single yi=0 with weak prior information, the approximation for the canonical pDiΘ is very accurate and so pDiΘ does not depend on the data observed. There can be a substantial dependence for the mean parameterization, with pDiΟ being higher when the prior mean underestimates the data.

5.2.4. Conclusion

In conclusion, for both binomial and Poisson data there is reasonable agreement between the different pDis provided that the model provides a reasonable fit to the data, i.e. there is not strong conflict between the prior and data. The canonical parameterization appears preferable, both for its lack of dependence on the data and for its generally close approximation to the invariant pDi based on a median estimator. Thus we would not normally expect the choice of parameterization to have a strong effect, although in Section 8.3 we present an example of a Bernoulli model where this choice does prove to be important.

Here we shall focus on the canonical parameterization in terms of θi, both for the reasons outlined above and because its likelihood should better fulfil a normal approximation (Slate, 1994): related identities are available for the mean parameterization in terms of Οi=Ο(θi). We emphasize again that the approximate identities that are derived in this and the following section are only for understanding the behaviour of pD in idealized circumstances (i.e. known precision parameters) and are not required for computation in practical situations.

Following McCullagh and Nelder (1989) we assume that the mean Οi of yij is related to a set of covariates xi through a link function g(Οi)=xiTι⁠, and that g is the canonical link θ(Ο). The second-order Taylor series expansion of D(θi) around D(θ¯i) yields an approximate normal distribution for working observations and hence derivations of Section 3 apply. We eventually obtain

where W is diagonal with entries

the generalized linear model iterated weights (McCullagh and Nelder (1989), page 40): φ is assumed known.

Under an N(α0,C2) prior on α, the prior contribution to the negative Hessian matrix at the mode is just C2−1⁠, so under the canonical link the approximate normal posterior has variance

again producing pD as a measure of the ratio of the ‘working’ likelihood to posterior information.

5.4. Generalized linear mixed models

We now consider the class of generalized linear mixed models with canonical link, in which g(Οi)=xiTι+ziTβ⁠, where β_∟_N(0,D) (Breslow and Clayton, 1993) and D is assumed known.

Using the same argument as for generalized linear models (Section 5.3), we find that

pD≈tr[(X,Z)TW(X,Z)V{(α,β)∣y}]≈tr(V∗V−1),

where

V∗=(XTW−1XXTW−1ZZTW−1XZTW−1Z),V=(XTW−1XXTW−1ZZTW−1XZTW−1Z+D−1).

This matches the proposal of Lee and Nelder (1996) except their D−1 is a diagonal matrix of the second derivatives of the prior likelihood for each random effect.

6. Diagnostics for fit and influence

6.1. Posterior expected deviance as a Bayesian measure of fit or ‘adequacy’

The posterior mean of the deviance Eθ∣y{D(θ)}=D(θ)¯ has often been used to compare models informally: see, for example, Dempster (1974) (reprinted as Dempster (1997a)), Raghunathan (1988), Zeger and Karim (1991), Gilks et al. (1993) and Richardson and Green (1997). These researchers have, however, not been explicit about whether, or how much, such a measure might be traded off against increasing complexity of a model: Dempster (1997b) suggested plotting log-likelihoods from MCMC runs but hesitated to dictate a model choice procedure. We shall discuss this further in Section 7.3. In Section 2.6 we argued that D(θ)¯ already incorporates some penalty for complexity and hence we use the term ‘adequacy’ and ‘Bayesian fit’ interchangeably.

6.2. Sampling theory diagnostics for lack of Bayesian fit

Suppose that all aspects of the model were assumed true. Then before observing data Y our expectation of the posterior expected deviance is

EY(D¯)=EY[Eθ∣y{D(θ)}]=Eθ(EY∣θ[−2log{p(Y∣θ)}+2log{f(Y)}])

(32)

by reversing the conditioning between Y and θ. If f(Y)=p{Y∣θ^(Y)} where θ^(Y) is the standard maximum likelihood estimate, then

EY∣θ(−2log[p(Y∣θ)p{Y∣θ^(Y)}])

is simply the expected likelihood ratio statistic for the fitted values θ^(Y) with respect to the true null model θ and hence under standard conditions is approximately E(χp2)=p⁠, the dimensionality of θ. From equation (32) we therefore expect, if the model is true, the posterior expected deviance (standardized by the maximized log-likelihood) to be EY(D¯)≈Eθ(p)=p⁠, the number of free parameters in θ. This might be appropriate for checking the overall goodness of fit of the model.

In particular, consider the one-parameter exponential family where p=n, the total sample size. The likelihood is maximized by substituting yi for the mean of yi, and the posterior mean of the standardized deviance has approximate sampling expectation n if the model is true. This will be exact for normal models with known variance, but in general it will only be reliable if each observation provides considerable information about its mean (McCullagh and Nelder (1989), page 36). Note that comparing D¯ with n is precisely the same as comparing the ‘classical’ fit D(θ¯) with n_−_pD, the effective degrees of freedom.

It is then natural to consider the contribution Di of each observation i to the overall mean deviance, so that

where dri=±√D¯i (with the sign given by the sign of yi−E(yi∣θ¯)⁠) termed the Bayesian deviance residual, defined analogously to McCullagh and Nelder (1989), page 39. See Section 8.1 for an application of this procedure.

6.3. Leverage diagnostics

In Section 4.1 we noted that in normal linear models the contribution pDi of each observation i to pD turned out to be its leverage, defined as the relative influence that each observation has on its own fitted value. For yi conditionally independent given θ, it can be shown that

pDi=−2(Eθ∣y[log{p(θ∣yi)p(θ)}]−log{p(θ¯∣yi)p(θ¯)})

which reflects its interpretation as the difficulty in estimating θ with yi.

It may be possible to exploit this interpretation in general model fitting, and as a by-product of MCMC estimation to obtain estimates of leverage for each observation. Such diagnostics are illustrated in Section 8.1.

7. A model comparison criterion

7.1. Model ‘selection’

There has been a long and continuing debate about whether the issue of selecting a model as a basis for inferences is amenable to a strict mathematical analysis using, for example, a decision theoretic paradigm: see, for example, Key et al. (1999). Our approach here can be considered to be semiformal. Although we believe that it is useful to have measures of fit and complexity, and to combine them into overall criteria that have some theoretical justification, we also feel that an overformal approach to model ‘selection’ is inappropriate since so many other features of a model should be taken into account before using it as a basis for reporting inferences, e.g. the robustness of its conclusions and its inherent plausibility. In addition, in many contexts it may not be appropriate to ‘choose’ a single model. Our development closely follows that of Section 2.

A characteristic that is common to both Bayesian and classical approaches is the concept of an independent replicate data set Yrep, derived from the same data-generating mechanism as gave rise to the observed data. Suppose that the loss in assigning to a set of data Y a probability p(Y∣θ˜) is ℒ(Y,θ˜)⁠. We assume that we shall favour models p(Y∣θ˜) for which ℒ(Y,θ˜) is expected to be small, and thus a criterion can be based on an estimate of EYrep ∣θt{ℒ(Yrep ,θ˜)}⁠.

A natural, but optimistic, estimate of this quantity is the ‘apparent’ loss ℒ{y,θ˜(y)} that is suffered on repredicting the observed y that gave rise to θ˜(y)⁠. We follow Efron (1986) in defining the ‘optimism’ that is associated with this estimator as cΘ, where

EYrep ∣θt[ℒ{Yrep ,θ˜(y)}]=ℒ{y,θ˜(y)}+cΘ{y,θt,θ˜(y)}.

(33)

Both classical and Bayesian approaches to estimating the optimism cΘ will now be examined when assuming a logarithmic loss function ℒ(Y,θ˜)=−2log{p(Y∣θ˜)}⁠: as in Section 2, the classical approach attempts to estimate the sampling expectation of cΘ, whereas the Bayesian approach is based on a direct calculation of the posterior expectation of cΘ.

7.2. Classical criteria for model comparison

From the previous discussion, approximate forms for the expected optimism

π(θt)=EY∣θt[cΘ{Y,θt,θ˜(Y)}]

will, from equation (33), yield criteria for a comparison of models that are based on minimizing

E^Yrep ∣θt[ℒ{Yrep ,θ˜(y)}]=ℒ{y,θ˜(y)}+π^(θt).

(34)

Efron (1986) derived the expression for π(θt) for exponential families and for general loss functions. In particular, for the logarithmic loss function, Efron showed that

where Y^i is the fitted value arising from the estimator θ˜: if θ˜ corresponds to maximum likelihood estimation based on a linear predictor with p parameters, then πE(θt)≈2_p_. Hence Efron’s result can be thought of as generalizing Akaike (1973), who sought to minimize the expected Kullback–Leibler distance between the true and estimated predictive distribution and showed under broad conditions that π(θt)≈2_p_.

This in turn suggests that πE/2, derived from equation (35), may be adopted as a measure of complexity in more complex modelling situations. Ye and Wong (1998) extended the work mentioned in Section 4.2 to show that πE/2 for exponential families can be expressed as a sum of the average sensitivity of the fitted values y^i to a small change in yi: this quantity is termed by Ye and Wong the ‘generalized degrees of freedom’ when using a general estimation procedure. In normal models with linear estimators y^i=θ˜i(y)=Σjhijyj⁠, and so π(θt)=2 tr(H). Finally, Ripley (1996) extended the analysis described in Section 2.4 to show that if the model assumed is not true then π(θt)≈2_p_, where p is defined in equation (4). See Burnham and Anderson (1998) for a full and detailed review of all aspects of estimation of π(θt).

These classical criteria for general model comparison are thus all based on equation (34) and can all be considered as corresponding to a plug-in estimate of fit, plus twice the effective number of parameters in the model. We shall now adapt this structure to a Bayesian context.

7.3. Bayesian criteria for model comparison

Gelfand and Ghosh (1998) and Laud and Ibrahim (1995) both attempted strict decision theoretic approaches to model choice based on expected losses on replicate data sets. Our approach is more informal, in aiming to identify models that best explain the observed data, but with the expectation that they are likely to minimize uncertainty about observations generated in the same way. Thus, by analogy with the classical results described above, we propose a deviance information criterion DIC, defined as a classical estimate of fit, plus twice the effective number of parameters, to give

by definition of pD (10): equation (37) shows that DIC can also be considered as a Bayesian measure of fit or adequacy, penalized by an additional complexity term pD . From the results in Section 3.2, we immediately see that in models with negligible prior information DIC will be approximately equivalent to Akaike’s criterion.

An approximate decision theoretic justification for DIC can be obtained by mimicking the development of Ripley (1996) (page 33) and Burnham and Anderson (1998) (chapter 6). Using the logarithmic loss function in equation (33), we obtain

cΘ{y,θt,θ˜(y)}=EYrep ∣θt{Drep (θ˜)}−D(θ˜)

where −2log[p{Yrep ∣θ˜(y)}] is denoted Drep (θ˜) and so on: note in this section that D is an unstandardized deviance (f(·)=1). It is convenient to expand cΘ into the three terms

cΘ=EYrep ∣θt{Drep (θ˜)−Drep (θt)}+EYrep ∣θt{Drep (θt)−D(θt)}+{D(θt)−D(θ˜)};

(38)

we shall denote the first two terms by ℒ1 and ℒ2 respectively and, since we are taking a Bayesian perspective, replace the true θt by a random quantity θ.

Expanding the first term to second order gives

ℒ1(θ,θ˜)≈EYrep ∣θ{−2(θ˜−θ)TLrep ,θ′−(θ˜−θ)TLrep ,θ′′(θ˜−θ)}

where Lrep,θ= log {p(Yrep|θ)}. Since EYrep ∣θ(Lrep ,θ′)=0 from standard results for score statistics, we obtain after some rearrangement

ℒ1(θ,θ˜)≈tr{Iθ(θ˜−θ)(θ˜−θ)T}

where Iθ=EYrep ∣θ(−Lrep ,θ′′) is the assumed Fisher information in Yrep, and hence also in y. Making the good model assumption (Section 2.2), this might reasonably be approximated by the observed information at the estimated parameters, so

ℒ1(θ,θ˜)≈tr{−Lθ˜′′(θ˜−θ)(θ˜−θ)T}.

(39)

Suppose that under a particular model assumption we obtain a posterior distribution p(θ|y). Then from approximations (38) and (39) our posterior expected optimism when adopting this model and the estimator θ˜ is

Eθ∣y(cΘ)≈tr[−Lθ˜′′Eθ∣y{(θ−θ˜)(θ−θ˜)T}]+Eθ∣y{ℒ2(y,θ)}+Eθ∣y{D(θ)−D(θ˜)}.

Using the posterior mean θ¯ as our estimator makes the expected optimism

Eθ∣y(cΘ)≈tr(−Lθ¯′′V)+Eθ∣y{ℒ2(y,θ)}+pD,

(40)

where V again is defined as the posterior covariance of θ, and pD=D¯−D(θ¯)⁠. Now

ℒ2(y,θ)=EYrep ∣θ[−2log{p(Yrep ∣θ)}]+2log{p(y∣θ)},

and so EY[Eθ|Y{ℒ2(Y,θ)}]=E[EY|θ{ℒ2(Y,θ)}]=0. We have already shown in approximation (15) that pD≈tr(−Lθ¯′′V)⁠, and hence from expressions (33) and (40) the expected posterior loss when adopting a particular model is

D(θ¯)+Eθ∣y(cΘ)≈D(θ¯)+2pD=DIC,

neglecting a term Eθ|y{ℒ2(y,θ)} which is expected to be 0. This derivation has assumed that D is an unstandardized deviance: common standardization across models will leave unchanged the property that differences in DIC are estimates of differences in expected loss in prediction.

We make the following observations concerning this admittedly heuristic justification of DIC. First, for the general normal linear model (20), it is straightforward to show that ℒ2(y,θ)=p−(y−A1θ)TC1−1(y−A1θ) where p is the dimensionality of θ, and hence for true θ has sampling distribution p−χp2 with mean 0 and variance 2_p_. This parallels the classical development in which Ripley (1996) (page 34) pointed out that the equivalent term is O(√_n_): we would hope that this factor will tend to cancel when assessing differences in DIC, but this requires further investigation.

Second, this development draws heavily on the approximations in Section 3 and hence encourages parameterizations in which likelihood normality is more plausible.

Third, we are attempting to evaluate the consequences of assuming a particular model, using an analysis that is based on that very assumption. This use of the good model assumption (Section 2.2) argues for the use of DIC in comparing models that have already been shown to e adequate candidates for explaining the observations.

8. Examples

pD and DIC have already been applied by other researchers in a variety of contexts, such as alternative models for diagnostic probabilities in screening studies ( Erkanli et al., 1999 ), longitudinal binary data using Markov regression models ( Erkanli et al., 2001 ), spline models with Bernoulli responses ( Biller and Fahrmeir, 2001 ), multistage models for treatment usage which combine to form a total DIC ( Gelfand et al., 2000 ), complex spatial models for Poisson counts ( Green and Richardson, 2000 ), pharmacokinetic modelling ( Rahman et al., 1999 ) and structures of Bayesian neural networks ( Vehtari and Lampinen, 1999 ). The following examples illustrate the use of pD and DIC to compare alternative prior and likelihood structures.

8.1. The spatial distribution of lip cancer in Scotland

We consider data on the rates of lip cancer in 56 districts in Scotland (Clayton and Kaldor, 1987; Breslow and Clayton, 1993). The data include observed (yi) and expected (Ei) numbers of cases for each county i (where the expected counts are based on the age- and sex-standardized national rate applied to the population at risk in each county) plus the ‘location’ of each county expressed as a list (𝒜i) of its ni adjacent counties. We assume that the cancer counts within each county yi follow a Poisson distribution with mean exp (θi)Ei where exp (θi) denotes the underlying true area-specific relative risk of lip cancer. We then consider the following set of candidate models for θi, reflecting different assumptions about the between-county variation in (log-) relative risk of lip cancer: model 1,

model 2,

model 3,

model 4,

model 5,

An improper uniform prior is placed on α0, independent (proper) normal priors with large variance are specified for each αi (i=1,…,56), γi are exchangeable random effects with a normal prior distribution having zero mean and precision λ, and δi are spatial random effects with a conditional autoregressive prior (Besag, 1974) given by

δi∣δ\i~normal(1ni∑j∈Aiδj,1niλδ).

A sum-to-zero constraint is imposed on the {δi} for identifiability, and weakly informative gamma(0.5,0.0005) priors are assumed for the random effects precision parameters λ and λ. These five models cover the spectrum between the pooled model 1 that makes no allowance for variation between the true risk ratios in each county and the saturated model 5 that assumes independence between the county-specific risk ratios (essentially yielding the maximum likelihood estimates θ^i=log(yi/Ei)⁠). The random-effects models 2–4 allow the county-specific relative risks to be similar but not identical, with the autoregressive term allowing for the possibility of spatially correlated variation.

We use the saturated deviance (McCullagh and Nelder (1989), page 34)

D(θ)=2∑i[yilog{yi/exp(θi)Ei}−{yi−exp(θi)Ei}]

obtained by taking −2log{f(y)}=−2Σilog{p(yi∣θ^i)} as the standardizing factor (see Section 2.5). This allows calculation of absolute measures of fit (see Section 6.2). For model comparisons, however, it is sufficient to take the standardizing factor as f(y)=1. For each model we ran two independent chains of an MCMC sampler in WinBUGS (Spiegelhalter et al., 2000) for 15 000 iterations each, following a burn-in period of 5000 iterations. As suggested by Dempster (1997b), Fig. 3 shows a kernel density smoothed plot of the resulting posterior distributions of the deviance under each competing model. Apart from revealing the obvious unacceptability of model 1, this clearly illustrates the difficulty of formally comparing posterior deviances on the basis of such plots alone.

Posterior distributions of the deviance for each model considered in the lip cancer example: ——, model 1; ⋯⋯·, model 2; - - - - - - -, model 3; –––, model 4; ——, model 5

Fig. 3

Posterior distributions of the deviance for each model considered in the lip cancer example: ——, model 1; ⋯⋯·, model 2; - - - - - - -, model 3; –––, model 4; ——, model 5

The deviance summaries proposed in this paper are shown for the lip cancer data in Table 1: D¯ is simply the mean of the posterior samples of the saturated deviance; D(Ο¯) is calculated by plugging the posterior mean of Οi= exp (θi)Ei into the saturated deviance; D(θ¯) is calculated by plugging the posterior means of the relevant parameters (ι0, ιi, γi and/or δi) into the linear predictor θi and then evaluating the saturated deviance; D(med) is calculated by plugging the posterior median of θi (or, equivalently, of Οi) into the saturated deviance. The results are remarkably similar for the three alternative parameterizations of the plug-in deviance. For fixed effects models we would expect from Section 3.2 that pD should be approximately the true number of independent parameters. For the pooled model 1, pD=1.0 as expected, whereas, for the saturated model 5, pD ranges from 52.8 to 55.9 depending on the parameterization that is used, which is close to the true value of 56 parameters. The models containing spatial random effects (either with or without additional exchangeable effects) both have around 31 effective parameters, whereas the model with only exchangeable random effects has about 12 additional effective parameters. On the basis of the results of Section 5.2 comparing pD for Poisson likelihoods with different priors, this suggests that the spatial model provides stronger prior information than does the exchangeable model for these data.

Table 1

Deviance summaries for the lip cancer data using three alternative parameterizations (mean, canonical and median) for the plug-in deviance†

ModelD¯D(Ο¯)pDΟDICΟD(θ¯)pDθDICθD ( med )pDmed DICmed
1, pooled381.7380.71.0382.7380.71.0382.7380.71.0382.7
2, exchangeable61.118.242.9104.017.743.4104.517.643.5104.6
3, spatial58.326.631.789.927.131.289.527.231.189.3
4, exchangeable + spatial57.926.131.889.726.531.489.326.631.389.2
5, saturated55.90.055.9111.73.152.8108.61.454.5110.4

†Exchangeable means an exchangeable random effect; spatial is a spatially correlated random effect.

Turning to the comparison of DIC for each model, we first note that DIC is subject to Monte Carlo sampling error, since it is a function of stochastic quantities generated under an MCMC sampling scheme. Whereas computing the precise standard errors for our DIC values is a subject of on-going research, the standard errors for the DÂŻ-values are readily obtained and provide a good indication of the accuracy of DIC and pD. In any case, in several runs using different initial values and random-number seeds for this example, the DIC and pD-estimates obtained never varied by more than 0.5. As such, we are confident that, even allowing for Monte Carlo error, either of models 3 or 4 is superior (in terms of DIC performance) to models 2 or 5, which are in turn superior to model 1. A comparison of DIC for models 3 and 4 suggests that the two spatial models are virtually indistinguishable in terms of the overall fit: pragmatically, we might prefer reporting model 3 since its DIC is only marginally greater than the more complex model 4.

Considering now the absolute measure of fit suggested in Section 6.2, we compare the values of DÂŻ in Table 1 with the sample size n=56. This suggests that all models except the pooled model 1 provide an adequate overall fit to the data, and that the comparison is essentially based on their complexity alone.

Following the discussion in Section 6, Fig. 4 shows a plot of deviance residuals dri against leverages pDi for each of the five models considered. The broken curves marked on each plot are of the form x2+y=c and points lying along such a parabola will each contribute an amount DICi=c to the overall DIC for that model. For models 2–5, parabolas are marked at values of c= 1, 2, 5, and any data point whose contribution DICi is greater than 2 is labelled by its observation number. For model 1, parabolas are marked at c= 1, 10, 50, since the size of the deviance residuals and individual contributions to DIC are much larger and, for clarity, only points for which DICi is greater than 10 are marked by their observation number. Observations 55 and 56, the only districts with yi=0, are clearly identified as potential outliers under each of the random-effects models 2–4, as is observation 1 (the district with the highest observed risk ratio yi/Ei). A few other observations (2, 3, 4, 53 and 54) have contributions DICi that are just larger than 2 under model 2: with the exception of the three districts already discussed, these five districts have the most extreme observed risk ratios and so their estimates tend to be shrunk furthest under the exchangeable model. Observations 14, 15, 45 and 50 appear to be outliers in models 3 and 4 which have a spatial effect, but not in the remaining models. A further investigation reveals that the observed risk ratios in these districts are extreme compared with those in each of their neighbouring districts. For example district 50 has only six cases compared with 19.6 expected, whereas each of its three neighbouring districts have high observed counts (17, 16 and 16) relative to those expected (7.8, 10.5 and 14.4). The spatial prior in models 3 and 4 causes the estimated rate in district 50 to be smoothed towards the mean of its neighbours’ rates, thus leading to the discrepancy between observed and fitted values, and since the observation still exercises considerable weight on its fitted value the leverage is high as well. However, overall we might not consider that there is sufficient evidence to cast doubt on any particular observations.

Diagnostics for the lip cancer example—residuals versus leverages (the parabolas indicate contributions of 1, 2 or 5 to the total DIC (apart from model 1): (a) model 1; (b) model 2; (c) model 3; (d) model 4; (e) model 5

Fig. 4

Diagnostics for the lip cancer example—residuals versus leverages (the parabolas indicate contributions of 1, 2 or 5 to the total DIC (apart from model 1): (a) model 1; (b) model 2; (c) model 3; (d) model 4; (e) model 5

8.2. Robust regression using the stack loss data

Spiegelhalter et al. (1996) (pages 27–29) considered a variety of error structures for the oftanalysed stack loss data of Brownlee (1965) . Here the response variable y , the amount of stack loss (escaping ammonia in an industrial application), is regresssed on three predictor variables: air flow x1 , temperature x2 and acid concentration x3 . Assuming the usual linear regression structure

where zij=(xij−x¯.j)/sd(x.j) , the standardized covariates, the presence of a few prominent outliers among the n =21 cases motivates a comparison of the following four error distributions: model 1,

model 2,

model 3,

model 4,

(where DE denotes the double-exponential (Laplace) distribution and td denotes Student’s t -distribution with d degrees of freedom).

A well-known alternative to the direct fitting of many symmetric but non-normal error distributions is through scale mixtures of normals (Andrews and Mallows, 1974). From page 210 of Carlin and Louis (2000), we have the alternate td-formulation model 5,

yi∼normal(μi,1wiτ),wi∼1dχd2=gamma(d2,d2).

Unlike our other examples the form of the likelihood changes with each model, so we must use the full normalizing constants when computing −2 log {p(y|μ,τ)}.

Following Spiegelhalter et al. (1996) we set d=4, and for each model we placed essentially flat priors on the βj (actually normal with mean 0 and precision 0.000 01) and log (τ) (actually gamma(0.001,0.001) on τ) and ran the Gibbs sampler in BUGS for 5000 iterations following a burn-in period of 1000 iterations.

Replacing τ and wi by their posterior means where necessary for the D(θ¯)-calculation, the resulting deviance summaries are shown in Table 2 (note that the mean parameterization and the canonical parameterization are equivalent here, since the mean μi is a linear function of the canonical β-parameters). Beginning with a comparison of the first four models, the estimates of pD are all just over 5, the correct number of parameters for this example. The DIC-values imply that model 2 (double exponential) is best, followed by the t4-, the logistic and finally the normal models. Clearly this order is consistent with the models’ respective abilities to accommodate outliers.

Table 2

Deviance results for the stack loss data

ModelD¯D(θ¯)pDDIC
1, normal110.1105.05.1115.2
2, double exponential107.9102.35.6113.5
3, logistic109.5104.25.3114.8
4, t4108.7103.25.5114.2
5, t4 as scale mixture102.194.57.6109.7

Turning to the normal scale mixture representation for the t4-likelihood (model 5), the pD-value is 7.6, suggesting that the wi random effects contribute only an extra 2–2.5 parameters. However, the model’s smaller DIC-value implies that the extra mixing parameters are worthwhile in an overall quality-of-fit sense. We emphasize that the results from models 4 and 5 need not be equal since, although they lead to the same marginal likelihood for the yi, they correspond to different prediction problems.

Finally, plots of deviance residuals versus leverages (which are not shown) clearly identify the observations determined to be ‘outlying’ by several previous researchers who analysed this data set.

8.3. Longitudinal binary observations: the six-cities study

To illustrate how the mean and canonical parameterizations (introduced in Section 5 and further discussed in Section 9) can sometimes lead to different conclusions, our next example considers a subset of data from the six-cities study, a longitudinal study of the health effects of air pollution: see Fitzmaurice and Laird (1993) for the data and a likelihood-based analysis. The data consist of repeated binary measurements yij of the wheezing status (1, yes; 0, no) of child i at time j, i=1,…,I,j=1,…,J, for each of I=537 children living in Stuebenville, Ohio, at J=4 time points. We are given two predictor variables: aij, the age of child i in years at measurement point j (7, 8, 9 or 10 years), and si, the smoking status of child i‘s mother (1, yes; 0, no). Following the Bayesian analysis of Chib and Greenberg (1998), we adopt the conditional response model

Yij∼Bernoulli(pij),pij≡Pr(Yij=1)=g−1(μij),μij=β0+β1zij1+β2zij2+β3zij3+bi,

where zijk=xijk−x¯.k,k=1,2,3⁠, and xij1=aij, xij2=si and xij3=aijsi, a smoking–age interaction term. The bi are individual-specific random effects, initially given an exchangeable N(0,λ−1) specification, which allow for dependence between the longitudinal responses for child i. The model choice issue here is to determine the most appropriate link function g(·) among three candidates, namely the logit, the probit and the complementary log–log-links. More formally, our three models are model 1,

g(pij)=logit(pij)=log{pij/(1−pij)},

model 2,

g(pij)=probit(pij)=Φ−1(pij),

and model 3,

g(pij)=cloglog(pij)=log{−log(1−pij)}.

Since the Bernoulli likelihood is unaffected by this choice, in all cases the deviance takes the simple form

D=−2∑i,j{yijlog(pij)+(1−yij)log(1−pij)}.

Placing flat priors on the βk and a gamma(0.001,0.001) prior on λ, and running the Gibbs sampler for 5000 iterations following a burn-in period of 1000 iterations produces the deviance summaries in Table 3 for the canonical and mean parameterizations: the canonical parameterization constructs θ¯ as the mean of the linear predictors β and bi, and then uses the appropriate linking transformation (logit, probit or complementary log–log) to obtain the imputed means for the pij. The mean parameterization simply uses the means of the pij themselves when computing D(θ¯)⁠. Natarajan and Kass (2000) have pointed out potential problems with the gamma(0.001,0.001) prior on λ, but in this context the 537 random effects ensure that these findings are robust to the choice of prior for λ.

Table 3

Results for both parameterizations of the Bernoulli panel data

ModelDÂŻResults for the canonicalResults for the mean
parameterizationparameterization
------------
D(θ¯)pD
------------
1, logit1166.4917.7248.7
2, probit1148.6885.9262.7
3, complementary log–log1180.9956.5224.4

The posterior standard deviation √_λ_−1 of the random effects is estimated to be 2.2 (standard deviation 0.2), which indicates extremely high unexplained overdispersion and hence considerable prior–data conflict: this should warn us of a potential lack of robustness in our procedure. We have a sample size of ni=4 for each of I=537 individuals, and an average pDi for the canonical parameterization of around 0.4–0.5. From approximation (31), this indicates a prior sample size a+b of around 4–6. Referring to the evidence in Fig. 1 concerning low prior and observation sample sizes (ni=1;a+b=1), we might expect the mean parameterization to display decreased complexity compared with the canonical, and this is borne out in the results. DIC prefers the complementary log–log-link under the canonical parameterization, but the probit link under the mean parameterization. We repeat that we prefer the canonical results because of the improved normality of the likelihoods and their lack of dependence on observed data: however, none of the models explain the data very well, and the lack of consensus suggests caution in using any of the models.

9. Discussion

Here we briefly discuss relationships to other suggestions and give some guidance on the practical use of the techniques described in this paper.

9.1. Relationship of pD and DIC to other suggestions

9.1.1. Cross-validation

Stone (1977) showed the asymptotic equivalence of model comparison based on cross-validation and AIC, whereas Wahba (1990) (page 52) showed how a generalized cross-validation criterion leads to the use of n −tr( H ) as a denominator in the estimation of residual mean-squared error. We would expect our measure of model complexity pD to be strongly related to cross-validatory assessment, but this requires further investigation.

9.1.2. Other predictive loss functions

Kass and Raftery (1995) criticized Akaike (1973) for using a plug-in predictive distribution as we have done in Section 7.3, rather than the full predictive distribution obtained by integrating out the unknown parameters. A criterion based on this predictive distribution is also invariant to reparameterizations. Laud and Ibrahim (1995) and Gelfand and Ghosh (1998) suggested minimizing a predictive ‘discrepancy measure’ E { d ( Ynew , y )| y }, where Ynew is a draw from the posterior predictive distribution p ( Ynew | y ), and we might for instance take d ( Ynew , y )=( Ynew − y ) T ( Ynew − y ). They showed that their measures also have attractive interpretations as weighted sums of ‘goodness of fit’ and ‘predictive variability penalty’ terms. However, a proper choice of the criterion requires fairly involved analytic work, as well as several subjective choices about the utility function that is appropriate for the problem at hand. Furthermore, the one-way ANOVA model in Section 2.5 gives rise to a fit term equivalent to D(θ¯) , and a predictive variability term equal to pD + p . Thus their suggestion is equivalent in this context to the comparison by our Bayesian measure of fit D¯ which, although invariant to parameterization, does not seem to penalize complexity sufficiently.

In general the use of a plug-in estimate appears to ‘cost’ an extra penalty of pD.

9.1.3. Bayes factors

Bayes factors are criteria based on a comparison of the marginal likelihoods (1) (Kass and Raftery, 1995), and a common approximation is the Bayesian (or Schwarz) information criterion (Schwarz, 1978), which for a model with p parameters and n observations is given by

BIC=−2log{p(y∣θ^)}+plog(n).

Bernardo and Smith (1994) (chapter 6) argued that this formulation may only be appropriate in circumstances where it was really believed that one and only one of the competing models was in fact true, and the crucial issue was to choose this correct model, and that in other circumstances criteria based on short-term prediction, such as cross-validation, may be more appropriate. We support this view and refer to Han and Carlin (2001) for a review of some of the computational and conceptual difficulties in using Bayes factors to compare complex hierarchical models. Whether DIC can be justified as a basis for model averaging remains open for investigation.

9.2. Practical issues in using DIC

9.2.1. Invariance

pD may be only approximately invariant to the chosen parameterization, since different fitted deviances D(θ¯) may arise from substituting posterior means of alternative choices of θ . The example in Section 8.3 shows that this choice could be important with Bernoulli data.

In Section 5 we explored the use of the posterior median as an estimator leading to an invariant pD. This has two possible disadvantages: we do not have a proof that pD will be positive and some additional computational difficulty in that the full sample needs to be retained. In addition the approximate properties based on Taylor series expansions in Section 3 may not hold, although this may be only of theoretical interest. Currently we recommend calculation of DIC on the basis of several different estimators, with a preference for posterior means based on parameterizations obeying approximate likelihood normality.

9.2.2. Focus of analysis

As we saw in the stack loss example of Section 8.2, there may be sensitivity to apparently innocuous restructuring of the model: this is to be expected since by making such changes we are altering the definition of a replicate data set, and hence one would expect DIC to change. For example, consider a model comprising a mixture of normal distributions. If this assumption was solely to obtain a flexible functional form, then the appropriate likelihood would comprise the mixture. If, however, we were interested in the membership of individual observations, then the likelihoods would be normal and the membership variables would contribute to the complexity of the model. Thus the parameters in the focus of a model should ideally depend on the purpose of the investigation, although in practice it is likely that the focus may be chosen on computational grounds as providing likelihoods that are available in closed form.

9.2.3. Nuisance parameters

Strictly speaking, nuisance parameters should first be integrated out to leave a likelihood depending solely on parameters in focus. In practice, however, parameters such as variances are likely to be included in the focus and add to the estimated complexity: we would recommend posterior means of log-variances as estimators.

9.2.4. What is an important difference in DIC?

Burnham and Anderson (1998) suggested models receiving AIC within 1–2 of the ‘best’ deserve consideration, and 3–7 have considerably less support: these rules of thumb appear to work reasonably well for DIC. Certainly we would like to ensure that differences are not due to Monte Carlo error: although this is straightforward for D¯ , Zhu and Carlin (2000) have explored the difficulty of assessing the Monte Carlo error on DIC.

9.2.5. Asymptotic consistency

As with AIC, DIC will not consistently select the true model from a fixed set with increasing sample sizes. We are not greatly concerned about this: we neither believe in a true model nor would expect the list of models being considered to remain static as the sample size increased.

9.3. Conclusion

In conclusion, our suggestions have a similar ‘information theoretic’ background to frequentist measures of model complexity and criteria for model comparison but are based on expectations with respect to parameters in place of sampling expectations. DIC can thus be viewed as a Bayesian analogue of AIC, with a similar justification but wider applicability. It is also applicable to any class of model, involves negligible additional analytic work or Monte Carlo sampling and appears to perform reasonably across a range of examples. We feel that pD and DIC deserve further investigation as tools for model assessment and comparison.

Discussion on the paper by Spiegelhalter, Best, Carlin and van der Linde

S. P. Brooks ( University of Cambridge )

This is a wonderful paper containing a wide array of interesting ideas. It seems to me very much like a first step (and in the right direction) and I am sure that it will be seen as both a focus and a source of inspiration for future developments in this area.

As the authors point out, their pD and the deviance information criterion (DIC) statistics have already been widely used within the Bayesian literature. Given this history and in the previous absence of a published source for these ideas, it is easy to misunderstand what pD actually does. Certainly, before reading this paper, but having read several others which use the DIC, I thought that the pD-statistic was a clever way of avoiding the problem that Bayesians have when it comes to calculating the number of parameters in any hierarchical model. Essentially the problem is one of deciding which variables in the posterior are model parameters and which are hyperparameters arising from the prior. However, pD does not help us here and that is why we have Section 2.1 explaining that this choice is up to the reader. The authors refer to this as choosing the ‘focus’ for the analysis. Sadly, in many cases the calculation of pD will be impossible for the focus of primary interest since the deviance will not be available in closed from (this includes random effects and state space models, for example), so this remains an open problem.

What pDdoes do is to tell you, once you have chosen your focus, how many parameters you lose (or even gain?) by being Bayesian. The number of degrees of freedom (or parameters) in a model is clear from the (focused) likelihood. However, by combining the likelihood with the prior we almost always impose additional restrictions on the parameter space, effectively reducing the degrees of freedom of our model. Take the authors’ saturated model of Section 8.1, in which parameters α1,…,α56 are given a prior with some unknown mean μ and fixed variance σ2. Clearly, in the limit as σ2 goes to 0, we essentially remove the 56 individual parameters αi and effectively replace them with a single parameter μ. I guess that this is fairly obvious with hindsight as is the case with many great ideas. None-the-less it is a credit to the authors firstly for seeing it and, more importantly, for actually deriving a procedure for dealing with it.

This prior-induced parameter reduction can be clearly observed in Fig. 5 in which we plot the value of pDθ against log(σ2) both for a hyperprior μ ∼ N(0,1000) and for μ=0 (the authors are unclear about which, if either, they actually use in Section 8.1). We can see that, as σ2 decreases, the effective number of parameters decreases to either 1 or 0 depending on whether or not μ itself is a parameter, i.e. which prior is chosen. It is interesting to note the rapid decline in pD for variances between 1 and 0.01, but what is particularly interesting about this plot is that, as σ2 increases, pD converges to a fixed maximum well below 56, the number of parameters in the likelihood. As an experiment, if we take σ2=1030 or even the Jeffreys prior for the μi, a value for pD exceeding 53.1 is never obtained (modulo Monte Carlo error). This suggests that we automatically lose three parameters just by being Bayesian, even if we are as vague as we could possibly be with our prior. Quoting Bernardo and Smith (1994), page 298, ‘every prior specification has some informative posterior or predictive implications …. There is no ‘‘objective’’ prior that represents ignorance.‘ Of course, the authors’Table 1 suggests that if we took the median as the basis for the calculation of pD then we might obtain different results; indeed we seem to regain several parameters this way! Unfortunately, analytic investigation of the pD-statistic is essentially limited to the case where we take θ˜(y) to be the posterior mean, so we have little idea of the extent and nature of the variability across parameterizations. This choice is likely to have a significant effect on any inference based on the corresponding pD-statistic and further (no doubt simulation-based) investigation along these lines would certainly be very helpful.

Plot of pDθ for the saturated model of Section 8.1 demonstrating its dependence on the prior variance for the random effects: ——, pD -statistic with an N (0, 1000) hyperprior for μ : - - - - -, corresponding value when we fix μ =0; ⋯⋯·, number of parameters in the likelihood

Fig. 5

Plot of pDθ for the saturated model of Section 8.1 demonstrating its dependence on the prior variance for the random effects: ——, pD -statistic with an N (0, 1000) hyperprior for μ : - - - - -, corresponding value when we fix μ =0; ⋯⋯·, number of parameters in the likelihood

As well as the construction of the pD-statistic, the paper also derives a new criterion for model comparison labelled the DIC. The authors provide a heuristic justification for the DIC, but there are clearly several alternatives. One obvious extension of the usual Akaike information criterion (AIC) statistic to the Bayesian context is to calculate its posterior expectation, EAIC=D(θ)¯+2p (rather than evaluating it at the posterior mode under a flat prior), or to take the deviance calculated at the posterior mean, i.e. taking D(θ¯)+2p⁠. Of course, as with the DIC, posterior medians, modes etc. could also be taken and similar extensions could be applied to the corrected AIC statistic and the Bayesian information criterion for example. Further, the number of parameters in each of these expressions might be replaced by pD to gain even more potential criteria. Table 4 gives the posterior model probabilities and posterior-averaged information criteria (based on p, rather than pD), including DIC, for autoregressive models of various orders fitted to the well-known lynx data (Priestley (1981), section 5.5). We note the broad agreement between the DIC, EAIC and EAICc (as is common in my own experience and, I think, expected by the authors), but that EBIC locates an entirely different model. We note also that the posterior model probabilities correctly identify the fact that two models appear to describe the data well and it is the only criterion to identify correctly the existence of two distinct modes in the posterior.

Table 4

Effective number of parameters, values of DIC and the posterior expectation of various information criteria for fitting an autoregressive model of order k (with k +1 parameters including the error variance) to the lynx data†

kpDDICEAICEBICEAICcπ ( K = k )wkDICwkEAICwkEBICwEAICc
11.88206.66206.78209.51206.810.0000.0000.0000.0000.000
22.85126.58127.72133.19127.830.2430.0000.0030.8580.011
33.78127.06129.27137.48129.500.0160.0000.0010.1010.005
44.76125.52128.75139.70129.120.0070.0000.0020.0330.006
55.70125.23129.52143.20130.080.0020.0000.0010.0060.004
66.62126.30131.68148.09132.460.0010.0000.0040.0000.001
77.60122.34128.72147.88129.780.0020.0000.0020.0010.004
88.61121.81129.19151.08130.560.0020.0000.0010.0000.003
99.58122.75131.16155.79132.890.0010.0000.0010.0000.001
1010.54118.94128.40155.76130.530.0020.0010.0020.0000.003
1111.33106.51117.16147.26119.750.1540.4310.5660.0010.624
1212.61106.89118.27151.10121.360.2680.3560.3250.0000.280
1313.56108.74121.17156.74124.810.1350.1420.0760.0000.050
1414.46110.77124.30162.61128.540.0670.0510.0160.0000.008
1515.37112.896127.42168.47132.320.0000.0190.0030.0000.001

†Criterion entries in bold indicate the model minimizing the relevant criterion, whereas those in italics denote alternative plausible models under the rules of thumb discussed in Section 9.2.4. Probabilities π or weights w in bold denotes the top two models in each case. Here, EAICc denotes the posterior mean of the corrected EAIC (Burnham and Anderson, 1998). π(K=k) the corresponding posterior model probability under a flat prior across models and the wkX the corresponding Akaike weights (or equivalent). The posterior model probabilities were kindly provided by Ricardo Ehlers.

Given the number of approximations and assumptions that are required to obtain the DIC it can only really be used as a broad brush technique for discriminating between obviously disparate models, in much the same way as any of the alternative information criteria suggested above might be used. However, in many realistic applications there may be two or more models with sufficiently similar DIC that it is impossible to choose between the two. The only sensible choice in this circumstance is to model-average (see Section 9.1.3). Burnham and Anderson (1998), section 4.2, suggested the use of AIC weights and these are also given in Table 4 together with the corresponding weights for the other criteria. Essentially, these are obtained by subtracting from each AIC the value associated with the ‘best’ model and then setting

where ΔAIC(k) denotes the transformed AIC-value for model k. These weights are then normalized to sum to 1 over the models under consideration.

Note the distinct differences between the weights and the posterior model probabilities given in Table 4, suggesting that only one or the other can really make any sense. We note here that similar comparisons have been made in the context of other examples. In the context of a log-linear contingency table analysis, King (2001), Table 2.5, found that two models have posterior probability 0.557 and 0.057 but corresponding DIC weights of 0.062 and 0.682 respectively. Similar examples in which the DIC and posterior model probabilities give wildly different results are provided by King and Brooks (2001). Do the authors have any feel for why these two approaches might give such different results? Which would they recommend be used and do they have any suggestions for alternative DIC-based weights for model averaging which might lead to more sensible results? Surely, the only sensible approach is to calculate posterior model probabilities via transdimensional Markov chain Monte Carlo methods. When, then, do the authors suggest that the DIC might be used? What, in practical terms is the question that the DIC is answering as opposed to the posterior model probabilities?

The incorporation of the DIC-statistic into WinBUGS 1.4 ensures its ultimate success, but I have grave misgivings concerning the blind application of a ‘default’ DIC-statistic for model determination problems particularly given its heuristic derivation and the series of essentially arbitrary assumptions and approximations on which it is based. The authors ‘recommend calculation of DIC on the basis of several different estimators’. The option to choose different parameterizations is not available in the beta version of WinBUGS 1.4; will it be added to later versions? What about options for the all-important choice of focus? What do the authors suggest we do when the same parameterization is not calculable for all models being compared? Could not the choice of parameterization for each model adversely influence the results, particularly for models with large numbers of parameters (where a small percentage change in pD might mean a large absolute change in the corresponding DIC)?

The paper, like any good discussion paper, leaves various other open questions. For example: why take 𝔼θ|y[dΘ] in equation (9) and not the mode or median; how should we decide when to take θ^ to be the mean, median, mode etc. as this will surely lead to different comparative results for the DIC; when is pD negative and why; in an entirely practical sense, how does model comparison with the DIC compare with that via posterior model probabilities and why do they differ—can both be ‘correct’ in any meaningful way? On page 613, the authors write ‘pD and DIC deserve further investigation as tools for model assessment and comparison’ and I would certainly agree that they do. I have very much enjoyed thinking about some of these ideas over the past few weeks and I am very grateful to the authors for the opportunity and motivation to do so. It therefore gives me great pleasure to propose the vote of thanks.

Jim Smith ( University of Warwick, Coventry )

I shall not address technical inaccuracies but just present four foundational problems that I have with the model selection in this paper.

  • (a)

    Bayesian models are designed to make plausible predictive statements about future observables. The predictive implications of all the prior settings on variances in the worked examples in Section 8 are unbelievable. They do not represent carefully elicited expert judgments but the views of a vacuous software user. Early in Section 1 the authors state that they want to identify succinct models ‘which appear to describe the information [about wrong ‘‘true’’ parameter values (see Section 2.2)?] in the data accurately’. But in a Bayesian analysis a separation between information in the data and in the prior is artificial and inappropriate. For example where do I input extraneous data used as the basis of my prior? When do I stop calling this data (and so include it in D (·)) and instead call it prior information? This forces the authors to use default priors. A Bayesian analysis on behalf of a remote auditing expert (Smith, 1996) might require the selection of a prior that is robust within a class of belief of different experts (e.g. Pericchi and Walley (1991)). Default priors can sometimes be justified for simple models. Even then, models within a selection class need to have compatible parameterizations: see Moreno et al. (1998). However, in examples where ‘the number of parameters outnumbers observations’—they claim their approach addresses—default priors are unlikely to exhibit any robustness. In particular, outside the domain of vague location estimation or separating variance estimation (discussed in Section 4), apparently default priors can have strong influence on model implications and hence selection.

  • (b)

    Suppose that we need to select models whose predictive implications we do not believe. Surely we should try to ensure that prior information in each model corresponds to predictive statements that are comparable. Such issues, not addressed here, are considered by Madigan and Raftery (1991) for simple discrete Bayesian models. But outside linear models with known variances this is a difficult problem. Furthermore it is well known that calibration is a fast function ( Cooke, 1991 ). In particular apparently inconsequential deviations from the features of a model ‘not in focus’ tend to dominate D ( θ ) and D(θ)¯⁠. A trivial example of this occurs when we plan to forecast X2 having observed an independent identically distributed X1 =0.01 which under models M1 and M2 have respective Gaussian distributions N (100,10 000) and N (0,0.001). Then, for most priors, model M1 is strongly preferred although its predictions about X2 are less ‘useful’ (Section 2.2). The authors’ premise that all the models they entertain are ‘wrong’ allows these calibration issues to bite theoretically even in the limit, unlike their asymptotically consistent rivals. The authors, however, do no more than to acknowledge the existence of this core difficulty after the example in Section 8.3.

  • (c)

    Suppose that problems (a) and (b) do not bite. Then the ‘vector of parameters of focus’ (POF) will have a critical influence on any ensuing inference. How in practice do we specify this? The authors state without elaboration that this ‘should depend on the purpose of the investigation’ (Section 9.2.2). But it appears that in practice the POF is calculated on ‘computational grounds’, their software capability driving their inference. The high influence of the choice of the POF is illustrated in the example in Section 8.2. Here models 4 and 5 are predictively identical but model 5 has a significantly smaller deviance information criterion DIC than model 4. The authors conclude that ‘the extra mixing parameters are worthwhile’: why? In what practical sense is this helpful? This example illustrates that the unguided choice of the POF will often be inferentially critical. Incidentally in this example the order of DIC is not (as stated) consistent with the thickness of tails of the sample distribution, the thickest-tailed distribution being model 4.

  • (d)

    But ignoring all these difficulties there still remains the acknowledged choice of (re)parameterization governing the choice of θ¯ which initially we shall assume to be the mean. Consider the case when the POF θ is one dimensional with strictly increasing posterior distribution function F ( θ | y ), and G is a distribution function of a random variable with mean μ . Then the reparameterization of θ to ϕμ=Gμ−1{F(θ∣y)} has E(ϕμ)=μ.  (or D(ϕ¯) is arbitrary within the range of D (·). Thus, contrary to Section (5.1.4), the choice of parameterization of θ with non-degenerate posterior will always be critical. But no general selection guidance is given here. In observation (c) of Section 2.6 the authors suggest the use of the posterior median instead of the mean if this can be calculated easily from their output: not a solution when the POF is more than one dimensional. Even familiar transforms of marginal medians to contrasts and means or means and variances to means and coefficients of variation will not exhibit the required sorts of invariance.

There may be theoretical reasons to use DIC but I do not believe that this paper gives them. So my suggestion to a practitioner would be: if you must use a formal selection criterion do not use DIC. I second the vote of thanks.

The vote of thanks was passed by acclamation.

Aki Vehtari ( Helsinki University of Technology )

The authors mention that the deviance information criterion DIC estimates the expected loss, with deviance as the loss function. This connection should be emphasized more. It should be remembered that the estimation of the expected deviance was Akaike’s motivation for deriving the very first information criterion AIC (Akaike, 1973). In prediction and decision problems, it is natural to assess the predictive ability of the model by estimating the expected utilities, as the principle of rational decisions is based on maximizing the expected utility (Good, 1952) and the maximization of expected likelihood maximizes the information gained (Bernardo, 1979). It is often useful to use other than likelihood-based utilities. For example, in classification problems it is much more meaningful for the application expert to know the expected classification accuracy than just the expected deviance value (Vehtari, 2001). Given an arbitrary utility function u, it is possible to use Monte Carlo samples to estimate Eθ[u¯(θ)] and u¯(Eθ[θ])⁠, and then to compute an expected utility estimate as

u¯DIC=u¯(Eθ[θ])+2{Eθ[u¯(θ)]−u¯(Eθ[θ])},

which is a generalization of DIC (Vehtari, 2001).

The authors also mention the known asymptotic relationship of AIC to cross-validation (CV). Equally important is to note that the same asymptotic relationship holds also for NIC (Stone (1977), equation (4.5)). The asymptotic relationship is not surprising, as it is known that CV can also be used to estimate expected utilities with Bayesian justification (Bernardo and Smith (1994), chapter 6, Vehtari (2001) and Vehtari and Lampinen (2002a)). Below some main differences between CV and DIC are listed. See Vehtari (2001) and Vehtari and Lampinen (2002b) for full discussion and empirical comparisons. CV can use full predictive distributions. In the CV approach, there are no parameterization problems, as it deals directly with predictive distributions. CV estimates the expected utility directly, but it can also be used to estimate the effective number of parameters if desired. In the CV approach, it is easy to estimate the distributions of the expected utility estimates, which can for example be used to determine automatically whether the difference between two models is ‘important’. Importance sampling leave-one-out CV (Gelfand et al., 1992; Gelfand, 1996) is computationally as light as DIC, but it seems to be numerically more unstable. k-fold CV is very stable and reliable, but it requires k times more computation time to use. k-fold CV can also handle finite range dependences in the data. For example, in the six-cities study, the wheezing statuses of a single child at different ages are not independent. DIC, which assumes independence, underestimates the expected deviance. In k-fold CV it is possible to group the dependent data and to handle independent groups and thus to obtain better estimates (Vehtari, 2001; Vehtari and Lampinen, 2002b).

Martyn Plummer ( International Agency for Research on Cancer, Lyon )

I congratulate the authors on their thought-provoking paper. I would like to offer one constructive suggestion and one criticism.

Firstly, I have a proposal for a modified definition of the effective number of parameters pD. Starting from the Kullback–Leibler information divergence between the predictive distributions at two different values of θ

I(θ0,θ1)=EYrep ∣θ0[log{p(Yrep ∣θ0)p(Yrep ∣θ1)}],

I suggest that pD be defined as the expected value of I(θ0,θ1) when θ0 and θ1 are independent samples from the posterior distribution of θ. This modified definition yields exactly the same expression for pD in the normal linear model with known variance. In general, it should give a similar estimate of pD when θ has an asymptotic normal distribution. This version of pD can also be decomposed into influence diagnostics when the likelihood factorizes as in Section 6.3. It has the theoretical advantages of being non-negative and co-ordinate free. A practical advantage is that pD can be estimated via Markov chain Monte Carlo sampling using two parallel chains by taking the sample average of

log{p(Yrep 0∣θ0)p(Yrep 0∣θ1)}

where the superscript denotes the chain to which each quantity belongs. The Monte Carlo error of this estimate is easily calculated and the difficulties discussed by Zhu and Carlin (2000) can thus be avoided.

For exponential family models, I(θ0,θ1) can be expressed in closed form and there is no need to simulate replicate observations Yrep. When the scale parameter φ is known, the expression for pDi simplifies to

pDi=niwicov{θi,μ(θi)∣Y}/ϕ.

This gives a surprising resolution to the problem of whether to use the canonical or mean parameterization to estimate pD.

On a more negative note, I am not convinced by the heuristic derivation of the deviance information criterion DIC in Section 7.3. I followed this derivation for the linear model of Section 4.1, for which it is not necessary to make any approximations. The term with expectation 0, neglected in the final expression, is p−pD−D(θ¯)⁠. Adding this to DIC gives an expected loss of p+pD which is not useful as a model choice criterion. I am not suggesting that the use of DIC is wrong, but a formal derivation is lacking.

Mervyn Stone ( University College London )

The paper is rather economical with the ‘truth’. The truth of pt(Y) corresponds fixedly to the conditions of the experimental or observational set-up that ensures independent future replication Yrep or internal independence of y=y=(y1,…,yn) (not excluding an implicit concomitant x). For pt(Y)≈_p_(Y|θt),θ must parameterize a scientifically plausible family of alternative distributions of Y under those conditions and is therefore a necessary‘focus’ if the ‘good [true] model’ idea is to be invoked: think of tossing a bent coin. Changing focus is not an option.

Any connection of pD with cross-validatory assessment would need truth as pt(y)=pt(y1)…pt(yn). If l= log (p) is an acceptable measure of predictive success, A=∑il(yi∣θ˜−i) is a one-out estimate of Ept(Y)[∑il{Yi∣θ˜(y)}]⁠. Multiplied by −2, this connects with equation (33) only when the θ-model is true with Y1,…,Yn independent.

Extending Stone (1977) to the posterior mode for prior p(θ), with n large, A≈Lθ˜(y)−Π(y) where

Π(y)=−tr{Lθ˜′′+l″(θ˜)}−1∑ilθ˜′(yi)lθ˜′(yi)T

and l(θ)= log {p(θ)}. If l′′(θ˜) is negative definite, the typically non-negative penalty Π(y) is smaller for the posterior mode than for the maximum likelihood estimate. For the maximum likelihood estimate, l′′(θ˜)=O gives Π(y) estimating p, but the general form probably gives Ripley’s p.

If Section 7.3 could be rigorously developed (the use of EY does look suspicious!), another connection (via equation (33)) might be that DIC ≈−2_A_. But, since Section 7.3 invokes the ‘good model’ assumption and small |θ˜−θ| for the Taylor series expansion (i.e. large n), such a connection would be as contrived as that of A with the Akaike information criterion: why not stick with the pristine (nowadays calculable) form of A—which does not need large n or truth, and which accommodates estimation of θ at the independence level of a hierarchical Bayesian model? If sensitivity of the logarithm to negligible probabilities is objectionable, Bayesians should be happy to substitute a subjectively preferable measure of predictive success.

Christian P. Robert ( UniversitĂŠ Paris Dauphine ) and D. M. Titterington ( University of Glasgow )

A question that arises regarding this thought challenging paper was actually raised in the discussion of Aitkin (1991), namely that the data seem to be used twice in the construction of pD. Indeed, y is used the first time to produce the posterior distribution π(θ|y) and the associated estimate θ˜(y)⁠. The (Bayesian) deviance criterion then computes the posterior expectation of the observed likelihood p(y|θ),

∫log{p(y∣θ)}π(dθ∣y)∝∫log{p(y∣θ)}p(y∣θ)π(dθ),

and thus uses y again, similarly to Aitkin’s posterior Bayes factor

This repeated use of y would appear to be a potential factor for overfitting.

It thus seems more pertinent (within the Bayesian paradigm) to follow an integrated approach along the lines of the posterior expected deviance of Section 6.2,

∫EY∣θ[−2log{p(Y∣θ)}+2log{f(Y)}]π(dθ∣y)

because this quantity would be strongly related to the posterior expected loss defined by the logarithmic deviance,

d(θ,θ˜)=EY∣θ[log{p(Y∣θ)}−log{p(Y∣θ˜)}],

advocated in Robert (1996) and Dupuis and Robert (2002) as an intrinsic loss adequate for model fitting. In fact, the connection between pD, the deviance information criterion and the logarithmic deviance would suggest the use of this loss d(θ,θ˜) to compute the estimate plugged in pD as the intrinsic Bayes estimator

θπ(y)=argminθ˜{Eθ∣y(EY∣θ[log{p(Y∣θ)}−log{p(Y∣θ˜)}])}=argmax[EY∣y{p(Y∣θ˜)}]

where the last expectation is computed under the predictive distribution. Not only does this make sense because of the aforementioned connection, but it also provides an estimator that is completely invariant to reparameterization and thus avoids the possibly difficult choice of the parameterization of the problem. (See Celeux et al. (2000) for an illustration in the set-up of mixtures.)

J. A. Nelder ( Imperial College of Science, Technology and Medicine, London )

My colleague Professor Lee has made some general points connecting the subject of this paper to our work on likelihood-based hierarchical generalized linear models. I want to make one specific point and two general ones.

  • (a)

    Professor Dodge has shown that, of the 21 observations in the stack loss data set, only five have not been declared to be outliers by someone! Yet there is a simple model in which no observation appears as an outlier. It is a generalized linear model with gamma distribution, log-link and linear predictor x2 + log ( x1 )* log ( x3 ). This gives the following entries for Table 2 in the paper

    (I am indebted to Dr Best for calculating these). It is clearly better than the existing models used in Table 2 .

  • (b)

    This example illustrates my first general point. I believe that the time has passed when it was enough to assume an identity link for models while allowing the distribution only to change. We should take as our base-line set of models at least the generalized linear model class defined by distribution, link and linear predictor, with choice of scales for the covariates in the last named.

  • (c)

    My second general point is that there is, for me, not nearly enough model checking in the paper (I am assuming that the use of such techniques is not against the Bayesian rules). For example, if a set of random effects is sufficiently large in number and the model postulates that they are normally distributed, their estimates should be graphed to see whether they look like a sample from such a distribution. If they look, for example, strongly bimodal, then the model must be revised.

Anthony Atkinson ( London School of Economics and Political Science )

This is an interesting paper which tackles important problems. In my comments I concentrate on regression models: the points extend to the more complicated models at the centre of the authors’ presentation.

It is stressed in Section 7.1 that information criteria assume a replication of the observations; in regression this would be with the same X-matrix. But, the simulations of Atkinson (1980) showed that, to predict over a different region, higher values of the penalty coefficient than two in equation (36) are needed. Do the authors know of any analytical results in this area?

Information criteria for model selection are based on aggregate statistics. Fig. 4 shows an alternative and more informative breakdown of one criterion into the contributions of individual observations than that given by Weisberg (1981). However, it does not show the effect of the deletion of observations on model choice. Atkinson and Riani (2000) used the forward search to analyse the stack loss data, for which symmetrical error distributions were considered in Section 8.2. Their Fig. 4.28 shows that the square-root transformation is the only one supported by all the data. The forward plot of residuals, Fig. 3.27, is stable, with observations 4 and 21 outlying. This diagnostic technique complements the choice of a model using information criteria calculated over a set of models that is too narrow.

An example of model choice potentially confounded by the presence of several outliers is provided by 108 observations on the survival of patients following liver surgery from Neter et al. (1996), pages 334 and 438. There are four explanatory variables. Fig. 6 shows the evolution of the added variable t-tests for the variables during the forward search with log(survival time) as the response: the evidence for the importance of all variables except x4 increases steadily during the search. Atkinson and Riani (2002) modify the data to produce two different effects. The forward plots of the t-tests in Fig. 7(a) show that now x1 is non-significant at the end of the search. The plot identifies the group of modified observations which have this effect on the t-test for x1. Fig. 7(b) shows the effect of a different contamination, which makes x4 significant at the end of the search.

Transformed surgical unit data: forward plot of the four added variable t -statistics: three variables are needed in the model— x4 is not significant

Fig. 6

Transformed surgical unit data: forward plot of the four added variable t -statistics: three variables are needed in the model— x4 is not significant

Modified transformed surgical unit data: (a) outliers render x1 non-significant; (b) now the outliers make x4 significant (both (a) and (b) show forward plots of added variable t -statistics)

Fig. 7

Modified transformed surgical unit data: (a) outliers render x1 non-significant; (b) now the outliers make x4 significant (both (a) and (b) show forward plots of added variable t -statistics)

The use of information criteria in the selection of models is a first step, which needs to be complemented by diagnostic tests and plots. These examples show that the forward search is an extremely powerful tool for this purpose. It also requires many fits of the model to subsets of the data. Can it be combined with the appreciable computations of the authors’ Markov chain Monte Carlo methods?

A. P. Dawid ( University College London )

This paper should have been titled ‘Measures of Bayesian model complexity and fit’, for it is the models, not the measures, that are Bayesian. Once the ingredients of a problem have been specified, any relevant question has a unique Bayesian answer. Bayesian methodology should focus on specification issues or on ways of calculating or approximating the answer. Nothing else is required.

Classical criteria overfit complex models, necessitating some form of penalization, and this paper lies firmly in that tradition. But with Bayesian techniques (Kass and Raftery, 1995) overfitting is not a problem: the marginal likelihood automatically penalizes model complexity without any need for further adjustment. In particular, Bayesian model choice is consistent in the ‘good model’ case (Dawid, 1992a). In Section 9.2.5 the authors brush aside the failure of their deviance information criterion procedure to share this consistency property; but should we not seek reassurance that a procedure performs well in those simple cases for which its performance can be readily assessed, before trusting it on more complex problems?

I contest the view (Section 9.1.3) that likelihood is relevant only under the good model assumption: from a decision theoretic perspective, we can always regard the ‘log-loss’ scoring rule S(p,y):=− log {p(y)} as a measure of the inadequacy of an assessed density p(·) in the light of empirical data y (Dawid, 1986). Moreover, when y is a sequence yn=(y1,…,yn) of not necessarily independent or identically distributed variables, we have

−log{p(yn)}=∑i=1n−log{p(yi∣yi−1)},

(41)

the _i_th term measuring the performance of the Bayesian probability forecast for yi on the basis of analysis of earlier data only (Cowell et al. (1999), chapters 10 and 11). This representation clearly demonstrates why unadjusted marginal likelihood offers a valid measure of model fit: each ‘test’ observation yi is always entirely disjoint from the associated ‘training’ data yi−1. If desired, we can generalize this prequential formulation of marginal likelihood by inserting other loss functions (Dawid, 1992b) or using other model fitting methods (Skouras and Dawid, 1999). Such procedures exhibit a natural consistency property even under model misspecification (Dawid, 1991; Skouras and Dawid, 2000).

One place where a Bayesian might want a measure of model complexity is as a substitute for p in the Bayes information criterion approximation to marginal likelihood, e.g. for hierarchical models. But in such cases the definition of the sample size n can be just as problematic as that of the model dimension p. What we need is a better substitute for the whole term p log (n).

Andrew Lawson and Allan Clark ( University of Aberdeen )

We would like to make several comments on this excellent paper.

Our prime concern here is the fact that the deviance information criterion DIC is not designed to provide a sensible measure of model complexity when the parameters in the model take the form of locations in some ℛ-dimensional space. In the spatial context, this could mean the locations of cluster centres or, more generally, the components of a mixture. Clearly the averaging of parameters in these contexts is nonsensical but is a fundamental ingredient of DIC’s penalty term D(θ¯)⁠. Even if an alternative measure of central tendency is used it remains inappropriate to average over configurations where locations in the chosen space are parameters (e.g. cluster detection modelling in spatial epidemiology (McKeague and Loiseaux, 2002; Gangnon and Clayton, 2002). In the case of the Bayes information criterion, however, it might be possible to replace the penalty p ln (n) by an average number of parameters (in a reversible jump context) such as p¯⁠, where p is the number of parameters and n the sample size. This would at least approximately accommodate the varying dimension but would not require the averaging of parameters (as compared with DIC). This was suggested in Lawson (2000).

The second point of concern is the relationship of the goodness of fit to convergence of the Markov chain Monte Carlo samplers for which DIC is designed. If posterior marginal distributions are multimodal then the conventional convergence diagnostic will fail (as they will usually find too much variability in individual chains), and also DIC will average over the modes.

We are also somewhat concerned and puzzled by the results for the Scottish lip cancer data set. In Table 1, excepting the saturated model, the largest penalty terms are for the exchangeable model and not those with either spatial or spatial and exchangeable components. We also note that it is not strictly appropriate to fit a spatial-only model without the exchangeable component.

Finally we note that alternative approaches have recently been proposed (Plummer, 2002).

JosÊ M. Bernardo ( Universitat de València )

This interesting paper discusses rather polemic issues and offers some reasonable suggestions. I shall limit my comments to some points which could benefit from further analysis.

  • (a)

    The authors point out that their proposal is not invariant under reparameterization and show that differences may be large. The use of the median would make the result invariant in one dimension, but it is not trivial to extend this to many dimensions. An attractive, general invariant estimator is the intrinsic estimator obtained by minimizing the reference posterior expectation of the intrinsic loss δ(θ^,θ) ( Bernardo and Suarez, 2002 ) defined as the minimum logarithmic divergence between p(x∣θ^) and p ( x | θ ). Under regularity conditions and moderate or large samples, this is well approximated by ( E [ θ | x ]+ M [ θ | x ])/2, the average between the reference posterior mean and mode. Other invariant estimators may be obtained by minimizing the posterior expectation of δ(θ^,θ) obtained from either a proper subjective prior or an improper prior which, as the reference prior, is obtained from an algorithm which is invariant under reparameterization.

  • (b)

    The authors use ‘essentially flat’ or ‘weakly informative’ priors, i.e. conjugate-like priors with very small parameter values. This is dangerous and is not recommended. There is no reason to believe that those priors are weakly informative on the parameters of interest. Indeed, these limiting proper priors can have hidden undesirable features such as strong biases (cf. the Stein paradox). Moreover, they may approximate a prior function which would result in an improper posterior and using a ‘vague’ proper prior in that case does not solve the problem; the answer will then typically be extremely sensitive to the hyperparameters chosen for the vague proper prior and, since the Markov chain Monte Carlo algorithm will converge because the posteriors are guaranteed to be proper, one might not notice anything wrong. If full, credible, subjective elicitation is not possible then one should use formal methods to derive an appropriate reference prior.

  • (c)

    The authors’ brief comment (in Section 9.2.4) on the calibration of the deviance information criterion DIC is too short to offer guidance. With Bayes factors, we have a direct interpretation of the numbers obtained. The Bayesian reference criterion ( Bernardo, 1999 ) is defined in terms of natural information units (and may also be described in terms of log-odds). Is there a natural interpretation for DIC?

  • (d)

    The important particular case of nested models is not discussed in the paper. Would the authors comment on the behaviour on DIC in that case (and hence on their implication on precise hypothesis testing)? For instance, what is DIC’s recommendation for the simple canonical problem of testing a value for a normal mean? It seems to me that, like Akaike’s information criterion or the Bayesian reference criterion (but not the Bayes information criterion or Bayes factors), DIC would avoid Lindley’s paradox. Is this so?

Sujit K. Sahu ( University of Southampton )

This impressive paper shows how the very complicated business of model complexity can be assessed easily by using Markov chain Monte Carlo methods. My comments mostly concern the foundational aspects of the methods proposed and the interrelationship of the deviance information criterion DIC and other Bayesian model selection criteria.

The paper provides a long list of models and the associated pD, the effective number of parameters. In each of these cases pD is interpreted nicely in terms of model quantities. However, there is an unappealing feature of pD that I would like to point out in the discussion below.

Consider the set-up leading to equation (23). Assume further that A1=1,C1=1 and C2=τ2. Thus the likelihood is N(θ,1) and the prior is N(0,τ2). Then equation (23) yields that

Assuming τ2 to be finite it is seen that pD increases to 1 as _n_→ ∞ . The unappealing point is that the effective number of parameters is larger for larger sample sizes; conventional intuition suggests otherwise. The number of unknowns (i.e. the effective number of parameters) should decrease as more data are obtained under this very simple static model. In spite of the authors’ views on asymptotics or consistency, this point deserves further explanation as it is valid even when small sample sizes are considered.

In Section 9.1 the relationship between DIC and other well-known Bayesian model selection criteria including the Bayes factor is discussed. Although DIC is not to be viewed as a formal model choice criterion (according to the authors), it is often (and it will be) used to perform model selection; see for example the references cited by the authors. In this regard a more precise statement about the relationship between the Bayes factor and DIC can be made. I illustrate this with the above simple example taken from the paper.

Assume that the observation model is N(θ,1) and the prior for θ is N(0,τ2). Suppose that model 0 specifies that H0:θ=0 and model 1 says that H1:_θ_≠0. I assume that both n and τ2 are finite and thus avoid the problems with interpretation of the Bayes factor and Lindley’s paradox. Using the Bayes factor, model 0 will be selected if

ny¯2<(1+nτ2)log(1+nτ2)nτ2.

In contrast, DIC selects model 0 if

Clearly, if DIC selects model 0 then the Bayes factor will also select model 0. It is also observed that the Bayes factor allows for higher |yÂŻ|-values without rejecting the simpler model. In effect DIC is seen to have the much discussed poor behaviour of a conventional significance test which criticizes the simpler null hypothesis too much and often rejects it when it should not.

Sylvia Richardson ( Imperial College School of Medicine, London )

I restrict my comments on this far-reaching paper to the use of the deviance information criterion DIC for choosing within a family of models and the behaviour of pD as a penalization.

My first remark concerns the spatial example of Section 8. The DIC-values for the ‘spatial’ and the ‘spatial plus exchangeable’ models are nearly identical. Thus, the authors resort to external pragmatic considerations for preferring the simpler model, while the more complex one is not penalized.

Turning to mixture models and the comparison between models with different numbers of components, I discuss two situations. The first concerns simple Gaussian mixtures with an unknown number of components; yi~∑j=1kwjf(⋅∣θj),i=1,…,n⁠, where f(·|θj) is Gaussian. To calculate DIC in this setting, let us focus on mixtures as flexible distributions and use the conditional density for a new observation y:g(y)=p(y*|y,w,θ,k) to calculate the deviance D(g)=−2∑i=1nlog{g(yi)} and take its expectation over the Markov chain Monte Carlo run, conditional on k. We have pD(k)=E{D(g)}−D(g^k)⁠, where g^k=p(y*∣y,k)⁠.

Two cases of Gaussian mixtures were simulated (one replication): a well-separated bimodal mixture (bimod), 0.5 N(−1.5,0.5)+0.5 N(1.5,0.5), and an overlapping skewed bimodal mixture (skew): 0.75 N(0, 1) + 0.25 N(1.5, 0.33), each with 200 data points.

In the clear-cut bimod case, DIC(k) is lower for k=2, with a small incremental increase in both E(D|y,k) and pD as extra components are being fitted (Table 5). In the more challenging skew case, the pattern of DIC-values shows that this data set requires more than two components to be adequately fitted, but the values of DIC and pD stay surprisingly flat between three and six components. Note that the predictive density plots conditional on k=3,4,5 are completely superimposed (Fig. 8), indicating that more than three components can be considered as overfitting the data, in the sense that they give alternative explanations that are no better but involve increasing numbers of parameters.

Table 5

Performance of DIC for mixture models with different numbers of components

Results for the following values of k:
k = 2
------
Bimod (n = 200)
DIC(k)566.7
E ( Dy , k )
pD3.3
Skew (n = 200)
DIC(k)545.5
E ( Dy , k )
pD5.2
North–south (n = 94)
DIC(k)110.5
E ( Dy , k )
pD16.3

Predictive densities for the skew data set: ⋯⋯, k =2; ——, unconditional (results for k =3, 4, 5 are superimposed)

Fig. 8

Predictive densities for the skew data set: ⋯⋯, k =2; ——, unconditional (results for k =3, 4, 5 are superimposed)

The second situation is that of spatial mixture models proposed in Green and Richardson (2002) in the context of disease mapping. DIC was calculated by focusing on area-specific risk. Referring, for example, to the simple north–south (two-component) contrast defined in that paper, we find that DIC stays stable as k increases, decreasing E(D|y,k) values being compensated by increasing pD. On the basis of a mean-square error criterion between the estimated and the underlying risk surface, a deterioration of the fit would be seen with values of 0.14, 0.15 and 0.16 for k=2,3,4 respectively.

Thus pD acts as a sufficient penalization only in the simplest case. In other cases, DIC does not distinguish between alternative fits with increasing number of parameters.

Peter Green ( University of Bristol )

I have two rather simple comments on this interesting, important and long-awaited paper.

The first concerns using basic distribution theory to give a surprising new perspective on pD in the normal case, perhaps identifying a missed opportunity in exposition.

Consider first a decomposition of data as focus plus noise:

where X and Z are independent n-vectors, normally distributed with fixed means and variances, and var(Z) is non-singular. The deviance is

and so

pD=E[D(X)∣Y]−D(E[X∣Y])=tr{var(Z)−1var(Z∣Y)},

(42)

using the standard expression for the expectation of a quadratic form. Several results in the paper have this form, possibly in disguise. However,

var(Z∣Y)=var(Z)−cov(Z,Y)var(Y)−1cov(Y,Z)=var(Z)−var(Z)var(Y)−1var(Z)=var(Z)var(Y)−1{var(Y)−var(Z)},

yielding the much more easily interpretable

This allows a very clean derivation of examples in Sections 2.5 and 4.1–4.3. For example, in the Lindley and Smith model we have var(Z)=C1 and var(X)=A1C2A1T⁠, and so

pD=tr{(A1C2A1T+C1)−1A1C2A1T}=tr{A1TC1−1A1(A1TC1−1A1+C2−1)−1},

as in equation (21) of the paper.

Turning now to hierarchical models, consider a decomposition into k independent terms

where all Zi are normal, and var(Zk) is non-singular. These represent all the various terms of the model: fixed effects with priors, random effects with different structures, errors at various levels; again all means and variances are fixed. Then for any level l=1,2,…,_k_−1 we may take the sum of the first l terms as the focus and the rest as noise.

Version (42) of pD above is then not very promising:

pD(l)=tr{var(∑i=l+1kZi)−1var(∑i=l+1kZi∣Y)},

but expression (43) gives the more compelling

pD(l)=tr{var(Y)−1var(∑i=1lZi)}.

(44)

Thus pD has generated a decomposition of the overall degrees of freedom n=Σl tr{var(Y)−1var(Zl)} into non-negative terms attributable to the levels l=1,2,…,k, just as in frequentist nested model analysis of variance. (We must take care with improper priors in using expression (44), and terms should be treated as limits as precisions go to 0.) Of course, expressions (43) and (44) fail to hold with unknown variances or with non-normal models, but the observations above do provide further motivation for accepting pD as a measure of complexity, and suggest exploring more thoroughly its role in hierarchical models.

My second point notes that the paper has no examples with discrete ‘parameters’. Conditional distributions in hierarchical models with purely categorical variables can be computed by using probability propagation methods (Lauritzen and Spiegelhalter, 1988), avoiding Markov chain Monte Carlo methods, so that pD is again a cheap local computation. Presumably marginal posterior modes would be used for θ¯⁠. Certainly this is a context where pD can be negative. Can connections be drawn with existing model criticism criteria in probabilistic expert systems?

The following contributions were received in writing after the meeting.

Kenneth P. Burnham ( US Geological Survey and Colorado State University, Fort Collins )

This paper is an impressive contribution to the literature and I congratulate the authors on their achievements therein. My comments focus on the model selection aspect of the deviance information criterion DIC. My perspectives on model selection are given in Burnham and Anderson (2002), which has a focus on the Akaike information criterion AIC as derived from Kullback–Leibler information theory. A lesson that we learned was that, if the sample size n is small or the number of estimated parameters p is large relative to n, a modified AIC should be used, such as AICc=AIC+2_p_(p+1)/(_n_−_p_−1). I wonder whether DIC needs such a modification or if it really automatically adjusts for a small sample size or large p, relative to n. This would be a useful issue for the authors to explore in detail.

At a deeper level I maintain that model selection should be multimodel inference rather than just inference based on a single best model. Thus, model selection to me has become the computation of a set of model weights (probabilities in a Bayesian approach), based on the data and the set of models, that sum to 1. Given these weights and the fitted models (or posterior distributions), model selection uncertainty can be assessed and model-averaged inferences made. The authors clearly have this issue in mind as demonstrated by the last sentence of Section 9.1.3. I urge them to pursue this much more general implementation of model selection and to seek a theoretical or empirical basis for it with DIC.

There is a matter that I am confused about. The authors say ‘… we essentially reduce all models to non-hierarchical structures’ (third page), and ‘Strictly speaking, nuisance parameters should first be integrated out …‘ (Section 9.2.3). Does this mean that we cannot make full inferences about models with random effects? Can DIC be applied to random-effects models? It seems so on the basis of their lip cancer example (Section 8.1). Can I have a model with fixed effects τ, random effects φ1,…,φk, with postulated distribution g(φ|θ),θ as fixed effects (plus priors on all fixed effects) and have my focus be all of τ,φ and θ? Thus, I obtain shrinkage-type inferences about the φi; I do not integrate out the φ (AIC has been adapted to this usage).

The authors make a point (page 612) that I wish to make more strongly. It will usually not be appropriate to ‘choose’ a single model. Unfortunately, standard statistical model selection has been to select a single model and to ignore any selection uncertainty in the subsequent inferences.

Maria DeIorio ( University of Oxford ) and Christian P. Robert ( UniversitĂŠ Paris Dauphine )

Amidst the wide scope of possible extensions of their paper, the authors mention the case of mixtures

which is quite interesting, as it illustrates the versatility of the deviance information criterion DIC under different representations of the same model.

In this set-up, if the pjs are known, the associated completed likelihood is

L{θ∣(x1,z1),…,(xn,zn)}∝∏i=1nf(xi∣θzi)=∏j=1k∏i:zi=jf(xi∣θj).

(45)

Therefore, conditional on the latent variables z=(z1,…,zn), and setting the saturated deviance f(x) to 1, define

[DIC∣z]=∑j=1k∑i:zi=j(−4E[log{f(xi∣θj)}∣x,z}+2log{f(xi∣θ^j)}])

where θ^j=E(θj∣x,z) (under proper identifiability constraints; see Celeux et al. (2000)). The integrated DIC is then

DIC1=∑z∈Z[DIC∣z]Pr(z∣x),

where Pr(z|x) can be approximated (Casella et al., 1999).

A second possibility is the observed DIC, DIC2, based on the observed likelihood, which does not use the latent variables z. (We note the strong dependence of DIC on the choice of the saturated function f and the corresponding lack of clear guidance outside exponential families. For instance, if f(xi) goes from the marginal density to the extreme alternative where both θ1 and θ2 are set equal to xi, DIC2 goes from −31.71 to 166.6 in the following example.)

A third possibility is the full DIC, DIC3, based on the completed likelihood (45) when it incorporateśz as an additional parameter, in which case the saturated deviance could be the normal standardized deviance, although we still use f(x)=1 for comparison.

The three possibilities above lead to rather different figures, as shown by Table 6 for the simulated data set in Fig. 9; Table 6 exhibits in addition a lack of clear domination of the mixture (k=2) versus the normal distribution (k=1) (second column), except when z is set to its true value (third column) or estimated (last column). Note that, for the full DIC, pD is far from 102; this may be because, for some combinations of z, the likelihood is the same. (This also relates to the fact that z is not a parameter in the classical sense.)

Table 6

Comparison of the three different criteria DIC 1 , DIC 2 and DIC 3 for a simulated sample of 100 observations from 0.5 𝒩 (5,1.5)+0.5 𝒩 (7.5,8) with a conjugate prior θ1 ∼ 𝒩 (4,5) and θ2 ∼ 𝒩 (8, 5), and of DIC based on the true complete sample ( x , z ) and DIC for the single-component normal model (with an 𝒩 (6, 5) prior and a variance set of 6.07)

Results for the following models:
Normal
(k=1)
DIC465.1
ΔDIC—
pD0.99

Histogram of the simulated data set and true density

Fig. 9

Histogram of the simulated data set and true density

David Draper ( University of California, Santa Cruz )

The authors of this interesting paper talk about Bayesian model assessment, comparison and fit, but—if their work is to be put seriously to practical use—the real point of the paper is Bayesian model choice: we are encouraged to pick the model with the smallest deviance information criterion DIC among the class of ‘good’ models (those which are ‘adequate candidates for explaining the observations’). (It is implicit that somehow this class has been previously specified by means that are not addressed here—would the authors comment on how this set of models is to be identified in general?) However, in the case of model selection it would seem self-evident that to choose a model you have to say to what purpose the model will be put, for how else will you know whether your model is sufficiently good? We can, perhaps, use DIC to say that model 2 is better than model 1, and we can, perhaps, compare D¯ with ‘the number of free parameters in θ‘ to ‘check the overall goodness of fit’ of model 2, but we cannot use the authors’ methods to say whether model 2 is sufficiently good, because the real world definition of this concept has not been incorporated into their methods. It seems hard to escape the fact that specifying the purpose to which a model will be put demands a decision theoretic basis for model choice; thus (Draper, 1999) I am firmly in the camp of Key et al. (1999).

See Draper and Fouskakis (2000) and Fouskakis and Draper (2002) for an example from health policy that puts this approach into practice, as follows. Most attempts at variable selection in generalized linear models conduct what might be termed a benefit-only analysis, in which a subset of the available predictors is chosen solely on the basis of predictive accuracy. However, if the purpose of the modelling is to create a scale that will be used—in an environment of constrained costs, which is frequently the case—to make predictions of outcome values for future observations, then the model selection process must seek a subset of predictors which trades off predictive accuracy against data collection cost. We use stochastic optimization methods to maximize the expected utility in a decision theoretic framework in the space of all 2p possible subsets (for p of the order of 100), and because our predictors vary widely in how much they cost to collect (which will also often be true in practice) we obtain subsets which are sharply different from (and much better than) those identified by benefit-only methods for performing ‘optimal’ variable selection in regression, including DIC.

Alan E. Gelfand ( Duke University, Durham ) and Matilde Trevisani ( University of Trieste )

The authors’ generally informal approach motivates several remarks which we can only briefly develop here. First, in Section 2.1, we think that better terminology would be ‘focused on p(y|θ)‘ with ‘interest in the models for θ‘, as in, for example, the example in Section 8.1 where there is no θ in the likelihood for any of the given models. Even the example in Section 8.2, where θ does not change across models, emphasizes the focus on p(y|θ) since f(y) depends on the choice of p. So, here, a relative comparison of the models depends on the choices made for the _f_s. Without a clear prescription for f (once we leave the exponential family), the opportunity exists to fiddle the support for a model.

Though the functional form of the Bayesian deviance does not depend on p(θ), DIC and pD will. With the authors’ hierarchical specification,

p(y,θ,ψ)=p(y∣θ)p(θ∣ψ)p(ψ),

the effective degrees of freedom will depend on p(ψ). But, also, under this specification, rather than p(y|θ), we can put a different distribution, p(y|ψ), in focus. Again, it seems preferable not to speak in terms of ‘parameters in focus’.

Moreover, since p(y|θ) and p(y|ψ) have the same marginal distribution p(y), a coherent model choice criterion must provide the same value under either focus. Otherwise, a particular hierarchical specification could be given more or less support according to which distribution we focus on. But let DIC1,pD1 and f1(y) be associated with p(y|θ) and DIC2,pD2 and f2(y) with p(y|ψ). To have DIC1=DIC2 requires, after some algebra, that

ln{f2(y)}−ln{f1(y)}=pD1−pD2+E[ln{p(y∣ψ)∣y}]−E[ln{p(y∣θ)∣y}].

Just as the functional form of f1(y) depends only on the form of p(y|θ), the form for f2(y) should depend only on p(y|ψ). Evidently this is not so. For instance, under the authors’ example in expression (2), f1(y)=0. The above expression yields the non-intuitive choice

ln{f2(y)}=∑wi+12∑ln(1−wi)−λvar(ψ∣y)∑wi2−λ2∑wi2{yi−E(ψ∣y)}2

where wi=τi/(τi+λ). This issue is discussed further in Gelfand and Trevisani (2002).

Jim Hodges ( University of Minnesota, Minneapolis )

This is a most interesting paper, presenting a method of tremendous generality and, as a bonus, a fine survey of related methods. I can think of a dozen models for which I would like to see pD, but I shall ask for just one: a balanced one-way random-effects model with unknown between-group precision, in which each group has its own unknown error precision, these latter precisions being modelled as draws from, say, a common gamma distribution with unknown parameters. Thus the precisions will be shrunk as well as the means, and presumably the two kinds of shrinkage will affect each other. The focus could be either the means or the precisions, or preferably both at once.

One thing is troubling: the possibility of a negative measure of complexity (Section 2.6, comment (d)). Hodges and Sargent (2001) is linked (shackled?) to linear model theory, in which complexity is defined as the dimension of the subspace of ℜn in which the fitted values lie. In our generalization, the fitted values may be restricted to ‘using’ only part of a basis vector’s dimension, because they are stochastically constrained by higher levels of the model’s hierarchy. (Basing complexity on fitted values may remove the need to specify a focus, although, if true, this is not obvious.) In this context, zero complexity makes sense: the fitted values lie in a space of dimension 0 specified entirely by a degenerate prior. Negative complexity, however, is uninterpretable in these terms. The authors attribute negative complexity to a poor model fit, which suggests that pD describes something more than the fitted values’ complexity per se. Perhaps the authors could comment further on this.

Youngjo Lee ( Seoul National University )

It is very interesting to see the Bayesian view of Section 4.2 of Lee and Nelder (1996), which used extended or h-likelihood and in which we introduced various test statistics. For a lack of fit of the model we proposed using the scaled deviance

Dr=−2(log{p(y∣θ˜t)}−log[p{y∣μ(θ)=y}])

with degrees of freedom E(Dr), estimated by n−tr(−Lθ˜′′V) where −Lθ˜′′=V* as in Sections 4.3 and 5.4 of this paper. We considered a wider class of models, which we called hierarchical generalized linear models (HGLMs) (see also Lee and Nelder (2001a, b)), but some of our proofs hold more widely than this, so that, for example, Section 3.1 of this paper is summarized in our Appendix D, etc. For model complexity the authors define in equation (9) the scaled deviance

Dm=−2[log{p(y∣θ)}−log{p(y∣θ˜t)}].

Dr and Dm are the scaled deviances for the residual and model respectively, whose degrees of freedom add up to the sample size n. We are very glad that the authors have pointed out the importance of the parameterization of θ in forming deviances. We extended the canonical parameters of Section 5 to arbitrary links by defining the h-likelihood on a particular scale of the random parameters, namely one in which they occur linearly in the linear predictor. In HGLMs the degrees of freedom for fixed effects are integers whereas those for random effects are fractions. Thus, a GLM has integer degrees of freedom pm=rank(X) because C2−1δ is 0 in Section 5, whereas the estimated degrees of freedom of Dm in HGLMs are fractions. Lee and Nelder (1996) introduced the adjusted profile h-likelihood eliminating θ, and this can be used to test various structures of the dispersion parameters λ discussed in the examples of Section 8: see the model checking plots for the lip cancer data in Lee and Nelder (2001b). Lee and Nelder (2001a) justified the simultaneous elimination of fixed and random nuisance parameters. It will be interesting to have the Bayesian view of the adjusted profile h-likelihood.

Xavier de Luna ( UmeĂĽ University )

This interesting paper presents Bayesian measures of model complexity and fit which are useful at different stages of a data analysis. My comments will focus on their use for model selection. In this respect, one of the noticeable contributions of the paper is to propose a Bayesian analogue, the deviance information criterion DIC, to the Akaike information criterion AIC and TIC. Both DIC and TIC are generalizations of AIC. The former may be useful in a Bayesian data analysis, whereas the frequentist criterion TIC has the advantage of not requiring the ‘good model’ assumption discussed by the authors.

Such ‘information-based’ criteria use measures of model complexity (denoted p or pD in the paper). It should, however, be emphasized that models can be compared without having to define and compute their complexity. Instead, out-of-sample validation methods, such as cross-validation (Stone, 1974) or prequential tests (Dawid, 1984) can be used in wide generality. Moreover, to use an estimate of p in a model selection criterion, some characteristics of the data-generating mechanism (DGM)—‘true model’ in the paper—must be known. For instance, depending on the DGM either AIC-type or Bayes information type criteria are asymptotically optimal (see Shao (1997) for a formal treatment of linear models). Thus, when little is known about the DGM, out-of-sample validation provides a formal and general framework to perform model selection as was presented in de Luna and Skouras (2003), in which accumulated prediction errors (defined with a loss function chosen in accordance with the purpose of the data analysis) were advocated to compare and choose between different model selection strategies. When many models are under scrutiny, out-of-sample validation may be computationally prohibitive and generally yields high variability in the selection of a model. In such cases, different model selection strategies based on p* (making—implicitly or explicitly—diverse DGM assumptions) can be applied to reduce the dimension of the selection problem. Accumulated prediction errors can then be used to identify the best strategy while making very few assumptions on the DGM.

Xiao-Li Meng ( Harvard University, Cambridge, and University of Chicago )

The summary made me smile, for the ‘mean of the deviance − deviance of the mean’ theme once injected a small dose of excitement into my student life. I was rather intrigued by the ‘cuteness’ of expressions (3.4) and (3.8) of Meng and Rubin (1992), and seeing a Bayesian analogue of our likelihood ratio version certainly brought back fond memories. My excitement back then was short lived as I quickly realized that all I of a well-known variance formula. Let D(x,μ)=(x_−_μ)2 be the deviance, a case of realized discrepancy of Gelman et al. (1996); then

1n∑i=1n(xi−x¯)2=D(xi,μ)¯−D(x¯,μ). 

(46)

Although equation (46) is typically mentioned (with Îź set to 0) for computational convenience, it is the back-bone of the theme under quadratic or normal approximations, or more generally with log-concave likelihoods, beyond which assumptions become much harder to justify or derive. (Obviously, equation (46) is applicable for posterior or likelihood averaging by switching x and Îź.)

Section 1 contained a small puzzle. I wondered why Ye (1998) was omitted from the list of ‘the most ambitious attempts’, because Ye’s ‘data derivative’ perspective goes far beyond the independent normal model cited in Section 4.2 (for example, it addresses data mining). It also provides a more original and insightful justification than normal approximations, especially considering that Markov chain Monte Carlo sampling is most needed in cases where such approximations are deemed unacceptable.

Section 2.1 presented a bigger puzzle. The authors undoubtedly would agree that a statement like ‘In hierarchical modelling we cannot uniquely define a ‘‘posterior’’ or ‘‘model complexity’’ without specifying the level of the hierarchy that is the focus of the modelling exercise’ is tautological. Surely the ‘posterior’ and thus the corresponding ‘model complexity’ depend on the level or parameter(s) of interest. So why does the statement become a meaningful motivation when the word posterior is replaced by ‘likelihood’? There is even some irony here, because hierarchical models are models where there are unambiguous and uncontroversial marginal likelihoods—both L(θ|y)=p(y|θ) and L(φ|y)=p(y|φ) in Section 2.1 are likelihoods in the original sense.

Although limitations on space prevent me from describing my reactions when reading the rest, I do wish that DIC would stick out in the dazzling AIC—TIC alphabet contest, so we would all be less compelled to look for UIC (unified or useful information criterion?) ….

The authors replied later, in writing, as follows.

We thank all the contributors for their wide-ranging and provocative discussion. Our reply is organized according to a number of recurring themes, but constraints on space mean that it is impossible to address all the points raised. Echoing Brooks’s opening remarks, our hope is that discussants and readers will be sufficiently inspired to pursue the ideas proposed in this paper and to address some of the unresolved issues highlighted in the discussion.

Model focus and definition of deviance

Our notion of the ‘focus’ of a model and its relationship to the prediction problem of interest provoked some controversy. The crucial role of the model focus is to define the (parameterization of the) likelihood, and we appreciate Gelfand and Trevisani’s suggestion of the term ‘focus on p(y|θ)‘, with interest in the structure of θ, rather than models ‘focused on θ‘. In all our examples the likelihood has been taken to be p(y|θ) (using the notation of Section 2.1) leading to models with a closed form likelihood but an unknown number of effective parameters that we propose to estimate by pD. However, as Brooks points out, if the focus is on p(y|ψ) (i.e. integrating over the random effects θ), then in general the likelihood will no longer be available in closed form, and other methods must be sought to evaluate p(y|ψ): in this circumstance the number of parameters will be the dimension of ψ or less, depending on the strength of the prior information on ψ.

Smith and others ask how the model focus should be chosen in practice. We argue that the focus is operationalized by the prediction problem of interest. For example, if the random effects θ in a hierarchical model relate to observation units such as schools or hospitals or geographical areas, where we might reasonably want to make future predictions for those same units, then taking p(y|θ) as the focus is sensible. The prediction problem is then to predict a new Yi,rep conditional on the posterior estimate of θi for that unit. However, if the random effects relate to individual people, say, then we are often interested in population-average inference rather than subject-specific inference, so we may want to predict responses for a new or ‘typical’ individual rather than an individual who is already in the data set. In this case, it is appropriate to integrate over the _θ_s and to predict Yrep for a new individual conditional on ψ, leading to a model focused on p(y|ψ). A crucial insight is that a predictive probability statement such as p(Yrep|y) is not uniquely defined without specifying the level of the hierarchy that is kept fixed in the prediction—this defines the focus of the model. In summary, we feel that the issue of focus with respect to predictive model assessment and selection is an issue in hierarchical modelling and not specifically Bayesian.

When the forms of the likelihoods differ between models being compared, it is clearly vital to be careful that any standardizing terms that are used in the deviance are common. As observed by Smith, a comparison of models with focus at different levels of the hierarchy may not be meaningful as they correspond to different prediction problems.

Features of pD

Several discussants questioned the definition or performance of pD. As to the definition we maintain our claim (in spite of Dawid’s comment) that it is in our models that there is a genuine Bayesian interest in quantifying the interaction between Y and Θ in probabilistic terms. One can indeed often think of pD in terms of dimensionality as Hodges suggests, but in general we prefer to think of it as a feature of the joint distribution of Y and Θ. This frees it from the shackles imposed by normal linear model theory. Such a measure of interaction or model complexity may, for example, be used to reparameterize hyperparameters ψ to facilitate an intuitively interpretable specification of model priors on ψ (Holmes and Denison, 1999). Still, as suggested by Brooks, pD may turn out to be only a step towards a (better) definition of model complexity such as that suggested by Plummer: we feel that the quantity that he proposes is intuitively intriguing and that it may be particularly appropriate in exponential families, but we wonder about its general validation and justification.

Our uncertainty about whether to recommend pD as a definition or as an estimate of a quantity still to be defined makes it difficult to judge proposals for an ‘improvement’. For example, using an invariant estimator such as that proposed by Robert and Titterington or Bernardo instead of θ¯ is tempting as part of a definition, but it takes into account only one feature of pD while destroying others such as the trace approximation. Similarly the occurrence of a negative value of pD, typically observed if the model fits poorly, might resemble a negative estimate for a positive parameter. We take a pragmatic point of view and look forward to theoretical progress that provides insight into why pD generally appears to work well. Green provides a valuable insight into the interpretation of pD in the normal case, using an attractive decomposition of the total predictive variance of the observables.

Replying to those discussants who were concerned about observing pD<n under ‘flat’ priors, we re-emphasize that pD = n was obtained theoretically only in the normal case or under normal approximations. There is no proof that pD=n for general distributions. In the case of Brooks’s illustration using the Scottish lip cancer data, in which he shows that pD appears to ‘lose’ two or three (modulo Monte Carlo error) parameters under such priors, we point out that two of the 56 observations in this data set are 0 with small expected values and so contribute negligibly to the Poisson deviance. We have replicated his analysis replacing these two observations by non-zero counts, and we found that pD increases by about 2 to around 55.5.

We certainly do not recommend the unthinking use of default priors, a concern of Smith and Bernardo: on the contrary, one of our main aims is to demonstrate how an informative prior reduces model complexity. Typically a large number of parameters p relative to a small sample size n is compensated by using an informative prior, and the deviance information criterion DIC and pD adjust accordingly without any need for additional adjustment for small sample size (see Burnham, and Lawson and Clark’s comment on the example in Section 8.1).

There is evidence (Daniels and Kass, 1999, 2001) that, in the absence of missing data, the use of default priors for variance components typically has little effect on the posteriors for the main effects in a model. Still, Smith and Bernardo observe that the flat priors that may maximize pD are not necessarily weakly informative, and we agree. Reference priors that are least informative in an information theoretical sense can be easily studied in some of our examples. For example, Fig. 1 displays the performance of the beta(12,12) reference prior (corresponding to a prior sample size of ni=a+b=1) for the binomial likelihood, and the approximation (31) indicates that pDiΘ based on the reference prior is greater than pDiΘ based on the uniform beta(1, 1) prior (which has prior sample size ni=2). Similarly for a Poisson likelihood the reference prior π(μi)∝√_μ_i yields a Γ(yi+12,ni) posterior distribution corresponding to a=12,b→0⁠. Hence pDiμ≈yi/(yi+12) and pDiΘ≈ni/ni=1 might be compared with the values shown in Fig. 2.

Properties of DIC

Another main part of the discussion focused on the properties and performance of DIC. Plummer doubted the usefulness of the expected loss that DIC approximates, but he has included a standardizing constant in the loss function which should not be present (we have made this clearer in the paper). The expected loss in the (independent) normal linear case is then p+pD+n log (2_π__σ_2): this says that when comparing ‘good’ models with the same σ2s the expected loss is minimized with a degenerate prior in which no parameters are estimated. This seems entirely reasonable, as all the models have equivalent fit, and so distinction is based on complexity alone. Of course in practice either σ2 will be estimated or σ2 will vary between models, and hence the appropriate trade-off between fit and complexity will naturally arise. A practical aspect, related to the need for ‘good’ models in the derivation of DIC, is that the term ℒ2 ignored by DIC will tend to be negative with poorly fitting models and hence to inflate DIC: the approximation of DIC to expected loss will thus tend automatically to penalize models that are not ‘good’.

Though we agree with Brooks that owing to its heuristic derivation DIC may be considered as a ‘broad brush technique’, we do not regard it to be as arbitrary as the alternatives that he suggests. In particular we do not feel that terms of ‘fit’ and ‘complexity’ can be arbitrarily combined, but we re-emphasize that a measure of model complexity results from correcting overfit due to an approximation of the expected loss that ‘uses the observations twice’. Similarly we would like to see a justification of Vehtari’s estimates of expected utilities as valid approximations generalizing DIC.

Bernardo asks for the application of DIC to nested models and hypothesis testing, in particular the occurrence of Lindley’s paradox. This is an interesting question partially answered by the example discussed in Section 8.1 where some of the competing models are nested. The key point is that DIC is designed to take into account priors that are concentrated on parameters which are specified in a model, thus effectively assigning prior probability 0 to hypothetically omitted parameters (if there are remaining parameters). Let us consider Lindley’s paradox in the following version: when comparing using the Bayes factor X¯N(μ0,σ2/n) with X¯N(μ,σ2/n) where μ_∼_N(μ1,τ2), evidence in favour of H0:μ=μ0 becomes overwhelming as τ2→ ∞ even if x¯ would cause the rejection of H0 at any arbitrary significance level. If σ2 is known μ is the only parameter in the model. To apply DIC we compare the model X¯~N(μ,σ2/n) with prior μ_∼_N(μ0,τ2),τ2→0, corresponding to H0 with the model with the same likelihood but prior μ_∼_N(μ1,τ2),τ2→ ∞ . Then D(μ)=n(x¯−μ)2/σ2,D(μ)¯=(n/σ2){D(μ¯)+var(μ∣x¯)} and pD=n/σ2var(μ∣x¯)⁠. For τ2→0, pD→0,μ¯→μ0 and DIC→_D_(μ0). Similarly, for τ2→ ∞ , pD→1,μ¯→x¯ and DIC→D(x¯)+2=2⁠. Hence the model with the flat prior—the ‘alternative hypothesis’—is favoured if D(μ0)>2 or |√n(x¯−μ0)/σ|>1.414 which corresponds to a rejection of H0 at a significance level _α_≈0.16—exactly the behaviour of the Akaike information criterion. Thus Lindley’s paradox is not observed. Similarly Sahu contrasts the prior concentrated on μ0=0 with an informative prior N(0,τ2) which is centered at μ0, also. Thus it is reasonable to reject H0 using DIC if the data are suitably compatible with the ‘alternative’ prior. However, we do not accept an assessment of DIC that uses Bayes factors as a ‘gold standard’, since they are dealing with different prediction problems (see below).

Several discussants (Brooks, Bernardo, Burnham and Smith) were concerned with the lack of calibration of DIC. However, unlike the Bayesian reference criterion (Bernardo, 1999), which is based on a Kullback–Leibler distance and therefore a relative measure, DIC is an approximation to an absolute expected loss, and we cannot calibrate it (externally). Correspondingly, ‘coherence’ of model choice cannot be required in terms of equal DIC-values as Gelfand and Trevisani or Smith claim but can only be discussed in terms of model ranking by DIC. Note, by the way, that Plummer’s alternative measure of model complexity, as well as our pD, are defined relatively, indicating that these measures might be calibrated.

Finally, we certainly do not claim that applying DIC is an exhaustive tool for model assessment. Although we feel that our Fig. 4 is a step in the right direction, additional techniques such as those discussed by Nelder and Atkinson are certainly needed for refined analyses.

Applications

There were various comments on the interpretation of pD in the Scottish lip cancer analysis (Lawson and Clark, and Richardson) and in mixture models (Richardson, and DeIorio and Robert). Here we tend to think of pD as the estimable dimension of the parameter space or, alternatively, as the size of the parameter space that is identifiable by the data. We repeat that the spatial model 3 in the lip cancer example (Section 8.1) provides stronger prior information than the exchangeable model 2 leading to a smaller pD. Only the sum of the spatial and exchangeable random effects is uniquely identifiable in model 4 and so pD remains virtually unchanged compared with the spatial-only model 3, thus justifying the lack of an additional ‘penalty’ for the apparently more complex model. The same is true for mixture models, where increasing the number of components does not necessarily increase the identifiable parameter space. We do appreciate the discussion of DIC in mixture models introduced by DeIorio and Robert, and by Richardson (though Richardson does not appear to have calculated DIC as we have defined it, but a different criterion based on predictive deviances). DeIorio and Robert’s example nicely illustrates a range of possibilities for defining DIC in this case, although we re-emphasize that a comparison of models with different focus (e.g. their DIC2versus DIC3) may not be meaningful, and we further note that their integrated DIC (DIC1) does not correspond to our definition of DIC.

In response to Lawson and Clark’s query about averaging ‘location’ parameters, we point to Green’s comment concerning the calculation of pD and DIC for models with discrete parameters, and his suggestion that marginal posterior modes could be used for θ¯ in this case.

We thank Nelder and Atkinson for their refinements to the analysis of the stack loss data (Section 8.2). We disagree with Smith that our models 4 and 5 for these data are predictively identical since, as already discussed, the prediction problem addressed by model 4 integrates over the random effects and corresponds to predicting stack loss for a new chimney, whereas model 5 conditions on the random effects and corresponds to predicting future stack loss for the 21 chimneys in the data set.

Alternatives to DIC

Several discussants (Brooks, Dawid and Sahu) feel that DIC suffers in comparison with more traditional Bayesian model selection criteria based on posterior model probabilities and Bayes factors. Here we can only repeat that our deliberate intention was to offer an alternative to Bayes factors, which are most suitable when the entire collection of candidate models can be specified ahead of time (the ‘ℳ closed’ case of Bernardo and Smith (1994)). In our practical experience, the model-building, criticism and rebuilding process is typically an iterative ‘ℳ open’ one in which the ultimate model collection is rarely known ahead of time, and here DIC may emerge as more appropriate. Moreover, Bayes factors address how well the prior has predicted the observed data; this prior predictive emphasis ultimately leads to the Lindley paradox. DIC instead addresses how well the posterior might predict future data generated by the same mechanism that gave rise to the observed data; this posterior predictive outlook might be considered intuitively more appealing in many practical contexts. We emphasize that these techniques are intended to answer different questions and cannot be expected to give the same conclusions: in any case, posterior model probabilities may be highly dependent on within- and between-model priors, so their comparison with DIC is not straightforward. On a related point, several discussants (Brooks, Burnham and Draper) mention the possible alternative of model averaging. We do not, however, see any justification for transforming DIC-values to relative probabilities, and in any case the prior on the model space may be difficult to develop, and might even reasonably be related to model complexity!

Dawid wishes for a better definition of p log (n) (instead of just p) for use in the Bayesian information criterion (BIC) but previous work has shown that many such definitions are justifiable asymptotically (e.g. Volinsky and Raftery (2000)), so this line of research does not appear promising. Regarding the suggestion by Lawson and Clark of using pÂŻlog(n) as a penalty for the BIC, this of course assumes that the number of parameters p is a suitable measure of model complexity. But most spatial models of the type that they refer to will involve random effects, where such use of the raw parameter count p would be inappropriate; indeed, this is precisely the situation that pD was designed to address.

Vehtari and de Luna argue persuasively on behalf of cross-validation as an alternative to our posterior predictive approach that avoids a definition of complexity. Whereas no knowledge of the data-generating mechanism is required for cross-validation, the data-generating mechanism is necessary in a fully Bayesian analysis. Still, cross-validation as an alternative estimation method was also used to estimate model complexity by Efron (1986). We certainly acknowledge the potential of this approach, particularly in comparisons of different model selection strategies. We agree with Stone concerning further investigation of model assessment procedures in which the model is not assumed to be correct, and we refer to Konishi and Kitagawa (1996) (whose GIC adds yet further to the alphabet).

In conclusion, it is clear that several of the discussants feel that our pragmatic aims are muddying otherwise pure Bayesian waters. We feel, however, that the huge increase in the use of Bayesian methods in complex practical problems means that full elicitation of informative priors and utilities is simply not feasible in most situations, and that reasonably simple and robust methods for prior specification, model criticism and model comparison are necessary. We hope that we have made a positive contribution to the final concern.

Acknowledgements

We are very grateful for the generous discussion and criticism of the participants in the programme on neural networks and machine learning that was held at the Isaac Newton Institute for Mathematical Sciences in 1997, and to Andrew Thomas for so quickly implementing our changing ideas into WinBUGS. NGB received partial support from Medical Research Council grant G9803841, and BPC received partial support from National Institute of Allergy and Infectious Diseases grant 1-R01-AI41966.