Bayesian inference for the cases when the data generating process allows for a conjugate setup. Conjugacy in Bayesian statistics is the scenario when the prior and posterior distributions belong to the same family, for example Beta in the case of binary outcome data (Binomial likelihood) or gamma in the case of count data (Poisson likelihood).
Bayes’ theorem provides an optimal means of updating our beliefs in the light of new evidence. Following Laplace’s formalisation1 we have
where we can think of as representing a parameter of interest, for example the probability parameter in a binomial or geometric model, the rate parameter in a Poisson model or the parameter in a normal (Gaussian) model.
In the case of all of the above sampling models when we are inferring on one parameter that defines our data generating process2, we have recourse to convenient conjugate priors and simple closed-form expressions for the posterior hyperparameters.
A prior belief distribution on the parameter of interest is a conjugate family of distributions to the sampling model (likelihood) when the posterior it yields when combined with the sampling model via Bayes’ theorem belongs to that same family.
Conjugate priors lead to expedient and computationally trivial updating rules to compute the posterior beliefs given our prior and a sample of data, for example from an experiment, and allow for sequential updates plugging in the posterior from the last update as the prior for the next.
We are however not limited to the single parameter case. We are also able to make inference jointly on a vector of parameters and still have conjugate priors. In particular, we’ll look at this for the mean and variance, and , of a normal sampling model. Whilst still having a conjugate prior, it will be composed of multiple components as we will decompose the joint model into the product of conditional and marginal distributions .
So what follows is a look at some of these common conjugate prior-sampling model combinations with proofs to make the arguments more grounded and some code to allow you to acquaint yourself with the numbers and functional forms by experimenting with them yourself. We’ll start with the Beta-Binomial, look at the Gamma-Poisson and finish with the Normal-Normal and Normal-Normal-Inverse Gamma combinations of Bayesian conjugate pairs.
Beta-Binomial
The beta distribution3 is a family of continuous probability distributions supported on the interval and is parameterized by two strictly positive shape parameters, and . Its probability density function (PDF) is defined as follows
where is the Beta function and constitutes a normalising constant with components given by the Gamma function.
The Beta distribution can be generalised for multiple variables into the Dirichlet distribution4, i.e. to model multiple probabilities. In other words, the Beta distribution is an instance of a Dirichlet distribution for a single random variable.
In Bayesian inference, the beta distribution is the conjugate prior probability distribution for the:
- Bernoulli
- binomial
- negative binomial; and
- geometric
distributions.
The beta distribution is a suitable model for the random behavior of probabilities, proportions or percentages.
Eliciting the Beta as the Conjugate Prior of the Binomial Sampling Model (Proof)
We can prove the conjugacy of the Beta prior in the case of a binomial sampling model as follows.
Overall, we have prior-posterior relationship that
In Practice (Code)
Section forthcoming
Section forthcoming
Section forthcoming
We simulate values, , of the random variable, , which is our outcome of interest and are conditionally independent given . If we simulate (sufficiently many) values, , using the parameter, , we effectively sample from the marginal (sampling across the distribution of and producing given samples of ).
Gamma-Poisson
The gamma distribution is a two-parameter family of continuous probability distributions supported on the positive real line, i.e. in the interval . It subsumes the exponential, Erlang and chi-square distributions. Its PDF is defined as follows
The above definition makes use of the parametrisation according to the two strictly positive shape and rate (or inverse scale) parameters, and .
The gamma is the Bayesian conjugate prior for the
- Poisson
- exponential
- normal (with known mean); and
- Pareto
distributions, amongst others.
Eliciting the Gamma as the Conjugate Prior of the Poisson Sampling Model (Proof)
We can show the conjugacy of the Gamma distribution as a prior in the case of a Poisson sampling model, i.e. when the random variable that we are collecting data on is distributed according to a Poisson process, for example in the case of births, road accidents or men kicked to death by horses among ten Prussian army corps over 20 years.
We’ll go about proving conjugacy in the same vein as before by computing the posterior distribution of , or to within some proportionality constant, that is to say by finding its shape.
Suppose we have some data and impose our prior .
We can find an expression for the posterior to within a scaling constant, , by straightforwardly collecting algebraic terms after substituting in the relevant expressions for the Poisson and Gamma PDFs, as follows
To within a scaling constant, , the final expression is a gamma distribution with new shape and rate parameters, and .
So we showed the conjugacy of the gamma distribution family for the Poisson sampling model and overall, we have prior-posterior relationship
Estimation and prediction proceed in a manner similar to that in the binomial model. The posterior expectation of is a linear (convex) combination of the prior expectation and the sample average:
is interpreted as the number of prior observations; is interpreted as the sum of counts from prior observations.
For large , the information from the data dominates the prior information:
Predictions about additional data can be obtained with the posterior predictive distribution:
Evaluation of this complicated integral looks daunting, but it turns out that it can be done without any additional calculus. Let’s use what we know about the gamma density:
for any values .
This means that
for any values
Now substitute in instead of and instead of to get
After simplifying some of the algebra, this gives
In Practice (Code)
Section forthcoming
Section forthcoming
Section forthcoming
The Normal Model
Theory (Math)
The Normal, or Gaussian, distribution is a two-parameter family of continuous probability distributions supported on the real line. Its PDF is defined as follows
where the parameter is the mean as well as the median and mode of the distribution, and the parameter is its variance.
The Normal family of distributions is important given its ubiquity. This arises from the fact that the mean of many observations of a random process (the so-called sample mean) converges to a normal distribution when the number of samples (the sample size) becomes large enough, a result known as the Central Limit Theorem5. For this reason many quantities are approximately normally distributed since they result from, i.e. constitute the sum of, many independent processes. An example would be biological phenotypes like birds’ wing lengths, which are determined by the contributions of large numbers of proteins, themselves encoded by a myriad of gene variants (alleles).
Bayesian inference for the Normal sampling model is more complicated than for the previous two examples because it is a two-parameter model. We’ll approach inference in the Bayesian regime by splitting the problem into three cases where, in the first two cases, we assume we know6 one of these two parameters of the sampling model.
- Known or fixed sampling variance, , to perform inference on the population mean,
- Known or fixed sampling mean, , to perform inference on the population variance,
We can show the conjugacy of the Normal prior to the Normal sampling model (i.e. likelihood)
where
Now let’s see if takes the form of a normal density
First Case: unknown; known / fixed
Inference on the Normal is easier if we assume that we know the prior variance, .
Analogy: We can think of this by analogy to
The predictive distribution has more uncertainty than the data distribution.
Reparametrization: We use .
If and
then .
Second Case: known / fixed; unknown
Third Case (General Case): Both \theta unknown; unknown
Relationship between Gamma and Inverse Gamma Distributions
To derive the inverse gamma distribution from the gamma, we use the change of variables method, also known as the transformation theorem for random variables.
1 Parameterizations
There are at least a couple common parameterizations of the gamma distribution. For our purposes, a gamma distribution has density
for . With this parameterization, a gamma distribution has mean and variance .
Define the inverse gamma (IG) distribution to have the density
for .
2 Relation to the gamma distribution
With the above parameterizations, if has a gamma distribution then has an distribution. To see this, apply the transformation theorem.
John D. Cook (October 3, 2008) Inverse Gamma Distribution. https://www.johndcook.com/inverse_gamma.pdf.
In Practice
In either the the one-parameter fixed case or the joint both unknown case , , when we are sampling from the (joint) posterior distribution by sampling first and then sampling conditionally on the value of drawn on that iteration of our MCMC, we are using the fact that the joint distribution is a completion of . A completion
Reference Table Conjugate Bayesian Inference
Likelihood | Model Parameters | Conjugate Prior Distribution | Prior Parameters | Posterior Parameters | Interpretation of Parameters | Posterior Predictive |
---|---|---|---|---|---|---|
Binomial | Beta | , | successes, failures | Beta-Binomial | ||
Poisson | (rate) | Gamma | , | total occurrences in intervals | Negative Binomial | |
Normal (fixed variance ) | (mean) | Normal | --- | Normal |
The above table is adapted from the one on the Wikipedia entry for the Conjugate prior.
Forthcoming Sections
- Add reparametrizations
- Add R code
- Add Normal
- Follow-up on MCMC
References and Notes
- Hoff, Peter (2009) A First Course in Bayesian Statistical Methods. https://pdhoff.github.io/book/. Peter Hoff’s book is eminently readable and didactic. Most of the proofs in this post are adapted from ones found therein.
- Kevin P. Murphy (October 3, 2007) Conjugate Bayesian analysis of the Gaussian distribution. https://www.cs.ubc.ca/~murphyk/Papers/bayesGauss.pdf.
- Persi Diaconis, Donald Ylvisaker “Conjugate Priors for Exponential Families” The Annals of Statistics, Ann. Statist. 7(2), 269-281, (March, 1979). https://projecteuclid.org/journals/annals-of-statistics/volume-7/issue-2/Conjugate-Priors-for-Exponential-Families/10.1214/aos/1176344611.full.
Footnotes
-
Whilst looking for an example source crediting Pierre-Simon Laplace for formalising Bayes’ Theorem mathematically, I came across the brilliant post A History of Bayes’ Theorem by lukeprog from the 29th of August 2011, which is a pithy synopsis of the book The Theory That Would Not Die by Sharon McGrayne. ] ↩
-
In the case of the normal model, we can perform one-parameter inference if we assume a fixed, known variance. ↩
-
See Aerin Kim’s excellent post entitled Beta Distribution — Intuition, Examples, and Derivation for an introductory walkthrough of the Beta distribution. ↩
-
The channel Mathematical Monk has some excellent expositional video lectures on the Dirichlet distribution as chapters ML 7.7.A1 - 7.8 inclusive (four total) of his Machine Learning series. ↩
-
Note the sample mean is itself a random variable and that there are weak and strong forms of the Central Limit Theorem which assume certain conditions including that the random variable being sampled from itself has finite mean and variance. ↩
-
As usual, this is the kind of simplification that makes derivations or computations tractable. In practice, we might not know the sample variance but might still choose to fix it to a constant value to make life easier, depending on our goal. ↩