Bayesian inference for the cases when the data generating process allows for a conjugate setup. Conjugacy in Bayesian statistics is the scenario when the prior and posterior distributions belong to the same family, for example Beta in the case of binary outcome data (Binomial likelihood) or gamma in the case of count data (Poisson likelihood).

Bayes’ theorem provides an optimal means of updating our beliefs in the light of new evidence. Following Laplace’s formalisation1 we have

where we can think of as representing a parameter of interest, for example the probability parameter in a binomial or geometric model, the rate parameter in a Poisson model or the parameter in a normal (Gaussian) model.

In the case of all of the above sampling models when we are inferring on one parameter that defines our data generating process2, we have recourse to convenient conjugate priors and simple closed-form expressions for the posterior hyperparameters.

A prior belief distribution on the parameter of interest is a conjugate family of distributions to the sampling model (likelihood) when the posterior it yields when combined with the sampling model via Bayes’ theorem belongs to that same family.

Conjugate priors lead to expedient and computationally trivial updating rules to compute the posterior beliefs given our prior and a sample of data, for example from an experiment, and allow for sequential updates plugging in the posterior from the last update as the prior for the next.

We are however not limited to the single parameter case. We are also able to make inference jointly on a vector of parameters and still have conjugate priors. In particular, we’ll look at this for the mean and variance, and , of a normal sampling model. Whilst still having a conjugate prior, it will be composed of multiple components as we will decompose the joint model into the product of conditional and marginal distributions .

So what follows is a look at some of these common conjugate prior-sampling model combinations with proofs to make the arguments more grounded and some code to allow you to acquaint yourself with the numbers and functional forms by experimenting with them yourself. We’ll start with the Beta-Binomial, look at the Gamma-Poisson and finish with the Normal-Normal and Normal-Normal-Inverse Gamma combinations of Bayesian conjugate pairs.

Beta-Binomial

The beta distribution3 is a family of continuous probability distributions supported on the interval and is parameterized by two strictly positive shape parameters, and . Its probability density function (PDF) is defined as follows

where is the Beta function and constitutes a normalising constant with components given by the Gamma function.

The Beta distribution can be generalised for multiple variables into the Dirichlet distribution4, i.e. to model multiple probabilities. In other words, the Beta distribution is an instance of a Dirichlet distribution for a single random variable.

In Bayesian inference, the beta distribution is the conjugate prior probability distribution for the:

  • Bernoulli
  • binomial
  • negative binomial; and
  • geometric

distributions.

The beta distribution is a suitable model for the random behavior of probabilities, proportions or percentages.

Eliciting the Beta as the Conjugate Prior of the Binomial Sampling Model (Proof)

We can prove the conjugacy of the Beta prior in the case of a binomial sampling model as follows.

Overall, we have prior-posterior relationship that

In Practice (Code)

Section forthcoming
Section forthcoming
Section forthcoming

We simulate values, , of the random variable, , which is our outcome of interest and are conditionally independent given . If we simulate (sufficiently many) values, , using the parameter, , we effectively sample from the marginal (sampling across the distribution of and producing given samples of ).

Gamma-Poisson

The gamma distribution is a two-parameter family of continuous probability distributions supported on the positive real line, i.e. in the interval . It subsumes the exponential, Erlang and chi-square distributions. Its PDF is defined as follows

The above definition makes use of the parametrisation according to the two strictly positive shape and rate (or inverse scale) parameters, and .

The gamma is the Bayesian conjugate prior for the

  • Poisson
  • exponential
  • normal (with known mean); and
  • Pareto

distributions, amongst others.

Eliciting the Gamma as the Conjugate Prior of the Poisson Sampling Model (Proof)

We can show the conjugacy of the Gamma distribution as a prior in the case of a Poisson sampling model, i.e. when the random variable that we are collecting data on is distributed according to a Poisson process, for example in the case of births, road accidents or men kicked to death by horses among ten Prussian army corps over 20 years.

We’ll go about proving conjugacy in the same vein as before by computing the posterior distribution of , or to within some proportionality constant, that is to say by finding its shape.

Suppose we have some data and impose our prior .

We can find an expression for the posterior to within a scaling constant, , by straightforwardly collecting algebraic terms after substituting in the relevant expressions for the Poisson and Gamma PDFs, as follows

To within a scaling constant, , the final expression is a gamma distribution with new shape and rate parameters, and .

So we showed the conjugacy of the gamma distribution family for the Poisson sampling model and overall, we have prior-posterior relationship

Estimation and prediction proceed in a manner similar to that in the binomial model. The posterior expectation of is a linear (convex) combination of the prior expectation and the sample average:

is interpreted as the number of prior observations; is interpreted as the sum of counts from prior observations.

For large , the information from the data dominates the prior information:

Predictions about additional data can be obtained with the posterior predictive distribution:

Evaluation of this complicated integral looks daunting, but it turns out that it can be done without any additional calculus. Let’s use what we know about the gamma density:

for any values .

This means that

for any values

Now substitute in instead of and instead of to get

After simplifying some of the algebra, this gives

In Practice (Code)

Section forthcoming
Section forthcoming
Section forthcoming

The Normal Model

Theory (Math)

The Normal, or Gaussian, distribution is a two-parameter family of continuous probability distributions supported on the real line. Its PDF is defined as follows

where the parameter is the mean as well as the median and mode of the distribution, and the parameter is its variance.

The Normal family of distributions is important given its ubiquity. This arises from the fact that the mean of many observations of a random process (the so-called sample mean) converges to a normal distribution when the number of samples (the sample size) becomes large enough, a result known as the Central Limit Theorem5. For this reason many quantities are approximately normally distributed since they result from, i.e. constitute the sum of, many independent processes. An example would be biological phenotypes like birds’ wing lengths, which are determined by the contributions of large numbers of proteins, themselves encoded by a myriad of gene variants (alleles).

Bayesian inference for the Normal sampling model is more complicated than for the previous two examples because it is a two-parameter model. We’ll approach inference in the Bayesian regime by splitting the problem into three cases where, in the first two cases, we assume we know6 one of these two parameters of the sampling model.

  1. Known or fixed sampling variance, , to perform inference on the population mean,
  2. Known or fixed sampling mean, , to perform inference on the population variance,

We can show the conjugacy of the Normal prior to the Normal sampling model (i.e. likelihood)

where

Now let’s see if takes the form of a normal density

First Case: unknown; known / fixed

Inference on the Normal is easier if we assume that we know the prior variance, .

Analogy: We can think of this by analogy to

The predictive distribution has more uncertainty than the data distribution.

Reparametrization: We use .

If and

then .

Second Case: known / fixed; unknown

Third Case (General Case): Both \theta unknown; unknown

Relationship between Gamma and Inverse Gamma Distributions

To derive the inverse gamma distribution from the gamma, we use the change of variables method, also known as the transformation theorem for random variables.

1 Parameterizations

There are at least a couple common parameterizations of the gamma distribution. For our purposes, a gamma distribution has density

for . With this parameterization, a gamma distribution has mean and variance .

Define the inverse gamma (IG) distribution to have the density

for .

2 Relation to the gamma distribution

With the above parameterizations, if has a gamma distribution then has an distribution. To see this, apply the transformation theorem.

John D. Cook (October 3, 2008) Inverse Gamma Distribution. https://www.johndcook.com/inverse_gamma.pdf.

In Practice

In either the the one-parameter fixed case or the joint both unknown case , , when we are sampling from the (joint) posterior distribution by sampling first and then sampling conditionally on the value of drawn on that iteration of our MCMC, we are using the fact that the joint distribution is a completion of . A completion

Reference Table Conjugate Bayesian Inference

LikelihoodModel ParametersConjugate Prior DistributionPrior ParametersPosterior ParametersInterpretation of ParametersPosterior Predictive
BinomialBeta, successes, failuresBeta-Binomial
Poisson (rate)Gamma, total occurrences in intervalsNegative Binomial
Normal (fixed variance ) (mean)Normal---Normal

The above table is adapted from the one on the Wikipedia entry for the Conjugate prior.

Forthcoming Sections

  • Add reparametrizations
  • Add R code
  • Add Normal
  • Follow-up on MCMC

References and Notes

Footnotes

  1. Whilst looking for an example source crediting Pierre-Simon Laplace for formalising Bayes’ Theorem mathematically, I came across the brilliant post A History of Bayes’ Theorem by lukeprog from the 29th of August 2011, which is a pithy synopsis of the book The Theory That Would Not Die by Sharon McGrayne. ]

  2. In the case of the normal model, we can perform one-parameter inference if we assume a fixed, known variance.

  3. See Aerin Kim’s excellent post entitled Beta Distribution — Intuition, Examples, and Derivation for an introductory walkthrough of the Beta distribution.

  4. The channel Mathematical Monk has some excellent expositional video lectures on the Dirichlet distribution as chapters ML 7.7.A1 - 7.8 inclusive (four total) of his Machine Learning series.

  5. Note the sample mean is itself a random variable and that there are weak and strong forms of the Central Limit Theorem which assume certain conditions including that the random variable being sampled from itself has finite mean and variance.

  6. As usual, this is the kind of simplification that makes derivations or computations tractable. In practice, we might not know the sample variance but might still choose to fix it to a constant value to make life easier, depending on our goal.