Bayesian inference for the cases when the data generating process allows for a conjugate setup. Conjugacy in Bayesian statistics is the scenario when the prior and posterior distributions belong to the same family, for example Beta in the case of binary outcome data (Binomial likelihood) or gamma in the case of count data (Poisson likelihood).

Bayes’ theorem provides an optimal means of updating our beliefs in the light of new evidence. Following Laplace’s formalisation1 we have

\[\underbrace{P(\theta \vert y)}_{\text{posterior}} = \dfrac{\overbrace{P(y \vert \theta)}^{\text{likelihood}} \cdot \overbrace{p(\theta)}^{\text{prior}}}{\underbrace{p(y)}_{\text{predictive distribution}}} = \dfrac{P(y \vert \theta) \cdot p(\theta)}{\int_\Theta{P(y \vert \theta) \cdot p(\theta) d\theta}} \propto \underbrace{P(y \vert \theta)}_{\text{likelihood}} \cdot \underbrace{p(\theta)}_{\text{prior}}\]

where we can think of $\theta$ as representing a parameter of interest, for example the $p$ probability parameter in a binomial or geometric model, the $\lambda$ rate parameter in a Poisson model or the $\mu$ parameter in a normal (Gaussian) model.

In the case of all of the above sampling models when we are inferring on one parameter that defines our data generating process2, we have recourse to convenient conjugate priors and simple closed-form expressions for the posterior hyperparameters.

A prior belief distribution on the parameter of interest is a conjugate family of distributions to the sampling model (likelihood) when the posterior it yields when combined with the sampling model via Bayes’ theorem belongs to that same family.

Conjugate priors lead to expedient and computationally trivial updating rules to compute the posterior beliefs given our prior and a sample of data, for example from an experiment, and allow for sequential updates plugging in the posterior from the last update as the prior for the next.

We are however not limited to the single parameter case. We are also able to make inference jointly on a vector of parameters and still have conjugate priors. In particular, we’ll look at this for the mean and variance, $\mu$ and $\sigma^2$, of a normal sampling model. Whilst still having a conjugate prior, it will be composed of multiple components as we will decompose the joint model into the product of conditional and marginal distributions $p(\mu, \sigma^2) = p(\mu \vert \sigma^2) \cdot p(\sigma)$.

So what follows is a look at some of these common conjugate prior-sampling model combinations with proofs to make the arguments more grounded and some code to allow you to acquaint yourself with the numbers and functional forms by experimenting with them yourself. We’ll start with the Beta-Binomial, look at the Gamma-Poisson and finish with the Normal-Normal and Normal-Normal-Inverse Gamma combinations of Bayesian conjugate pairs.

Beta-Binomial

The beta distribution3 is a family of continuous probability distributions supported on the interval $[0, 1]$ and is parameterized by two strictly positive shape parameters, $\alpha > 0$ and $\beta > 0$. Its probability density function (PDF) is defined as follows

\[p(\theta) = \dfrac{\theta^{\alpha-1}(1-\theta)^{\beta-1}} {B(\alpha,\beta)} = \dfrac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \cdot \theta^{\alpha-1} \cdot (1-\theta)^{\beta-1}\]

where $B(\alpha,\beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}$ is the Beta function and constitutes a normalising constant with components given by $\Gamma(.)$ the Gamma function.

The Beta distribution can be generalised for multiple variables into the Dirichlet distribution4, i.e. to model multiple probabilities. In other words, the Beta distribution is an instance of a Dirichlet distribution for a single random variable.

In Bayesian inference, the beta distribution is the conjugate prior probability distribution for the:

  • Bernoulli
  • binomial
  • negative binomial; and
  • geometric

distributions.

The beta distribution is a suitable model for the random behavior of probabilities, proportions or percentages.

Eliciting the Beta as the Conjugate Prior of the Binomial Sampling Model (Proof)

We can prove the conjugacy of the Beta prior in the case of a binomial sampling model as follows.

\[\begin{aligned} p(\theta \mid y) &= \frac{p(\theta) p(y \mid \theta)}{p(y)} && \text{; substitute beta and binomial} \\ \\ &= \frac{1}{p(y)} \times \underbrace{\frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)} \theta^{a-1}(1-\theta)^{b-1}}_{\text{beta prior}} \times \underbrace{\binom{n}{y} \theta^{y}(1-\theta)^{n-y}}_{\text{binomial sampling model}} && \text{; collect terms} \\ \\ &= c(n, y, a, b) \times \theta^{a+y-1}(1-\theta)^{b+n-y-1} && c(n, y, a, b) \text{ is a constant} \\ \\ &= \operatorname{dbeta}(\theta, a+y, b+n-y) && \operatorname{dbeta} \text{is R syntax for the beta PDF} \\ \\ &\implies \theta \ \vert \ y \sim \text{Beta}(a+y, b+n-y) \end{aligned}\]

Overall, we have prior-posterior relationship that

\[\left\{\begin{array}{c} \theta \sim \operatorname{beta}(\alpha, \beta) \text { (uniform) } \\ Y \sim \operatorname{binomial}(n, \theta) \end{array}\right\} \implies \{\theta \mid Y = y\} \sim \operatorname{beta}(\alpha + y, \beta + n - y)\]

In Practice (Code)

Section forthcoming
Section forthcoming
Section forthcoming

We simulate values, $y_i$, of the random variable, $Y$, which is our outcome of interest and are conditionally independent given $\theta$. If we simulate (sufficiently many) values, $y_i$, using the parameter, $\theta$, we effectively sample from the marginal (sampling across the distribution of $\theta$ and producing $Y$ given samples of $\theta$).

Gamma-Poisson

The gamma distribution is a two-parameter family of continuous probability distributions supported on the positive real line, i.e. in the interval $(0, \infty)$. It subsumes the exponential, Erlang and chi-square distributions. Its PDF is defined as follows

\[f(\theta) = \dfrac{b^a}{\Gamma(a)} \cdot \theta^{a - 1} \cdot e^{-b \theta }\]

The above definition makes use of the parametrisation according to the two strictly positive shape and rate (or inverse scale) parameters, $a > 0$ and $b > 0$.

The gamma is the Bayesian conjugate prior for the

  • Poisson
  • exponential
  • normal (with known mean); and
  • Pareto

distributions, amongst others.

Eliciting the Gamma as the Conjugate Prior of the Poisson Sampling Model (Proof)

We can show the conjugacy of the Gamma distribution as a prior in the case of a Poisson sampling model, i.e. when the random variable that we are collecting data on is distributed according to a Poisson process, for example in the case of births, road accidents or men kicked to death by horses among ten Prussian army corps over 20 years.

We’ll go about proving conjugacy in the same vein as before by computing the posterior distribution of $\theta$, $p(\theta \mid y_{1}, \ldots, y_{n})$ or $p(\theta_{posterior})$ to within some proportionality constant, that is to say by finding its shape.

Suppose we have some data $Y_{1}, \ldots, Y_{n} \mid \theta \sim Poisson(\theta)$ and impose our prior $p(\theta) \sim Gamma(a, b)$.

We can find an expression for the posterior to within a scaling constant, $c(.)$, by straightforwardly collecting algebraic terms after substituting in the relevant expressions for the Poisson and Gamma PDFs, as follows

\[\begin{aligned} p\left(\theta \mid y_{1}, \ldots, y_{n}\right) &= \dfrac{p(\theta) \times p\left(y_{1}, \ldots, y_{n} \mid \theta\right)}{p\left(y_{1}, \ldots, y_{n}\right)} \\ \\ &= \text{needs an extra line here with the explicit expressions for the gamma and Poisson} \\ &= \left\{\theta^{a-1} e^{-b \theta}\right\} \times\left\{\theta^{\sum y_{i}} e^{-n \theta}\right\} \times c\left(y_{1}, \ldots, y_{n}, a, b\right) \\ \\ &= \left\{\theta^{a+\sum y_{i}-1} e^{-(b+n) \theta}\right\} \times c\left(y_{1}, \ldots, y_{n}, a, b\right) \end{aligned}\]

To within a scaling constant, $c(.)$, the final expression is a gamma distribution with new shape and rate parameters, $a^* = a_{posterior} = a + \sum y_{i} > 0$ and $b^* = b_{posterior} = b + n > 0$.

So we showed the conjugacy of the gamma distribution family for the Poisson sampling model and overall, we have prior-posterior relationship

\[\left\{\begin{array}{c} \theta \sim \operatorname{gamma}(a, b) \\ Y_{1}, \ldots, Y_{n} \mid \theta \sim \operatorname{Poisson}(\theta) \end{array}\right\} \Rightarrow\left\{\theta \mid Y_{1}, \ldots, Y_{n}\right\} \sim \operatorname{gamma}\left(a+\sum_{i=1}^{n} Y_{i}, b+n\right) .\]

Estimation and prediction proceed in a manner similar to that in the binomial model. The posterior expectation of $\theta$ is a linear (convex) combination of the prior expectation and the sample average:

\[\begin{aligned} \mathrm{E}\left[\theta \mid y_{1}, \ldots, y_{n}\right] &=\frac{a+\sum y_{i}}{b+n} \\ &=\frac{b}{b+n} \frac{a}{b}+\frac{n}{b+n} \frac{\sum y_{i}}{n} \end{aligned}\]

$b$ is interpreted as the number of prior observations; $a$ is interpreted as the sum of counts from $b$ prior observations.

For large $n$, the information from the data dominates the prior information:

\[n \gg b \Rightarrow \mathrm{E}\left[\theta \mid y_{1}, \ldots, y_{n}\right] \approx \bar{y}, \operatorname{Var}\left[\theta \mid y_{1}, \ldots, y_{n}\right] \approx \bar{y} / n\]

Predictions about additional data can be obtained with the posterior predictive distribution:

\[\begin{aligned} p\left(\tilde{y} \mid y_{1}, \ldots, y_{n}\right) &=\int_{0}^{\infty} p\left(\bar{y} \mid \theta, y_{1}, \ldots, y_{n}\right) p\left(\theta \mid y_{1}, \ldots, y_{n}\right) d \theta \\ &=\int p(\tilde{y} \mid \theta) p\left(\theta \mid y_{1}, \ldots, y_{n}\right) d \theta \\ &=\int \operatorname{dpois}(\tilde{y}, \theta) \operatorname{dgamma}\left(\theta, a+\sum y_{i}, b+n\right) d \theta \\ &=\int\left\{\frac{1}{\tilde{y} !} \theta^{\bar{y}} e^{-\theta}\right\}\left\{\frac{(b+n)^{a+\sum y_{i}}{\Gamma\left(a+\sum y_{i}\right)}} \theta^{a+\sum y_{i}-1} e^{-(b+n) \theta}\right\} d \theta \\ &=\frac{(b+n)^{a+\sum} y_{i}}{\Gamma(\tilde{y}+1) \Gamma\left(a+\sum y_{i}\right)} \int_{0}^{\infty} \theta^{a+\sum y_{i}+\bar{y}-1} e^{-(b+n+1) \theta} d \theta . \end{aligned}\]

Evaluation of this complicated integral looks daunting, but it turns out that it can be done without any additional calculus. Let’s use what we know about the gamma density:

$1=\int_{0}^{\infty} \frac{b^{a}}{\Gamma(a)} \theta^{a-1} e^{-b \theta} d \theta \quad$ for any values $a, b>0$.

This means that

$\int_{0}^{\infty} \theta^{a-1} e^{-b \theta} d \theta=\frac{\Gamma(a)}{b^{a}} \quad$ for any values $a, b>0 .$

Now substitute in $a+\sum y_{i}+\tilde{y}$ instead of $a$ and $b+n+1$ instead of $b$ to get

\[\int_{0}^{\infty} \theta^{a+\sum y_{i}+\bar{y}-1} e^{-(b+n+1) \theta} d \theta=\frac{\Gamma\left(a+\sum y_{i}+\tilde{y}\right)}{(b+n+1)^{a+\sum y_{i}+\dot{y}}}\]

After simplifying some of the algebra, this gives

\[p\left(\tilde{y} \mid y_{1}, \ldots, y_{n}\right)=\frac{\Gamma\left(a+\sum y_{i}+\tilde{y}\right)}{\Gamma(\tilde{y}+1) \Gamma\left(a+\sum y_{i}\right)}\left(\frac{b+n}{b+n+1}\right)^{a+\sum y_{i}}\left(\frac{1}{b+n+1}\right)^{\bar{y}}\]

In Practice (Code)

Section forthcoming
Section forthcoming
Section forthcoming

The Normal Model

Theory (Math)

The Normal, or Gaussian, distribution is a two-parameter family of continuous probability distributions supported on the real line. Its PDF is defined as follows

\[f(\theta) = \frac{1}{\sqrt{2 \pi \sigma^2}} \operatorname{exp}\left\{-\frac{1}{2}\left(\frac{\theta - \mu}{\sigma}\right)^2\right\}\]

where the $\mu$ parameter is the mean as well as the median and mode of the distribution, and the $\sigma^2$ parameter is its variance.

The Normal family of distributions is important given its ubiquity. This arises from the fact that the mean of many observations of a random process (the so-called sample mean) converges to a normal distribution when the number of samples (the sample size) becomes large enough, a result known as the Central Limit Theorem5. For this reason many quantities are approximately normally distributed since they result from, i.e. constitute the sum of, many independent processes. An example would be biological phenotypes like birds’ wing lengths, which are determined by the contributions of large numbers of proteins, themselves encoded by a myriad of gene variants (alleles).

Bayesian inference for the Normal sampling model is more complicated than for the previous two examples because it is a two-parameter model. We’ll approach inference in the Bayesian regime by splitting the problem into three cases where, in the first two cases, we assume we know6 one of these two parameters of the sampling model.

  1. Known or fixed sampling variance, $\sigma^2$, to perform inference on the population mean, $\theta$
  2. Known or fixed sampling mean, $\theta$, to perform inference on the population variance, $\sigma^2$

We can show the conjugacy of the Normal prior to the Normal sampling model (i.e. likelihood)

\[\begin{array}{c} \frac{1}{\tau_{0}^{2}}\left(\theta^{2}-2 \theta \mu_{0}+\mu_{0}^{2}\right)+\frac{1}{\sigma^{2}}\left(\sum y_{i}^{2}-2 \theta \sum y_{i}+n \theta^{2}\right)=a \theta^{2}-2 b \theta+c \end{array}\]

where

\[a=\frac{1}{\tau_{0}^{2}}+\frac{n}{\sigma^{2}}, \quad b=\frac{\mu_{0}}{\tau_{0}^{2}}+\frac{\sum y_{i}}{\sigma^{2}}, \quad \text { and } c=c\left(\mu_{0}, \tau_{0}^{2}, \sigma^{2}, y_{1}, \ldots, y_{n}\right)\]

Now let’s see if $p\left(\theta \mid \sigma^{2}, y_{1}, \ldots, y_{n}\right)$ takes the form of a normal density

\[\begin{aligned} p\left(\theta \mid \sigma^{2}, y_{1}, \ldots, y_{n}\right) & \propto \exp \left\{-\frac{1}{2}\left(a \theta^{2}-2 b \theta\right)\right\} \\ &=\exp \left\{-\frac{1}{2} a\left(\theta^{2}-2 b \theta / a+b^{2} / a^{2}\right)+\frac{1}{2} b^{2} / a\right\} \\ & \propto \exp \left\{-\frac{1}{2} a(\theta-b / a)^{2}\right\} \\ &=\exp \left\{-\frac{1}{2}\left(\frac{\theta-b / a}{1 / \sqrt{a}}\right)^{2}\right\} \end{aligned}\]

First Case: \theta unknown; $\sigma^2_{sample}$ known / fixed

Inference on the Normal is easier if we assume that we know the prior variance, $\sigma_{prior}$.

Analogy: We can think of this by analogy to

\[Y \vert \theta \sim N(\theta, \sigma^2)\] \[W + \theta \vert \theta \sim N(\theta, \sigma^2); \quad W \sim N(0, \sigma^2)\] \[(Y \vert \theta) =_{distribution} (W + \theta \vert \theta)\] \[(Y, \theta) =_{distribution} (W + \theta, \theta) \implies Y =_{distribution} W + \theta \iff Y \sim N(\mu + 0, \tau^2 + \sigma^2)\]

The predictive distribution has more uncertainty than the data distribution.

Reparametrization: We use $r = \psi(h) = (\mu, 1 / \sigma^2)$.

If $\theta \sim p(\theta)$ and

\[p(\theta) \propto exp \left \{ - \frac{1}{2} a \theta^2 + b \theta \right \}\]

then $\theta \sim Normal(\mu = \frac{b}{a}, \sigma^2 = \frac{1}{a}$.

Second Case: \theta known / fixed; $\sigma^2_{sample}$ unknown

Third Case (General Case): Both \theta unknown; $\sigma^2_{sample}$ unknown

Relationship between Gamma and Inverse Gamma Distributions

To derive the inverse gamma distribution from the gamma, we use the change of variables method, also known as the transformation theorem for random variables.

1 Parameterizations

There are at least a couple common parameterizations of the gamma distribution. For our purposes, a gamma $(\alpha, \beta)$ distribution has density

\[f(x)=\frac{1}{\Gamma(\alpha) \beta^{\alpha}} x^{\alpha-1} \exp (-x / \beta)\]

for $x>0$. With this parameterization, a gamma $(\alpha, \beta)$ distribution has mean $\alpha \beta$ and variance $\alpha \beta^{2}$.

Define the inverse gamma (IG) distribution to have the density

\[f(x)=\frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{-\alpha-1} \exp (-\beta / x)\]

for $x>0$.

2 Relation to the gamma distribution

With the above parameterizations, if $X$ has a gamma $(\alpha, \beta)$ distribution then $Y=1 / X$ has an $\mathrm{IG}(\alpha, 1 / \beta)$ distribution. To see this, apply the transformation theorem.

\[\begin{aligned} f_{Y}(y) &=f_{X}(1 / y) \left \vert \frac{d}{d y} y^{-1} \right \vert \\ &=\frac{1}{\Gamma(\alpha) \beta^{\alpha}} y^{-\alpha+1} \exp (-1 / \beta y) y^{-2} \\ &=\frac{(1 / \beta)^{\alpha}}{\Gamma(\alpha)} y^{-\alpha-1} \exp (-(1 / \beta) / y) \end{aligned}\]

John D. Cook (October 3, 2008) Inverse Gamma Distribution. https://www.johndcook.com/inverse_gamma.pdf.

In Practice

In either the the one-parameter fixed $\sigma$ case or the joint both unknown $(\theta, \sigma^2)$ case , $p(\theta \vert \sigma^2, y_1, \dots, y_n) \cdot p(\sigma^2 \vert y_1, \dots, y_n)$, when we are sampling from the (joint) posterior distribution by sampling $\sigma$ first and then sampling $\theta$ conditionally on the value of $\sigma$ drawn on that iteration of our MCMC, we are using the fact that the joint distribution is a completion of $\theta$. A completion

Reference Table Conjugate Bayesian Inference

Likelihood Model Parameters Conjugate Prior Distribution Prior Parameters Posterior Parameters Interpretation of Parameters Posterior Predictive
Binomial $p$ Beta $a, b \in \mathbb{R}$ $a+y$, $b + n - y$ $a$ successes, $b$ failures Beta-Binomial
Poisson $\lambda$ (rate) Gamma $a, b \in \mathbb{R}^+$ $a + \sum_{i=1}^n Y_i$, $b + n$ $a$ total occurrences in $b$ intervals Negative Binomial
Normal (fixed variance $\sigma^2$) $\mu$ (mean) Normal $\mu_0, \tau_{0}^{2}$ $\dfrac{1}{\dfrac{1}{\sigma_{0}^{2}}+\dfrac{n}{\sigma^{2}}}\left(\dfrac{\mu_{0}}{\sigma_{0}^{2}}+\dfrac{\sum_{i=1}^{n} x_{i}}{\sigma^{2}}\right),\left(\dfrac{1}{\sigma_{0}^{2}}+\dfrac{n}{\sigma^{2}}\right)^{-1}$ Normal

The above table is adapted from the one on the Wikipedia entry for the Conjugate prior.

Forthcoming Sections

  • Add reparametrizations
  • Add R code
  • Add Normal
  • Follow-up on MCMC

References and Notes

  1. Whilst looking for an example source crediting Pierre-Simon Laplace for formalising Bayes’ Theorem mathematically, I came across the brilliant post A History of Bayes’ Theorem by lukeprog from the 29th of August 2011, which is a pithy synopsis of the book The Theory That Would Not Die by Sharon McGrayne. ] 

  2. In the case of the normal model, we can perform one-parameter inference if we assume a fixed, known variance. 

  3. See Aerin Kim’s excellent post entitled Beta Distribution — Intuition, Examples, and Derivation for an introductory walkthrough of the Beta distribution. 

  4. The channel Mathematical Monk has some excellent expositional video lectures on the Dirichlet distribution as chapters ML 7.7.A1 - 7.8 inclusive (four total) of his Machine Learning series. 

  5. Note the sample mean is itself a random variable and that there are weak and strong forms of the Central Limit Theorem which assume certain conditions including that the random variable being sampled from itself has finite mean and variance. 

  6. As usual, this is the kind of simplification that makes derivations or computations tractable. In practice, we might not know the sample variance but might still choose to fix it to a constant value to make life easier, depending on our goal.