Generalized Linear Models and the Exponential Family
An introduction to the Exponential Family of probability distributions. Familiarity with the exponential family is the basis for understanding the Generalized Linear Modeling (GLM) framework, which includes logistic and log regression models for binary (Binomial) and count (Poisson) data.
Nelder and Wedderburn (1972)1 proposed the Generalized Linear Models (GLM) regression framework, which unifies the modelling of variables generated from many different stochastic distributions including the normal (Gaussian), binomial, Poisson, exponential, gamma and inverse Gaussian.
They did this by re-expressing many common distributions in the form of the more general exponential family of distributions, with the following form
\[f(y) = \exp\left\{\frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \right\}\]where
- $a(\phi)$
- $b(\theta)$
- $c(y, \phi)$
are all known functions. It is also common that $a(\phi)$ has the simple form
\[a(\phi) = \phi / p\]where $p$ is a known prior weight, which is often $1$.
The parameters $\theta$ and $\phi$ are location and scale parameters.
It can be shown that if $Y \sim \mathcal{P}(\theta, \phi)$ for a distribution $\mathcal{P}$ in the Exponential family then the mean and variance are given as
\[\begin{gathered} \mathbb{E}[Y] = \mu = b'(\theta) \\ \mathbb{Var}[Y] = \sigma^2 = b''(\theta)a(\phi) \end{gathered}\]where $b’(\theta)$ and $b’’(\theta)$ are the first and second derivatives of $b(\theta)$ with respect to $\theta$. Obviously, when $a(\phi) = \phi / p$ the variance has the simpler form
\[\mathbb{Var}[Y] = \sigma^2 = \phi b''(\theta) / p\]As mentioned, the above formulation subsumes the normal (Gaussian), binomial, Poisson, exponential, gamma and inverse Gaussian distributions2. Together with link functions, introduced below, this enables the modelling of variables generated according to these distributions within one simple regression framework, for example allowing us to conduct optimization with the iteratively re-weighted least squares algorithm uniformly.
Alternative Formulation
We can alternatively denote the density of a distribution or model, $\mathcal{P}$, in the $d$-dimensional Exponential Family as
\[p(x ; \theta)=\exp \left(\sum_{i=1}^{d} \eta_{i}(\theta) T_{i}(x)-B(\theta)\right) h(x)\]where
- $\eta_{i}(\theta) \in \mathbb{R}$ are called the natural parameters
- $T_{i}(x) \in \mathbb{R}$ are its sufficient statistics
- $B(\theta)$ is the log-partition function because it is the logarithm of a normalization factor:
- $h(x) \in \mathbb{R}$ is the base measure
I prefer this formulation as I find it more intuitive.
Example: The Gaussian Distribution is a member of the Exponential Family
The Gaussian has the following density (PDF)
\[f(y) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left\{-\frac{1}{2}\frac{(y-\mu)^2}{\sigma^2}\right\}\]We can express this in the above form as
\[f(y) = \exp\left\{ \frac{y \mu-\frac{1}{2}\mu^2}{\sigma^2} - \frac{y^2}{2\sigma^2} - \frac{1}{2}\log(2\pi\sigma^2)\right\}\]where $\mu / \sigma^2$ is the coefficient on the random variable $y$ yielding $\theta = \mu$ and $\phi = \sigma^2$ for the Gaussian distribution as the mapping of parameters from the general form.
In this instance we have $a(\phi) = \phi$; the prior weight $p=1$. Equally, we have that $b(\theta) = \frac{1}{2} \theta^2$ and $c(y, \phi) = - \frac{y^2}{2\sigma^2} - \frac{1}{2}\log(2\pi\sigma^2)$ where the formulation for $c(y, \phi)$ results in part from our having to incorporate the normalising constant, previously outside the exponent, into the exponentiation.
In general,
We can trivially verify the mean and variance formulae hold by taking the first and second derivatives of $b(\theta) = \frac{1}{2} \theta^2$
\[\begin{aligned} b(\theta) &= \frac{1}{2} \theta^2 = \frac{1}{2} \mu^2 \\ b'(\theta) &= \theta = \mu \\ b''(\theta) &= 1 \end{aligned}\]Since $\mathbb{E}[Y] = b’(\theta) = \mu$ the validity of the expectation is verified and for the variance we have $a(\phi) \cdot b’’(\theta) = \phi b’’(\theta) / p = \phi = \sigma^2$ so the variance formula is also the familiar one we recognise for the Gaussian.
Example: The Exponential Distribution is a member of the Exponential Family
For the Exponential distribution we have the density
\[f(y)= \lambda \exp\left\{ -\lambda y\right\} = \exp\left\{ -\lambda y + log(\lambda) \right\}\]So, following the second formulation and considering only the univariate case, we have
- $\eta(\theta) = - \theta$
- $T(x) = x$
- $B(\theta) = - log(\theta)$
We can consider $h(x) = 1$ or even $h(x) = \mathbb{I}[x \geq 0]$ to include the support inside the density expression.
Link Functions
Within generalized linear modelling we are trying to regress stochastic variables onto predictors. We can perform this in the case of normally-distributed variables, for example with ordinary least squares regression in closed form (or various optimization methods numerically). However, to expand the regression framework to variables differently distributed, we require link functions to allow these to dovetail with the existing machinery.
Link functions are one-to-one, continuous differentiable transformations, $g(.)$. We apply them to the mean, $\mu$, of a target variable.
\[\eta = g(\mu)\]Examples of link functions include the identity, log, reciprocal, logit (the log of the odds) and probit (the quantile function of the normal distribuion).
We assume that the transformed mean is a linear function of the predictors and refer to it as such as the linear predictor, since it is the expected value given the data through the link function.
\[\eta = g(\mu) = \mathbf{x}\mathbf{\beta}\]This simultaneously yields a familiar simple model for the linear predictor and the ability to recover the mean before transformation by inverting the link function, which we can do since it is bijective (i.e. one-to-one and onto, and therefore invertible).
\[\mu = g^{-1}(\mathbf{x}\mathbf{\beta})\]It is very important to realise that it is the expected value, $\mu$, of the response variable, $y$, that is modelled and not the response variable itself that is modelled or predicted directly. A model where $log(\mathbb{E}[y]) = log(\mu)$ is a linear function of the vector of predictors $\mathbf{x}$ is not equivalent to another model where $log(y)$ is linear on the same $\mathbf{x}$.
Vanilla Regression uses Gaussian Error and the Identity Link
The standard regression model can be described as a generalized linear model where the error is normally distributed and the link function is the identity, giving
\[\eta = \mu\]We saw that for the Gaussian distribution we have $\mu = \eta = \theta$, which is the more general parameter appearing in the expression for the density of the Exponential Family.
Link functions which map the linear predictor, $\eta$, onto the canonical parameter of the Exponential Family density, $\theta$, we specifically refer to this as the canonical link.
The canonical links for some common probability distributions are given below.
Error | Link |
---|---|
Normal | Identity |
Binomial | Logit |
Poisson | Log |
Apropos, remember that when we mention probability distributions, we are referring to the way in which the response variables - or equivalently their errors given a null model - are distributed.
Canonical links bring the advantage that a minimal sufficient statistic for $\beta$ exists, which is to say that all the information about $\beta$ is contained in a function of the same dimensionality as $\beta$.
References
Most of this post is a rehashing of the various lecture notes referenced below, which form part of courses on Bayesian statistics or general probability.
- Generalized Linear Model Theory (Appendix B) from Generalized Linear Models by Germán Rodríguez
- Chapter 8. The exponential family: Basics from Bayesian Modeling and Inference (Stat 260/CS 294 at UC Berkeley) taught by Michael Jordan
- Lecture 2 (September 24) from Theory of Statistics Fall 2015 (Stanford STATS 300A) taught by Lester Mackey
- The Exponential Family and Statistical Applications by Anirban DasGupta
-
J. A. Nelder, R. W. M. Wedderburn (1972) Generalized Linear Models. Journal of the Royal Statistical Society: Series A (General) Volume 135, Issue 3 p. 370-384. Accessible https://repository.rothamsted.ac.uk/download/25425465aa52d05e1a9e553b2daddeeffe15d0ba40f5f9b8937aaab5c3d29e1d/4410096/Nelder%201972.pdf. ↩
-
Expression of the densities in this general form also emphasises sufficient statistics, i.e. optimal data reduction. ↩