Some Information Theory

Sketches of some concepts from Information Theory. Readers are referred to Shannon’s original 1948 paper A Mathematical Theory of Communication.

Entropy
Kullback-Leibler Divergence (Relative Entropy)
- Bayesian Context
- Frequentist Context
Relationship between Cross-entropy and Log-Likelihood
A Quick Note on Perplexity
References and Further Reading

Entropy

Entropy is a measurement of the uncertainty inherent in a random variable, usually denoted $H(X)$ for the random variable, $X$.

\[H(X) := - \sum_{x \in \mathcal{X}} p(x) \ \operatorname{log} p(x) = \mathbb{E_X}\left[- \operatorname{log} p(X)\right] = \mathbb{E_X}\left[\operatorname{log} \left(\dfrac{1}{p(X)} \right)\right]\]

The rightmost expression most clearly presents the entropy of an event as an expectation over (a monotonic function of) the reciprocal of that variable’s probability distribution.

A few remarks:

entropy is a weighted sum of the information content associated with each possible outcome of the random variable
- information content, self-information or surprisal is the information conveyed in the log-probability-reciprocals in the above rightmost formula: $I(x) = log(1 / P(X = x))$
entropy is a weighted sum of positive quantities $\implies$ entropy is zero or positive (i.e. non-negative)
entropy is higher for systems with probability dispersed over more possible states - the entropy of a dice roll is higher than a coin flip, which is higher than a sure thing
the entropy of a certain event is $0$ - we see that $log(1 / p(E)) = log(1 / 1) = 0$ for the sure-to-happen event $E \in \mathcal{X}$
- a degenerate distribution, a point mass, has entropy $0$
use of the logarithm makes the self-information of independent events additive, analogous to how the probability of independent events is multiplicative: $I(x_1 \cap x_2) = I(x_1) + I(x_2) = \sum_i x_i \iff P(x_1 \cap x_2) = \prod_i P(x_i)$

Here is a graph of the information content of an event, $E$, with probability, $p: \mathcal{X} \rightarrow [0,1]$ along the $x$-axis.

Self-information vs Probability Graph

Entropy in turn gives rise to the principle of maximum entropy, which asserts that the probability best suited to representing the system is the one with the largest entropy, for example when determining priors in the context of Bayesian inference.

Here is another graph, showing the entropy of a binary random variable, $X \in {0, 1}$ at different parametrisations of $P(X = 1)$ in bits, nats and hartleys.

entropy graph

A couple of observations:

the entropy is highest for $p = 0.5$ - the situation in which we would be least comfortable placing a bet on the outcome
the entropy is zero at both $p = 0$ and $p = 1$ when we are sure of the outcome
the parameterisation that maximises the entropy of the underlying model (distribution) is the parametrisation we would use in a state of greatest ignorance

Example of Entropy for Binary Random Variables: Formally, we arrived at this graph by applying the formula for $H(.)$ to a Bernoulli distributed variable, $X$, where the distribution is parametrized by $p = P(X = 1)$. The probability mass function of the Bernoulli distribution is $p$ if $X = 1$ and $(1-p)$ if $X = 0$ which gives the following for entropy, applying the formula for $H(.)$

\[\begin{aligned} H(X) &= \mathbb{E}_X -log(p) \\ &= \sum_{x\in\{0, 1\}} - p \cdot log(p) \\ &= - (1-p) \cdot log(1-p) - p \cdot log(p) \\ &= (p - 1) \cdot log(1 - p) - p \cdot log(p) \end{aligned}\]

Zero-probability events: Note above that the value $p=0$ is included in the codomain of the probability function: $[0,1]$. Therefore we have some events whose probability is zero, $P(\emptyset) = 0$, for which the associated self-information is $I(\emptyset) = -log(0) = log(1/0)$ which is not defined. In the context of calculating the entropy including zero-probability events, we see terms in the weighted sum for these events like $0\ log(0)$. We follow the convention that $lim_{x \rightarrow 0}\ x\ log(x) = 0$.

Names: Entropy is also referred to as Shannon entropy in the context of information theory. For continuous random variables, Shannon entropy is known as differential entropy.

Kullback-Leibler Divergence (Relative Entropy)

The Kullback-Leibler divergence (“KL divergence”), $D_\text{KL}(p \ \Vert \ q)$, or relative entropy of $p$ with respect to $q$, is the information lost when approximating $p$ with $q$, or equivalently the information gained when updating $q$ with $p$.

The Kullback-Leibler divergence is defined as

\[\begin{aligned} D_\text{KL}(P \parallel Q) &= \sum_{x\in\mathcal{X}} P(x) \log\left(\frac{P(x)}{Q(x)}\right) \\ &= -\sum_{x\in\mathcal{X}} P(x) \log\left(\frac{Q(x)}{P(x)}\right) \end{aligned}\]

with $P(.)$ and $Q(.)$ representing probability mass functions (PMFs).

Note that the KL divergence is neither symmetric nor does not satisfy the triangle inequality so it does not constitute a valid distance, although it is non-negative.

Bayesian Context

The Kulback-Leibler divergence, or relative entropy between two distributions can be thought of in the Bayesian context with prior $p(\theta)$ and posterior $p(\theta \ \vert \ y_1, \dots, y_n)$ as the information gain when updating our prior beliefs to our posterior ones or in other words the information gained about the model parameters, $\theta$, by observing data, $\ y_1, \dots, y_n$.

The KL divergence can be written out as a function of the prior and posterior distributions

\[D_\text{KL}(p(\theta \ \vert \ y_1, \dots, y_n) \ \Vert \ p(\theta)) = \sum_{\theta} p(\theta \ \vert \ y_1, \dots, y_n) \cdot log \left [ \dfrac{p(\theta \ \vert \ y_1, \dots, y_n)}{p(\theta)} \right ]\]

This is exactly the usual formulation but just with the prior substituted in for $P(x)$ and the posterior substituted in for $Q(x)$.

Frequentist Context

In the frequentist Maximum Likelihood Estimation (MLE) context, maximising the log-likelihood of observing the data, $y_1, \dots, y_n$, with respect to the model parameters, $\theta$, is equivalent to minimising the KL divergence between the likelihood and the true distribution of the data.

The KL divergence from $p_{\text{true}}(Y)$, the true source of the data (unknown), to $p(Y \ \vert \ \theta)$, the model likelihood fit to the data, is given by

\[\begin{aligned} D_\text{KL}\left(p_\text{true}(Y) \ \Vert \ p(Y \ \vert \ \theta)\right) &=\sum_{Y} p_{\text {true}}(X) \log \frac{p_{\text {true}}(Y)}{p(Y \ \vert \ \theta)} \\ &=-\sum_{Y} p_{\text {true}}(Y) \log p(Y \ \vert \ \theta)+\sum_{Y} p_{\text {true}}(Y) \log p_{\text {true}}(Y) \\ &= \left (\lim_{N \rightarrow \infty} - \frac{1}{N} \sum_{i = 1}^{N} \log p\left (y_{i} \ \vert \ \theta\right)\right) + H \left[ p_{\text{true}}(Y) \right ] \end{aligned}\]

The left-hand term is the cross-entropy and is equivalent¹ to the negative log-likelihood for discrete chance variables².

Relationship between Cross-entropy and Log-Likelihood

The cross-entropy of two (discrete) distributions is equivalent to the negative log-likelihood.

We can see this if we re-express the log-likelihood function

\[\mathcal{L}(y_1, \dots, y_n; \theta) \triangleq \prod_{i=1}^n p(y_i \vert \theta) = \prod_{i=1}^n p_\theta(y_i)\]

We can express the likelihood of a given sample, $y_i$ compactly as $p_\theta(y_i)$ where this reflects that the probability is with respect to our model³.

Conversely, there will be some true distribution (unknown to us) which determines the data. We can denote the likelihood of a data point $p_\text{true}(y_i)$ and note that in a sample of size $n$, there will be $p_\text{true}(y_i) \cdot n$ occurrences of this value/symbol/outcome, at least as $n \rightarrow \infty$, which gives

\[\begin{aligned} log\left[\mathcal{L}(y_1, \dots, y_n; \theta)\right] &= log\left[\prod_{i=1}^n p_\theta(y_i)^{n \cdot p_\text{true}(y_i)}\right] \\ &= \sum_{i=1}^n log ( p_\theta(y_i)^{n \cdot p_\text{true}(y_i)} ) \\ &= n \cdot \sum_{i=1}^n p_\text{true}(y_i) \cdot log ( p_\theta(y_i) ) \\ &= n \cdot \mathbb{E}_\text{true} log ( p_\theta(y_i) ) \\ &= n \cdot - H(p_\text{true}(y_i), p_\theta(y_i)) \\ \end{aligned}\]

This gives the relation that maximisation of the log-likelihood yields the same parameters as minimization of the cross-entropy.

Now since the entropy of the data source is fixed with respect to our model parameters, i.e. we cannot do anything about this term via optimisation, we have that the argmin over model parameters, $\theta$, of the KL divergence is the parameters’ argmax given the log-likelihood term in the infinite limit of the sample size, that is to say the maximum likelihood estimate of the parameter given the data

\[\arg \min_{\theta} D_\text{KL} \left (p_{\text {true}}(Y) \ \Vert \ p(Y \ \vert \ \theta) \right ) = \arg \max_{\theta} \left (\lim_{N \rightarrow \infty} \frac{1}{N} \sum_{i = 1}^{N} \log p \left (y_{i} \ \vert \ \theta \right ) \right ) \triangleq \hat{\theta}_\text{ML}\]

A Quick Note on Perplexity

Perplexity is a measurement for how well a distribution or model predicts a sample or random variable.

A low perplexity indicates that the {model, distribution} is good at predicting the {sample, random variable}.

Perplexity for distribution is computed as $ppl(p) := b^{H(p)}$ for a base $b$ and the entropy $H(p)$ measured according to the same base (i.e. bits for $b=2$, nats for $b = e$, or hartleys for $b = 10$).

References and Further Reading

Shannon (1948) A Mathematical Theory of Communication. The Bell System Technical Journal. Vol. 27, pp. 379–423, 623–656. Reprint with corrections http://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf.
Mike Morais (2018) Lecture 8: Information Theory and Maximum Entropy. From NEU 560: Statistical Modeling and Analysis of Neural Data (Spring 2018).
Daniel Commenges (2015) Information Theory and Statistics: an overview. https://arxiv.org/pdf/1511.00860.pdf
Derivation of the approximation of the Binomial distribution’s entropy using the De Moivre–Laplace theorem https://math.stackexchange.com/questions/244455/entropy-of-a-binomial-distribution.
Section 3.13 on Information Theory of the Deep Learning Book - thanks to Vincenzo for pointing to this resource.
Some other important/interesting concepts: Mutual information, Pointwise mutual information, Conditional entropy

This is stated in the section Relation to log-likelihood under Wikipedia’ entry for Cross entropy. ↩
For chance variables, you can read random variables. Feeling inspired, I’m going with the term as used by Shannon (1948) A Mathematical Theory of Communication. The Bell System Technical Journal. Vol. 27, pp. 379–423, 623–656. Reprint with corrections http://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf. ↩
In fact, it is with respect to our whole model, not just the parameters. The model imposes a general inductive bias: we expect our data to be sampled from the same family of distributions overall. Then we would like the parameters to lead to the best approximation of the truth given the model. ↩