Some Information Theory
Sketches of some concepts from Information Theory. Readers are referred to Shannon’s original 1948 paper A Mathematical Theory of Communication.
- Entropy
- Kullback-Leibler Divergence (Relative Entropy)
- Relationship between Cross-entropy and Log-Likelihood
- A Quick Note on Perplexity
- References and Further Reading
Entropy
Entropy is a measurement of the uncertainty inherent in a random variable, usually denoted
The rightmost expression most clearly presents the entropy of an event as an expectation over (a monotonic function of) the reciprocal of that variable’s probability distribution.
A few remarks:
- entropy is a weighted sum of the information content associated with each possible outcome of the random variable
- information content, self-information or surprisal is the information conveyed in the log-probability-reciprocals in the above rightmost formula:
- information content, self-information or surprisal is the information conveyed in the log-probability-reciprocals in the above rightmost formula:
- entropy is a weighted sum of positive quantities
entropy is zero or positive (i.e. non-negative) - entropy is higher for systems with probability dispersed over more possible states - the entropy of a dice roll is higher than a coin flip, which is higher than a sure thing
- the entropy of a certain event is
- we see that for the sure-to-happen event- a degenerate distribution, a point mass, has entropy
- a degenerate distribution, a point mass, has entropy
- use of the logarithm makes the self-information of independent events additive, analogous to how the probability of independent events is multiplicative:
Here is a graph of the information content of an event,
Entropy in turn gives rise to the principle of maximum entropy, which asserts that the probability best suited to representing the system is the one with the largest entropy, for example when determining priors in the context of Bayesian inference.
Here is another graph, showing the entropy of a binary random variable,
A couple of observations:
- the entropy is highest for
- the situation in which we would be least comfortable placing a bet on the outcome - the entropy is zero at both
and when we are sure of the outcome - the parameterisation that maximises the entropy of the underlying model (distribution) is the parametrisation we would use in a state of greatest ignorance
Example of Entropy for Binary Random Variables: Formally, we arrived at this graph by applying the formula for
Zero-probability events: Note above that the value
Names: Entropy is also referred to as Shannon entropy in the context of information theory. For continuous random variables, Shannon entropy is known as differential entropy.
Kullback-Leibler Divergence (Relative Entropy)
The Kullback-Leibler divergence (“KL divergence”),
The Kullback-Leibler divergence is defined as
with
Note that the KL divergence is neither symmetric nor does not satisfy the triangle inequality so it does not constitute a valid distance, although it is non-negative.
Bayesian Context
The Kulback-Leibler divergence, or relative entropy between two distributions can be thought of in the Bayesian context with prior
The KL divergence can be written out as a function of the prior and posterior distributions
This is exactly the usual formulation but just with the prior substituted in for
Frequentist Context
In the frequentist Maximum Likelihood Estimation (MLE) context, maximising the log-likelihood of observing the data,
The KL divergence from
The left-hand term is the cross-entropy and is equivalent1 to the negative log-likelihood for discrete chance variables2.
Relationship between Cross-entropy and Log-Likelihood
The cross-entropy of two (discrete) distributions is equivalent to the negative log-likelihood.
We can see this if we re-express the log-likelihood function
We can express the likelihood of a given sample,
Conversely, there will be some true distribution (unknown to us) which determines the data. We can denote the likelihood of a data point
This gives the relation that maximisation of the log-likelihood yields the same parameters as minimization of the cross-entropy.
Now since the entropy of the data source is fixed with respect to our model parameters, i.e. we cannot do anything about this term via optimisation, we have that the argmin over model parameters,
A Quick Note on Perplexity
Perplexity is a measurement for how well a distribution or model predicts a sample or random variable.
A low perplexity indicates that the {model, distribution} is good at predicting the {sample, random variable}.
Perplexity for distribution is computed as
References and Further Reading
- Shannon (1948) A Mathematical Theory of Communication. The Bell System Technical Journal. Vol. 27, pp. 379–423, 623–656. Reprint with corrections http://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf.
- Mike Morais (2018) Lecture 8: Information Theory and Maximum Entropy. From NEU 560: Statistical Modeling and Analysis of Neural Data (Spring 2018).
- Daniel Commenges (2015) Information Theory and Statistics: an overview. https://arxiv.org/pdf/1511.00860.pdf
- Derivation of the approximation of the Binomial distribution’s entropy using the De Moivre–Laplace theorem https://math.stackexchange.com/questions/244455/entropy-of-a-binomial-distribution.
- Section 3.13 on Information Theory of the Deep Learning Book - thanks to Vincenzo for pointing to this resource.
- Some other important/interesting concepts: Mutual information, Pointwise mutual information, Conditional entropy
-
This is stated in the section Relation to log-likelihood under Wikipedia’ entry for Cross entropy. ↩
-
For chance variables, you can read random variables. Feeling inspired, I’m going with the term as used by Shannon (1948) A Mathematical Theory of Communication. The Bell System Technical Journal. Vol. 27, pp. 379–423, 623–656. Reprint with corrections http://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf. ↩
-
In fact, it is with respect to our whole model, not just the parameters. The model imposes a general inductive bias: we expect our data to be sampled from the same family of distributions overall. Then we would like the parameters to lead to the best approximation of the truth given the model. ↩