3 minute read
Published: May 25, 2020
The evidence lower bound is an important quantity at the core of a number of important algorithms used in statistical inference including expectation-maximization and variational inference. In this post, I describe its context, definition, and derivation.
Introduction
The evidence lower bound (ELBO) is an important quantity that lies at the core of a number of important algorithms in probabilistic inference such as expectation-maximization and variational infererence. To understand these algorithms, it is helpful to understand the ELBO.
Before digging in, let’s review the probabilistic inference task for a latent variable model. In a latent variable model, we posit that our observed data is a realization from some random variable . Moreover, we posit the existence of another random variable where and are distributed according to a joint distribution where parameterizes the distribution. Unfortunately, our data is only a realization of , not , and therefore remains unobserved (i.e. latent).
There are two predominant tasks that we may be interested in accomplishing:
- Given some fixed value for , compute the posterior distribution
- Given that is unknown, find the maximum likelihood estimate of :
where is the log-likelihood function:
Variational inference is used for Task 1 and expectation-maximization is used for Task 2. Both of these algorithms rely on the ELBO.
What is the ELBO?
To understand the evidence lower bound, we must first understand what we mean by “evidence”. The evidence, quite simply, is just a name given to the likelihood function evaluated at a fixed :
Why is this quantity called the “evidence”? Intuitively, if we have chosen the right model and , then we would expect that the marginal probability of our observed data , would be high. Thus, a higher value of indicates, in some sense, that we may be on the right track with the model and parameters that we have chosen. That is, this quantity is “evidence” that we have chosen the right model for the data.
If we happen to also know (or posit) that follows some distribution denoted by (and that ), then the evidence lower bound is, well, just a lower bound on the evidence that makes use of the known (or posited) . Specifically,
where the ELBO is simply the right-hand side of the above inequality:
Derivation
We derive this lower bound as follows:
log p \left(\right. x ; \theta \left.\right) & = log \int p \left(\right. x , z ; \theta \left.\right) \&\text{nbsp}; d z \\ & = log \int p \left(\right. x , z ; \theta \left.\right) \frac{q \left(\right. z \left.\right)}{q \left(\right. z \left.\right)} \&\text{nbsp}; d z \\ & = log E_{Z sim q} \left[\right. \frac{p \left(\right. x , Z \left.\right)}{q \left(\right. z \left.\right)} \left]\right. \\ & \geq E_{Z sim q} \left[\right. log \frac{p \left(\right. x , Z \left.\right)}{q \left(\right. z \left.\right)} \left]\right.This final inequality follows from Jensen’s Inequality.
The gap between the evidence and the ELBO
It turns out that the gap between the evidence and the ELBO is precisely the Kullback-Leibler divergence between and . This fact forms the basis of the variational inference algorithm for approximate Bayesian inference!
This can be derived as follows:
K L \left(\right. q \left(\right. z \left.\right) \&\text{nbsp}; ∣∣ p \left(\right. z \mid x ; \theta \left.\right) \left.\right) & := E_{Z sim q} \left[\right. log \frac{q \left(\right. Z \left.\right)}{p \left(\right. Z \mid x ; \theta \left.\right)} \left]\right. \\ & = E_{Z sim q} \left[\right. log q \left(\right. Z \left.\right) \left]\right. - E_{Z sim q} \left[\right. log \frac{p \left(\right. x , Z ; \theta \left.\right)}{p \left(\right. x ; \theta \left.\right)} \left]\right. \\ & = E_{Z sim q} \left[\right. log q \left(\right. Z \left.\right) \left]\right. - E_{Z sim q} \left[\right. log p \left(\right. x , Z ; \theta \left.\right) \left]\right. + E_{Z sim q} \left[\right. log p \left(\right. x ; \theta \left.\right) \left]\right. \\ & = log p \left(\right. x ; \theta \left.\right) - E_{Z sim q} \left[\right. log \frac{p \left(\right. x , Z ; \theta \left.\right)}{q \left(\right. z \left.\right)} \left]\right. \\ & = \text{evidence} - \text{ELBO}