Variational Bayeisan (VB) Methods are a family of techniques that are very popular in statistical Machine Learning. VB methods allow us to re-write statistical inference problems (i.e. infer the value of a random variable given the value of another random variable) as optimization problems (i.e. find the parameter values that minimize some objective function).
This inference-optimization duality is powerful because it allows us to use the latest-and-greatest optimization algorithms to solve statistical Machine Learning problems (and vice versa, minimize functions using statistical techniques).
This post is an introductory tutorial on Variational Methods. I will derive the optimization objective for the simplest of VB methods, known as the Mean-Field Approximation. This objective, also known as the Variational Lower Bound, is exactly the same one used in Variational Autoencoders (a neat paper which I will explain in a follow-up post).
Table of Contents
Preliminaries and Notation
Problem formulation
Variational Lower Bound for Mean-field Approximation
Forward KL vs. Reverse KL
Connections to Deep Learning
Preliminaries and Notation
This article assumes that the reader is familiar with concepts like random variables, probability distributions, and expectations. Here’s a refresher if you forgot some stuff. Machine Learning & Statistics notation isn’t standardized very well, so it’s helpful to be really precise with notation in this post:
Uppercase
X
denotes a random variable
Uppercase
P(X)
denotes the probability distribution over that variable
Lowercase
x∼P(X)
denotes a value
x
sampled (
∼
) from the probability distribution
P(X)
via some generative process.
Lowercase
p(X)
is the density function of the distribution of
X
. It is a scalar function over the measure space of
, then we generate images of cats and images of non-cats just as easily as we can generate random numbers. If you'd like to learn more about this, see my other articles on generative models: [\[1\]](http://blog.evjang.com/2016/06/generative-adversarial-nets-in.html), [\[2\]](http://blog.evjang.com/2016/06/understanding-and-implementing.html).
.
### Hidden Variables as Priors
*This is an aside for interested readers. Skip to the [next section](http:/#aproblem) to continue with the tutorial.*
The previous cat example presents a very conventional example of observed variables, hidden variables, and priors. However, it's important to realize that the distinction between hidden / observed variables is somewhat arbitrary, and you're free to factor the graphical model however you like.
We can re-write Bayes' Theorem by swapping the terms:
, and those have priors still, and so on. Any hyper-parameter can be thought of as a prior. In Bayesian statistics, [it's priors all the way down](https://en.wikipedia.org/wiki/Turtles_all_the_way_down).
[](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_m700wD57Tjb2kAPaaB1aFUMxMUtwrQuXNfIodJ5V_CI5EEjqGP6qHQikJTgi9Oz-EpHYczWdWqnJU_TgDjJE-vZazN7ha_1MjrMCJoHen1js5Yj2t7I7r6SJsEWTSwQTzRCFZbYRcOM/s1600/Untitled+presentation+%25281%2529.png)
## Problem Formulation
The key problem we are interested in is *posterior inference*, or computing functions on the hidden variable
, but the corresponding computation is so complicated that we cannot evaluate it in a reasonable amount of time. We could try to use sampling-based approaches like [MCMC](https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo), but these are slow to converge.
## Variational Lower Bound for Mean-field Approximation
The idea behind variational inference is this: let's just perform inference on an easy, parametric distribution
L=EQ[logp(x|z)+logp(z)qϕ(z|x)]=EQ[logp(x|z)]+∑Qq(z|x)logp(z)qϕ(z|x)=EQ[logp(x|z)]−KL(Q(Z|X)||P(Z))Definition of expectationDefinition of KL divergence(3)
).
## Connections to Deep Learning
Variational methods are really important for Deep Learning. I will elaborate more in a later post, but here's a quick spoiler:
1. Deep learning is really good at optimization (specifically, gradient descent) over very large parameter spaces using lots of data.
2. Variational Bayes give us a framework with which we can re-write statistical inference problems as optimization problems.
Combining Deep learning and VB Methods allow us to perform inference on *extremely* complex posterior distributions. As it turns out, modern techniques like Variational Autoencoders optimize the exact same mean-field variational lower-bound derived in this post!
Thanks for reading, and stay tuned!