Variational Bayeisan (VB) Methods are a family of techniques that are very popular in statistical Machine Learning. VB methods allow us to re-write statistical inference problems (i.e. infer the value of a random variable given the value of another random variable) as optimization problems (i.e. find the parameter values that minimize some objective function).

This inference-optimization duality is powerful because it allows us to use the latest-and-greatest optimization algorithms to solve statistical Machine Learning problems (and vice versa, minimize functions using statistical techniques).

This post is an introductory tutorial on Variational Methods. I will derive the optimization objective for the simplest of VB methods, known as the Mean-Field Approximation. This objective, also known as the Variational Lower Bound, is exactly the same one used in Variational Autoencoders (a neat paper which I will explain in a follow-up post).

Table of Contents

  1. Preliminaries and Notation
  2. Problem formulation
  3. Variational Lower Bound for Mean-field Approximation
  4. Forward KL vs. Reverse KL
  5. Connections to Deep Learning

Preliminaries and Notation

This article assumes that the reader is familiar with concepts like random variables, probability distributions, and expectations. Here’s a refresher if you forgot some stuff. Machine Learning & Statistics notation isn’t standardized very well, so it’s helpful to be really precise with notation in this post:

  • Uppercase

denotes a random variable

  • Uppercase

denotes the probability distribution over that variable

  • Lowercase

denotes a value

sampled (

) from the probability distribution

via some generative process.

  • Lowercase

is the density function of the distribution of

. It is a scalar function over the measure space of

.

p(X=x)

p(x)

x

X

P(X)

p(X)

X

Z

Z

X

P(X|Z)

X

Z

Z=1

X

X=

P(Z=1)=1

X=

P(Z=1)=0

X=

P(Z=1)=0.1

p(Z|X)=p(X|Z)p(Z)p(X)

p(Z|X)

z∼P(Z|X)

p(X|Z)

Z

X

x∼P(X|Z)

, then we generate images of cats and images of non-cats just as easily as we can generate random numbers. If you'd like to learn more about this, see my other articles on generative models: [\[1\]](http://blog.evjang.com/2016/06/generative-adversarial-nets-in.html), [\[2\]](http://blog.evjang.com/2016/06/understanding-and-implementing.html).

p(Z)

Z

p(Z=1)=13

p(Z=0)=23

. ### Hidden Variables as Priors *This is an aside for interested readers. Skip to the [next section](http:/#aproblem) to continue with the tutorial.* The previous cat example presents a very conventional example of observed variables, hidden variables, and priors. However, it's important to realize that the distinction between hidden / observed variables is somewhat arbitrary, and you're free to factor the graphical model however you like. We can re-write Bayes' Theorem by swapping the terms:

p(Z|X)p(X)p(Z)=p(X|Z)

P(X|Z)

X

Z

P(Z)

P(X)

X

Z

Z

X+Y

Y∼N(0,1)

P(X|Z)

P(Z|θ)

P(θ)

, and those have priors still, and so on. Any hyper-parameter can be thought of as a prior. In Bayesian statistics, [it's priors all the way down](https://en.wikipedia.org/wiki/Turtles_all_the_way_down). [![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_m700wD57Tjb2kAPaaB1aFUMxMUtwrQuXNfIodJ5V_CI5EEjqGP6qHQikJTgi9Oz-EpHYczWdWqnJU_TgDjJE-vZazN7ha_1MjrMCJoHen1js5Yj2t7I7r6SJsEWTSwQTzRCFZbYRcOM/s320/Untitled+presentation+%25281%2529.png)](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_m700wD57Tjb2kAPaaB1aFUMxMUtwrQuXNfIodJ5V_CI5EEjqGP6qHQikJTgi9Oz-EpHYczWdWqnJU_TgDjJE-vZazN7ha_1MjrMCJoHen1js5Yj2t7I7r6SJsEWTSwQTzRCFZbYRcOM/s1600/Untitled+presentation+%25281%2529.png) ## Problem Formulation The key problem we are interested in is *posterior inference*, or computing functions on the hidden variable

Z

X

X

X1:t−1

Xt

P(X|Z)

P(Z)

P(Z|X)

p(X|Z)

p(Z|X)

, but the corresponding computation is so complicated that we cannot evaluate it in a reasonable amount of time. We could try to use sampling-based approaches like [MCMC](https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo), but these are slow to converge. ## Variational Lower Bound for Mean-field Approximation The idea behind variational inference is this: let's just perform inference on an easy, parametric distribution

Qϕ(Z|X)

ϕ

P

KL(Qϕ(Z|X)||P(Z|X))=∑z∈Zqϕ(z|x)logqϕ(z|x)p(z|x)

1log(2)

P(Z)

Qϕ(Z)

ϕ

p(z|x)=p(x,z)p(x)

KL

KL(Q||P)=∑z∈Zqϕ(z|x)logqϕ(z|x)p(x)p(z,x)=∑z∈Zqϕ(z|x)(logqϕ(z|x)p(z,x)+logp(x))=(∑zqϕ(z|x)logqϕ(z|x)p(z,x))+(∑zlogp(x)qϕ(z|x))=(∑zqϕ(z|x)logqϕ(z|x)p(z,x))+(logp(x)∑zqϕ(z|x))=logp(x)+(∑zqϕ(z|x)logqϕ(z|x)p(z,x))(1)note: ∑zq(z)=1

KL(Q||P)

ϕ

∑zqϕ(z|x)logqϕ(z|x)p(z,x)

logp(x)

ϕ

Qϕ(Z|X)

∑zqϕ(z|x)logqϕ(z|x)p(z,x)=Ez∼Qϕ(Z|X)[logqϕ(z|x)p(z,x)]=EQ[logqϕ(z|x)−logp(x,z)]=EQ[logqϕ(z|x)−(logp(x|z)+log(p(z)))]=EQlogqϕ(z|x)−logp(x|z)−log(p(z)))

maximize L=−∑zqϕ(z|x)logqϕ(z|x)p(z,x)=EQ[−logqϕ(z|x)+logp(x|z)+log(p(z)))]=EQlogp(x|z)+logp(z)qϕ(z|x)

L

p(x|z),p(z),q(z|x)

L=EQ[logp(x|z)+logp(z)qϕ(z|x)]=EQ[logp(x|z)]+∑Qq(z|x)logp(z)qϕ(z|x)=EQ[logp(x|z)]−KL(Q(Z|X)||P(Z))Definition of expectationDefinition of KL divergence(3)

z∼Q(Z|X)

x

z

x∼Q(X|Z)

z

L

Z

X

Z

Q(Z|X)

Z

L

L

KL(Q||P)logp(x)=logp(x)−L=L+KL(Q||P)(4)

p(x)

x

L

KL(Q||P)

Q(Z|X=x)

P(Z|X=x)

X

KL(Q||P)≥0

logp(x)

L

L

logp(x)

L

L=logp(x)−KL(Q(Z|X)||P(Z|X))=EQ[logp(x|z)]−KL(Q(Z|X)||P(Z))

L

logp(x)

. ## Forward KL vs. Reverse KL KL divergence is *not* a symmetric distance function, i.e.

KL(P||Q)≠KL(Q||P)

Q≡P

p(Z|X)

logp(z)q(z)

p(z)

KL(P||Q)=∑zp(z)logp(z)q(z)=Ep(z)[logp(z)q(z)]

p(Z)>0

p(Z)>0

limq(Z)→0logp(z)q(z)→∞

Q(Z)

P(Z)

q(z)>0

p(z)>0

Q(Z)

p(Z)

KL(Q||P)=∑zq(z)logq(z)p(z)=Ep(z)[logq(z)p(z)]

p(Z)=0

q(Z)=0

p(Z)=0

Q(Z)

P(Z)

Q(Z)

P(Z)

P(Z)

Q(Z)

). ## Connections to Deep Learning Variational methods are really important for Deep Learning. I will elaborate more in a later post, but here's a quick spoiler: 1. Deep learning is really good at optimization (specifically, gradient descent) over very large parameter spaces using lots of data. 2. Variational Bayes give us a framework with which we can re-write statistical inference problems as optimization problems. Combining Deep learning and VB Methods allow us to perform inference on *extremely* complex posterior distributions. As it turns out, modern techniques like Variational Autoencoders optimize the exact same mean-field variational lower-bound derived in this post! Thanks for reading, and stay tuned!