A Beginner's Guide to Variational Methods: Mean-Field Approximation

Variational Bayeisan (VB) Methods are a family of techniques that are very popular in statistical Machine Learning. VB methods allow us to re-write statistical inference problems (i.e. infer the value of a random variable given the value of another random variable) as optimization problems (i.e. find the parameter values that minimize some objective function).

This inference-optimization duality is powerful because it allows us to use the latest-and-greatest optimization algorithms to solve statistical Machine Learning problems (and vice versa, minimize functions using statistical techniques).

This post is an introductory tutorial on Variational Methods. I will derive the optimization objective for the simplest of VB methods, known as the Mean-Field Approximation. This objective, also known as the Variational Lower Bound, is exactly the same one used in Variational Autoencoders (a neat paper which I will explain in a follow-up post).

Preliminaries and Notation
Problem formulation
Variational Lower Bound for Mean-field Approximation
Forward KL vs. Reverse KL
Connections to Deep Learning

Preliminaries and Notation

This article assumes that the reader is familiar with concepts like random variables, probability distributions, and expectations. Here’s a refresher if you forgot some stuff. Machine Learning & Statistics notation isn’t standardized very well, so it’s helpful to be really precise with notation in this post:

Uppercase

X

denotes a random variable

Uppercase

P (X)

denotes the probability distribution over that variable

Lowercase

x \sim P (X)

denotes a value

x

sampled (

\sim

) from the probability distribution

P (X)

via some generative process.

Lowercase

p (X)

is the density function of the distribution of

X

. It is a scalar function over the measure space of

X

p(X=x)

(s h or t han d

p(x)

) d e n o t es t h e d e n s i t y f u n c t i o n e v a l u a t e d a t a p a r t i c u l a r v a l u e

. M an y a c a d e mi c p a p ers u se t h e t er m s " v a r iab l es ", " d i s t r ib u t i o n s ", " d e n s i t i es ", an d e v e n " m o d e l s " in t erc han g e ab l y . T hi s i s n o t n ecess a r i l y w ro n g p erse, s in ce

,

P(X)

, an d

p(X)

a ll im pl ye a c h o t h er v iaa o n e - t o - o n ecorres p o n d e n ce . Ho w e v er, i t^{'} sco n f u s in g t o mi x t h ese w or d s t o g e t h er b ec a u se t h e i r t y p es a re d i ff ere n t (i t d oes n^{'} t mak ese n se t o * s am pl e * a f u n c t i o n, n or d oes i t mak ese n se t o * in t e g r a t e * a d i s t r ib u t i o n) . W e m o d e l sys t e m s a s a co ll ec t i o n o f r an d o m v a r iab l es, w h ereso m e v a r iab l es (

) a re " o b ser v ab l e ", w hi l eo t h er v a r iab l es (

) a re " hi dd e n ". W ec an d r a wt hi sre l a t i o n s hi p v ia t h e f o ll o w in gg r a p h : [! [] (h ttp s : // b l o gg er . g oo g l e u serco n t e n t . co m / im g / b / R 29 v Z 2 x l / A V v X s E ikY SB 3 y M x F I b f S A G 7 lS o J 09 J C v e M f m o tX 8 Y u Q v J 7 y x G H x W z m A l A 076 J n d 4 L Q D 5 ik t o 7 - i U l_{u} HyD qF s 8 qw a 3 S x lT D 8 B c n v l yo d P V r i ZZT g 9 m M_{Y} 05 P H t 6 RN Y 5 k P D h 7 a A q 0 xL 7 F z O 7 n o F 0/ s 1600/ U n t i tl e d + p rese n t a t i o n . p n g)] (h ttp s : // b l o gg er . g oo g l e u serco n t e n t . co m / im g / b / R 29 v Z 2 x l / A V v X s E ikY SB 3 y M x F I b f S A G 7 lS o J 09 J C v e M f m o tX 8 Y u Q v J 7 y x G H x W z m A l A 076 J n d 4 L Q D 5 ik t o 7 - i U l_{u} HyD qF s 8 qw a 3 S x lT D 8 B c n v l yo d P V r i ZZT g 9 m M_{Y} 05 P H t 6 RN Y 5 k P D h 7 a A q 0 xL 7 F z O 7 n o F 0/ s 1600/ U n t i tl e d + p rese n t a t i o n . p n g) T h ee d g e d r a w n f ro m

t o

re l a t es t h e tw o v a r iab l es t o g e t h er v ia t h eco n d i t i o na l d i s t r ib u t i o n

P(X|Z)

. Her e^{'} s am oreco n cre t ee x am pl e :

mi g h t re p rese n tt h e " r a wp i x e l v a l u eso f anima g e ", w hi l e

i s abina ry v a r iab l es u c h t ha t

Z=1

" i f

i s anima g eo f a c a t ".

[! [] (h ttp s : // b l o gg er . g oo g l e u serco n t e n t . co m / im g / b / R 29 v Z 2 x l / A V v X s E g R J v QO 45 L k u k 6 n Mw A 81 a O_{T} 4 E L 1 Wh 9 G e J 0 D Z x H b e T 0 Y 243 T W X f K q 5 V 44 wR 0 p 1 P 36 D d k gx 1 TX o h E b 5 p b 2 b 1 w i c i 1 A X_{C} R hn 5 I 2 cKsHs 0 R 1 ji wX D q 7 m 63 G n 0 o QlZ 5 r j Q U D 6 NEF H I Xw a 8 k M g / s 320/ Z a c k - F a t - C a t - 1024 x 675. j p g)] (h ttp s : // b l o gg er . g oo g l e u serco n t e n t . co m / im g / b / R 29 v Z 2 x l / A V v X s E g R J v QO 45 L k u k 6 n Mw A 81 a O_{T} 4 E L 1 Wh 9 G e J 0 D Z x H b e T 0 Y 243 T W X f K q 5 V 44 wR 0 p 1 P 36 D d k gx 1 TX o h E b 5 p b 2 b 1 w i c i 1 A X_{C} R hn 5 I 2 cKsHs 0 R 1 ji wX D q 7 m 63 G n 0 o QlZ 5 r j Q U D 6 NEF H I Xw a 8 k M g / s 1024/ Z a c k - F a t - C a t - 1024 x 675. j p g)

P(Z=1)=1

(d e f ini t e l y a c a t)

[! [] (h ttp s : // b l o gg er . g oo g l e u serco n t e n t . co m / im g / b / R 29 v Z 2 x l / A V v X s E jk - O r 8 A v - 7 W 56 t x y 7 h g z V SRQR b s T n 1 ll 1 n N cH g m N 2 S 9 L TOCw f OFF K 7 K 3 u PX 8 r Y 7 Q 38 MZt u 9 x k 2 R WY e u H PR W 2 C e CM a r f sD n t 1 k I 4 i 3 T o wSSN - A B i S y - d D g S u bi cD I er - s L t gL m N 1_{v} c - Ho / s 320/ b ro w n - s ug a r - i ce - cre am - 1 a . j p g)] (h ttp s : // b l o gg er . g oo g l e u serco n t e n t . co m / im g / b / R 29 v Z 2 x l / A V v X s E jk - O r 8 A v - 7 W 56 t x y 7 h g z V SRQR b s T n 1 ll 1 n N cH g m N 2 S 9 L TOCw f OFF K 7 K 3 u PX 8 r Y 7 Q 38 MZt u 9 x k 2 R WY e u H PR W 2 C e CM a r f sD n t 1 k I 4 i 3 T o wSSN - A B i S y - d D g S u bi cD I er - s L t gL m N 1_{v} c - Ho / s 840/ b ro w n - s ug a r - i ce - cre am - 1 a . j p g)

P(Z=1)=0

(d e f ini t e l y n o t a c a t)

[! [] (h ttp s : // s - m e d ia - c a c h e - ak 0. p inim g . co m /564 x /9 f / f b / b 7/9 ff bb 78 f 65 b 5 d 829 a 91 c a 082211 f 116 c . j p g)] (h ttp s : // s - m e d ia - c a c h e - ak 0. p inim g . co m /564 x /9 f / f b / b 7/9 ff bb 78 f 65 b 5 d 829 a 91 c a 082211 f 116 c . j p g)

P(Z=1)=0.1

(sor t o f c a t - l ik e) [B a ye s^{'} T h eore m] (h ttp s : // e n . w iki p e d ia . or g / w iki / B a yes

p(Z|X)=p(X|Z)p(Z)p(X)

T h e v a r i o u s p i eceso f t hi s a re a ssoc ia t e d w i t h co mm o nnam es :

p(Z|X)

i s t h e * * p os t er i or p ro babi l i t y * * : " g i v e n t h e ima g e, w ha t i s t h e p ro babi l i t y t ha tt hi s i so f a c a t ? " I f w ec an s am pl e f ro m

z∼P(Z|X)

, w ec an u se t hi s t o mak e a c a t c l a ss i f i er t ha tt e ll s u s w h e t h er a g i v e nima g e i s a c a t or n o t .

p(X|Z)

i s t h e * * l ik e l ih oo d * * : " g i v e na v a l u eo f

t hi sco m p u t es h o w " p ro bab l e " t hi s ima g e

i s u n d er t ha t c a t e g ory (" i s - a - c a t "/" i s - n o t - a - c a t ") . I f w ec an s am pl e f ro m

x∼P(X|Z)

, then we generate images of cats and images of non-cats just as easily as we can generate random numbers. If you'd like to learn more about this, see my other articles on generative models: [\[1\]](http://blog.evjang.com/2016/06/generative-adversarial-nets-in.html), [\[2\]](http://blog.evjang.com/2016/06/understanding-and-implementing.html).

p(Z)

i s t h e * * p r i or p ro babi l i t y * * . T hi sc a pt u res an y p r i or in f or ma t i o n w e kn o w ab o u t

- f ore x am pl e, i f w e t hink t ha t 1/3 o f a ll ima g es in e x i s t e n ce a reo f c a t s, t h e n

p(Z=1)=13

an d

p(Z=0)=23

. ### Hidden Variables as Priors *This is an aside for interested readers. Skip to the [next section](http:/#aproblem) to continue with the tutorial.* The previous cat example presents a very conventional example of observed variables, hidden variables, and priors. However, it's important to realize that the distinction between hidden / observed variables is somewhat arbitrary, and you're free to factor the graphical model however you like. We can re-write Bayes' Theorem by swapping the terms:

p(Z|X)p(X)p(Z)=p(X|Z)

T h e " p os t er i or " in q u es t i o ni s n o w

P(X|Z)

. H i dd e n v a r iab l esc anb e in t er p re t e df ro ma [B a yes ian St a t i s t i cs] (h ttp s : // e n . w iki p e d ia . or g / w iki / B a yes ia n_{s} t a t i s t i cs) f r am e w or ka s * p r i or b e l i e f s * a tt a c h e d t o t h eo b ser v e d v a r iab l es . F ore x am pl e, i f w e b e l i e v e

i s am u lt i v a r ia t e G a u ss ian, t h e hi dd e n v a r iab l e

mi g h t re p rese n tt h e m e anan d v a r ian ceo f t h e G a u ss ian d i s t r ib u t i o n . T h e d i s t r ib u t i o n o v er p a r am e t ers

P(Z)

i s t h e na * p r i or * d i s t r ib u t i o n t o

P(X)

. Y o u a re a l so f ree t oc h oose w hi c h v a l u es

an d

re p rese n t . F ore x am pl e,

co u l d in s t e a d b e " m e an, c u b eroo t o f v a r ian ce, an d

X+Y

w h ere

Y∼N(0,1)

". T hi s i sso m e w ha t u nna t u r a l an d w e i r d, b u tt h es t r u c t u re i ss t i ll v a l i d, a s l o n g a s

P(X|Z)

i s m o d i f i e d a ccor d in g l y . Y o u c an e v e n " a dd " v a r iab l es t oyo u rsys t e m . T h e p r i or i t se l f mi g h t b e d e p e n d e n t o n o t h err an d o m v a r iab l es v ia

P(Z|θ)

, w hi c hha v e p r i or d i s t r ib u t i o n so f t h e i ro w n

P(θ)

, and those have priors still, and so on. Any hyper-parameter can be thought of as a prior. In Bayesian statistics, [it's priors all the way down](https://en.wikipedia.org/wiki/Turtles_all_the_way_down). [![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_m700wD57Tjb2kAPaaB1aFUMxMUtwrQuXNfIodJ5V_CI5EEjqGP6qHQikJTgi9Oz-EpHYczWdWqnJU_TgDjJE-vZazN7ha_1MjrMCJoHen1js5Yj2t7I7r6SJsEWTSwQTzRCFZbYRcOM/s320/Untitled+presentation+%25281%2529.png)](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_m700wD57Tjb2kAPaaB1aFUMxMUtwrQuXNfIodJ5V_CI5EEjqGP6qHQikJTgi9Oz-EpHYczWdWqnJU_TgDjJE-vZazN7ha_1MjrMCJoHen1js5Yj2t7I7r6SJsEWTSwQTzRCFZbYRcOM/s1600/Untitled+presentation+%25281%2529.png) ## Problem Formulation The key problem we are interested in is *posterior inference*, or computing functions on the hidden variable

. S o m ec an o ni c a l e x am pl eso f p os t er i or in f ere n ce : - * G i v e n t hi ss u r v e i ll an ce f oo t a g e

, d i d t h es u s p ec t s h o w u p ini t ? * - * G i v e n t hi s tw i tt er f ee d

, i s t h e a u t h or d e p resse d ? * - * G i v e nhi s t or i c a l s t oc k p r i ces

X1:t−1

, w ha tw i ll

b e ? * W e u s u a ll y a ss u m e t ha tw e kn o w h o wt oco m p u t e f u n c t i o n so n l ik e l ih oo df u n c t i o n

P(X|Z)

an d p r i ors

P(Z)

. T h e p ro b l e mi s, f orco m pl i c a t e d t a s k s l ik e ab o v e, w eo f t e n d o n^{'} t kn o w h o wt os am pl e f ro m

P(Z|X)

orco m p u t e

p(X|Z)

. A lt er na t i v e l y, w e mi g h t kn o wt h e f or m o f

p(Z|X)

, but the corresponding computation is so complicated that we cannot evaluate it in a reasonable amount of time. We could try to use sampling-based approaches like [MCMC](https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo), but these are slow to converge. ## Variational Lower Bound for Mean-field Approximation The idea behind variational inference is this: let's just perform inference on an easy, parametric distribution

Qϕ(Z|X)

(l ik e a G a u ss ian) f or w hi c h w e kn o w h o wt o d o p os t er i or in f ere n ce, b u t a d j u s tt h e p a r am e t ers

so t ha t

Qϕ

i s a sc l ose t o

a s p oss ib l e . T hi s i s v i s u a ll y i ll u s t r a t e d b e l o w : t h e b l u ec u r v e i s t h e t r u e p os t er i or d i s t r ib u t i o n, an d t h e g ree n d i s t r ib u t i o ni s t h e v a r ia t i o na l a pp ro x ima t i o n (G a u ss ian) t ha tw e f i tt o t h e b l u e d e n s i t y v ia o pt imi z a t i o n . [! [] (h ttp s : // b l o gg er . g oo g l e u serco n t e n t . co m / im g / b / R 29 v Z 2 x l / A V v X s E g c G k x aW H J OF 8 De b g R Y MEq V 9 T z d 7 Z Y 1 q I b ZOF ys SE 1 W 1 e n I Mt 0 U O Ke 3 G 9 H G m l b T 3 R n_{M} H l 61 B 0 g P n I y PF 5 F h K 5 O 495 qX 9 w v Y x Tt 2 d s V J R 17 MP b gg t I G d E s n x Z orcy T h ZBX 6 h d h I l a d 7 A k Q / s 400/ U n t i tl e d + p rese n t a t i o n + Wha t d oes i t m e an f or d i s t r ib u t i o n s t o b e " c l ose " ? M e an - f i e l d v a r ia t i o na lB a yes (t h e m os t co mm o n t y p e) u ses t h e R e v erseK L D i v er g e n ce t o a s t h e d i s t an ce m e t r i c b e tw ee n tw o d i s t r ib u t i o n s .

KL(Qϕ(Z|X)||P(Z|X))=∑z∈Zqϕ(z|x)logqϕ(z|x)p(z|x)

R e v erseK L d i v er g e n ce m e a s u res t h e am o u n t o f in f or ma t i o n (inna t s, or u ni t so f

1log(2)

bi t s) re q u i re d t o " d i s t or t "

P(Z)

in t o

Qϕ(Z)

. W e w i s h t o minimi ze t hi s q u an t i t y w i t h res p ec tt o

. B y d e f ini t i o n o f a co n d i t i o na l d i s t r ib u t i o n,

p(z|x)=p(x,z)p(x)

. L e t^{'} ss u b s t i t u t e t hi se x p ress i o nin t oo u ror i g ina l

e x p ress i o n, an d t h e n d i s t r ib u t e :

KL(Q||P)=∑z∈Zqϕ(z|x)logqϕ(z|x)p(x)p(z,x)=∑z∈Zqϕ(z|x)(logqϕ(z|x)p(z,x)+logp(x))=(∑zqϕ(z|x)logqϕ(z|x)p(z,x))+(∑zlogp(x)qϕ(z|x))=(∑zqϕ(z|x)logqϕ(z|x)p(z,x))+(logp(x)∑zqϕ(z|x))=logp(x)+(∑zqϕ(z|x)logqϕ(z|x)p(z,x))(1)note: ∑zq(z)=1

T o minimi ze

KL(Q||P)

w i t h res p ec tt o v a r ia t i o na lp a r am e t ers

, w e j u s t ha v e t o minimi ze

∑zqϕ(z|x)logqϕ(z|x)p(z,x)

, s in ce

logp(x)

i s f i x e d w i t h res p ec tt o

. L e t^{'} sre - w r i t e t hi s q u an t i t y a s an e x p ec t a t i o n o v er t h e d i s t r ib u t i o n

Qϕ(Z|X)

.

M inimi z in g t hi s i se q u i v a l e n tt o * ma x imi z in g * t h e n e g a t i o n o f t hi s f u n c t i o n :

I n l i t er a t u re,

i s kn o w na s t h e * v a r ia t i o na ll o w er b o u n d *, an d i sco m p u t a t i o na ll y t r a c t ab l e i f w ec an e v a l u a t e

p(x|z),p(z),q(z|x)

. W ec an f u r t h erre - a rr an g e t er m s ina w a y t ha t y i e l d s anin t u i t i v e f or m u l a :

L=EQ[logp(x|z)+logp(z)qϕ(z|x)]=EQ[logp(x|z)]+∑Qq(z|x)logp(z)qϕ(z|x)=EQ[logp(x|z)]−KL(Q(Z|X)||P(Z))Definition of expectationDefinition of KL divergence(3)

I f s am pl in g

z∼Q(Z|X)

i s an " e n co d in g " p rocess t ha t co n v er t s an o b ser v a t i o n

t o l a t e n t co d e

, t h e n s am pl in g

x∼Q(X|Z)

i s a " d eco d in g " p rocess t ha t reco n s t r u c t s t h eo b ser v a t i o n f ro m

. I t f o ll o w s t ha t

i s t h es u m o f t h ee x p ec t e d " d eco d in g " l ik e l ih oo d (h o w g oo d o u r v a r ia t i o na l d i s t r ib u t i o n c an d eco d e a s am pl eo f

ba c k t o a s am pl eo f

), pl u s t h eK L d i v er g e n ce b e tw ee n t h e v a r ia t i o na l a pp ro x ima t i o nan d t h e p r i oro n

. I f w e a ss u m e

Q(Z|X)

i sco n d i t i o na ll y G a u ss ian, t h e n p r i or

i so f t e n c h ose n t o b e a d ia g o na lG a u ss ian d i s t r ib u t i o n w i t hm e an 0 an d s t an d a r dd e v ia t i o n 1. Wh y i s

c a ll e d t h e v a r ia t i o na ll o w er b o u n d ? S u b s t i t u t in g

ba c kin t o Eq . (1), w e ha v e :

KL(Q||P)logp(x)=logp(x)−L=L+KL(Q||P)(4)

T h e m e anin g o f Eq . (4), in pl ain l an gu a g e, i s t ha t

p(x)

, t h e l o g - l ik e l ih oo d o f a d a t a p o in t

u n d er t h e t r u e d i s t r ib u t i o n, i s

, pl u s an error t er m

KL(Q||P)

t ha t c a pt u res t h e d i s t an ce b e tw ee n

Q(Z|X=x)

an d

P(Z|X=x)

a tt ha tp a r t i c u l a r v a l u eo f

. S in ce

KL(Q||P)≥0

,

logp(x)

m u s t b e g re a t er t han

. T h ere f ore

i s a * l o w er b o u n d * f or

logp(x)

.

i s a l sore f erre d t o a se v i d e n ce l o w er b o u n d (E L BO), v ia t h e a lt er na t e f or m u l a t i o n :

L=logp(x)−KL(Q(Z|X)||P(Z|X))=EQ[logp(x|z)]−KL(Q(Z|X)||P(Z))

N o t e t ha t

i t se l f co n t ain s a K L d i v er g e n ce t er mb e tw ee n t h e a pp ro x ima t e p os t er i or an d t h e p r i or, so t h ere a re tw oK L t er m s in t o t a l in

logp(x)

. ## Forward KL vs. Reverse KL KL divergence is *not* a symmetric distance function, i.e.

KL(P||Q)≠KL(Q||P)

(e x ce ptw h e n

Q≡P

) T h e f i rs t i s kn o w na s t h e " f or w a r d K L ", w hi l e t h e l a tt er i s " re v erseK L ". S o w h y d o w e u se R e v erseK L ? T hi s i s b ec a u se t h eres u lt in g d er i v a t i o n w o u l d re q u i re u s t o kn o w h o wt oco m p u t e

p(Z|X)

, w hi c hi s w ha tw e^{'} d l ik e t o d o in t h e f i rs tpl a ce . I re a ll y l ik eKe v in M u r p h y^{'} se x pl ana t i o nin t h e [PM L t e x t b oo k] (h ttp s : // www . cs . u b c . c a / m u r p h y k / M L b oo k /), w hi c h I s ha ll a tt e m ptt ore - p h r a se h ere : L e t^{'} sco n s i d er t h e f or w a r d - K L f i rs t . A s w es a w f ro m t h e ab o v e d er i v a t i o n s, w ec an w r i t eK L a s t h ee x p ec t a t i o n o f a " p e na lt y " f u n c t i o n

logp(z)q(z)

o v er a w e i g hin g f u n c t i o n

p(z)

.

KL(P||Q)=∑zp(z)logp(z)q(z)=Ep(z)[logp(z)q(z)]

T h e p e na lt y f u n c t i o n co n t r ib u t es l oss t o t h e t o t a l K L w h ere v er

p(Z)>0

. F or

p(Z)>0

,

limq(Z)→0logp(z)q(z)→∞

. T hi s m e an s t ha tt h e f or w a r d - K L w i ll b e l a r g e w h ere v er

Q(Z)

f ai l s t o " co v er u p "

P(Z)

. T h ere f ore, t h e f or w a r d - K L i s minimi ze d w h e n w ee n s u re t ha t

q(z)>0

w h ere v er

p(z)>0

. T h eo pt imi ze d v a r ia t i o na l d i s t r ib u t i o n

Q(Z)

i s kn o w na s " zero - a v o i d in g " (d e n s i t y a v o i d szero w h e n

p(Z)

i szero) . [! [] (h ttp s : // b l o gg er . g oo g l e u serco n t e n t . co m / im g / b / R 29 v Z 2 x l / A V v X s E ha S n D 0 T m e gU I 12 OBQG 7 d b or W 0 T s ah u C z 7 k oK V 2 tR z p e 3 D f Q cH RTRt u d - 0 xu z m d S 9 O y 8 j Ho X A w 5 n so w - s Y d uL X g M 38 T z PFS x i I M x 7_{a} GG x 3 P H 462 MT V 1 K v a o u MCF n o nj 4 W FFX V 9 a x d SR h Q / s 640/ f or w a r d - K L . p n g)] (h ttp s : // b l o gg er . g oo g l e u serco n t e n t . co m / im g / b / R 29 v Z 2 x l / A V v X s E ha S n D 0 T m e gU I 12 OBQG 7 d b or W 0 T s ah u C z 7 k oK V 2 tR z p e 3 D f Q cH RTRt u d - 0 xu z m d S 9 O y 8 j Ho X A w 5 n so w - s Y d uL X g M 38 T z PFS x i I M x 7_{a} GG x 3 P H 462 MT V 1 K v a o u MCF n o nj 4 W FFX V 9 a x d SR h Q / s 1600/ f or w a r d - K L . p n g) M inimi z in g t h e R e v erse - K L ha se x a c tl y t h eo pp os i t e b e ha v i or :

KL(Q||P)=∑zq(z)logq(z)p(z)=Ep(z)[logq(z)p(z)]

I f

p(Z)=0

, w e m u s t e n s u re t ha tt h e w e i g h t in g f u n c t i o n

q(Z)=0

w h ere v er d e n o mina t or

p(Z)=0

, o t h er w i se t h eK L b l o w s u p . T hi s i s kn o w na s " zero - f orc in g " : [! [] (h ttp s : // b l o gg er . g oo g l e u serco n t e n t . co m / im g / b / R 29 v Z 2 x l / A V v X s E g P 80 a N 0 I D F 7 y w c v N 7 - 1 X v n d C hY x 7 c VVa U P y 4 ma D 60 TtB g t Y E v h 9 TMX 4 qGqwXMEG o I Z 68 He I d 60 z x 8 H F Wj zzr l 9 c G 4 a ES J p W 5 w iY l f Cq i t 1 s_{c} K f p 0 qT o M r 2 I 76 D S n TN A 4 h 3 m v n 6 m qZM 7 M / s 640/ re v erse - K L . p n g)] (h ttp s : // b l o gg er . g oo g l e u serco n t e n t . co m / im g / b / R 29 v Z 2 x l / A V v X s E g P 80 a N 0 I D F 7 y w c v N 7 - 1 X v n d C hY x 7 c VVa U P y 4 ma D 60 TtB g t Y E v h 9 TMX 4 qGqwXMEG o I Z 68 He I d 60 z x 8 H F Wj zzr l 9 c G 4 a ES J p W 5 w iY l f Cq i t 1 s_{c} K f p 0 qT o M r 2 I 76 D S n TN A 4 h 3 m v n 6 m qZM 7 M / s 1600/ re v erse - K L . p n g) S o in s u mma ry, minimi z in g f or w a r d - K L " s t re t c h es " yo u r v a r ia t i o na l d i s t r ib u t i o n

Q(Z)

t oco v er * * o v er * * t h ee n t i re

P(Z)

l ik e a t a r p, w hi l e minimi z in g re v erse - K L " s q u eezes " t h e

Q(Z)

* * u n d er * *

P(Z)

. I t^{'} s im p or t an tt o k ee p inmin d t h e im pl i c a t i o n so f u s in g re v erse - K L w h e n u s in g t h e m e an - f i e l d a pp ro x ima t i o ninma c hin e l e a r nin g p ro b l e m s . I f w e a re f i tt in g a u nim o d a l d i s t r ib u t i o n t o am u lt i - m o d a l o n e, w e^{'} ll e n d u pw i t hm ore f a l se n e g a t i v es (t h ere i s a c t u a ll y p ro babi l i t y ma ss in

P(Z)

w h ere w e t hink t h ere i s n o n e in

Q(Z)

). ## Connections to Deep Learning Variational methods are really important for Deep Learning. I will elaborate more in a later post, but here's a quick spoiler: 1. Deep learning is really good at optimization (specifically, gradient descent) over very large parameter spaces using lots of data. 2. Variational Bayes give us a framework with which we can re-write statistical inference problems as optimization problems. Combining Deep learning and VB Methods allow us to perform inference on *extremely* complex posterior distributions. As it turns out, modern techniques like Variational Autoencoders optimize the exact same mean-field variational lower-bound derived in this post! Thanks for reading, and stay tuned!

🪴 Anil's Garden

Explorer

A Beginner's Guide to Variational Methods: Mean-Field Approximation

Table of Contents

Preliminaries and Notation

Graph View

Table of Contents

Backlinks