7.2 Metropolis-Hastings | Advanced Statistical Computing
Excerpt
The book covers material taught in the Johns Hopkins Biostatistics Advanced Statistical Computing course.
Let q(Y∣X) be a transition density for p-dimensional X and Y from which we can easily simulate and let π(X) be our target density (i.e. the stationary distribution that our Markov chain will eventually converge to). The Metropolis-Hastings procedure is an iterative algorithm where at each stage, there are three steps. Suppose we are currently in the state x and we want to know how to move to the next state in the state space.
-
Simulate a candidate value y∼q(Y∣x). Note that the candidate value depends on our current state x.
-
Let α(y∣x)=min{π(y)q(x∣y)π(x)q(y∣x),1} α(y∣x) is referred to as the acceptance ratio.
-
Simulate u∼Unif(0,1). If u≤α(y∣x), then the next state is equal to y. Otherwise, the next state is still x (we stay in the same place).
This three step process represents the transition kernel for our Markov chain from which we are simulating. Recall that the hope is that our Markov chain will, after many simulations, converge to the stationary distribution. Eventually, we can be reasonably sure that the samples that we draw from this process are draws from the stationary distribution, i.e. π(X).
Why does this work? Recall that we need to have the Markov chain generated by this transition kernel be time reversible. If K(y∣x) is the transition kernel embodied by the three steps above, then we need to show that
π(y)K(x∣y)=π(x)K(y∣x). We can write the transition kernel K(y∣x) as
K(y∣x)=α(y∣x)q(y∣x)+1{y=x}[1−∫α(s∣x)q(s∣x)ds] The function K(y∣x) can be decomposed into two parts and we can treat each separately. First we can show that K(y∣x)π(x)=K(x∣y)π(y)α(y∣x)q(y∣x)π(x)=α(x∣y)q(x∣y)π(y)min{π(y)π(x)q(x∣y)q(y∣x),1}q(y∣x)π(x)=min{π(x)π(y)q(y∣x)q(x∣y),1}q(x∣y)π(y)min(π(y)q(x∣y),q(y∣x)π(x))=min(π(x)q(y∣x),q(x∣y)π(y))
The last line is trivially true because on both sides of the equation we are taking the minimum of the same quantities.
For the second part, let r(x)=[1−∫α(s∣x)q(s∣x)ds]. Then we need to show that 1{y=x}r(x)π(x)=1{x=y}r(y)π(y) over the set where y=x. But this is trivially true because if y=x then every quantity in the above equation is the same regardless of an x or a y appears in it.
Random Walk Metropolis-Hastings
Let q(y∣x) be defined as
y=x+ε where ε∼g and g is a probability density symmetric about 0. Given this definition, we have
q(y∣x)=g(ε)
and
q(x∣y)=g(−ε)=g(ε)
Because q(y∣x) is symmetric in x and y, the Metropolis-Hastings acceptance ratio α(y∣x) simplifies to
α(y∣x)=min{π(y)q(x∣y)π(x)q(y∣x),1}=min{π(y)π(x),1}
Given our current state x, the random walk Metropolis-Hastings algorithm proceeds as follows:
-
Simulate ε∼g and let y=x+ε.
-
Compute α(y∣x)=min{π(y)π(x),1}
-
Simulate u∼Unif(0,1). If u≤α(y∣x) then accept y as the next state, otherwise stay at x.
It should be noted that this form of the Metropolis-Hastings algorithm was the original form of the Metropolis algorithm.
Example: Sampling Normal Variates
As a simple example, we can show how random walk Metropolis-Hastings can be used to sample from a standard Normal distribution. Let g be a uniform distribution over the interval (−δ,δ), where δ is small and >0 (its exact value doesn’t matter). Then we can do
-
Simulate ε∼Unif(−δ,δ) and let y=x+ε.
-
Compute α(y∣x)=min{φ(y)φ(x),1} where φ is the standard Normal density.
-
Simulate u∼Unif(0,1). If u≤α(y∣x) then accept y as the next state, otherwise stay at x.
We can see what this looks like but running the iteration many times.
<span id="cb1-1">delta <span><-</span> <span>0.5</span></span>
<span id="cb1-2">N <span><-</span> <span>500</span></span>
<span id="cb1-3">x <span><-</span> <span>numeric</span>(N)</span>
<span id="cb1-4">x[<span>1</span>] <span><-</span> <span>0</span></span>
<span id="cb1-5"><span>set.seed</span>(<span>2018-06-04</span>)</span>
<span id="cb1-6"><span>for</span>(i <span>in</span> <span>2</span><span>:</span>N) {</span>
<span id="cb1-7"> eps <span><-</span> <span>runif</span>(<span>1</span>, <span>-</span>delta, delta)</span>
<span id="cb1-8"> y <span><-</span> x[i<span>-1</span>] <span>+</span> eps</span>
<span id="cb1-9"> alpha <span><-</span> <span>min</span>(<span>dnorm</span>(y, <span>log =</span> <span>TRUE</span>) <span>-</span> <span>dnorm</span>(x[i<span>-1</span>], <span>log =</span> <span>TRUE</span>), <span>0</span>)</span>
<span id="cb1-10"> u <span><-</span> <span>runif</span>(<span>1</span>, <span>0</span>, <span>1</span>)</span>
<span id="cb1-11"> <span>if</span>(<span>log</span>(u) <span><=</span> alpha)</span>
<span id="cb1-12"> x[i] <span><-</span> y</span>
<span id="cb1-13"> <span>else</span></span>
<span id="cb1-14"> x[i] <span><-</span> x[i<span>-1</span>]</span>
<span id="cb1-15">}</span>
<span id="cb1-16"><span>summary</span>(x)</span>
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.1314 -0.6135 -0.1485 -0.1681 0.3034 1.8465
We can take a look at a histogram of the samples to see if they look like a Normal distribution.
We can also look at a trace plot of the samples to see how the chain moved around.
<span id="cb4-1"><span>library</span>(ggplot2)</span>
<span id="cb4-2"><span>qplot</span>(<span>1</span><span>:</span>N, x, <span>geom =</span> <span>"line"</span>, <span>xlab =</span> <span>"Iteration"</span>)</span>
What would happen if we changed the value of δ? Here we make δ bigger and we can see the effect on the trace plot.
<span id="cb5-1">delta <span><-</span> <span>2</span></span>
<span id="cb5-2">N <span><-</span> <span>500</span></span>
<span id="cb5-3">x <span><-</span> <span>numeric</span>(N)</span>
<span id="cb5-4">x[<span>1</span>] <span><-</span> <span>0</span></span>
<span id="cb5-5"><span>set.seed</span>(<span>2018-06-04</span>)</span>
<span id="cb5-6"><span>for</span>(i <span>in</span> <span>2</span><span>:</span>N) {</span>
<span id="cb5-7"> eps <span><-</span> <span>runif</span>(<span>1</span>, <span>-</span>delta, delta)</span>
<span id="cb5-8"> y <span><-</span> x[i<span>-1</span>] <span>+</span> eps</span>
<span id="cb5-9"> alpha <span><-</span> <span>min</span>(<span>dnorm</span>(y, <span>log =</span> <span>TRUE</span>) <span>-</span> <span>dnorm</span>(x[i<span>-1</span>], <span>log =</span> <span>TRUE</span>), <span>0</span>)</span>
<span id="cb5-10"> u <span><-</span> <span>runif</span>(<span>1</span>, <span>0</span>, <span>1</span>)</span>
<span id="cb5-11"> <span>if</span>(<span>log</span>(u) <span><=</span> alpha)</span>
<span id="cb5-12"> x[i] <span><-</span> y</span>
<span id="cb5-13"> <span>else</span></span>
<span id="cb5-14"> x[i] <span><-</span> x[i<span>-1</span>]</span>
<span id="cb5-15">}</span>
<span id="cb5-16"><span>summary</span>(x)</span>
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.60714 -0.72944 -0.05603 -0.07395 0.53416 2.51142
Now look at the trace plot.
<span id="cb7-1"><span>library</span>(ggplot2)</span>
<span id="cb7-2"><span>qplot</span>(<span>1</span><span>:</span>N, x, <span>geom =</span> <span>"line"</span>, <span>xlab =</span> <span>"Iteration"</span>)</span>
You can see in this trace plot that there are a number of iterations where there is no movement (so that the candidate is rejected, whereas in the above trace plot with the smaller value of δ there was much more movement, although the steps themselves were smaller. In particular, the first chain exhibited a bit more autocorrelation. With a larger δ, we are proposing candidates that are much further from the current state, and that may not comport with the Normal distribution. As a result, the candidates are more likely to be rejected.
This raises an important in Markov chain samplers: tuning. The theory does not depend on a specific value of δ; both chains above should converge to a Normal distribution. However, the speed with which they converge and amount of the sample space that is explored does depend on δ. Hence, the sampler can be tuned to improve its efficiency.
Independence Metropolis Algorithm
The independence Metropolis algorithm defines a transition density as q(y∣x)=q(y). In other words, the candidate proposals do not depend on the current state x. Otherwise, the algorithm works the same as the original Metropolis-Hastings algorithm, with a modified acceptance ratio,
α(y∣x)=min{π(y)q(x)π(x)q(y),1}
The independence sampler seems to work well in situations where rejection sampling might be reasonable, i.e. relatively low-dimensional problems. In particular, it has the interesting property that if
C=supxπ(x)q(x)<∞
then
‖πn−π‖≤kρn for 0<ρ<1 and some constant k>0. So if we have the same conditions under which rejection sampling is allowed, then we can prove that convergence of the chain to the stationary distribution is geometric with rate ρ. The value of ρ depends on C; if C is close to 1, then ρ will be small. This is an interesting property of this sampler, but it is largely of theoretical interest.
Slice Sampler
Given the current state x, generate
y∼Unif(0,π(x))
Given y, generate a new state x⋆
x⋆∼Unif({x:π(x)≥y})
Slice Sampler
Note that π(x)=∫1{0≤y≤π(x)}dy. Therefore, f(x,y)=1{0≤y≤π(x)} is a joint density. The slice sampler method creates an auxiliary variable y and the intergrates it out later to give back our original target density π(x).
As depicted in the illustration, the slice sampler can be useful for densities that might have multiple modes where it’s easy to get stuck around one model. The sampling of x⋆ allows one to jump easily between modes. However, sampling x⋆ may be difficult depending on the complexity of π(x) and ultimately may not be worth the effort. If π(x) is very wiggly or is high-dimensional, then calculating the region {x:π(x)≥y} will be very difficult.
Hit and Run Sampler
The hit and run sampler combines ideas from line search optimization methods with MCMC sampling. Here, suppose we have the current state x in p-dimensions and we want to propose a new state. Let e be a random p-dimensional vector that indicates a random direction in which to travel. Then construct the density p(r)∝π(x+re). where r is a scalar (and hence p(r) is a 1-dimensional density). From this density, sample a value for r and set the proposal to be x⋆=x+re. This process has teh advantage that it chooses more directions than a typical Gibbs sampler (see below) and does not require a multi-dimensional proposal distribution. However, sampling from the density p(r) may not be obvious as there is no guaranteed closed form solution.
Single Component Metropolis-Hastings
The standard Metropolis algorithm updates the entire parameter vector at once with a single step. Hence, a p-dimensional parameter vector must have a p-dimensional proposal distribution q. However, it is sometimes simpler to update individual components of the parameter vector one at a time, in an algortihm known as single component Metropolis-Hastings (SCMH).
Let x(n)=(x1(n),x2(n),…,xp(n)) be a p-dimensional vector representing our parameters of interest at iteration n. Define x−i=(x1,…,xi−1,xi+1,…,xp) as the vector x with the ith component removed. The SCMH algorithm updates each paramater in x one at a time. All that is needed to update the ith component of x(n) is a proposal distribution q(y∣xi(n),x−i(n)) for proposing a new value of xi given the current value of xi and all the other components.
At iteration n, SCMH updates the ith component via the following steps.
-
Sample yi∼qi(y∣xi(n),x−i(n)) as a proposal for component i.
-
Let α(yi∣x(n))=min(π(yi∣x−i(n))q(xi∣yi,x−i(n))π(xi∣x−i(n))q(yi∣xi,x−i(n)),1)
-
Accept yi for component i with probablity α(yi∣x(n)).
-
Repeat 1–3 p times until every component is updated.