Introduction to EM: Gaussian Mixture Models

Excerpt

Last updated: 2019-06-12


Last updated: 2019-06-12

Checks: 7 0

Knit directory: fiveMinuteStats/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.4.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(12345) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.Rhistory
    Ignored:    analysis/bernoulli_poisson_process_cache/

Untracked files:
    Untracked:  _workflowr.yml
    Untracked:  analysis/CI.Rmd
    Untracked:  analysis/gibbs_structure.Rmd
    Untracked:  analysis/libs/
    Untracked:  analysis/results.Rmd
    Untracked:  analysis/shiny/tester/
    Untracked:  docs/MH_intro_files/
    Untracked:  docs/citations.bib
    Untracked:  docs/figure/The Ancestral Process.Rmd/
    Untracked:  docs/figure/The Coalescent Process.Rmd/
    Untracked:  docs/hmm_files/
    Untracked:  docs/libs/
    Untracked:  docs/shiny/tester/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

FileVersionAuthorDateMessage
Rmd7da8091Matthew Stephens2019-06-12workflowr::wflow_publish(“analysis/intro_to_em.Rmd”)
html5f62ee6Matthew Stephens2019-03-31Build site.
html34bcc51John Blischak2017-03-06Build site.
Rmd5fbc8b5John Blischak2017-03-06Update workflowr project with wflow_update (version 0.4.0).
Rmd391ba3cJohn Blischak2017-03-06Remove front and end matter of non-standard templates.
htmlfb0f6e3stephens9992017-03-03Merge pull request #33 from mdavy86/f/review
html0713277stephens9992017-03-03Merge pull request #31 from mdavy86/f/review
Rmdd674141Marcus Davy2017-02-27typos, refs
htmlc3b365aJohn Blischak2017-01-02Build site.
Rmd67a8575John Blischak2017-01-02Use external chunk to set knitr chunk options.
Rmd5ec12c7John Blischak2017-01-02Use session-info chunk.
Rmd3bb3b73mbonakda2016-02-24add two mixture model vignettes + merge redundant info in markov chain vignettes

Pre-requisites

This document assumes basic familiarity with mixture models.

Overview

In this note we introduced mixture models. Recall that if our observations come from a mixture model with mixture components, the marginal probability distribution of is of the form:

where is the latent variable representing the mixture component for , is the mixture component, and is the mixture proportion representing the probability that belongs to the -th mixture component.

In this note, we will introduce the expectation-maximization (EM) algorithm in the context of Gaussian mixture models. Let denote the probability distribution function for a normal random variable. In this scenario, we have that the conditional distribution so that the marginal distribution of is:

Similarly, the joint probability of observations is therefore:

This note describes the EM algorithm which aims to obtain the maximum likelihood estimates of and given a data set of observations .

Review: MLE of Normal Distribution

Suppose we have observations from a Gaussian distribution with unknown mean and known variance . To find the maximum likelihood estimate for , we find the log-likelihood , take the derivative with respect to , set it equal zero, and solve for :

Setting this equal to zero and solving for , we get that . Note that applying the log function to the likelihood helped us decompose the product and removed the exponential function so that we could easily solve for the MLE.

MLE of Gaussian Mixture Model

Now we attempt the same strategy for deriving the MLE of the Gaussian mixture model. Our unknown parameters are , and so from the first section of this note, our likelihood is:

So our log-likelihood is:

Taking a look at the expression above, we already see a difference between this scenario and the simple setup in the previous section. We see that the summation over the components “blocks” our log function from being applied to the normal densities. If we were to follow the same steps as above and differentiate with respect to and set the expression equal to zero, we would get:

Now we’re stuck because we can’t analytically solve for . However, we make one important observation which provides intuition for whats to come: if we knew the latent variables , then we could simply gather all our samples such that and simply use the estimate from the previous section to estimate .

EM, informally

Intuitively, the latent variables should help us find the MLEs. We first attempt to compute the posterior distribution of given the observations:

Now we can rewrite equation (1), the derivative of the log-likelihood with respect to , as follows:

Even though depends on , we can cheat a bit and pretend that it doesn’t. Now we can solve for in this equation to get:

.

Where we set . We can think of as the effective number of points assigned to component . We see that is therefore a weighted average of the data with weights . Similarly, if we apply a similar method to finding and , we find that:

Again, remember that depends on the unknown parameters, so these equations are not closed-form expressions. This looks like a vicious circle. But, as Cosma Shalizi says, “one man’s vicious circle is another man’s successive approximation procedure.”

We are now in the following situation:

  1. If we knew the parameters, we could compute the posterior probabilities
  2. If we knew the posteriors , we could easily compute the parameters

The EM algorithm, motivated by the two observations above, proceeds as follows:

  1. Initialize the ’s, ’s and ’s and evaluate the log-likelihood with these parameters.

  2. E-step: Evaluate the posterior probabilities using the current values of the ’s and ’s with equation (2)

  3. M-step: Estimate new parameters , and with the current values of using equations (3), (4) and (5).

  4. Evaluate the log-likelihood with the new parameter estimates. If the log-likelihood has changed by less than some small , stop. Otherwise, go back to step 2.

The EM algorithm is sensitive to the initial values of the parameters, so care must be taken in the first step. However, assuming the initial values are “valid,” one property of the EM algorithm is that the log-likelihood increases at every step. This invariant proves to be useful when debugging the algorithm in practice.

EM, formally

The EM algorithm attempts to find maximum likelihood estimates for models with latent variables. In this section, we describe a more abstract view of EM which can be extended to other latent variable models.

Let be the entire set of observed variables and the entire set of latent variables. The log-likelihood is therefore:

where we’ve simply marginalized out of the joint distribution.

As we noted above, the existence of the sum inside the logarithm prevents us from applying the log to the densities which results in a complicated expression for the MLE. Now suppose that we observed both and . We call the complete data set, and we say is incomplete. As we noted previously, if we knew , the maximization would be easy.

We typically don’t know , but the information we do have about is contained in the posterior . Since we don’t know the complete log-likelihood, we consider its expectation under the posterior distribution of the latent variables. This corresponds to the E-step above. In the M-step, we maximize this expectation to find a new estimate for the parameters.

In the E-step, we use the current value of the parameters to find the posterior distribution of the latent variables given by . This corresponds to the in the previous section. We then use this to find the expectation of the complete data log-likelihood, with respect to this posterior, evaluated at an arbitrary . This expectation is denoted and it equals:

In the M-step, we determine the new parameter by maximizing Q:

Now we derive the relevant quantities for Gaussian mixture models and compare it to our “informal” derivation above. The complete likelihood takes the form

so the complete log-likelihood takes the form:

Note that for the complete log-likelihood, the logarithm acts directly on the normal density which leads to a simpler solution for the MLE. As we said, in practice, we do not observe the latent variables, so we consider the expectation of the complete log-likelihood with respect to the posterior of the latent variables.

The expected value of the complete log-likelihood is therefore:

Since , we see that this is simply which we computed in the previous section. Hence, we have

EM proceeds as follows: first choose initial values for and use these in the E-step to evaluate the . Then, with fixed, maximize the expected complete log-likelihood above with respect to and . This leads to the closed form solutions we derived in the previous section.

Example

In this example, we will assume our mixture components are fully specified Gaussian distributions (i.e the means and variances are known), and we are interested in finding the maximum likelihood estimates of the ’s.

Assume we have components, so that:

The true mixture proportions will be and . First we simulate data from this mixture model:

# mixture components
mu.true    = c(5, 10)
sigma.true = c(1.5, 2)

# determine Z_i
Z = rbinom(500, 1, 0.75)
# sample from mixture model

X <- rnorm(10000, mean=mu.true[Z+1], sd=sigma.true[Z+1])
hist(X,breaks=15)

VersionAuthorDate
5f62ee6Matthew Stephens2019-03-31
c3b365aJohn Blischak2017-01-02

Now we write a function to compute the log-likelihood for the incomplete data, assuming the parameters are known. This will be used to determine convergence:

compute.log.lik <- function(L, w) {
  L[,1] = L[,1]*w[1]
  L[,2] = L[,2]*w[2]
  return(sum(log(rowSums(L))))
}

Since the mixture components are fully specified, for each sample we can compute the likelihood and . We store these values in the columns of L:

L = matrix(NA, nrow=length(X), ncol= 2)
L[, 1] = dnorm(X, mean=mu.true[1], sd = sigma.true[1])
L[, 2] = dnorm(X, mean=mu.true[2], sd = sigma.true[2])

Finally, we implement the E and M step in the EM.iter function below. The mixture.EM function is the driver which checks for convergence by computing the log-likelihoods at each step.

mixture.EM <- function(w.init, L) {
  
  w.curr <- w.init
  
  # store log-likehoods for each iteration
  log_liks <- c()
  ll       <- compute.log.lik(L, w.curr)
  log_liks <- c(log_liks, ll)
  delta.ll <- 1
  
  while(delta.ll > 1e-5) {
    w.curr   <- EM.iter(w.curr, L)
    ll       <- compute.log.lik(L, w.curr)
    log_liks <- c(log_liks, ll)
    delta.ll <- log_liks[length(log_liks)]  - log_liks[length(log_liks)-1]
  }
  return(list(w.curr, log_liks))
}

EM.iter <- function(w.curr, L, ...) {
  
  # E-step: compute E_{Z|X,w0}[I(Z_i = k)]
  z_ik <- L
  for(i in seq_len(ncol(L))) {
    z_ik[,i] <- w.curr[i]*z_ik[,i]
  }
  z_ik     <- z_ik / rowSums(z_ik)
  
  # M-step
  w.next   <- colSums(z_ik)/sum(z_ik)
  return(w.next)
}
#perform EM
ee <- mixture.EM(w.init=c(0.5,0.5), L)
print(paste("Estimate = (", round(ee[[1]][1],2), ",", round(ee[[1]][2],2), ")", sep=""))
[1] "Estimate = (0.29,0.71)"

Finally, we inspect the evolution of the log-likelihood and note that it is strictly increases:

plot(ee[[2]], ylab='incomplete log-likelihood', xlab='iteration')

VersionAuthorDate
5f62ee6Matthew Stephens2019-03-31
c3b365aJohn Blischak2017-01-02
sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.4

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] workflowr_1.4.0 Rcpp_1.0.1      digest_0.6.19   rprojroot_1.3-2
 [5] backports_1.1.4 git2r_0.25.2    magrittr_1.5    evaluate_0.14  
 [9] stringi_1.4.3   fs_1.3.1        whisker_0.3-2   rmarkdown_1.13 
[13] tools_3.5.2     stringr_1.4.0   glue_1.3.1      xfun_0.7       
[17] yaml_2.2.0      compiler_3.5.2  htmltools_0.3.6 knitr_1.23     

This site was created with R Markdown