Expectation Maximization Algorithm

Last updated: 2018-08-28

workflowr checks: (Click a bullet for more information)

✔ R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Repository version: 0542bd5
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
```
Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.Rhistory
    Ignored:    docs/.DS_Store
    Ignored:    docs/figure/

Unstaged changes:
    Modified:   analysis/_site.yml
```
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

Expand here to see past versions:

File	Version	Author	Date	Message
Rmd	0542bd5	zouyuxin	2018-08-28	wflow_publish(“analysis/EM.Rmd”)
html	e6cb2c8	zouyuxin	2018-08-28	Build site.
Rmd	c806238	zouyuxin	2018-08-28	wflow_publish(“analysis/EM.Rmd”)

An iterative method to find maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables.

Matthew Stephens has an introduction to EM: Stephens fiveMinuteStats EM

Given the statistical model which generates a set \(\mathbf{X}\) of observed data, a set of unobserved latent data or missing values \(\mathbf{Z}\), and a vector of unknown parameters \(\boldsymbol{\theta}\), along with a likelihood function \(L(\boldsymbol\theta; \mathbf{X}, \mathbf{Z}) = p(\mathbf{X}, \mathbf{Z}|\boldsymbol\theta)\), the maximum likelihood estimate (MLE) of the unknown parameters is determined by maximizing the marginal likelihood of the observed data \[ {\displaystyle L({\boldsymbol {\theta }};\mathbf {X} )=p(\mathbf {X} |{\boldsymbol {\theta }})=\int p(\mathbf {X} ,\mathbf {Z} |{\boldsymbol {\theta }})d\mathbf {Z} } \]

E step: Define \(Q(\boldsymbol\theta|\boldsymbol\theta^{(t)})\) as the expected value of the log likelihood function of \({\boldsymbol{\theta}}\), with respect to the current conditional distribution of \(\mathbf{Z}\) given \(\mathbf{X}\) and the current estimates of the parameters \(\boldsymbol\theta^{(t)}\): \[ Q(\boldsymbol\theta|\boldsymbol\theta^{(t)}) = \operatorname{E}_{\mathbf{Z}|\mathbf{X},\boldsymbol\theta^{(t)}}\left[ \log L (\boldsymbol\theta;\mathbf{X},\mathbf{Z}) \right] \]

M step: Find the parameters that maximize this quantity: \[ \boldsymbol\theta^{(t+1)} = \underset{\boldsymbol\theta}{\operatorname{arg\,max}} \ Q(\boldsymbol\theta|\boldsymbol\theta^{(t)}) \]

Note: the trick is that we replace the unknown \(\mathbf{Z}\) with the posterior expectation

Example

\[ X \sim \sum_{k=1}^{K} \pi_{k}N(\boldsymbol\mu_{k}, \Sigma_{k}) \] \(\boldsymbol\theta = (\pi_{1}, \cdots, \pi_{K}, \boldsymbol\mu_{1}, \cdots, \boldsymbol\mu_{K}, \Sigma_{1}, \cdots, \Sigma_{K})\)

The incomplete likelihood is \[ L(\boldsymbol\theta; X) = \prod_{i=1}^{n} \sum_{k=1}^{K} \pi_{k}N(\mathbf{x}_{i}; \boldsymbol\mu_{k}, \Sigma_{k}) \]

The complete likelihood is \[ L(\boldsymbol\theta; X, \mathbf{z}) = \prod_{i=1}^{n} \prod_{k=1}^{K} \left[\pi_{k}N(\mathbf{x}_{i}; \boldsymbol\mu_{k}, \Sigma_{k}) \right]^{\mathbb{I}(z_{i} = k)} \]

E step:

Given our current estimate of the parameters \(\boldsymbol\theta^{(t)}\), the conditional distribution of the \(z_{i}\) is determined by Bayes theorem to be the proportional height of the normal density weighted by \(\pi\): \[ {\displaystyle \gamma_{z_{i}}^{(t)}(k):=\operatorname {P} (z_{i}=k|X_{i}=\mathbf {x} _{i};\theta ^{(t)})={\frac {\pi_{k}^{(t)}\ N(\mathbf{x} _{i};{\boldsymbol {\mu }}_{k}^{(t)},\Sigma _{k}^{(t)})}{\sum_{k'=1}^{K} \pi_{k'}^{(t)}N(\mathbf {x} _{i}; \boldsymbol\mu_{k}^{(t)}, \Sigma_{k}^{(t)})}}} \]

\[ {\displaystyle {\begin{aligned}Q(\theta |\theta ^{(t)})&=\operatorname {E} _{\mathbf {z} |\mathbf {X} ,\mathbf {\theta } ^{(t)}}[\log L(\theta ;\mathbf {X} ,\mathbf {z} )]\\ &=\operatorname {E} _{\mathbf {z} |\mathbf {X} ,\mathbf {\theta } ^{(t)}}[\log \prod _{i=1}^{n}L(\theta ;\mathbf {x} _{i},\mathbf {z} _{i})]\\ &=\operatorname {E} _{\mathbf {z} |\mathbf {X} ,\mathbf {\theta } ^{(t)}}[\sum _{i=1}^{n}\log L(\theta ;\mathbf {x} _{i},\mathbf {z} _{i})]\\ &=\sum _{i=1}^{n}\operatorname {E} _{\mathbf {z} |\mathbf {X} ;\mathbf {\theta } ^{(t)}}[\log L(\theta ;\mathbf {x} _{i},\mathbf {z} _{i})]\\ &=\sum _{i=1}^{n}\sum_{k=1}^{K} \operatorname {E} _{\mathbf {z} |\mathbf {X} ;\mathbf {\theta } ^{(t)}}[ \mathbb{I}(z_{i} = k)\left(\log \pi_{k} + \log N(\mathbf{x}_{i}; \boldsymbol\mu_{k}, \Sigma_{k}) \right) ]\\ &=\sum _{i=1}^{n}\sum _{k=1}^{K}P(z_{i}=j|X_{i}=\mathbf {x} _{i};\theta ^{(t)}) \left(\log \pi_{k} + \log N(\mathbf{x}_{i}; \boldsymbol\mu_{k}, \Sigma_{k}) \right) ]\\ &=\sum _{i=1}^{n}\sum _{k=1}^{K}\gamma_{z_{i}}^{(t)}(k){\big [}\log \pi_{k}-{\tfrac {1}{2}}\log |\Sigma _{k}|-{\tfrac {1}{2}}(\mathbf {x} _{i}-{\boldsymbol {\mu }}_{k})^{\top }\Sigma _{k}^{-1}(\mathbf {x} _{i}-{\boldsymbol {\mu }}_{k})-{\tfrac {d}{2}}\log(2\pi ){\big ]}\end{aligned}}} \]

M step:

\[ \pi_{k}^{(t+1)} = \frac{1}{n} \sum_{i=1}^{n} \gamma_{z_{i}}^{(t)}(k) \] \[ \boldsymbol\mu_{k}^{(t+1)} = \frac{\sum_{i=1}^{n}\gamma_{z_{i}}^{(t)}(k) \mathbf{x}_{i} }{\sum_{i=1}^{n}\gamma_{z_{i}}^{(t)}(k)} \]

\[ \Sigma_{k}^{(t)} = \frac{\sum_{i=1}^{n}\gamma_{z_{i}}^{(t)}(k) (\mathbf{x}_{i} - \boldsymbol\mu_{k}^{(t+1)}) (\mathbf{x}_{i} - \boldsymbol\mu_{k}^{(t+1)})^{\top }}{\sum_{i=1}^{n}\gamma_{z_{i}}^{(t)}(k)} \]

Theory

If \(\mathbf{z}\) is observed, then the MLE would amount to maximizing the quantity: \[ l(\theta; x,z) = \log p(x,z|\theta) \] which is the complete log likelihood.

Given the \(\mathbf{z}\) is not observed, the probability of the data X is a marginal probability \[ l(\theta; x) = \log p(x|\theta) = \log \sum_{z} p(x,z|\theta) \]

The complete log likelihood is a random quantity. Suppose we average over z to remove the randomness, using an ‘averaging distribution’ \(q(z|x)\). We define the expected complete log likelihood \[ <l(\theta; x,z)>_{q} = \sum_{z} q(z|x, \theta) \log p(x,z|\theta) \] which is linear in the complete log likelihood. If q is chosen well, then perhaps the expected complete log likelihood will not be too far from the log likelihood.

An averaging distribution \(q(z|x)\) can be used to provide a lower bound on the log likelihood. \[ {\begin{align*} l(\theta;x) &= \log p(x|\theta) = \log \sum_{z} p(x,z|\theta) \\ &= \log \sum_{z} q(z|x) \frac{p(x,z|\theta)}{q(z|x)} \\ &\geq \sum_{z} q(z|x) \log \frac{p(x,z|\theta)}{q(z|x)} = \mathcal{L}(q,\theta) \end{align*}} \] \(\mathcal{L}(q,\theta)\) is the auxiliary function. For an arbitrary distribution \(q(z|x)\), the auxiliary function \(\mathcal{L}(q,\theta)\) is a lower bound for the log likelihood.

The EM algorithm is a coordinate ascent algorithm on the function \(\mathcal{L}(q,\theta)\).

E step: \(q^{(t+1)} = \arg \max_{q} \mathcal{L}(q,\theta^{(t)})\)

M step: \(\theta^{(t+1)} = \arg \max_{\theta} \mathcal{L}(q^{(t+1)},\theta)\)

The lower bound can break into \[ \begin{align*} \mathcal{L}(q,\theta^{(t)}) &= \sum_{z} q(z|x) \log \frac{p(x,z|\theta)}{q(z|x)} \\ &= \sum_{z} q(z|x) \log p(x,z|\theta) - \sum_{z} q(z|x) \log q(z|x) \\ &= <l(\theta; x,z)>_{q} - \sum_{z} q(z|x) \log q(z|x) \end{align*} \] the second term is independent of \(\theta\). Thus, maximizing \(\mathcal{L}(q,\theta)\) with respect to \(\theta\) is equivalent to maximizing \(<l(\theta; x,z)>_{q}\) with respect to \(\theta\).

Maximizing \(\mathcal{L}(q,\theta)\) with respect to the averaging distribution q, \(q^{(t+1)}(z|x) = p(z|x,\theta^{(t)})\) yields the maximum. \[ \begin{align*} \mathcal{L}(p(z|x,\theta^{(t)}),\theta^{(t)}) &= \sum_{z} p(z|x, \theta^{(t)}) \log \frac{p(x,z|\theta)}{p(z|x, \theta^{(t)})} \\ &= \sum_{z} q(z|x, \theta^{(t)}) \log p(x|\theta^{(t)})\\ &= \log p(x|\theta^{(t)}) = l(\theta^{(t)}; x) \end{align*} \] Given that \(l(\theta^{(t)}; x)\) is an upper bound of \(\mathcal{L}(q,\theta)\), this shows that \(\mathcal{L}(q,\theta)\) is maximized by setting \(q(z|x)\) equal to \(p(z|x,\theta^{(t)})\).

We increase the lower bound in the M step. In the E step, we close the gap between the function and the bound by the posterior of z (\(l(\theta^{(t)}; x) = \mathcal{L}(q^{(t+1)}, \theta^{(t)})\)).

Different way: KL divergence

\[ \begin{align*} l(\theta; x) - \mathcal{L}(q, \theta) &= l(\theta; x) - \sum_{z} q(z|x) \log \frac{p(x,z|\theta)}{q(z|x)} \\ &= \sum_{z} q(z|x)\log p(x|\theta) - \sum_{z} q(z|x) \log \frac{p(x,z|\theta)}{q(z|x)} \\ &= \sum_{z} q(z|x)\log p(x|\theta) - q(z|x) \log \frac{p(z|x,\theta)}{q(z|x)} - q(z|x) \log p(x|\theta)\\ &= \sum_{z} q(z|x) \log \frac{q(z|x)}{p(z|x,\theta)} = D_{KL}(q(z|x)||p(z|x,\theta)) \geq 0 \end{align*} \] The KL divergence is uniquely minimized by letting \(q(z|x) = p(z|x,\theta^{(t)})\).

Minimizing the KL divergence is equivalent to maximizing \(\mathcal{L}(q, \theta)\).

\[ \begin{align*} D_{KL}(\tilde{p}(x)||p(x|\theta)) &= \sum_{x} \tilde{p}(x) \log \frac{\tilde{p}(x)}{p(x|\theta)} = -\sum_{x} \tilde{p}(x)\log p(x|\theta) + \sum_{x} \tilde{p}(x) \log \tilde{p}(x) \\ &\leq -\sum_{x} \tilde{p}(x) \mathcal{L}(q,\theta) + \sum_{x} \tilde{p}(x) \log \tilde{p}(x) \\ &= -\sum_{x} \tilde{p}(x) \sum_{z}q(z|x)\log \frac{p(x,z|\theta)}{q(z|x)} + \sum_{x} \tilde{p}(x) \log \tilde{p}(x) \\ &= \sum_{x} \tilde{p}(x) \sum_{z}q(z|x)\log \frac{\tilde{p}(x) q(z|x)}{p(x,z|\theta)} \\ &= D_{KL}(\tilde{p}(x) q(z|x) || p(x,z|\theta)) \end{align*} \]

The KL divergence between the empirical distribution and the model (the quantity that we wish to minimize) is upper bounded by a “complete KL divergence”, a KL divergence between joint distributions on (x,z).

Minimizing the complete KL divergence with respect to q and \(\theta\) is equivalent to maximizing the auxiliary function \(\mathcal{L}(q, \theta)\) with respect to these variables.

Define \(D(q||\theta) = D_{KL}(\tilde{p}(x)q(z|x) || p(x,z|\theta)\)

E step: \(q^{(t+1)}(z|x) = \arg \min_{q} D(q || \theta^{(t+1)})\)

M step: \(\theta^{(t+1)} = \arg \min_{\theta} D(q^{(t+1)}||\theta)\)

This reproducible R Markdown analysis was created with workflowr 1.1.1

Expectation Maximization Algorithm

Yuxin Zou

2018-08-26

Example

Theory

Different way: KL divergence