Last updated: 2018-08-28
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Repository version: 0542bd5
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/.Rhistory
Ignored: docs/.DS_Store
Ignored: docs/figure/
Unstaged changes:
Modified: analysis/_site.yml
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
An iterative method to find maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables.
Matthew Stephens has an introduction to EM: Stephens fiveMinuteStats EM
Given the statistical model which generates a set \(\mathbf{X}\) of observed data, a set of unobserved latent data or missing values \(\mathbf{Z}\), and a vector of unknown parameters \(\boldsymbol{\theta}\), along with a likelihood function \(L(\boldsymbol\theta; \mathbf{X}, \mathbf{Z}) = p(\mathbf{X}, \mathbf{Z}|\boldsymbol\theta)\), the maximum likelihood estimate (MLE) of the unknown parameters is determined by maximizing the marginal likelihood of the observed data \[ {\displaystyle L({\boldsymbol {\theta }};\mathbf {X} )=p(\mathbf {X} |{\boldsymbol {\theta }})=\int p(\mathbf {X} ,\mathbf {Z} |{\boldsymbol {\theta }})d\mathbf {Z} } \]
E step: Define \(Q(\boldsymbol\theta|\boldsymbol\theta^{(t)})\) as the expected value of the log likelihood function of \({\boldsymbol{\theta}}\), with respect to the current conditional distribution of \(\mathbf{Z}\) given \(\mathbf{X}\) and the current estimates of the parameters \(\boldsymbol\theta^{(t)}\): \[ Q(\boldsymbol\theta|\boldsymbol\theta^{(t)}) = \operatorname{E}_{\mathbf{Z}|\mathbf{X},\boldsymbol\theta^{(t)}}\left[ \log L (\boldsymbol\theta;\mathbf{X},\mathbf{Z}) \right] \]
M step: Find the parameters that maximize this quantity: \[ \boldsymbol\theta^{(t+1)} = \underset{\boldsymbol\theta}{\operatorname{arg\,max}} \ Q(\boldsymbol\theta|\boldsymbol\theta^{(t)}) \]
Note: the trick is that we replace the unknown \(\mathbf{Z}\) with the posterior expectation
\[ X \sim \sum_{k=1}^{K} \pi_{k}N(\boldsymbol\mu_{k}, \Sigma_{k}) \] \(\boldsymbol\theta = (\pi_{1}, \cdots, \pi_{K}, \boldsymbol\mu_{1}, \cdots, \boldsymbol\mu_{K}, \Sigma_{1}, \cdots, \Sigma_{K})\)
The incomplete likelihood is \[ L(\boldsymbol\theta; X) = \prod_{i=1}^{n} \sum_{k=1}^{K} \pi_{k}N(\mathbf{x}_{i}; \boldsymbol\mu_{k}, \Sigma_{k}) \]
The complete likelihood is \[ L(\boldsymbol\theta; X, \mathbf{z}) = \prod_{i=1}^{n} \prod_{k=1}^{K} \left[\pi_{k}N(\mathbf{x}_{i}; \boldsymbol\mu_{k}, \Sigma_{k}) \right]^{\mathbb{I}(z_{i} = k)} \]
E step:
Given our current estimate of the parameters \(\boldsymbol\theta^{(t)}\), the conditional distribution of the \(z_{i}\) is determined by Bayes theorem to be the proportional height of the normal density weighted by \(\pi\): \[ {\displaystyle \gamma_{z_{i}}^{(t)}(k):=\operatorname {P} (z_{i}=k|X_{i}=\mathbf {x} _{i};\theta ^{(t)})={\frac {\pi_{k}^{(t)}\ N(\mathbf{x} _{i};{\boldsymbol {\mu }}_{k}^{(t)},\Sigma _{k}^{(t)})}{\sum_{k'=1}^{K} \pi_{k'}^{(t)}N(\mathbf {x} _{i}; \boldsymbol\mu_{k}^{(t)}, \Sigma_{k}^{(t)})}}} \]
\[ {\displaystyle {\begin{aligned}Q(\theta |\theta ^{(t)})&=\operatorname {E} _{\mathbf {z} |\mathbf {X} ,\mathbf {\theta } ^{(t)}}[\log L(\theta ;\mathbf {X} ,\mathbf {z} )]\\ &=\operatorname {E} _{\mathbf {z} |\mathbf {X} ,\mathbf {\theta } ^{(t)}}[\log \prod _{i=1}^{n}L(\theta ;\mathbf {x} _{i},\mathbf {z} _{i})]\\ &=\operatorname {E} _{\mathbf {z} |\mathbf {X} ,\mathbf {\theta } ^{(t)}}[\sum _{i=1}^{n}\log L(\theta ;\mathbf {x} _{i},\mathbf {z} _{i})]\\ &=\sum _{i=1}^{n}\operatorname {E} _{\mathbf {z} |\mathbf {X} ;\mathbf {\theta } ^{(t)}}[\log L(\theta ;\mathbf {x} _{i},\mathbf {z} _{i})]\\ &=\sum _{i=1}^{n}\sum_{k=1}^{K} \operatorname {E} _{\mathbf {z} |\mathbf {X} ;\mathbf {\theta } ^{(t)}}[ \mathbb{I}(z_{i} = k)\left(\log \pi_{k} + \log N(\mathbf{x}_{i}; \boldsymbol\mu_{k}, \Sigma_{k}) \right) ]\\ &=\sum _{i=1}^{n}\sum _{k=1}^{K}P(z_{i}=j|X_{i}=\mathbf {x} _{i};\theta ^{(t)}) \left(\log \pi_{k} + \log N(\mathbf{x}_{i}; \boldsymbol\mu_{k}, \Sigma_{k}) \right) ]\\ &=\sum _{i=1}^{n}\sum _{k=1}^{K}\gamma_{z_{i}}^{(t)}(k){\big [}\log \pi_{k}-{\tfrac {1}{2}}\log |\Sigma _{k}|-{\tfrac {1}{2}}(\mathbf {x} _{i}-{\boldsymbol {\mu }}_{k})^{\top }\Sigma _{k}^{-1}(\mathbf {x} _{i}-{\boldsymbol {\mu }}_{k})-{\tfrac {d}{2}}\log(2\pi ){\big ]}\end{aligned}}} \]
M step:
\[ \pi_{k}^{(t+1)} = \frac{1}{n} \sum_{i=1}^{n} \gamma_{z_{i}}^{(t)}(k) \] \[ \boldsymbol\mu_{k}^{(t+1)} = \frac{\sum_{i=1}^{n}\gamma_{z_{i}}^{(t)}(k) \mathbf{x}_{i} }{\sum_{i=1}^{n}\gamma_{z_{i}}^{(t)}(k)} \]
\[ \Sigma_{k}^{(t)} = \frac{\sum_{i=1}^{n}\gamma_{z_{i}}^{(t)}(k) (\mathbf{x}_{i} - \boldsymbol\mu_{k}^{(t+1)}) (\mathbf{x}_{i} - \boldsymbol\mu_{k}^{(t+1)})^{\top }}{\sum_{i=1}^{n}\gamma_{z_{i}}^{(t)}(k)} \]
If \(\mathbf{z}\) is observed, then the MLE would amount to maximizing the quantity: \[ l(\theta; x,z) = \log p(x,z|\theta) \] which is the complete log likelihood.
Given the \(\mathbf{z}\) is not observed, the probability of the data X is a marginal probability \[ l(\theta; x) = \log p(x|\theta) = \log \sum_{z} p(x,z|\theta) \]
The complete log likelihood is a random quantity. Suppose we average over z to remove the randomness, using an ‘averaging distribution’ \(q(z|x)\). We define the expected complete log likelihood \[ <l(\theta; x,z)>_{q} = \sum_{z} q(z|x, \theta) \log p(x,z|\theta) \] which is linear in the complete log likelihood. If q is chosen well, then perhaps the expected complete log likelihood will not be too far from the log likelihood.
An averaging distribution \(q(z|x)\) can be used to provide a lower bound on the log likelihood. \[ {\begin{align*} l(\theta;x) &= \log p(x|\theta) = \log \sum_{z} p(x,z|\theta) \\ &= \log \sum_{z} q(z|x) \frac{p(x,z|\theta)}{q(z|x)} \\ &\geq \sum_{z} q(z|x) \log \frac{p(x,z|\theta)}{q(z|x)} = \mathcal{L}(q,\theta) \end{align*}} \] \(\mathcal{L}(q,\theta)\) is the auxiliary function. For an arbitrary distribution \(q(z|x)\), the auxiliary function \(\mathcal{L}(q,\theta)\) is a lower bound for the log likelihood.
The EM algorithm is a coordinate ascent algorithm on the function \(\mathcal{L}(q,\theta)\).
E step: \(q^{(t+1)} = \arg \max_{q} \mathcal{L}(q,\theta^{(t)})\)
M step: \(\theta^{(t+1)} = \arg \max_{\theta} \mathcal{L}(q^{(t+1)},\theta)\)
The lower bound can break into \[ \begin{align*} \mathcal{L}(q,\theta^{(t)}) &= \sum_{z} q(z|x) \log \frac{p(x,z|\theta)}{q(z|x)} \\ &= \sum_{z} q(z|x) \log p(x,z|\theta) - \sum_{z} q(z|x) \log q(z|x) \\ &= <l(\theta; x,z)>_{q} - \sum_{z} q(z|x) \log q(z|x) \end{align*} \] the second term is independent of \(\theta\). Thus, maximizing \(\mathcal{L}(q,\theta)\) with respect to \(\theta\) is equivalent to maximizing \(<l(\theta; x,z)>_{q}\) with respect to \(\theta\).
Maximizing \(\mathcal{L}(q,\theta)\) with respect to the averaging distribution q, \(q^{(t+1)}(z|x) = p(z|x,\theta^{(t)})\) yields the maximum. \[ \begin{align*} \mathcal{L}(p(z|x,\theta^{(t)}),\theta^{(t)}) &= \sum_{z} p(z|x, \theta^{(t)}) \log \frac{p(x,z|\theta)}{p(z|x, \theta^{(t)})} \\ &= \sum_{z} q(z|x, \theta^{(t)}) \log p(x|\theta^{(t)})\\ &= \log p(x|\theta^{(t)}) = l(\theta^{(t)}; x) \end{align*} \] Given that \(l(\theta^{(t)}; x)\) is an upper bound of \(\mathcal{L}(q,\theta)\), this shows that \(\mathcal{L}(q,\theta)\) is maximized by setting \(q(z|x)\) equal to \(p(z|x,\theta^{(t)})\).
We increase the lower bound in the M step. In the E step, we close the gap between the function and the bound by the posterior of z (\(l(\theta^{(t)}; x) = \mathcal{L}(q^{(t+1)}, \theta^{(t)})\)).
\[ \begin{align*} l(\theta; x) - \mathcal{L}(q, \theta) &= l(\theta; x) - \sum_{z} q(z|x) \log \frac{p(x,z|\theta)}{q(z|x)} \\ &= \sum_{z} q(z|x)\log p(x|\theta) - \sum_{z} q(z|x) \log \frac{p(x,z|\theta)}{q(z|x)} \\ &= \sum_{z} q(z|x)\log p(x|\theta) - q(z|x) \log \frac{p(z|x,\theta)}{q(z|x)} - q(z|x) \log p(x|\theta)\\ &= \sum_{z} q(z|x) \log \frac{q(z|x)}{p(z|x,\theta)} = D_{KL}(q(z|x)||p(z|x,\theta)) \geq 0 \end{align*} \] The KL divergence is uniquely minimized by letting \(q(z|x) = p(z|x,\theta^{(t)})\).
Minimizing the KL divergence is equivalent to maximizing \(\mathcal{L}(q, \theta)\).
\[ \begin{align*} D_{KL}(\tilde{p}(x)||p(x|\theta)) &= \sum_{x} \tilde{p}(x) \log \frac{\tilde{p}(x)}{p(x|\theta)} = -\sum_{x} \tilde{p}(x)\log p(x|\theta) + \sum_{x} \tilde{p}(x) \log \tilde{p}(x) \\ &\leq -\sum_{x} \tilde{p}(x) \mathcal{L}(q,\theta) + \sum_{x} \tilde{p}(x) \log \tilde{p}(x) \\ &= -\sum_{x} \tilde{p}(x) \sum_{z}q(z|x)\log \frac{p(x,z|\theta)}{q(z|x)} + \sum_{x} \tilde{p}(x) \log \tilde{p}(x) \\ &= \sum_{x} \tilde{p}(x) \sum_{z}q(z|x)\log \frac{\tilde{p}(x) q(z|x)}{p(x,z|\theta)} \\ &= D_{KL}(\tilde{p}(x) q(z|x) || p(x,z|\theta)) \end{align*} \]
The KL divergence between the empirical distribution and the model (the quantity that we wish to minimize) is upper bounded by a “complete KL divergence”, a KL divergence between joint distributions on (x,z).
Minimizing the complete KL divergence with respect to q and \(\theta\) is equivalent to maximizing the auxiliary function \(\mathcal{L}(q, \theta)\) with respect to these variables.
Define \(D(q||\theta) = D_{KL}(\tilde{p}(x)q(z|x) || p(x,z|\theta)\)
E step: \(q^{(t+1)}(z|x) = \arg \min_{q} D(q || \theta^{(t+1)})\)
M step: \(\theta^{(t+1)} = \arg \min_{\theta} D(q^{(t+1)}||\theta)\)
This reproducible R Markdown analysis was created with workflowr 1.1.1