Expectation Maximization

If a probabilistic model contains only observable variables, maximum likelihood estimation or Bayesian methods can be adopted to derive the model parameters. However it is also possible that a probabilistic model contains unobservable variables (called latent variables). Latent variables are those that you cannot observe but you know its existence and its influence in a random trial. In such case, Lagrange multipliers may be hard to apply because of the existence of the “log of sum” term.

Given observed samples $\mathrm{X} = \{\x^{(1)}, \x^{(2)}, ..., \x^{(m)}\}$ (with unobservable latent variable samples $\mathrm{Z}$), MLE tries to the find best parameters $\hat \theta$ such that \[ \hat \theta = \arg\max_{\theta}\log(p(\mathrm{X};\theta)) \]

The log-likelihood function is \[ \begin{aligned} l(\theta) &= \log p(\mathrm{X};\theta) = \sum_{\x \in \mathrm{X}} \log p(\x; \theta) \\ &= \sum_{\x \in \mathrm{X}} \log \E_{\z \sim q} \frac{p(\x, \z; \theta)}{q(\z)} \\ &\Downarrow_\text{by Jensen's Inequality} \\ &\ge \sum_{\x \in \mathrm{X}} \E_{\z \sim q} \log \frac{p(\x, \z; \theta)}{q(\z)} \triangleq B(q, \theta) \\ &\text{where $q$ is an arbitrary reference probability measure} \end{aligned} \]

If we enlarge the $B(q, \theta)$, the lower bound of $l(\theta)$ can be lifted, which gives us a gentle guarantee that $l(\theta)$ may increase.

Expectation maximization lifts this bound iteratively. To begin with, we choose a random initial estimation for $\theta$, say $\theta^0$. $B(q, \theta)$ is a function of $(q, \theta)$. We can specifically set $q(\z) = p(\z | \x; \theta^{t})$, where $\theta^t$ is the estimation in current iteration. Therefore, $B(q, \theta)$ becomes \[ B(q, \theta) = \sum_{\x \in \mathrm{X}} \E_{\z \sim p(\cdot | \x; \theta^{t})} \log \frac{p(\x, \z; \theta)}{p(\z | \x; \theta^{t})} \] Because $\log \frac{1}{p(\z | \x; \theta^{t})}$ is irrelevant to the optimization of $\theta$, we can simplify the problem as maximizing \[ Q(\theta^t, \theta) \triangleq \sum_{\x \in \mathrm{X}} \E_{\z \sim p(\cdot | \x; \theta^{t})} \log p(\x, \z; \theta) \label{q-function} \] The above step is called the expectation step because we are choosing a probability measure for the expectation term in $B(q, \theta)$. The next step is the maximization step where we fix $\theta^t$ and maximize $Q$-function in $\eqref{q-function}$ w.r.t. $\theta$. This accounts for the estimation in the next iteration: \[ \theta^{t+1} = \arg\max_{\theta} Q(\theta^t, \theta) \] We do these two steps back and forth, comprising the whole expectation maximization algorithm.

The reason we choose $q(\z) = p(\z | \x; \theta^{t})$, which is conditioned on $\theta^t$, is that we want to inject some dynamics into the algorithm. Say if we choose $q$ to be some static measure not relative to $\theta^t$, the algorithm will end at the very first iteration. On the other hand, $p(\z | \x; \theta^t)$ is the most approachable probability measure w.r.t. $\z$ and relative to $\theta^t$. After all, in latent models, $p(\z | \x; \theta^t)$ is much easier to compute than $p(\z; \theta^t)$.

Last updated on Jan 7, 2022