Expectation Maximization

If a probabilistic model contains only observable variables, maximum likelihood estimation or Bayesian methods can be adopted to derive the model parameters. However it is also possible that a probabilistic model contains unobservable variables (called latent variables). Latent variables are those that you cannot observe but you know its existence and its influence in a random trial. In such case, Lagrange multipliers may be hard to apply because of the existence of the “log of sum” term.

Given observed samples \(\mathrm{X} = \{\x^{(1)}, \x^{(2)}, ..., \x^{(m)}\}\) (with unobservable latent variable samples \(\mathrm{Z}\)), MLE tries to the find best parameters \(\hat \theta\) such that \[ \hat \theta = \arg\max_{\theta}\log(p(\mathrm{X};\theta)) \]

The log-likelihood function is \[ \begin{aligned} l(\theta) &= \log p(\mathrm{X};\theta) = \sum_{\x \in \mathrm{X}} \log p(\x; \theta) \\ &= \sum_{\x \in \mathrm{X}} \log \E_{\z \sim q} \frac{p(\x, \z; \theta)}{q(\z)} \\ &\Downarrow_\text{by Jensen's Inequality} \\ &\ge \sum_{\x \in \mathrm{X}} \E_{\z \sim q} \log \frac{p(\x, \z; \theta)}{q(\z)} \triangleq B(q, \theta) \\ &\text{where $q$ is an arbitrary reference probability measure} \end{aligned} \]

If we enlarge the \(B(q, \theta)\), the lower bound of \(l(\theta)\) can be lifted, which gives us a gentle guarantee that \(l(\theta)\) may increase.

Expectation maximization lifts this bound iteratively. To begin with, we choose a random initial estimation for \(\theta\), say \(\theta^0\). \(B(q, \theta)\) is a function of \((q, \theta)\). We can specifically set \(q(\z) = p(\z | \x; \theta^{t})\), where \(\theta^t\) is the estimation in current iteration. Therefore, \(B(q, \theta)\) becomes \[ B(q, \theta) = \sum_{\x \in \mathrm{X}} \E_{\z \sim p(\cdot | \x; \theta^{t})} \log \frac{p(\x, \z; \theta)}{p(\z | \x; \theta^{t})} \] Because \(\log \frac{1}{p(\z | \x; \theta^{t})}\) is irrelevant to the optimization of \(\theta\), we can simplify the problem as maximizing \[ Q(\theta^t, \theta) \triangleq \sum_{\x \in \mathrm{X}} \E_{\z \sim p(\cdot | \x; \theta^{t})} \log p(\x, \z; \theta) \label{q-function} \] The above step is called the expectation step because we are choosing a probability measure for the expectation term in \(B(q, \theta)\). The next step is the maximization step where we fix \(\theta^t\) and maximize \(Q\)-function in \(\eqref{q-function}\) w.r.t. \(\theta\). This accounts for the estimation in the next iteration: \[ \theta^{t+1} = \arg\max_{\theta} Q(\theta^t, \theta) \] We do these two steps back and forth, comprising the whole expectation maximization algorithm.

The reason we choose \(q(\z) = p(\z | \x; \theta^{t})\), which is conditioned on \(\theta^t\), is that we want to inject some dynamics into the algorithm. Say if we choose \(q\) to be some static measure not relative to \(\theta^t\), the algorithm will end at the very first iteration. On the other hand, \(p(\z | \x; \theta^t)\) is the most approachable probability measure w.r.t. \(\z\) and relative to \(\theta^t\). After all, in latent models, \(p(\z | \x; \theta^t)\) is much easier to compute than \(p(\z; \theta^t)\).

Previous
Next