KL-divergence
KL-divergence
KL-divergence, denoted as \(D_{KL}(p\|q)\), is statistical distance, measuring how the probability distribution \(q\) is different from the reference probability distribution \(p\), both defined on \(X \in \mathcal{X}\).
In information theory, it measures the relative entropy from \(q\) to \(p\), which is the average number of extra bits required to represent a message with \(q\) instead of \(p\).
In discrete form, \[ D_\text{KL}(p \| q) = -\sum_{x \in \mathcal{X}} p(x)\log \frac{q(x)}{p(x)} = \sum_{x \in \mathcal{X}} p(x)\log \frac{p(x)}{q(x)} \] In continuous form, \[ D_\text{KL}(p \| q) = -\int_{x \in \mathcal{X}} p(x) \log \frac{q(x)}{p(x)}dx = \int_{x \in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)}dx \]
Variational Lower-bound
One property of KL-divergence is \[ D_\text{KL}(p || q) = \sup_{T: \Omega \to \R} \E_{p} [T] - \log (\E_q[e^T]) \]
The proof is as follows. Given a distribution \(q\) and a function \(T\), construct the Gibbs distribution \(g\) such that \(g(x) = \frac{q(x)e^{T(x)}}{Z}\) where \(Z = \E_{q(x)} e^{T(x)}\). Then \[ \begin{aligned} \E&_{p(x)} T(x) - \log Z = \E_{p(x)} [T(x) - \log Z] \\ &= \E_{p(x)} [\log e^{T(x)} - \log \E_{q(x)} e^{T(x)}] \\ &= \E_{p(x)} \log \frac{e^{T(x)}} {\E_{q(x)} e^{T(x)}} \\ &= \E_{p(x)} \log \frac{q(x) e^{T(x)}} {q(x) \E_{q(x)} e^{T(x)}} \\ &= \E_{p(x)} \log \frac{g(x)} {q(x)} \\ \end{aligned} \] Finally KL-divergence minus above gives \[ \begin{aligned} &D_\text{KL}(p || q) - (\E_{p(x)} T(x) - \log Z) \\ &= \E_{p(x)} \log \frac{p(x)} {q(x)} - \E_{p(x)} \log \frac{g(x)} {q(x)} \\ &= \E_{p(x)} \log \frac{p(x)} {g(x)} \triangleq D_\text{KL}(p || g) \ge 0 \end{aligned} \]
Gaussian Case
Suppose \(X\) and \(Y\) are random variables, both of some \(n\)-dimensional Gaussian distribution. Then the KL-divergence between them can be formulated as: \[ \begin{aligned} D_\text{KL}(p_X || p_Y) &= \int p_X(x) \log \frac{p_X(x)} {p_Y(x)} \d x = \int p_X(x) \log [ \sqrt \frac{|\Sigma_X|}{|\Sigma_Y|} \frac { e^{-\frac 1 2 (x-\mu_X)^T \Sigma_X^{-1} (x-\mu_X)} } { e^{-\frac 1 2 (x-\mu_Y)^T \Sigma_Y^{-1} (x-\mu_Y)} } ] \d x \\ &= \frac 1 2 \log \frac{\Sigma_X}{\Sigma_Y} - \frac 1 2 \int p_X(x) [ (x-\mu_X)^T \Sigma_X^{-1} (x-\mu_X) + (x-\mu_Y)^T \Sigma_Y^{-1} (x-\mu_Y) ] \d x \\ &= \frac 1 2 \log \frac{\Sigma_X}{\Sigma_Y} - \frac 1 2 \int p_X(x) (x-\mu_X)^T \Sigma_X^{-1} (x-\mu_X) \d x \\ &\quad -\frac 1 2 \int p_X(x) (x-\mu_Y)^T \Sigma_Y^{-1} (x-\mu_Y) \d x \\ &= \frac 1 2 \log \frac{\Sigma_X}{\Sigma_Y} - \frac 1 2 \int p_X(x) x^T \Sigma_X^{-1} x \d x + \frac 1 2 \mu_X^T \Sigma_X^{-1} \mu_X \\ &\quad - \frac 1 2 \int p_X(x) x^T \Sigma_Y^{-1} x \d x + \frac 1 2 \mu_Y^T \Sigma_Y^{-1} \mu_Y \d x \\ &= \frac 1 2 \log \frac{\Sigma_X}{\Sigma_Y} - \frac 1 2 \tr(\Sigma_X^{-1} \Sigma_X) - \frac 1 2 \mu_X \Sigma_X^{-1} \mu_X + \frac 1 2 \mu_X^T \Sigma_X^{-1} \mu_X \\ &\quad - \frac 1 2 \tr(\Sigma_Y^{-1} \Sigma_X) - \frac 1 2 \mu_X \Sigma_Y^{-1} \mu_X + \frac 1 2 \mu_Y^T \Sigma_Y^{-1} \mu_Y \\ &= \frac 1 2 [\log \frac{\Sigma_X}{\Sigma_Y} - n - \tr(\Sigma_X^{-1} \Sigma_Y) + \mu_Y^T \Sigma_Y^{-1} \mu_Y - \mu_X \Sigma_Y^{-1} \mu_X] \\ &= \frac 1 2 [\log \frac{\Sigma_X}{\Sigma_Y} - n - \tr(\Sigma_X^{-1} \Sigma_Y)] \\ &\quad + \frac 1 2 [\mu_Y^T \Sigma_Y^{-1} \mu_Y - \mu^T_X \Sigma_Y^{-1} \mu_X + \mu_X^T \Sigma_Y^{-1} \mu_Y - \mu_Y^T \Sigma_Y^{-1} \mu_X] \\ &= \frac 1 2 [\log \frac{\Sigma_X}{\Sigma_Y} - n - \tr(\Sigma_X^{-1} \Sigma_Y) + (\mu_Y^T - \mu_X^T) \Sigma_Y^{-1} (\mu_Y - \mu_X)] \\ \end{aligned} \]