Overview

Both cross entropy and KL-divergence describe the relationship between two different probability measures/functions over the same event space.

Both conditional entropy and mutual information, as well as joint entropy, describe the relationship between two different random variables.

Since the notion of entropy is the basis of all other concepts in this section and the entropy is more well-defined on discrete random variable, they are more meaningful to be applied in discrete case, though they are usually trivially extended to continuous case.

KL-divergence, Entropy and Cross Entropy

By definition, \[ \begin{align} &D_\text{KL}(p||q) = -\mathrm{E}_{x \sim p} \log \frac{q(x)}{p(x)} \\ &= -\mathrm{E}_{x \sim p} \log q(x) - [-\mathrm{E}_{x \sim p} \log {p(x)}] \\ &= H(p||q) - H(p) \end{align} \] By the convexity of \(-\log\), \[ \begin{align} &D_\text{KL}(p||q) = \mathrm{E}_{x \sim p} [-\log \frac{q(x)}{p(x)}] \\ &\ge -\log(\mathrm{E}_{x \sim p} \frac{q(x)}{p(x)}) \\ &= 0 \end{align} \] That is \[ H(p||q) \ge H(p) \] The equality holds if and only if \(p = q\).

KL-divergence and Mutual Information

Given that \(X\) and \(Y\) have the same dimension, \[ \begin{aligned} &I(X;Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)} \\ &= \sum_{x,y} p(x) p(y|x) \log \frac{p(y|x)}{p(y)} \\ &= \sum_{x} \sum_{y} p(x) p(y|x) \log \frac{p(y|x)}{p(y)} \\ &= \sum_{x} p(x) D_{KL}[p(y|x), p(y)] \\ &= \E_{x} D_{KL}[p(y|x), p(y)] \\ &= \E_{y} D_{KL}[p(x|y), p(x)] \\ \end{aligned} \]

Joint Entropy and Conditional Entropy

Joint entropy is just entropy, but with the random variable usually broken into two separate components. \[ \begin{aligned} &H(X, Y) = \sum_{x,y} p(x,y) \log p(x,y) \\ &= \sum_{x,y} p(x,y) \log p(y|x) p(x) \\ &= \sum_{x,y} p(x,y) \log p(y|x) \\ &\quad + \sum_{x,y} p(x,y) \log p(x) \\ &= H(Y|X) + H(X) \end{aligned} \]

Last updated on Apr 22, 2022