Overview
Both cross entropy and KL-divergence describe the relationship between two different probability measures/functions over the same event space.
Both conditional entropy and mutual information, as well as joint entropy, describe the relationship between two different random variables.
Since the notion of entropy is the basis of all other concepts in this section and the entropy is more well-defined on discrete random variable, they are more meaningful to be applied in discrete case, though they are usually trivially extended to continuous case.
KL-divergence, Entropy and Cross Entropy
By definition, \[ \begin{align} &D_\text{KL}(p||q) = -\mathrm{E}_{x \sim p} \log \frac{q(x)}{p(x)} \\ &= -\mathrm{E}_{x \sim p} \log q(x) - [-\mathrm{E}_{x \sim p} \log {p(x)}] \\ &= H(p||q) - H(p) \end{align} \] By the convexity of \(-\log\), \[ \begin{align} &D_\text{KL}(p||q) = \mathrm{E}_{x \sim p} [-\log \frac{q(x)}{p(x)}] \\ &\ge -\log(\mathrm{E}_{x \sim p} \frac{q(x)}{p(x)}) \\ &= 0 \end{align} \] That is \[ H(p||q) \ge H(p) \] The equality holds if and only if \(p = q\).
KL-divergence and Mutual Information
Given that \(X\) and \(Y\) have the same dimension, \[ \begin{aligned} &I(X;Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)} \\ &= \sum_{x,y} p(x) p(y|x) \log \frac{p(y|x)}{p(y)} \\ &= \sum_{x} \sum_{y} p(x) p(y|x) \log \frac{p(y|x)}{p(y)} \\ &= \sum_{x} p(x) D_{KL}[p(y|x), p(y)] \\ &= \E_{x} D_{KL}[p(y|x), p(y)] \\ &= \E_{y} D_{KL}[p(x|y), p(x)] \\ \end{aligned} \]
Joint Entropy and Conditional Entropy
Joint entropy is just entropy, but with the random variable usually broken into two separate components. \[ \begin{aligned} &H(X, Y) = \sum_{x,y} p(x,y) \log p(x,y) \\ &= \sum_{x,y} p(x,y) \log p(y|x) p(x) \\ &= \sum_{x,y} p(x,y) \log p(y|x) \\ &\quad + \sum_{x,y} p(x,y) \log p(x) \\ &= H(Y|X) + H(X) \end{aligned} \]