Jenson-Shannon Divergence

In probability theory and statistics, Jenson-Shannon divergence is another method of measuring the distance between two distributions. It is based on KL-divergence with some notable differences. KL-divergence does not make a good measure of distance between distributions, since in the first place it is not symmetric. Another disadvantage of KL-divergence is that it is not bounded from above. Jenson-Shannon divergence, on the other hand, overcomes these two problems of KL-divergence.

Given two distributions $p$ and $q$ defined on the same sample space, Jenson-Shannon divergence is defined as \[ \newcommand{\JSD}{\mathop{\text{JSD}}} \JSD(p;q) \coloneq \frac{1}{2} (D_\text{KL}(p || m) + D_\text{KL}(q || m)), \text{where $m = \frac{p+q}{2}$} \] Since the JSD is the addition of two KL-divergences, it is non-negative by the non-negativity of KL-divergence and it reaches $0$ when $p=q$. On the other hand, \[ \begin{aligned} \JSD(p;q) &= \frac{1}{2} \big( \E_{x \sim p} \log \frac{p(x)}{(p(x)+q(x))/2} + \E_{x \sim q} \log \frac{q(x)}{(p(x)+q(x))/2} \big) \\ &= \frac{1}{2} \big( \E_{x \sim p} \log \frac{2}{( 1 + e^{\ln \frac{q(x)}{p(x)}} )} + \E_{x \sim q} \log \frac{2}{( 1 + e^{\ln \frac{p(x)}{q(x)}} )} \big) \\ \end{aligned} \] Due to the concavity of $f(x) = \log \frac{2}{1 + e^x}$, \[ \begin{aligned} \JSD(p;q) &= \frac{1}{2} \big( \E_{x \sim p} \log \frac{2}{( 1 + e^{\ln \frac{q(x)}{p(x)}} )} + \E_{x \sim q} \log \frac{2}{( 1 + e^{\ln \frac{p(x)}{q(x)}} )} \big) \\ &\le \frac{1}{2} \big( \log \frac{2}{( 1 + e^{\E_{x \sim p} \ln \frac{q(x)}{p(x)}} )} + \log \frac{2}{( 1 + e^{\E_{x \sim q} \ln \frac{p(x)}{q(x)}} )} \big) \\ &\le \frac{1}{2} \big( 2 \log \frac{2}{( 1 + e^{( \E_{x \sim p} \ln \frac{q(x)}{p(x)} + \E_{x \sim q} \ln \frac{p(x)}{q(x)} ) / 2} )} \big) \\ &= \log \frac{2}{( 1 + e^{-\frac{1}{2} ( D_\text{KL}(p||q) + D_\text{KL}(q||p) )} )} \end{aligned} \] This upper bound is attributed to Crooks. Since the KL-divergence can go to positive infinity, we can conclude that $\JSD(p;q)$ is upper-bounded by $\log 2$. The $\newcommand{\J}{\mathop{\text{J}}} \J(p;q) \coloneq \frac{1}{2} ( D_\text{KL}(p||q) + D_\text{KL}(q||p) )$ term is also known as Jeffreys divergence (the coefficient $\frac{1}{2}$ may be ignored in some other place). Even more accurately, the upper bound can be rewritten as (in this note)

\[ \JSD(p;q) \le \min (\frac{1}{4} \J(p;q), \log \frac{2}{1 + e^{-\J(p;q)}}) \] A lower bound in terms of Jeffreys divergence can be derived as (in this note)

\[ \JSD(p;q) \ge \frac{1}{4} \ln(1 + 2\J(p;q)) \]

Remarkably, the square root of JSD between two distributions satisfies the metric axioms.

Relation with Entropy

By definition, \[ \begin{aligned} &\JSD(p;q) = \frac{1}{2} (D_\text{KL}(p || \frac{p+q}{2}) + D_\text{KL}(q || \frac{p+q}{2})) \\ &= \frac{1}{2} [H(p||\frac{p+q}{2}) - H(p) + H(q||\frac{p+q}{2}) - H(q)] \\ &= \begin{aligned}[t] &-\frac{1}{2} [\E_{x \sim p} \log \frac{p+q}{2}(x) + \E_{x \sim q} \log \frac{p+q}{2}(x)] \\ &- \frac{1}{2} [H(p) + H(q)] \\ \end{aligned} \\ &= -\E_{x \sim \frac{p+q}{2}} \frac{p+q}{2}(x) - \frac{1}{2} [H(p) + H(q)] \\ &= H(\frac{p+q}{2}) - \frac{1}{2} [H(p) + H(q)] \end{aligned} \]

Last updated on May 3, 2022