$f$-divergence
\(f\)-divergence can be treated as the generalization of the KL-divergence. For continuous random variable, it is defined as \[ D_f (p||q) = \int f(\frac{p(x)}{q(x)}) q(x)\ \d x \] where \(f\) has to satisfy that \(f(1) = 0\) and \(f\) is a convex function. The reason for these two constraints is that we hope \[ \begin{gather} D_f (p||q) = 0 \text{ when $p=q$} \\ \forall p, q, D_f (p||q) \ge 0 \end{gather} \] To show it, \[ D_f (p||q) = \int f(\frac{p(x)}{q(x)}) q(x)\ \d x \ge f(\int \frac{p(x)}{q(x)} q(x) \d x) = f(1) = 0 \\ \] When \(f(x) = x\log x\), \(f\)-divergence becomes KL-divergence. The log sum inequality property can be derived using the formulation of \(f\)-divergence: \[ \begin{aligned} \sum_{i=1}^n a_i \log \frac{a_i}{b_i} &= \sum_{i=1}^n b_i \underbrace{\frac{a_i}{b_i} \log \frac{a_i}{b_i}}_{f(\frac{a_i}{b_i})} \\ &= b \sum_{i=1}^n \frac{b_i}{b} f(\frac{a_i}{b_i}) \\ &\ge b f(\sum_{i=1}^n \frac{b_i}{b} \frac{a_i}{b_i}) \\ &= b f(\frac{a}{b}) \\ &= a \log \frac{a}{b} \end{aligned} \]
Variational \(f\)-divergence
When \(p\) and \(q\) have no closed-form expression, it is difficult to compute the \(f\)-divergence. Therefore in practice \(f\)-divergence is computed with a variational expression: \[ D_f (p||q) = \sup_{T:\mathcal X \to \R} \{ \E_p[f(x)] + \E_q[f^* \circ T(x)] \} \] where \(f^*\) is the convex conjugate of \(f\). The derivation is as follows: \[ \begin{aligned} &D_f (p||q) = \int f(\frac{p(x)}{q(x)}) q(x)\ \d x \\ &= \int f^{**}(\frac{p(x)}{q(x)}) q(x)\ \d x \\ &= \int \sup_t [\frac{p(x)}{q(x)} t - f^*(t)] q(x)\ \d x \\ &= \int \sup_t[p(x) t - f^*(t) q(x)]\ \d x \\ &\Downarrow_{T(x) = \arg \sup_t[p(x) t - f^*(t) q(x)]} \\ &= \sup_{T:\mathcal X \to \R} \int [p(x) T(x) - f^*(T(x)) q(x)]\ \d x \\ &= \sup_{T:\mathcal X \to \R} \{ \E_p[f(x)] + \E_q[f^* \circ T(x)] \} \end{aligned} \]