Bounding Mutual Information
\(I_\text{BA}\)
The very basic bound on the Mutual Information is based on the non-negativity of KL-divergence. \[ \begin{aligned} I(X;Y) = &\E_{p(x,y)} \log \frac{p(x|y)}{p(x)} \\ = &\E_{p(x,y)} \log \frac{q(x|y)p(x|y)}{p(x)q(x|y)} \\ = &\E_{p(x,y)} \log \frac{q(x|y)}{p(x)} + \\ &\underbrace{\E_{p(x,y)} \log \frac{p(x|y)}{q(x|y)}}_{\E_{p(y)} D_{KL}(p(x|y) || q(x|y)} \\ \ge &\E_{p(x,y)} \log \frac{q(x|y)}{p(x)} \\ = &\E_{p(x,y)} \log q(x|y) + H(X) \triangleq I_\text{BA} \end{aligned} \] This bound is not usually tractable since \(p(x)\) and thus \(H(X)\) has no closed-form expression.
This bound is tight when \(q(x|y) = p(x|y)\).
\(I_\text{UBA}\)
When \(q(x|y)\) is replaced with an unnormalized model, that is \[ q(x|y) = \frac{p(x)}{Z(y)} e^{f(x,y)}, \text{where } Z(y) = \E_{p(x)} e^{f(x,y)} \] Substituting this back to the \(I_\text{BA}\) will give \[ \begin{aligned} I(X;Y) &\ge \E_{p(x,y)} \log \frac{p(x) e^{f(x,y)}}{p(x) Z(y)} \\ &= \E_{p(x,y)} f(x,y) - \E_{p(y)} \log Z(y) \triangleq I_\text{UBA} \end{aligned} \]
Note that by scaling \(q(x|y)\) with \(p(x)\), the \(p(x)\) term is canceled.
This bound is tight when \(q(x|y) = \frac{p(x)}{Z(y)} e^{f(x,y)} = p(x|y)\). That is \[ \begin{aligned} e^{f(x,y)} &= \frac{p(x, y)}{p(x)p(y)} Z(y) \\ e^{f(x,y)} &= \frac{Z(y)}{p(y)} p(y|x) \\ f(x,y) &= \ln p(y|x) + \underbrace{\ln \frac{Z(y)}{p(y)}}_{c(y)} \end{aligned} \]
\(I_\text{DV}\)
By further applying Jensen’s inequality to the \(\E_{p(y)} Z(y)\) term in \(I_\text{UBA}\), we get \[ \begin{aligned} I(X;Y) &\ge \E_{p(x,y)} f(x,y) - \E_{p(y)} \log Z(y) \\ &\ge \E_{p(x,y)} f(x,y) - \log\E_{p(y)} [Z(y)] \triangleq I_\text{DV} \end{aligned} \]
The original \(I_\text{DV}\) bound is in effect derived from the variational lower bound of KL-divergence: \[ \begin{aligned} I&(X;Y) = D_\text{KL}(p_{(x,y)} || p(x) \otimes p(y)) \\ &\ge \E_{p(x,y)} f(x,y) - \log [\E_{p(x) \otimes p(y)} f(x,y)] \\ &= \E_{p(x,y)} f(x,y) - \log [\E_{p(y)} \E_{p(x)} e^{f(x,y)}] \\ &= \E_{p(x,y)} f(x,y) - \log \E_{p(y)} Z(y) \\ \end{aligned} \]
\(I_\text{TUBA}\)
\(\log\) function also has the following property: \[ \begin{aligned} \forall x,a>0,\log(x) &\le \frac{x}{a} + \log(a) - 1 &\iff \\ a + a\log(x) &\le x + a\log(a) &\iff \\ a\log(x) - x &\le a\log(a) - a \\ \end{aligned} \] Therefore, we can insert the inequality \(\log Z(y) \le \frac{Z(y)}{a(y)} + \log a(y) - 1\) into the \(I_\text{UBA}\) to get \[ \begin{aligned} I(X;Y) &\ge \E_{p(x,y)} f(x,y) - \E_{p(y)} \log Z(y) \\ &\ge \E_{p(x,y)} f(x,y) - \E_{p(y)} [\frac{Z(y)}{a(y)} + \log a(y) - 1 \big)] \triangleq I_\text{TUBA} \end{aligned} \]
This bound is tight when
- the tight condition of \(I_\text{UBA}\) holds: \(f(x,y) = \log p(y|x) + \underbrace{\log \frac{Z(y)}{p(y)}}_{c(y)}\), and
- the tight condition of \(I_\text{TUBA}\) holds: \(a(y) = Z(y)\).
\(I_\text{NWJ}\)
By setting \(a(y) = e\) in \(I_\text{TUBA}\), we get \[ \begin{aligned} I(X;Y) &\ge \E_{p(x,y)} f(x,y) - \E_{p(y)} [\frac{Z(y)}{a(y)} + \log a(y) - 1 \big)] \\ &\ge \E_{p(x,y)} f(x,y) - e^{-1}\E_{p(y)} Z(y)\triangleq I_\text{NWJ} \end{aligned} \] Since \(I_\text{NWJ}\) is a special case of \(I_\text{TUBA}\), its bound is tight when \(Z(y)\) self-normalizes to \(e\). In this case \[ \begin{aligned} f(x,y) &= \log p(y|x) + \log \frac{e}{p(y)} \\ &= 1 + \log \frac{p(y|x)}{p(y)} \\ &= 1 + \log \frac{p(x|y)}{p(x)} \end{aligned} \]
\(I_\text{NCE}\)
Our goal is to estimate the \(I(X_1;Y)\) given one sample from \(p(x_1)p(y|x_1)\) and \(K-1\) additional independent samples \(x_{2:K} \sim p^{K-1}(x_{2:K})\). For any random variable \(Z\) that is independent from \(X_1\) and \(Y\), \[ \begin{aligned} I(X_1,Z;Y) &= \E_{p(x_1,z,y)} \frac{p(x_1,z,y)}{p(x_1,z) p(y)} \\ &= \E_{p(x_1,z,y)} \frac{p(x_1,y) p(z)}{p(x_1) p(z) p(y)} \\ &= I(X_1;Y) \end{aligned} \] Therefore, \(I(X_1;Y) = I(X_1,X_{2:K};Y)\). Bounding \(I(X_1;Y)\) becomes bounding \(I(X_1,X_{2:K};Y)\), which can be estimated using any of the preceding methods.
Set the critic to \(f(x_{1:K},y) = 1 + \overbrace{\log \frac{e^{g(x,y)}} {a(y;x_{1:K})}}^{h(x_{1:K},y)}\), where \(x\) is the sample among \(x_{1:K}\) that is paired with \(y\). Usually \((x,y)_{1:K}\) are sampled from the same marginal distribution of \(\tilde p(x,y)\). \(y\) is then uniformly drawn among \(y_{1:K}\). Substitute these to the \(I_\text{NWJ}\), we have \[ \label{infonce} \begin{aligned} I(X_{1:K};Y) &\ge \E_{p(x_{1:K},y)} f(x_{1:K},y) - e^{-1}\E_{p(y)} Z(y) \\ &= \E_{p(x_{1:K},y)} [1 + \log \frac{e^{g(x,y)}}{a(y,x_{1:K})}] - e^{-1} \E_{p(y)} [\E_{p(x_{1:K})} e^{1 + \log \frac{e^{g(x,y)}}{a(y,x_{1:K})}}] \\ &= 1 + \E_{p(x_{1:K})p(y|x_{1:K})} [\log \frac{e^{g(x,y)}}{a(y,x_{1:K})}] - e^{-1} \E_{p(y)} [e\E_{p(x_{1:K})} \frac{e^{g(x,y)}}{a(y,x_{1:K})}] \\ &= 1 + \E_{p(x_{1:K})p(y|x_{1:K})} [\log \frac{e^{g(x,y)}}{a(y,x_{1:K})}] - \E_{p(x_{1:K})p(y)} \frac{e^{g(x,y)}}{a(y,x_{1:K})} \\ \end{aligned} \] Further set \(a(y;x_{1:K}) = \frac{1}{K} \sum_{i=1}^K e^{g(x_i, y)}\). Substitute this into the last term in equation \(\eqref{infonce}\) will give \[ \begin{aligned} \E_{p(x_{1:K})p(y)} \frac{e^{g(x,y)}}{a(y,x_{1:K})} &= \E_{p(y)} \frac{\E_{p(x_{1:K})} e^{g(x,y)}}{\frac{1}{K} \sum_{i=1}^K e^{g(x_i, y)} } \\ &\stackrel{P}{\to} \E_{p(y)} \frac{\E_{p(x_{1:K})} e^{g(x,y)}} {\E_{p(x_{1:K})} e^{g(x,y)} } \\ &= 1 \end{aligned} \] Therefore, \[ \begin{aligned} I(X_{1:K};Y) &\ge \E_{p(x_{1:K})p(y|x_{1:K})} [\log \frac{e^{g(x,y)}} {\frac{1}{K} \sum_{i=1}^K e^{g(x_i, y)} }] \\ &\approx \E_{p(x_{1:K})} [\frac{1}{K} \sum_{j=1}^K \log \frac{e^{g(x_j,y_j)}} {\frac{1}{K} \sum_{i=1}^K e^{g(x_i, y_j)} }] \triangleq I_\text{NCE} \end{aligned} \]
On the other hand, \(I_\text{NCE}\) is tightly bounded by \(\log K\): \[ \begin{aligned} I_\text{NCE} &= \E_{p(x_{1:K})} [\frac{1}{K} \sum_{j=1}^K \log \frac{e^{g(x_j,y_j)}} {\frac{1}{K} \sum_{i=1}^K e^{g(x_i, y_j)}} ] \\ &= \E_{p(x_{1:K})} [\frac{1}{K} \sum_{j=1}^K (\log \frac{e^{g(x_j,y_j)}} {\sum_{i=1}^K e^{g(x_i, y_j)}} )] + \log K \\ &\le \E_{p(x_{1:K})} \log {[\frac{1}{K} \sum_{j=1}^K \frac{e^{g(x_j,y_j)}} {\sum_{i=1}^K e^{g(x_i, y_j)}} ]} + \log K \\ &= -\E_{p(x_{1:K})} \log {[\frac{1}{K} \sum_{j=1}^K \frac{\sum_{i=1}^K e^{g(x_i, y_j)}} {e^{g(x_j,y_j)}} ]} + \log K \end{aligned} \] The equality is reached when \(f(x_{1:K},y) = 1 + {h(x_{1:K},y)} = 1 + \log \frac{p(x|y)}{p(x)}\) as in \(I_\text{NWJ}\). In this case, it can be derived that \(g(x,y) = g^\star(x,y) = \frac{p(y|x)}{p(y)}\). And then, \[ \begin{aligned} I_\text{NCE} &\le -\E_{p(x_{1:K})} \log {[\frac{1}{K} \sum_{j=1}^K \frac{\frac{p(y_j|x_j)}{p(y_j)} + \sum_{i=1,i \ne j}^K \frac{p(y_j|x_i)}{p(y_j)}} {\frac{p(y_j|x_j)}{p(y_j)}} ]} + \log K \\ &\le -\E_{p(x_{1:K})} \log {[\frac{1}{K} \sum_{j=1}^K (1 + \frac{p(y_j)}{p(y_j|x_j)} \sum_{i=1,i \ne j}^K \frac{p(y_j|x_i)}{p(y_j)}) ]} + \log K \\ &\approx -\E_{p(x_{1:K})} \log {[\frac{1}{K} \sum_{j=1}^K \big(1 + \frac{p(y_j)}{p(y_j|x_j)} (K-1) \E_{p(y)} \frac{p(y|x_i)}{p(y)} \big) ]} + \log K \\ &= -\E_{p(x_{1:K})} \log {[1 + \frac{1}{K} \sum_{j=1}^K \big( \frac{p(y_j)}{p(y_j|x_j)} (K-1) \big) ]} + \log K \\ &\approx -\E_{p(x_{1:K})} \log {[\frac{1}{K} \sum_{j=1}^K \big( \frac{p(y_j)}{p(y_j|x_j)} (K-1) \big) ]} + \log K \\ &= -\E_{p(x_{1:K})} \log {[\frac{1}{K} \sum_{j=1}^K \frac{p(y_j)}{p(y_j|x_j)} ]} - \log(K-1) + \log K \\ &= I(X_1;Y) - \log(K-1) + \log K \end{aligned} \] This derivation is much looser than that in the InfoNCE’s original paper.
With \(I_\text{UBA}\)
There is another approach to the derivation of \(I_\text{NCE}\), that stems from an estimate of \(I_\text{UBA}\) (which I won’t call empirical estimate) and that may be more intuitive: \[ \begin{aligned} &I(X;Y) \ge \E_{p(x,y)} f(x,y) - \E_{p(y)} \log Z(y) \\ &= \E_{p^N(x,y)} f(x_1,y_1) - \E_{p^N(x,y)} \log Z(y_1) \\ &= \E_{p^N(x,y)} \log e^{f(x_1,y_1)} \\ &\quad\;- \E_{p^N(x,y)} \log \E_{p(x')} e^{f(x',y_1)} \\ &= \E_{p^N(x,y)} \log \frac{e^{f(x_1,y_1)}}{\E_{p(x')} e^{f(x',y)}} \\ &\simeq \E_{p^N(x,y)} \log \frac{e^{f(x_1,y_1)}} {\frac{1}{N} \sum_i e^{f(x_i,y_1)}} \\ &= \E_{p^N(x,y)} \log \frac{e^{f(x_1,y_1)}} {\sum_i e^{f(x_i,y_1)}} + \log N \\ &\triangleq I_\text{NCE} \le \log N \end{aligned} \]