Mutual Information

Mutual information of two random variables \(X\) and \(Y\) is a measure of the mutual independence between them. It quantifies the amount of information obtained about one random variable by observing the other random variable. It is defined as \[ I(X;Y) = \sum_{(x,y) \in X \times Y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)} \] By its definition, \[ \begin{gather} I(X;Y) = I(Y;X) \\ I(X;Y) = D_{KL}(p_{(X, Y)}|| p_X \otimes p_Y) \\ I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) \end{gather} \]

Gaussian Case

To better illustrate the formula of mutual information between two Gaussian-distributed random variables \(X\) and \(Y\). We can concatenate them to form, say an \(n\)-dimensional random variable \(Z\), which is also Gaussian-distributed. Then the mutual information between \(X\) and \(Y\) can be computed as: \[ I(X;Y) = \frac 1 2 \log \frac{\det \Sigma_X \det \Sigma_Y}{\det \Sigma_Z} \] The key to the derivation is that mutual information is the KL-divergence between the joint distribution and the product of the marginal distributions.

The joint can be described as \[ p_{X:Y} = N(\underbrace{\mu_X:\mu_Y}_\mu, \underbrace{ \begin{bmatrix} \ \Sigma_{X} & \Cov_{XY} \\ \Cov_{YX} & \Sigma_{Y} \\ \end{bmatrix} }_\Sigma ) \]

The product of marginals can be described as \[ p_{X} \times p_{Y} = N(\mu_x:\mu_y, \begin{bmatrix} \ \Sigma_{xx} & 0 \\ 0 & \Sigma_{yy} \\ \end{bmatrix} ) \] The probability density function of an \(n'\)-dimensional Gaussian distribution is \(p(x') = \frac{1}{\sqrt{|2\pi \Sigma'|}}e^{-\frac{1}{2}(x'-\mu')^T\Sigma'^{-1}(x'-\mu')}\). The entropy of this Gaussian distribution is \(\frac 1 2 n' + \frac 1 2 \log |2\pi\Sigma'|\). In view of above, \[ \begin{aligned} I&(X;Y) = D_{KL}(p_{X:Y} || p_X \times p_Y) = \int p_{X:Y}(\underbrace{x:y}_z) \log \frac{p_{X:Y}(x:y)} {p_X(x) p_{Y}(y)} \d z \\ &= \int p_{X:Y}(\underbrace{x:y}_z) \log p_{X:Y}(x:y) \d z - \int p_{X:Y}(\underbrace{x:y}_z) \log p_X(x) \d z \\ &\quad\quad\quad -\int p_{X:Y}(\underbrace{x:y}_z) \log p_Y(y) \d z \\ &= \int p_{Z}(z) \log p_{Z}(z) \d z - \int p_{X}(x) \log p_X(x) \d x - \int p_{Y}(y) \log p_Y(y) \d y \\ &= -(\log \sqrt{\det(2\pi \Sigma)} + \frac n 2) + (\log \sqrt{\det(2\pi \Sigma_{X}}) + \frac {n_X} 2) \\ &\quad\quad\quad +(\log \sqrt{\det(2\pi \Sigma_{Y}}) + \frac {n_Y} 2) \\ &= \frac 1 2 \log \frac{\det \Sigma_X \det \Sigma_Y}{\det \Sigma_Z} \end{aligned} \]

Kronecker Gaussian

Consider the multivariate Gaussian distribution random vector \(X_k\) and \(Y_k\) of the same length \(k\). Suppose they are both independent internally and they have the component-wise correlation coefficient \(\text{corr\_coef}(X_i, Y_j) = \delta_{ij} \rho\), where \(\rho \in (-1, 1)\) (open to ensure the covariance matrix is invertible), \(1 \le i, j \le k\); \(\delta_{ij}\) is the Kronecker’s delta: \[ \delta_{ij} = \begin{cases} 0, & i \ne j \\ 1, & i = j \end{cases} \] Let \(Z_k\) be the vector concatenated by \(X_k\) and \(Y_k\). It is easy to draw its covariance matrix \(\Sigma_{Z_k}\) like \[ \begin{pmatrix} 1 & 0 & \cdots & 0 & \rho & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 & 0 & \rho & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 & 0 & 0 & \cdots & \rho \\ \rho & 0 & \cdots & 0 & 1 & 0 & \cdots & 0 \\ 0 & \rho & \cdots & 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \rho & 0 & 0 & \cdots & 1 \\ \end{pmatrix} \] The mutual information between the \(X_k\) and \(Y_k\) can be calculated as \[ I(X;Y) = \frac 1 2 \log \frac{\det \Sigma_{X_k} \det \Sigma_{Y_k}}{\det \Sigma_{Z_k}} = -\frac 1 2 \log \det \Sigma_{Z_k} \] The problem remains as how to compute \(\det \Sigma_{Z_{k}}\). After applying the Laplacian expansion along the first column, it remains to deal with the determinants of following two matrices (dashed lines rule out the row/column to be deleted): \[ \begin{gather} A_k = \underbrace{ \left( \begin{array}{c:ccccccc} 1 & 0 & \cdots & 0 & \rho & 0 & \cdots & 0 \\ \hdashline 0 & 1 & \cdots & 0 & 0 & \rho & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 & 0 & 0 & \cdots & \rho \\ \rho & 0 & \cdots & 0 & 1 & 0 & \cdots & 0 \\ 0 & \rho & \cdots & 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \rho & 0 & 0 & \cdots & 1 \\ \end{array} \right) }_\text{$2k$ columns}, B_k = \underbrace{ \left( \begin{array}{c:ccccccc} 1 & 0 & \cdots & 0 & \rho & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 & 0 & \rho & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 & 0 & 0 & \cdots & \rho \\ \hdashline \rho & 0 & \cdots & 0 & 1 & 0 & \cdots & 0 \\ \hdashline 0 & \rho & \cdots & 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \rho & 0 & 0 & \cdots & 1 \\ \end{array} \right) }_\text{$2k$ columns}, \\ \det Z_k = \left. \begin{cases} \det A_k - \rho \det B_k, & \text{$k$ is odd} \\ \det A_k + \rho \det B_k, & \text{$k$ is even} \\ \end{cases} \right\} \Rightarrow \det Z_k = \det A_k + (-1)^k \rho \det B_k \end{gather} \] Applying the Laplacian expansion along the \(k\)-th row of \(A_k\), we have \[ \det A_k = \underbrace{ \left| \begin{array}{c:ccc:c:ccc} 1 & 0 & \cdots & 0 & \rho & 0 & \cdots & 0 \\ \hdashline 0 & 1 & \cdots & 0 & 0 & \rho & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 & 0 & 0 & \cdots & \rho \\ \hdashline \rho & 0 & \cdots & 0 & 1 & 0 & \cdots & 0 \\ \hdashline 0 & \rho & \cdots & 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \rho & 0 & 0 & \cdots & 1 \\ \end{array} \right| }_\text{$2k$ columns} = \det Z_{k-1} \] Applying the Laplacian expansion along the first row of \(B_k\), we have \[ \det B_k = (-1)^k \rho \underbrace{ \left| \begin{array}{c:ccc:c:ccc} 1 & 0 & \cdots & 0 & \rho & 0 & \cdots & 0 \\ \hdashline 0 & 1 & \cdots & 0 & 0 & \rho & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 & 0 & 0 & \cdots & \rho \\ \hdashline \rho & 0 & \cdots & 0 & 1 & 0 & \cdots & 0 \\ \hdashline 0 & \rho & \cdots & 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \rho & 0 & 0 & \cdots & 1 \\ \end{array} \right| }_\text{$2k$ columns} = (-1)^k \rho \det Z_{k-1} \] In all, \(\det Z_k = \det Z_{k-1} - (-1)^{2k} \rho^2 \det Z_{k-1} = (1 - \rho^2) Z_{k-1}\). Because \(\det Z_1 = 1 - \rho^2\), we finally have \[ \det Z_k = (1 - \rho^2)^k \] Therefore, \[ I(X;Y) = -\frac 1 2 \log \det \Sigma_{Z_k} = -\frac k 2 \log (1 - \rho^2) \]

External Materials

Mutual information between subsets of variables in the multivariate normal distribution - Cross Validated (stackexchange.com)

Information Theory and Predictability Lecture 7: Gaussian Case

Previous
Next