Mutual Information

Mutual information of two random variables $X$ and $Y$ is a measure of the mutual independence between them. It quantifies the amount of information obtained about one random variable by observing the other random variable. It is defined as \[ I(X;Y) = \sum_{(x,y) \in X \times Y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)} \] By its definition, \[ \begin{gather} I(X;Y) = I(Y;X) \\ I(X;Y) = D_{KL}(p_{(X, Y)}|| p_X \otimes p_Y) \\ I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) \end{gather} \]

Gaussian Case

To better illustrate the formula of mutual information between two Gaussian-distributed random variables $X$ and $Y$. We can concatenate them to form, say an $n$-dimensional random variable $Z$, which is also Gaussian-distributed. Then the mutual information between $X$ and $Y$ can be computed as: \[ I(X;Y) = \frac 1 2 \log \frac{\det \Sigma_X \det \Sigma_Y}{\det \Sigma_Z} \] The key to the derivation is that mutual information is the KL-divergence between the joint distribution and the product of the marginal distributions.

The joint can be described as \[ p_{X:Y} = N(\underbrace{\mu_X:\mu_Y}_\mu, \underbrace{ \begin{bmatrix} \ \Sigma_{X} & \Cov_{XY} \\ \Cov_{YX} & \Sigma_{Y} \\ \end{bmatrix} }_\Sigma ) \]

The product of marginals can be described as \[ p_{X} \times p_{Y} = N(\mu_x:\mu_y, \begin{bmatrix} \ \Sigma_{xx} & 0 \\ 0 & \Sigma_{yy} \\ \end{bmatrix} ) \] The probability density function of an $n'$-dimensional Gaussian distribution is $p(x') = \frac{1}{\sqrt{|2\pi \Sigma'|}}e^{-\frac{1}{2}(x'-\mu')^T\Sigma'^{-1}(x'-\mu')}$. The entropy of this Gaussian distribution is $\frac 1 2 n' + \frac 1 2 \log |2\pi\Sigma'|$. In view of above, \[ \begin{aligned} I&(X;Y) = D_{KL}(p_{X:Y} || p_X \times p_Y) = \int p_{X:Y}(\underbrace{x:y}_z) \log \frac{p_{X:Y}(x:y)} {p_X(x) p_{Y}(y)} \d z \\ &= \int p_{X:Y}(\underbrace{x:y}_z) \log p_{X:Y}(x:y) \d z - \int p_{X:Y}(\underbrace{x:y}_z) \log p_X(x) \d z \\ &\quad\quad\quad -\int p_{X:Y}(\underbrace{x:y}_z) \log p_Y(y) \d z \\ &= \int p_{Z}(z) \log p_{Z}(z) \d z - \int p_{X}(x) \log p_X(x) \d x - \int p_{Y}(y) \log p_Y(y) \d y \\ &= -(\log \sqrt{\det(2\pi \Sigma)} + \frac n 2) + (\log \sqrt{\det(2\pi \Sigma_{X}}) + \frac {n_X} 2) \\ &\quad\quad\quad +(\log \sqrt{\det(2\pi \Sigma_{Y}}) + \frac {n_Y} 2) \\ &= \frac 1 2 \log \frac{\det \Sigma_X \det \Sigma_Y}{\det \Sigma_Z} \end{aligned} \]

Kronecker Gaussian

Consider the multivariate Gaussian distribution random vector $X_k$ and $Y_k$ of the same length $k$. Suppose they are both independent internally and they have the component-wise correlation coefficient $\text{corr\_coef}(X_i, Y_j) = \delta_{ij} \rho$, where $\rho \in (-1, 1)$ (open to ensure the covariance matrix is invertible), $1 \le i, j \le k$; $\delta_{ij}$ is the Kronecker’s delta: \[ \delta_{ij} = \begin{cases} 0, & i \ne j \\ 1, & i = j \end{cases} \] Let $Z_k$ be the vector concatenated by $X_k$ and $Y_k$. It is easy to draw its covariance matrix $\Sigma_{Z_k}$ like \[ \begin{pmatrix} 1 & 0 & \cdots & 0 & \rho & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 & 0 & \rho & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 & 0 & 0 & \cdots & \rho \\ \rho & 0 & \cdots & 0 & 1 & 0 & \cdots & 0 \\ 0 & \rho & \cdots & 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \rho & 0 & 0 & \cdots & 1 \\ \end{pmatrix} \] The mutual information between the $X_k$ and $Y_k$ can be calculated as \[ I(X;Y) = \frac 1 2 \log \frac{\det \Sigma_{X_k} \det \Sigma_{Y_k}}{\det \Sigma_{Z_k}} = -\frac 1 2 \log \det \Sigma_{Z_k} \] The problem remains as how to compute $\det \Sigma_{Z_{k}}$. After applying the Laplacian expansion along the first column, it remains to deal with the determinants of following two matrices (dashed lines rule out the row/column to be deleted): \[ \begin{gather} A_k = \underbrace{ \left( \begin{array}{c:ccccccc} 1 & 0 & \cdots & 0 & \rho & 0 & \cdots & 0 \\ \hdashline 0 & 1 & \cdots & 0 & 0 & \rho & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 & 0 & 0 & \cdots & \rho \\ \rho & 0 & \cdots & 0 & 1 & 0 & \cdots & 0 \\ 0 & \rho & \cdots & 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \rho & 0 & 0 & \cdots & 1 \\ \end{array} \right) }_\text{$2k$ columns}, B_k = \underbrace{ \left( \begin{array}{c:ccccccc} 1 & 0 & \cdots & 0 & \rho & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 & 0 & \rho & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 & 0 & 0 & \cdots & \rho \\ \hdashline \rho & 0 & \cdots & 0 & 1 & 0 & \cdots & 0 \\ \hdashline 0 & \rho & \cdots & 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \rho & 0 & 0 & \cdots & 1 \\ \end{array} \right) }_\text{$2k$ columns}, \\ \det Z_k = \left. \begin{cases} \det A_k - \rho \det B_k, & \text{$k$ is odd} \\ \det A_k + \rho \det B_k, & \text{$k$ is even} \\ \end{cases} \right\} \Rightarrow \det Z_k = \det A_k + (-1)^k \rho \det B_k \end{gather} \] Applying the Laplacian expansion along the $k$-th row of $A_k$, we have \[ \det A_k = \underbrace{ \left| \begin{array}{c:ccc:c:ccc} 1 & 0 & \cdots & 0 & \rho & 0 & \cdots & 0 \\ \hdashline 0 & 1 & \cdots & 0 & 0 & \rho & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 & 0 & 0 & \cdots & \rho \\ \hdashline \rho & 0 & \cdots & 0 & 1 & 0 & \cdots & 0 \\ \hdashline 0 & \rho & \cdots & 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \rho & 0 & 0 & \cdots & 1 \\ \end{array} \right| }_\text{$2k$ columns} = \det Z_{k-1} \] Applying the Laplacian expansion along the first row of $B_k$, we have \[ \det B_k = (-1)^k \rho \underbrace{ \left| \begin{array}{c:ccc:c:ccc} 1 & 0 & \cdots & 0 & \rho & 0 & \cdots & 0 \\ \hdashline 0 & 1 & \cdots & 0 & 0 & \rho & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 & 0 & 0 & \cdots & \rho \\ \hdashline \rho & 0 & \cdots & 0 & 1 & 0 & \cdots & 0 \\ \hdashline 0 & \rho & \cdots & 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \rho & 0 & 0 & \cdots & 1 \\ \end{array} \right| }_\text{$2k$ columns} = (-1)^k \rho \det Z_{k-1} \] In all, $\det Z_k = \det Z_{k-1} - (-1)^{2k} \rho^2 \det Z_{k-1} = (1 - \rho^2) Z_{k-1}$. Because $\det Z_1 = 1 - \rho^2$, we finally have \[ \det Z_k = (1 - \rho^2)^k \] Therefore, \[ I(X;Y) = -\frac 1 2 \log \det \Sigma_{Z_k} = -\frac k 2 \log (1 - \rho^2) \]

External Materials

Mutual information between subsets of variables in the multivariate normal distribution - Cross Validated (stackexchange.com)

Information Theory and Predictability Lecture 7: Gaussian Case

Last updated on May 19, 2022