Independent Component Analysis
Assumption and Derivation
Suppose observations are the linear combination of signals. Also suppose that the number of signal sources is equal to the number of linearly mixed channel. Given observations (along the time axis \(i\)) \(\mathcal{D} = \{x^{(i)}\}_{i=1}^M, x^{(i)} \in \R^{N \times 1}\), independent component analysis finds the \(N \times N\) mixing matrix \(A\) such that \(X = AS\). Columns of \(A\) are the propagation weights from a signal source to observers. Therefore, they are considered independent (and are thus called independent components).
\(A\) is called mixing matrix and \(W = A^{-1}\) is called unmixing matrix because \(S = WX\).
Suppose \(A = U\Sigma V^T\) by SVD. Assume observations are pre-centered:
\[ \sum_{i=1}^Mx^{(i)} = 0 \] Assume that signals are independent and the variance of each signal is \(1\) (taking the advantage of scale/sign ambiguity): \[ \frac{1}{M}\sum_{i=1}^Ms^{(i)}(s^{(i)})^T = I \] Then the covariance of \(X\) can be calculated as \[ \begin{aligned} C &= \frac{1}{M}x^{(i)}(x^{(i)})^T \\ &= \frac{1}{M}As^{(i)}(As^{(i)})^T \\ &= \frac{1}{M}U\Sigma V^Ts^{(i)}(U\Sigma V^Ts^{(i)})^T \\ &= \frac{1}{M}U\Sigma V^Ts^{(i)}(s^{(i)})^TV\Sigma^T U^T \\ &= U\Sigma V^TIV\Sigma^T U^T \\ &= U\Sigma^2U^T \text{ (by property of SVD, note that $\Sigma$ is $n \times n$ diagonal)} \\ \end{aligned} \]
If our assumptions are correct, then the \(U\) is the concatenated eigenvectors of the matrix \(CC^T\). \(\Sigma^2\) is the diagonal matrix consisting of corresponding eigenvalues of the matrix \(C^TC\). Since \(C\) is symmetric, \(CC^T = C^TC = C^2\).
In other words, \(U\) and \(\Sigma\) can be solved by eigen decomposition. Define \(\hat x^{(i)} = \Sigma^{-1}U^Tx^{(i)}\): \[ \begin{aligned} \hat C &= \frac{1}{M}\hat x^{(i)}(\hat x^{(i)})^T \\ &= \frac{1}{M}\Sigma^{-1}U^Tx^{(i)}(\Sigma^{-1}U^Tx^{(i)})^T \\ &= \frac{1}{M}\Sigma^{-1}U^Tx^{(i)}(x^{(i)})^TU\Sigma^{-1} \\ &= \Sigma^{-1}U^T(\frac{1}{M}x^{(i)}(x^{(i)})^T)U\Sigma^{-1} \\ &= \Sigma^{-1}U^TU\Sigma^2U^TU\Sigma^{-1} \\ &= I \\ \end{aligned} \]
\[ S = WX = A^{-1}X = V\Sigma^{-1}U^TX = V\hat X \]
The remaining job is to find the \(V\). We resort to multi-information, a generalization of mutual-information from Information Theory to judge how close a distribution is to statistical independence for multiple variables. \[ I(s) = \sum_{s \in \mathcal{S}}p(s) \log \frac{p(s)}{\prod_jp(s_j)} \] It is a non-negative quantity that reaches the minimum if and only if all variables are statistically independent.
The multi-information can be written as a function of entropy, which is defined as \[ H(s) = -\sum_{s \in \mathcal{S}}p(s) \log p(s) \] The multi-information can be written as the difference between the sum of entropies of marginal distribution and the entropy of joint distribution \[ \begin{aligned} I(s) &= \sum_jH(s_j) - H(s) \\ &= \sum_jH((V\hat x)_j) - H(V\hat x) \\ &= \sum_jH((V\hat x)_j) - (H(\hat x) + log|V|) \\ &= \sum_jH((V\hat x)_j) - (H(\hat x) + log1) \\ &= \sum_jH((V\hat x)_j) - H(\hat x) \\ \end{aligned} \] Because we are assuming signal are independent from each other, \[ \begin{aligned} V^\star &= \arg \min_V\sum_jH((V\hat x)_j) - H(\hat x) \\ &= \arg \min_V\sum_jH((V\hat x)_j) \end{aligned} \] Calculating entropy and then taking derivative is no easy task and therefore ICA algorithms focus on approximations or equivalences to the above equation. For example, it can be approximated by finding the rotation that maximizes the expected log-likelihood of the observed data by assuming that the source distributions are known.
Non-Gaussian and PCA
The key assumption of ICA is that signals are non-Gaussian. We are trying to recover the \(S\) (by finding \(W\)) to be as non-Gaussian as possible. \(\hat X\) consists of independent components because its covariance matrix \(\hat C\) is shown to be an identity matrix (and thus diagonal). \(S\) is reconstructed by the linear combination of independent components. Thus it will be the most non-Gaussian when each variable is formed by exactly one component of \(\hat X\) by Central Limit Theorem, i.e. \(S\)’s components are independent.
Consider the case when \(S\) consists of \(N\) independent Gaussian, which invalidate the above analysis. If we forcibly calculate \(V^\star\), due to the property of multi-variate Gaussian that its iso-density maps are spherical, then any \(N \times N\) rotation matrix will be a solution to Equation (9), including the left singular matrix of \(X\), i.e. the concatenated eigenvectors of \(XX^T\), which is exactly the projection basis obtained in PCA. If any \(N \times N\) rotation matrix can be a solution, there are infinite many solutions to \(A\), and thus ICA will just fail.
The conclusion drawn above is based on the ideal case, i.e. many enough observations. Even in the real case where signals are Gaussian, we may get a unique solution to \(V^\star\).