Principal Component Analysis
Given a data matrix \(X = [x^{(1)}, \dots, x^{(M)}] \in \mathbb R^{N \times M}\), the goal of PCA is to identify the directions of maximum variance contained in the data (compared with directional derivative, this is directional variance) and project data onto those directions.
Linear PCA
Let \(v \in \mathbb R^{N \times 1}\) and \(||v||^2 = 1\), the variance in the direction \(v\) is given by \[ \frac{1}{M}\sum_{i=1}^M(v^Tx^{(i)} - \mu)^2, \text{where }\mu = \frac{1}{M}\sum_{i=1}^Mv^Tx^{(i)} = v^T\bar x \] Under the assumption that data is pre-centered so that \(\bar x = \frac{1}{M}\sum_{i=1}^Mx^{(i)} = 0\): \[ \begin{aligned} &\frac{1}{M}\sum_{i=1}^M(v^Tx^{(i)} - \mu)^2 = \frac{1}{M}\sum_{i=1}^M(v^Tx^{(i)} - v^T\bar x)^2 \\ &= \frac{1}{M}\sum_{i=1}^Mv^T(x^{(i)} - \bar x)(x^{(i)} - \bar x)^Tv \\ &= \frac{1}{M}v^T(\sum_{i=1}^M(x^{(i)} - \bar x)(x^{(i)} - \bar x)^T)v \\ &= \frac{1}{M}v^T(\sum_{i=1}^Mx^{(i)}(x^{(i)})^T)v \\ &\Downarrow_{\text{by column-row expansion}} \\ &= \frac{1}{M}v^TXX^Tv \\ \end{aligned} \] Suppose we want to identify the direction \(v\) of the maximum variance, we can formulate the problem as \[ \begin{gather} \max_v \quad \frac{1}{M}v^TXX^Tv \\ s.t.\quad v^Tv = 1 \end{gather} \] Let \(\Sigma = \frac{1}{M}XX^T\), which is real symmetric, we can form the Lagrangian function: \[ L(v, \lambda) = v^T\Sigma v + \lambda (1 - v^Tv) \] We represent the constraint as \(1 - v^Tv = 0\) instead of \(v^Tv - 1 = 0\). The reason is if we take derivative w.r.t \(v\), we have \[ \frac{\partial L}{\partial v} = 2\Sigma v - 2\lambda v \] which is more intuitive than \(\frac{\partial L}{\partial v} = 2\Sigma v + 2\lambda v\). Let the derivative be \(0\) to have \(\Sigma v = \lambda v\). Since \(v^T v = 1\), \(v \ne 0\), this means the optimal solution \((\lambda^\star, v^\star)\) must be a pair of eigenvalue and eigenvector of \(\Sigma\):
\[ v^\star = v_i, \lambda^\star = \lambda_i \] where \(v_i\) is rescaled such that \(v_i^Tv_i = 1\). Substitute the result back to the objective to give
\[ \begin{aligned} \frac{1}{M}v^TXX^Tv &= (v_i)^T\Sigma v_i \\ &= (v_i)^T\lambda_iv_i \\ &= \lambda_i(v_i)^Tv_i \\ &= \lambda_i \end{aligned} \] So the \(N\) largest directions of variance will be the \(\Sigma\)’s eigenvectors \(v_1, \dots, v_N\), corresponding to the eigenvalues \(\lambda_1 > \lambda_2 > ... > \lambda_N\).
Compared with SVD
Intuitively, principal components can be obtained by spectral decomposition of \(X\)’s covariance matrix. But in practice, principal components are usually solved with SVD, which is more efficient in computation and can handle sparse representation. One thing worth notice is that, before applying SVD, PCA centers the data firstly. That said, we cannot equalize PCA and SVD. For a more detailed discussion on the relation between PCA and SVD, please refer to this blog and this post.
Another view on PCA
Finding \(K\) different \(v\) to \[ \max_{v} \frac{1}{M}\sum_{i=1}^M(v^Tx^{(i)})^2 \\ \] is equivalent to finding \(K\) different \(v\) to \[ \min_{v} \frac{1}{M}\sum_{i=1}^M||x^{(i)} - v^Tx^{(i)}v||^2 \] both subject to \(||v||^2 = 1\) and inter-orthogonality. This second notation is essentially the projection of \(x^{(i)}\) on \(v\) because \(||v|| = 1\) and thus the objective indicates minimizing the overall approximation error.
Kernel PCA
We can map the data to a higher-dimension space: \(x^{(i)} \to \phi(x^{(i)})\). Let \(\mathcal{X} = [\phi(x^{(1)}), ..., \phi(x^{(M)})]\). Kernel tricks allow us not to explicitly define \(\phi\), but to only focus on the inner product of mapped data: \(\mathcal K(x^{(i)}, x^{(j)}) = \phi(x^{i})^T\phi(x^{j})\).
Similarly, the variance along direction \(v\), where \(v^Tv=1\), is given by \[ \frac{1}{M}\sum_{i=1}^M(v^T\phi (x^{(i)}) - \mu)^2, \text{where }\mu = \frac{1}{M}\sum_{i=1}^Mv^T\phi (x^{(i)}) = v^T\bar\phi(x) \] Under the assumption that data is pre-centered so that \(\bar\phi(x) = \frac{1}{M}\sum_{i=1}^M\phi(x^{(i)}) = 0\): \[ \begin{aligned} &\frac{1}{M}\sum_{i=1}^M(v^T\phi (x^{(i)}) - \mu)^2 = \frac{1}{M}\sum_{i=1}^M(v^T\phi (x^{(i)}) - v^T\bar\phi(x))^2 \\ &= \frac{1}{M}\sum_{i=1}^Mv^T(\phi(x^{(i)}) - \bar{\phi(x)})(\phi (x^{(i)}) - \bar\phi(x))^Tv \\ &= \frac{1}{M}v^T(\sum_{i=1}^M(\phi(x^{(i)}) - \bar\phi(x))(\phi (x^{(i)}) - \bar\phi(x))^T)v \\ &= \frac{1}{M}v^T(\sum_{i=1}^M\phi(x^{(i)})(\phi(x^{(i)})^T)v \\ &\Downarrow_{\text{by column-row expansion}} \\ &= \frac{1}{M}v^T\mathcal{X}\mathcal{X}^Tv \\ \end{aligned} \] Then comes the standard form of linear PCA. Let \(\Sigma = \frac{1}{M}\mathcal{X}\mathcal{X}^T\), we would like to solve \[ \begin{gather} \max_v\quad \frac{1}{M}v^T\mathcal{X}\mathcal{X}^Tv \\ s.t.\quad v^T v=1 \end{gather} \] This is equivalent to solving \(\Sigma\)’s eigenvectors. Unfortunately, this cannot be directly solved like in linear PCA since we don’t know \(\phi\). However note that
\[ \begin{gather} \begin{aligned} \Sigma v &= \lambda v \\ v &= \frac{1}{\lambda}\Sigma v \end{aligned} \\ \begin{aligned} v &= \frac{1}{M\lambda}\sum_{i=1}^M\phi(x^{(i)})[(\phi(x^{(i)})^Tv] \\ &= \frac{1}{M\lambda}\sum_{i=1}^M[(\phi(x^{(i)})^Tv]\phi(x^{(i)}) \end{aligned} \end{gather} \] \(v\) is some linear combination of \(\phi(x^{(i)})\)’s and can be written as \[ v = \mathcal{X}\alpha, \alpha \in \R^N \] Substitute this back to \(\Sigma v = \lambda v\) to give \[ \begin{aligned} \frac{1}{M}\mathcal{X}\mathcal{X}^T\mathcal{X}\alpha &= \lambda\mathcal{X}\alpha \\ \frac{1}{M}\mathcal{X}^T\mathcal{X}\mathcal{X}^T\mathcal{X}\alpha &= \lambda\mathcal{X}^T\mathcal{X}\alpha \\ \mathcal{X}^T\mathcal{X}\mathcal{X}^T\mathcal{X}\alpha &= M\lambda\mathcal{X}^T\mathcal{X}\alpha \\ \end{aligned} \] Let \(K = \mathcal{X}^T\mathcal{X}\) which is invertible. Then, \[ \begin{gathered} KK\alpha = M\lambda K\alpha \\ K\alpha = M\lambda\alpha \end{gathered} \] We can solve \(\alpha\) as \(K\)’s eigenvector. Kernel trick can be applied since \(K_{ij} = (\mathcal{X}^T\mathcal{X})_{ij} = \phi(x^{(i)})^T\phi(x^{(j)})\).
\(M\lambda\) is \(K\)’s eigenvalue, which is proportional to \(\lambda\), i.e. \(\Sigma\)’s eigenvalue and thus the objective. So \(K\)’s largest \(L\) eigenvalues correspond to \(L\) largest objective one-by-one. \(K\)’s \(L\) eigenvectors \(\alpha_1, ..., \alpha_L\) has to be solved with constraint \(\alpha_i^T\alpha_i = \frac{1}{M\lambda}\) so that for \(\mathcal{X}\)’s eigenvector \(v\): \[ \begin{aligned} v^Tv &= (\mathcal{X}\alpha)^T\mathcal{X}\alpha \\ &= \alpha^T(\mathcal{X}^T\mathcal{X})\alpha \\ &= \alpha^TK\alpha \\ &= M\lambda\alpha^T\alpha \\ &= 1 \end{aligned} \]
Given a sample \(x^*\), we compute its dot-products with \(v_1,\dots,v_L\) in the kernel space to get the projected coordinates. To get \(j\)-th coordinate of transformed sample, \[ \phi(x^*)^T \phi(v_j) = \phi(x^*)^T \mathcal X \alpha_j = [\mathcal K(x^*, x^{(1)}), \mathcal K(x^*, x^{(2)}), \dots, \mathcal K(x^*, x^{(M)})]\alpha_j \]
Centralizing in Kernel Trick
Let \(\tilde\phi(x^{(i)}) = \phi(x^{(i)}) - \bar\phi(x), \tilde{\mathcal{X}} = [\tilde\phi(x^{(1)}), \tilde\phi(x^{(2)}), ..., \tilde\phi(x^{(M)})]\). We have assumed that \(\phi(x^{(i)})\) is pre-centered by subtracting \(\bar\phi(x)\). However, as said, we didn’t explicitly define \(\phi\), so there is no way to calculate \(\phi(x^{(i)})\), let alone \(\bar\phi(x)\).
From another perspective, the objective of pre-centering is to get \(\tilde K = \tilde{\mathcal{X}}\tilde{\mathcal{X}}^T = \sum_{i=1}^M\tilde\phi(x^{(i)})\tilde\phi(x^{(i)})^T\).
Let \(\mathbb{1}_{M \times 1}\) represent a \(M \times 1\) column vector with all elements being \(1\): \[ \bar\phi(x) = \frac{1}{M}(\phi(x^{(1)}) + \phi(x^{(2)}) + ... + \phi(x^{(M)})) = \frac{1}{M}\mathcal{X}\mathbb{1}_{M \times 1} \] Let \(\mathbb{1}_M\) represents a \(M \times M\) matrix with all elements being \(\frac{1}{M}\). Pre-center all data to give \[ \begin{aligned} \tilde{\mathcal{X}} &= [\tilde\phi(x^{(1)}), \tilde\phi(x^{(2)}), ..., \tilde\phi(x^{(M)})]\\ &= [\phi(x_1) - \frac{1}{M}\mathcal{X}\mathbb{1}_{M \times 1}, \phi(x^{(2)}) - \frac{1}{M}\mathcal{X}\mathbb{1}_{M \times 1}, ..., \phi(x^{(M)}) - \frac{1}{M}\mathcal{X}\mathbb{1}_{M \times 1}] \\ &= [\phi(x^{(1)}), \phi(x^{(2)}), ..., \phi(x^{(M)})] - \frac{1}{M}\mathcal{X}\mathbb{1}_{M \times 1}\mathbb{1}_{M \times 1}^T \\ &= \mathcal{X} - \frac{1}{M}\mathcal{X}\mathbb{1}_{M \times 1}\mathbb{1}_{M \times 1}^T \\ &= \mathcal{X} - \mathcal{X}\mathbb{1}_M \\ \tilde K &= \tilde{\mathcal{X}}^T\tilde{\mathcal{X}} \\ &= (\mathcal{X} - \mathcal{X}\mathbb{1}_M)^T(\mathcal{X} - \mathcal{X}\mathbb{1}_M) \\ &= (\mathcal{X}(I - \mathbb{1}_M))^T\mathcal{X}(I - \mathbb{1}_M) \\ &= (I - \mathbb{1}_M)^T\mathcal{X}^T\mathcal{X}(I - \mathbb{1}_M) \\ &= (I - \mathbb{1}_M)K(I - \mathbb{1}_M) \\ \tilde K_{ij} &= (\tilde{\mathcal{X}}^T\tilde{\mathcal{X}})_{ij} \\ &= \tilde\phi(x^{(i)})^T\tilde\phi(x^{(j)})\\ &= (\phi(x^{(i)}) - \frac{1}{M}\mathcal{X}\mathbb{1}_{M \times 1})^T(\phi(x^{(j)}) - \frac{1}{M}\mathcal{X}\mathbb{1}_{M \times 1}) \\ \end{aligned} \]
Explained Variance
It remains a question that in real application, how many principal components to choose to represent the original data. Explained variance can be a good measure on this. We can choose a number of principal components such that they “explains” a certain percentage \(\tau\) of the original data.
Specifically, \(\frac{1}{M}v_i^TXX^Tv_i\) is the directional variance explained along \(v_i\), or the variance of the first dimension of transformed data samples. We may choose the first \(k\) principal components such that \[ \frac{1}{M}\sum_{i=1}^k v_i^TXX^Tv_i \ge \tau \sum_{i=1}^N \sigma_i^2 \] ## External Materials