Dimension Reduction

Unsupervised Dimension Reduction

Dimensionality reduction

reduces computational cost;
de-noises by projecting onto lower-dimensional space and back to original space;
makes results easier to understand by reducing the collinearity.

Compared to feature selection,

the goal of feature selection is to remove features that are not informative with respect to the class label. This obviously reduces the dimensionality of the feature space;
dimensionality reduction can be used to find a meaningful lower-dim feature space even when there is information in each feature dimension so that none can be discarded;
dimensionality reduction is unsupervised while feature selection is supervised.

Compared to data compression,

dimensionality reduction can be seen as a simplistic form of data compression. But they are not equivalent, as the goal of data compression is to reduce the entropy of the representation, which is not limited to the dimensionality reduction.

Linear Dimensionality Reduction

Linear dimensionality reduction projects data onto lower-dimensional space by representing the data with a new basis consisting of some major components. Mathematically, \[ \begin{gather} x^{(i)} = \sum_{k=1}^Kz^{(i)}_kb^{(k)}, \text{ where $z^{(i)} \in \R^K$ is the weight, $b^{(k)} \in \R^N$ is the basis vector.} \\ X = BZ, \text{ where $X = [x^{(i)}, \dots, x^{(M)}], B = [b^{(1)}, \dots, b^{(K)}], Z = [z^{(1)}, \dots, z^{(M)}]$} \end{gather} \] The objective can be set to minimize the error when recovering from $B, Z$ to $X$, i.e. \[ \min_{B,Z} = ||X - BZ||^2_F = \sum_{ij}(X - BZ)^2_{ij} \]

Alternating Least Squares

By leveraging the idea of coordinate descent and OLS solution to linear regression, we can optimize $B$ and $Z$ alternatively, until convergence.

Fix $Z$, take derivative w.r.t. $B$ and make it zero to give \[ \begin{gather} 2(X - BZ)(-Z^T) = 0 \\ BZZ^T = XZ^T \\ B = XZ^T(ZZ^T)^{-1} \end{gather} \] Fix $B$, take derivative w.r.t. $Z^T$ and make it zero to give \[ \begin{gather} \frac{\partial||X - BZ||^2_F}{\partial Z^T} = \frac{\partial||X^T - Z^TB^T||^2_F}{\partial Z^T} = 2(X^T - Z^TB^T)(-B) = 0 \\ 2(X^T - Z^TB^T)(-B) = 0 \\ Z^TB^TB = X^TB \\ Z^T = X^TB(B^TB)^{-1} \\ Z = (B^TB)^{-1}B^TX \end{gather} \] Assume we have run the ALS to convergence and obtain the global optimal parameters \[ B^\star, Z^\star = \arg \min_{B,Z} = ||X - BZ||^2_F = \sum_{ij}(X - BZ)^2_{ij} \] Let any invertible matrix $R \in \R^{K \times K}$. We can construct a pair of $\tilde B, \tilde Z$ such that $\tilde B = B^\star R, \tilde Z = R^{-1}Z^\star$. Then, \[ ||X - \tilde B \tilde Z||^2_F = ||X - B^\star RR^{-1} Z^\star||^2_F = ||X - B^\star Z^\star||^2_F \] Thus the global optima is not unique. We may force regularization on $B, Z$ and obtain the unique optima.

Principal Component Analysis

Suppose the SVD for $X$ is $X = U\Sigma V^T$, where \[ \begin{gather} \text{$U$ is $N \times N$, $\Sigma$ is diagonal, $V$ is $M \times M$} \\ UU^T = I \\ VV^T = I \\ \Sigma = diag_{N \times M}(\sigma_1, \sigma_2, ..., \sigma_p) \\ \sigma_1 \ge \sigma_2 \ge ... \ge \sigma_p \ge 0 \\ p = \min\{N,M\} \end{gather} \] By only preserving the $K \le \rank(\Sigma)$ most prominent singular values, $X$ can be approximated as $U_K\Sigma_KV^T_K$, where \[ \begin{gather} \text{$U_K \in \R^{N \times K}$ is first $K$ columns of $U$} \\ \text{$V_K \in \R^{M \times K}$ is first $K$ columns of $V$} \\ \Sigma_K \in \R^{K \times K} = diag(\sigma_1, \sigma_2, ..., \sigma_K) \\ \end{gather} \] Eckart-Young-Mirsky theorem will tell that $U_K\Sigma_KV^T_K$ is the best approximation of $X$ among $N \times M$ of rank $K$ in terms of Frobenius norm.

We may obtain the $B^\star, Z^\star$ by \[ B^\star = U_K, Z^\star = \Sigma_K V^T_K \]

If the data is centered in advance, $B^\star$ is essentially the principal component in PCA and $Z^\star$ is the transformed $X$ in the basis formed by $B^\star$.

Random Projection

The above methods all minimize the Frobenius norm during reconstruction. This objective may become time-consuming when input dimension becomes large. We may use some randomly-generated vectors as basis to do the projection. This greatly saves time, at the expense of losing accuracy. We can measure such projection by checking whether the structure of the data can be preserved, e.g. the distance between points.

Johnson-Lindenstrauss Lemma tells that a random function $f(x) = \frac{1}{\sqrt{K}}Ax$, where $A$ is a $K \times N$ matrix with i.i.d. entries sampled from a standard Gaussian, can preserve the distance between any two points within error $\epsilon$: \[ (1 - \epsilon)||x^{(i)} - x^{(j)}||^2_2 \le ||f(x^{(i)}) - f(x^{j})||^2_2 \le (1 + \epsilon)||x^{(i)} - x^{(j)}||^2_2 \] with the probability at least $\frac{1}{M}$ as long as \[ K \ge \frac{8\log M}{\epsilon^2} \] And usually this requirement is quite conservative.

Non-linear Dimensionality Reduction

We can introduce the feature mapping to handle the non-linearity problem in Dimensionality Reduction. Similar to non-linear regression, we can introduce the kernel trick in this case, yielding the Kernel PCA.

Supervised Dimensionality Reduction

There is no guarantee that the unsupervised Dimension Reduction can throw insight into the classification problem. It may or may not help. Otherwise supervised Dimension Reduction such as Fisher’s Linear Discriminant finds a lower-dimensional space so as to minimize the class overlap.

Last updated on Jan 7, 2022