Singular Value Decomposition

Theorem

Any \(m \times n\) matrix \(A\) can be decomposed into \(U\Sigma V^T\), where \[ \begin{gathered} \text{$U$ is $m \times m$, $V$ is $n \times n$} \\ U U^T = I \\ V V^T = I \\ \Sigma = \diag_{m \times n} (\sigma_1, \sigma_2, ..., \sigma_r) \\ \sigma_1 \ge \sigma_2 \ge ... \ge \sigma_r \ge 0 \\ r = \rank A \end{gathered} \]

Proof

This is shown by construction.

  1. Construction of \(V\) and \(\Sigma\)

    \(A^T A\) is an \(n \times n\) real symmetric matrix, so it can be diagonalized by an orthogonal matrix. Let \(V\) be this matrix and \(\Lambda\) be the corresponding diagonal matrix consisting of \(A^TA\)’s eigenvalues: \[ V^{-1} (A^T A) V = \Lambda \] Let \(\Lambda\) and \(V\) be arranged such that corresponding eigenvalues of \(V\) are decreasing. Because \(V\) consists of orthonormal basis, we have \(V^T V = I\). Therefore, \[ V^T (A^T A) V = \Lambda \] Because \(\rank(A) = r\), \(\rank(A^T A) = r\). And because \(A^T A\) is positive semi-definite, it has \(n\) non-negative real eigenvalues, \(r\) of which are positive. Let \(\sigma_i = \sqrt{\lambda_i}, i = 1, 2, ..., n\) (which are defined as \(A\)’s the singular values).

    \[ \begin{gather} \lambda_1, \lambda_2, ..., \lambda_r > 0, \lambda_r, \lambda_{r+1}, ..., \lambda_{n} = 0 \\ \sigma_1, \sigma_2, ..., \sigma_r > 0, \sigma_r, \sigma_{r+1}, ..., \sigma_{n} = 0 \end{gather} \] Let \(V_1 = [v_1, v_2, ..., v_r], V_2 = [v_{r+1}, v_{r+2}, ..., v_n], V = [V_1, V_2]\). Also let

    \[ \Sigma_1 = \begin{bmatrix} \sigma_1 \\ & \sigma_2 \\ & & \ddots \\ & & & \sigma_r \end{bmatrix} \] Complement \(\Sigma_1\) with \(0\) to get the \(m \times n\) matrix: \[ \Sigma = \begin{bmatrix} \Sigma_1 & 0 \\ 0 & 0 \\ \end{bmatrix} \] Since \(\rank(A^T A) = r\), \(\Nul(A^T A)\) is of dimension \(n-r\). Because \(V_2\) comprises of \(A^TA\)’s eigenvectors whose eigenvalues are \(0\), \(V_2\) has \(n - r\) independent columns and \(A^T A V_2 = 0\). Therefore, \(V_2\) spans \(\Nul(A^TA)\) and thus \(\Nul(A)\). \[ \begin{gather} I = VV^T = [V_1, V_2][V_1, V_2]^T = V_1 V_1^T + V_2 V_2^T \\ A = A I = A V_1 V_1^T + A V_2 V_2^T = A V_1 V_1^T \end{gather} \]

  2. Construction of \(U\)

    Let \[ \begin{gather} u_i \coloneq \frac{1}{\sigma_i} A v_i, i = 1, 2, ..., r \\ U_1 \coloneq [u_1, u_2, ..., u_r] \end{gather} \] Thus \(AV_1 = U_1\Sigma_1\). Also, \(U_1\)’s columns are orthonormal: \[ \begin{aligned} u_i^T u_j &= (\frac{1}{\sigma_i} A v_i)^T (\frac{1}{\sigma_j} A v_j) \\ &= \frac{1}{\sigma_i\sigma_j} v_i^T A^T A v_j \\ &= \frac{1}{\sigma_i\sigma_j} v_i^T \lambda_j v_j \\ &= \frac{\sigma_j}{\sigma_i} v_i^T v_j \\ &= \begin{cases} 0 & i \ne j \\ 1 & i = j \end{cases} \end{aligned} \]

    Since \(u_i\)’s are within \(\Col(A)\) and are orthonormal as shown, plus that \(\rank(A) = r\), \(u_i\)’s span the \(\Col(A)\).

    Let \(\Col(A)^\perp\) be \(\Col(A)\)’s complement, we have \(\Col(A)^\perp = \Nul(A^T)\). Let \(\{ u_{r+1}, u_{r+2}, \dots, u_{m} \}\) be an orthonormal basis of \(\Nul(A^T)\) such that they are orthogonal to \(U_1\)’s column vectors. We construct \(U\) by:

    \[ \begin{gather} U_2 \coloneq [u_{r+1}, u_{r+2}, ..., u_{m}] \\ U \coloneq [U_1, U_2] \end{gather} \]

  3. Proof of \(U \Sigma V^T = A\) \[ \begin{aligned} U \Sigma V^T &= [U_1, U_2] \begin{bmatrix} \Sigma_1 & 0 \\ 0 & 0 \\ \end{bmatrix} \begin{bmatrix} V_1^T \\ V_2^T \\ \end{bmatrix} \\ &= [U_1 \Sigma_1, 0] \begin{bmatrix} V_1^T \\ V_2^T \\ \end{bmatrix} \\ &= U_1 \Sigma_1 V_1^T \\ &= A V_1 V_1^T \\ &= A \end{aligned} \]

Note

Note that SVD reveals that \(A\) is essentially the addition of \(r\) rank-\(1\) matrix: \[ U \Sigma V^T = \sum_{i=1}^r \sigma_i u_i v_i^T \] We have shown the construction process of \(A\) as \(U \Sigma V^T\) in the above proof, i.e. the existence of such decomposition, with which we can investigate into \(U\), \(\Sigma\) and \(V\). Notice that \[ \begin{aligned} A A^T &= U \Sigma V^T V \Sigma U^T \\ A A^T &= U \Sigma^2 U^T \\ A A^T U &= U \Sigma^2 U^T U \\ A A^T U &= U \Sigma^2 \end{aligned} \] Since \(\Sigma\) is a diagonal matrix, we can conclude that \(U\) contains the eigenvectors of \(A A^T\). \(\Sigma^2\) contains the eigenvalues of \(A A^T\) as its diagonal entries. Similarly \[ \begin{aligned} A^T A &= V \Sigma U^T U \Sigma V^T \\ A^T A &= V \Sigma^2 V^T \\ A^T A V &= V \Sigma^2 V^T V \\ A^T A V &= V \Sigma^2 \end{aligned} \] \(V\) contains the eigenvectors of \(A^T A\). \(\Sigma^2\) contains the eigenvalues of \(A^T A\) as its diagonal entries.

It is easy to verify that \(A A^T\) and \(A^T A\) actually have the same eigenvalues. And if \(v\) is an eigenvector of \(A^T A\), \(A v\) is an eigenvector of \(A A^T\) with the same eigenvalue. The columns of \(U\) are called the left-singular vectors of \(A\) and the columns of \(V\) are called the right-singular vectors of \(A\).

The time complexity of SVD is \(O(\min\{ m^2 n, n^2 m \})\), depending on whether \(A A^T\) or \(A^T A\) is used to solve \(\Sigma^2\).

Eckart-Young-Mirsky Theorem

Given an \(m \times n\) matrix \(X\) of rank \(r \le \min(m,n)\) and its SVD \(X = U \Sigma V^T\), where \(\Sigma = \diag(\sigma_1, \sigma_2, ..., \sigma_r)\), among all the \(m \times n\) matrices of rank \(k \le r\), the best approximation to \(X\) is \(Y^\star = U \Sigma_k V^T\), where \(\Sigma_k = \diag(\sigma_1, \sigma_2, ..., \sigma_k)\), when distance between \(X\) and \(Y\) is defined as Frobenius norm: \[ ||X - Y||_F = \sqrt{\sum_{ij}(X - Y)_{ij}^2} \] Notice that in this case, \[ Y^\star = \sum_{i=1}^k \sigma_k u_k v_k^T \] Firstly we prove that, if \(U\) is orthogonal, then \(||U A||_F = ||A U||_F = ||A||_F\): \[ \begin{aligned} ||UA||_F^2 &= \tr((U A)^T U A) \\ &= \tr(A^T U U A) \\ &= \tr(A^T A) \\ &= ||A||_F^2 \end{aligned} \] The same can be derived for \(||A U||_F\). Then for any \(m \times n\) matrix \(Y\) of rank \(k \le r\),

\[ \begin{aligned} &||X - Y||_F^2 = ||U \Sigma V^T - Y||_F^2 \\ &= ||U^T U \Sigma V^T V - U^T Y V||_F^2 \\ &= ||\Sigma - U^T Y V||_F^2 \\ \end{aligned} \] Let \(Z = U^T Y V\), which is also of rank \(k\). \[ \begin{aligned} &||X - Y||_F^2 = ||\Sigma - Z||_F^2 = \sum_{ij}(\Sigma_{ij} - Z_{ij})^2 \\ &= \sum_{i=1}^r (\sigma_{i} - Z_{ii})^2 + \sum_{i>r}^{\min(m,n)} Z_{ii}^2 + \sum_{i \ne j} Z_{ij}^2 \\ \end{aligned} \] The minimum is achieved if \[ \begin{gather} Z_{ii} = \begin{cases} \sigma_i, & 1 \le i \le r \\ 0, & r \le i \le \min(m,n) \end{cases} \\ Z_{ij} = 0, 1 \le i \le M, 1 \le j \le N, i \ne j \end{gather} \] Such \(Z\) exists when \(Y = U \Sigma_k V^T\). Q.E.D.

Note that Eckart-Young-Mirsky theorem also holds for spectral norm.

SVD and Diagonalization

It is easy to mix up SVD with diagonalization. Notably, diagonalization factorizes square matrix while SVD factorizes any matrix. Also, not all square matrices can be diagonalized but all matrices can be applied with SVD. Finally, SVD can be interpreted as rotation-scaling-rotation because \(U\) and \(V^T\) are orthogonal. But in diagonalization, \(P\) and \(P^{-1}\) are not necessarily orthogonal, unless in the case of real symmetric matrix.

External materials

奇异值分解(SVD)原理与在降维中的应用

奇异值的物理意义是什么? - 知乎 (zhihu.com)

Singular Value Decomposition.pdf

Previous
Next