Non-linear Regression
The basic idea about non-linear regression is to perform the feature mapping \(\phi(x)\), followed by linear regression in the new feature space.
Polynomial Regression
When \(x\) is a scalar, the feature mapping in \(p\)-th order Polynomial Regression is \[ \phi(x) = \begin{bmatrix} 1 \\ x \\ x^2 \\ \vdots \\ x^p \end{bmatrix} \] The regression function and the parameters are then like those in linear regression: \[ f(x) = [w_0,w_1,w_2,\dots,w^p]\begin{bmatrix} 1 \\ x \\ x^2 \\ \vdots \\ x^p \end{bmatrix} = w^T\phi(x) \] \(\phi(x)\) captures all features in each degree from \(1\) up to \(p\). In higher-dimensional input feature space, there are many more entries for feature mapping in each degree. Take a 2-D input for example.
- degree 1: \([x_1, x_2]^T\)
- degree 2: \([x_1^2, x_1x_2, x_2^2]^T\)
- degree 3: \([x_1^3, x_1^2x_2, x_1x_2^2, x_2^3]^T\)
- …
Kernel Trick in Non-linear Regression
Many Linear Regression algorithms learning algorithms depend on the calculation of inner products between feature vectors, either during training or in prediction, without directly depending on the feature vector. We can transform the feature vector by applying a feature mapping \(\phi\) to do the Non-linear Regression task.
However, it is not necessary to explicitly define the feature mapping \(\phi\) when only the inner products are needed. Instead, we can define a function \(\mathcal K: \R^N \times \R^N \mapsto \R\) that directly calculate the inner products of pseudo-transformed features: \[ \mathcal K(x,x^\prime) = \phi_{pseudo}(x)^T\phi_{pseudo}(x^\prime) \] This will be more flexible and computationally-efficient.
Kernel Ridge Regression
Recall Ridge Regression: \[ W^\star = \min_{W}\frac{1}{2}||Y - X^TW||^2_2 + \frac{\alpha}{2}||W||^2_2 \iff (XX^T + \alpha I_{N+1})W^\star = XY \] If \((XX^T + \alpha I_{N+1})\) is invertible, \(W^\star = (XX^T + \alpha I_{N+1})^{-1}XY\). By matrix identity \[ (P^{-1}+B^TR^{-1}B)^{-1}B^TR^{-1} = PB^T(BPB^T+R)^{-1} \] we can obtain that \(W^\star = X(X^TX + \alpha I_M)^{-1}Y\). \(X \in \R^{(N+1)\times M}, (X^TX + \alpha I_M)^{-1}Y \in \R^{M}\). \(W^\star\) can be rewritten as a linear combination of columns of \(X\), i.e. \(W^\star\) lies in the span of input vectors: \[ W^\star = \sum_{i=1}^M\lambda_ix^{(i)} \] For a new data \(x^*\), we make prediction by \[ y^* = (x^*)^TW = (x^*)^TX(X^TX + \alpha I_M)^{-1}Y \] which totally depends on the inner products.
Support Vector Regression
Support Vector Regression learns a hyper-plane that incorporates within its margin as many points as possible. As a comparison, Support Vector Machine learns a hyper-plane that excludes outside its margin as many points as possible. Its objective is \[ \begin{aligned} \min_{w,\xi,\xi^*} \frac{1}{2}||w||^2_2 + C\sum_{i=1}^M(\xi_i + \xi^*_i) \\ s.t.\quad y^{(i)} - w^Tx^{(i)} - b \le \epsilon + \xi_i, i=1,\dots,M \\ y^{(i)} - w^Tx^{(i)} - b \ge \epsilon + \xi^*_i, i=1,\dots,M \\ \xi_i, \xi^*_i \ge 0, i=1,\dots,M \end{aligned} \] \(\epsilon\) is a hyper-parameter to be determined. Transform the problem into the standard optimization form: \[ \begin{aligned} \min_{w,\xi,\xi^*} \frac{1}{2}||w||^2_2 + C\sum_{i=1}^M(\xi_i + \xi^*_i) \\ s.t.\quad y^{(i)} - w^Tx^{(i)} - b - \epsilon - \xi_i \le 0, i=1,\dots,M \\ w^Tx^{(i)} + b - y^{(i)} + \epsilon + \xi^*_i \le 0, i=1,\dots,M \\ -\xi_i, -\xi^*_i \ge 0, i=1,\dots,M \end{aligned} \] The Lagrangian function will be \[ \begin{aligned} L(w,\xi,\xi^*,\lambda,\mu,\nu,\nu^*) = &\frac{1}{2}||w||^2_2 + C\sum_{i=1}^M(\xi_i + \xi^*_i) \\ &+ \sum_{i=1}^M\lambda_i(y^{(i)} - w^Tx^{(i)} - b - \epsilon - \xi_i) \\ &+ \sum_{i=1}^M\mu_i(w^Tx^{(i)} + b - y^{(i)} + \epsilon + \xi^*_i) \\ &- \sum_{i=1}^M\nu_i\xi_i -\sum_{i=1}^M\nu^*_i\xi^*_i \end{aligned} \]