Linear Regression
Given a dataset \(\mathcal D = \{(x^{(i)}, y^{(i)}),i=1,\dots,M\}\) where \(x^{(i)} \in \R^N\) is the feature vector and \(y^{(i)} \in R\) is the output, learn a linear function \(f = w^Tx + b: \R^N \mapsto \R\), where \(w \in \R^N, b \in \R\), that best predicts the \(y\) for any feature vector \(x\). The “best” is usually measured by the mean square error: \[ \mathrm{MSE}(f,\mathcal D) = \frac{1}{M}\sum_{i=1}^M(y^{(i)} - f(x^{(i)}))^2 \]
As an aside, it is very convenient to standardize the feature (so that the solution won’t depend to unit used in the measurement) and pre-center the label (so that the intercept term, or the bias term \(b\) can be omitted).
Ordinary Least Squares
OLS selects the linear regression parameters \(w, b\) that minimizes the mean squared error: \[ w^\star, b^\star = \arg \min_{w,b}\frac{1}{M}\sum_{i=1}^M(y^{(i)} - w^Tx^{(i)} - b)^2 \] This is equivalent to the least squares problem. Let \[ \begin{gathered} X = \left [ \begin{array}{c|c} (x^{(1)})^T & 1 \\ (x^{(2)})^T & 1 \\ \vdots & \vdots \\ (x^{(M)})^T & 1 \end{array} \right ], W = \begin{bmatrix} w \\ b \\ \end{bmatrix}, Y = \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(M)} \end{bmatrix} \end{gathered} \] Least squares problem aims to find the best approximation to \(Y\) in \(X\)’s column space, which is \(Y\)’s projection in \(\Col(X)\). Therefore we solve \(W\) by \[ \begin{aligned} X\hat W = \Pi_{\Col(X)}Y \Rightarrow Y &- X\hat W \in \Nul(X^T) \\ \Downarrow\\ X^T(Y - X\hat W) &= 0 \\ X^TX\hat W &= X^TY \end{aligned} \] Columns of \(X\) are independent \(\iff\) \(X^TX\) is invertible because they share the same nullspace. When \(X^TX\) is not invertible, there are infinite many solutions to \(\hat W\). We can get one specific \(\hat W\) by enforcing regularization on \(\hat W\) or adding more samples when \(M < N + 1\). When \(X^TX\) is invertible, there is a unique solution that \(\hat W = (X^TX)^{-1}X^TY\).
This gives the same result with minimizing the MSE: \[ \begin{gather} W^\star = \arg \min_{W} \mathrm{MSE}(W) = \frac{1}{M}(Y - XW)^T(Y - XW) \label{target} \\ \Downarrow \notag \\ \notag \\ \begin{aligned} \frac{\partial \mathrm{MSE}}{\partial W} &= \frac{\partial(Y^TY - Y^TXW - W^TX^TY + W^TX^TXW)}{\partial W} \\ &= \frac{\partial(Y^TY - Y^TXW - Y^TXW + (XW)^TXW)}{\partial W} \\ &= -2X^TY + 2X^TXW \\ \end{aligned} \notag \\ \Downarrow_{\text{making it zero}} \notag \\ \notag \\ 0 = -2X^TY + 2X^TXW^\star \notag \\ X^TXW^\star = X^TY \notag \end{gather} \]
There are chances that \(N\) is too large, making equation \(\eqref{target}\) much computationally expensive. In this case, we can use gradient descent. The update rule will be \[ \begin{aligned} W^{(t+1)} &= W^{(t)} - \frac{\eta}{2} \nabla \mathrm{MSE}(W^{(t)}) \\ &= W^{(t)} - \eta(X^TXW^{(t)} -X^TY) \end{aligned} \]
A Probabilistic View
Maximum Likelihood Estimation
In classification task, \(x\) is the feature, \(y\) is the label; in regression task, \(x\) is the “label”, \(y\) is the “feature”. From a probabilistic point of view, we would like estimate \(p(\text{feature}|\text{label})\). In this case, we treat \(y\) as the “feature” composed of a deterministic function and a noise sampled from an identical and independent Gaussian distribution, i.e., for random variable \(\mathcal X, \mathcal Y\) (to distinguish from matrix \(X\) and \(Y\)), \[ \mathcal Y = \mathcal XW + \epsilon, \text{ where }\epsilon \sim \mathcal N(0, \sigma^2) \] Thus, \[ p(Y|X;W,\sigma^2) = p(\epsilon=Y - XW;\sigma^2) = \mathcal N(Y - XW;0,\sigma^2I) \] Then OLS can be attacked by maximum likelihood estimation. The log-likelihood function will be
\[ \begin{aligned} l(W,\sigma^2) &= \log(L(W, \sigma^2)) = \log(p(Y|X;W,\sigma^2)) = \log \mathcal N(Y - XW;0,\sigma^2I) \\ &= \log(\frac{1}{\sqrt{(2\pi)^M|\sigma^2I|}}e^{-\frac{1}{2\sigma^2}(Y - XW)^T(Y - XW)}) \\ &= -\frac{1}{2\sigma^2}(Y - XW)^T(Y - XW) - \frac{M}{2}\log \sigma^2 - \frac{M}{2}\log 2\pi \end{aligned} \]
\[ \begin{aligned} \arg \max_{W,\sigma^2}l(W, \sigma^2) &= \arg \max_{W,\sigma^2}-\frac{1}{2\sigma^2}(Y - XW)^T(Y - XW) - \frac{M}{2}\log \sigma^2 - \frac{M}{2}\log 2\pi \\ &= \arg \min_{W,\sigma^2}\frac{1}{2\sigma^2}(Y - XW)^T(Y - XW) + \frac{M}{2}\log \sigma^2 \\ \end{aligned} \]
Take derivative w.r.t. \(W\) and make it \(0\) to give \(W^\star\): \[ \begin{gathered} \frac{1}{\sigma^2}(X^TXW^\star - X^TY) = 0 \\ X^TXW^\star = X^TY \end{gathered} \] Substitute \(W^\star\) back, take derivative w.r.t. \(\sigma^2\) and make it \(0\) to give \(\sigma^{\star2}\): \[ \begin{gathered} \frac{M}{2\sigma^{\star2}} - \frac{(Y - XW^\star)^T(Y - XW^\star)}{2(\sigma^{\star2})^2} = 0 \\ \sigma^{\star2} = \frac{1}{M}(Y - XW^\star)^T(Y - XW^\star) \end{gathered} \]
We may further analyze the efficacy of \(W^\star\) as a point estimator. Suppose \(X^T X\) is invertible. We have \[ \begin{aligned} &W^\star = (X^T X)^{-1} X^T Y \\ &= (X^T X)^{-1} X^T (X W_\text{real} + Z) \\ &= W_\text{real} + (X^T X)^{-1} X^T Z \end{aligned} \] where \(Z\) contains the noise term for each sample. \(W^\star\) is unbiased because \[ \begin{aligned} &\E[W^\star] = \E[W_\text{real} + (X^T X)^{-1} X^T Z] \\ &= W_\text{real} + (X^T X)^{-1} X^T \E[Z] = W_\text{real} \end{aligned} \]
The covariance matrix of \(W^\star\) will be: \[ \begin{aligned} &\Cov[W^\star] = \Cov[W_\text{real} + (X^T X)^{-1} X^T Z] \\ &= 0 + (X^T X)^{-1} X^T \Cov[Z] X (X^T X)^{-1} \\ &= (X^T X)^{-1} X^T {\sigma^\star}^2 I X (X^T X)^{-1} \\ &= {\sigma^\star}^2 (X^T X)^{-1} \end{aligned} \]
Maximum a Posteriori
If we take the Bayesian view and add a prior to \(W\) such that \(W \sim \mathcal N(0,\frac{C}{2}I)\), or rather \(p(W) \propto \exp(-\frac{1}{C}W^TW)\), \[ \begin{aligned} &p(W|X, Y) = \frac{p(W, X, Y)}{p(X, Y)} \\ &= \frac{p(Y|X, W)P(X, W)}{P(X, Y)} \\ &= \frac{p(Y|X, W)P(W)P(X)}{P(X, Y)} \\ &= \frac{p(Y|X, W)P(W)}{P(Y|X)} \\ &\propto p(Y|X,W)p(W) \end{aligned} \]
\[ W^\star = \arg \max_W p(W|X,Y) \]
Regularized Variants
Linear regression tends to have high variance, due to the sum of random variables in its inference formula. This is why we tend to restrict the magnitude of coefficients.
Ridge Regression
By adding a regularization term to OLS objective we obtain the ridge regression: \[ \min_{W} \frac{1}{2}||Y - XW||^2_2 + \frac{\alpha}{2}||W||^2_2 \] By regularization, we are “shrink” the amount of \(||W||_2\), whose magnitude is controlled by the \(\alpha\). We prefer smaller weights because:
- smaller weights are more robust to the perturbations of input;
- there are better chances to zero out some dimensions of the input feature, which may be uninformative.
The constant term before the sum of square errors that before \(\alpha\) are not of significance. We choose them to be \(\frac{1}{2}\) because this gives a nicer closed-form solution to \(W\). Take derivative of the objective w.r.t. \(W\) and make it \(0\) to give \[ \begin{gathered} -X^TY + X^TXW^\star + \alpha W^\star = 0 \\ (X^TX + \alpha I)W^\star = X^TY \end{gathered} \] The “ridge” comes from the \(\alpha I\) term, which is added to the diagonal of \(X^TX\).
Lasso Regression
In ridge regression, there is still chance that some weights are small but not zero, because the regularization term is small so far as the weights are fairly small. Think in this way: \(0.01^2 = 0.0001 \ll 0.01\), though it is not rigid.
To get better regularization, we can change the regularization term to \(l_1\)-norm: \[ \min_{W} \frac{1}{2}||Y - XW||^2_2 + \alpha||W||_1 \] The reason to choose \(l_1\) norm is that, \(1\) is the least number that preserves convexity among the \(l_p\) norms. Interestingly, when the penalty term on the \(l_1\) norm is large enough, the resulting estimation will be zero (refer to this). The solution to the lasso regression can be efficiently approached using coordinate descent or least angle regression.
External Materials
Lasso回归算法: 坐标轴下降法与最小角回归法小结 - 刘建平Pinard - 博客园 (cnblogs.com)
Statistical Learning with Sparsity: the Lasso and Generalizations (su.domains)