Least Angle Regression

Forward Selection

In cases where we are to solve \(W\) so that \(Y = XW\), where \(Y \in \R^M, X \in \R^{M \times N}, W \in \R^N\), we can do it in an iterative and greedy way.

  1. \(Y_{res} = Y, C = \{\mathrm x^{(1)}, \mathrm x^{(2)}, \dots, \mathrm x^{(N)}\}\) initially.

  2. Select among \(C\) a \(\mathrm x^{(k)}\) such that \(\mathrm x^{(k)}\) has the largest cosine distance to \(Y_{res}\). Then for each \(\mathrm x^{(i)}\) from left to right, do the following:

\[ \begin{gathered} \hat Y = \frac{Y \cdot \mathrm x^{(k)}}{||\mathrm x ^{(k)}||_2} \mathrm x ^{(k)} \\ Y_{res} = Y_{res} - \hat Y \end{gathered} \]

  1. Remove \(\mathrm x^{(k)}\) from \(C\). Go to step 2 until \(Y_{res}\) reaches \(0\) or we have used all columns in \(C\).

Apparently this algorithm is efficient. However, it does not guarantee an exact solution, even when \(Y\) lies in the column space of \(X\).

Forward Stagewise

Like forward selection, forward stagewise selects a \(\mathrm x^{(k)}\) such that \(\mathrm x^{(k)}\) has the largest cosine distance to \(Y_{res}\). Unlike Forward Selection, Forward Selection does not subtract the whole projection from \(Y_{res}\). Instead, it proceeds along \(\mathrm x^{(k)}\) with a small step.

  1. \(Y_{res} = Y, C = \{\mathrm x^{(1)}, \mathrm x^{(2)}, \dots, \mathrm x^{(N)}\},\eta\text{ is a small constant}\) initially.

  2. Select among \(C\) a \(\mathrm x^{(k)}\) such that \(\mathrm x^{(k)}\) has the largest cosine distance to \(Y_{res}\). Then for each \(\mathrm x^{(i)}\) from left to right, do the following:

\[ \begin{gathered} \hat Y = \eta \frac{Y \cdot \mathrm x^{(k)}}{||\mathrm x ^{(k)}||_2} \mathrm x ^{(k)} \\ Y_{res} = Y_{res} - \hat Y \end{gathered} \]

  1. Remove \(\mathrm x^{(k)}\) from \(C\). Go to step 2 until \(Y_{res}\) is sufficiently small.

Apparently forward stagewise gives an exact solution when \(\eta\) is small enough. But it is more time-consuming.

Least Angle Regression

LARS a is a compromise between forward selection and forward stagewise. Likewise, LARS selects a \(\mathrm x^{(k)}\) such that \(\mathrm x^{(k)}\) has the largest cosine distance to \(Y_{res}\). LARS proceeds stagewise like forward stagewise, however with its own methodology to determine the step size.

  1. \(Y_{res} = Y, C = \{\mathrm x^{(1)}, \mathrm x^{(2)}, \dots, \mathrm x^{(N)}\}, d = \arg\max_{\mathrm x \in C}\frac{Y \cdot \mathrm x }{||\mathrm x||_2}\) initially.

  2. Then do the following: \[ \begin{gathered} \hat Y = \eta \frac{Y \cdot d}{||d||_2} d \\ Y_{res} = Y_{res} - \hat Y \end{gathered} \] \(\eta\) is determined such that there is some \(\mathrm x^{(k)} \in C\) onto which the projection of \((Y - \hat Y)\) is the same as that onto \(d\). Or in other words, \((Y - \hat Y)\) lies on the bisector hyper-plane between \(\mathrm x^{(k)}\) and \(d\). Let \(d\) be along this bisector.

  3. Go to step 2 until \(Y_{res}\) is sufficiently small.

Previous
Next