Gradient Descent

If a convex function \(f\) is differentiable and satisfies certain “regularity conditions”, we can get a nice guarantee that \(f\) will converge by gradient descent.

\(L\)-smoothness

Qualitatively, smoothness means that the gradient of \(f\) changes in a controlled, bounded manner. Quantitatively, smoothness assumes that \(f\) has a Lipschitz gradient. That means there exists an \(L > 0\) such that \[ ||\nabla f(x) - \nabla f(y)||_2 \le L||x - y||_2, \forall x,y \in \mathrm{dom}(f) \] We will say that such function is \(L\)-smooth or strongly smooth. \(L\)-smoothness is equivalent to \[ f(y) \le f(x) + (y - x)^T \nabla f(x) + \frac{L}{2}||y - x||^2_2 \]

Proof

Further, \(L\)-smoothness is equivalent to \(\frac{L}{2}||x||^2_2 - f(x)\) being convex.

With the smoothness, or the Lipschitz gradient, we can bound \(f(y)\) from above.

We will assume \(L\)-smoothness in all cases below. Meanwhile, the local minima (in convex case, the global minima) is denoted as \(x^\star\).

Convergence with Convexity

With convexity, we can bound \(f(y)\) from below with linear approximation: \[ f(y) \ge f(x) + (y - x)^T \nabla f(x) \]

Convergence Rate

Fixed Step Size

Now consider a \(L\)-smooth function \(f\). We adopt a fixed step size \(\alpha = \frac{1}{L}\) so that the update in each iteration will be \[ x_{k+1} = x_k - \frac{1}{L}\nabla f(x) \] By the definition of \(L\)-smoothness, \[ \begin{equation} \label{diff} \begin{aligned} f(x_{k+1}) &\le f(x_k) + (-\frac{1}{L}\nabla f(x_k))^T \nabla f(x_k) + \frac{L}{2}||-\frac{1}{L}\nabla f(x_k)||^2_2 \\ &= f(x_k) - \frac{1}{2L}||\nabla f(x_k)||^2_2 \end{aligned} \end{equation} \] \(f(x_{k+1}) \le f(x_k)\) holds in each iteration.

By the convexity of \(f\), \[ \begin{gather} f(x^\star) \ge f(x_k) + (x^\star - x_k)^T \nabla f(x_k) \\ \label{convexity} f(x_k) \le f(x^\star) + (x_k - x^\star)^T \nabla f(x_k) \end{gather} \] Substituting the result back to the right-hand side of equation \(\eqref{diff}\), \[ \begin{aligned} f(x_{k+1}) &\le f(x^\star) + (x_k - x^\star)^T \nabla f(x_k) - \frac{1}{2M}||\nabla f(x_k)||^2_2 \\ f(x_{k+1}) - f(x^\star) &< (x_k - x^\star)^T L(x_k - x_{k+1}) - \frac{1}{2M}||L(x_k - x_{k+1})||^2_2 \\ f(x_{k+1}) - f(x^\star) &< \frac{L}{2}(2(x_k - x^\star)^T(x_k - x_{k+1}) - ||(x_k - x_{k+1})||^2_2) \\ \end{aligned} \] By using the fact that \[ \begin{aligned} ||a - b||^2_2 = ||a||^2_2 - 2a^Tb + ||b||^2_2 \\ 2a^Tb - ||b||^2_2 = ||a||^2_2 - ||a - b||^2_2 \end{aligned} \] We have \[ \label{closer} f(x_{k+1}) - f(x^\star) \le \frac{L}{2}(||x_k - x^\star||^2_2 - ||x_{k+1} - x^\star||^2_2) \]

This indicates that \(x_{k+1}\) is closer to the global minima \(x^\star\) than \(x_k\). In addition, \[ \begin{aligned} \sum_{i=0}^k [f(x_{i+1}) - f(x^\star)] &\le \sum_{i=0}^k \frac{L}{2}(||x_i - x^\star||^2_2 - ||x_{i+1} - x^\star||^2_2) \\ &= \frac{L}{2}(||x_0 - x^\star||^2_2 - ||x_{k+1} - x^\star||^2_2) \\ &\le \frac{L}{2}||x_0 - x^\star||^2_2 \end{aligned} \]

Because \(f(x_{i+1}) \le f(x_i), \forall i=0,\dots,k\), \(f(x_{k+1}) \le f(x_i), \forall i=0,\dots,k\), then \[ \begin{aligned} k(f(x_{k+1}) - f(x^\star)) \le \frac{L}{2}||x_0 - x^\star||^2_2 \\ f(x_{k+1}) - f(x^\star) \le \frac{L}{2k}||x_0 - x^\star||^2_2 \end{aligned} \]

Variant Step Size

In real application scenario, it may be difficult to know \(L\) exactly. However, as long as we choose \(\alpha_k \le \frac{1}{L}\) in each iteration, the convergence and the rate still holds. \[ \begin{aligned} f(x_{k+1}) &\le f(x_k) + (-\alpha_k \nabla f(x_k))^T \nabla f(x_k) + \frac{L}{2}||-\alpha_k \nabla f(x_k)||^2_2 \\ &= f(x_k) - (\alpha_k - \frac{L}{2} \alpha_k^2)||\nabla f(x_k)||^2_2 \\ f(x_{k+1}) &\le f(x_k) - (\alpha_k - \frac{L}{2} \alpha_k^2)||\nabla f(x_k)||^2_2 \\ \end{aligned} \]

\[ \begin{aligned} f(x_{k+1}) - f(x^\star) &\le f(x_k) - f(x^\star) - (\alpha_k - \frac{L}{2} \alpha_k^2)||\nabla f(x_k)||^2_2 \\ \end{aligned} \]

Let \(\eta_{k+1} = f(x_{k+1}) - f(x^\star) \ge 0\), the above becomes \(\eta_{k+1} \le \eta_{k} - (\alpha_k - \frac{L}{2} \alpha_k^2)||\nabla f(x_k)||^2_2\). From equation \(\eqref{convexity}\), \[ \begin{gather} \eta_k = f(x_k) - f(x^\star) \le (x_k - x^\star)^T \nabla f(x_k) \le ||x_k - x^\star||_2||\nabla f(x_k)||_2 \\ 0 \le \frac{\eta_k}{||x_k - x^\star||_2} \le ||\nabla f(x_k)||_2 \\ -||\nabla f(x_k)||^2_2 \le -\frac{\eta^2_k}{||x_k - x^\star||^2_2} \\ \end{gather} \] Substituting it back to give \[ \eta_{k+1} \le \eta_k - \frac{(\alpha_k - \frac{L}{2}\alpha^2_k) \eta^2_k}{||x_k - x^\star||^2_2} \] From equation \(\eqref{closer}\) we already have \[ ||x_{k+1} - x^\star||^2_2 \le ||x_k - x^\star||^2_2 \le \dots \le ||x_0 - x^\star||^2_2 \] Thus, \[ \begin{aligned} \eta_{k+1} &\le \eta_k - \frac{(\alpha_k - \frac{L}{2}\alpha^2_k) \eta^2_k}{||x_0 - x^\star||^2_2} \\ \frac{1}{\eta_k} &\le \frac{1}{\eta_{k+1}} - \frac{(\alpha_k - \frac{L}{2}\alpha^2_k)}{||x_0 - x^\star||^2_2} \cdot \frac{\eta_k}{\eta_{k+1}} \\ \frac{1}{\eta_{k+1}} - \frac{1}{\eta_k} &\ge \frac{(\alpha_k - \frac{L}{2}\alpha^2_k)}{||x_0 - x^\star||^2_2} \cdot \frac{\eta_k}{\eta_{k+1}} \\ \frac{1}{\eta_{k+1}} - \frac{1}{\eta_k} &\ge \frac{(\alpha_k - \frac{L}{2}\alpha^2_k)}{||x_0 - x^\star||^2_2} \text{, by } \eta_k \ge \eta_{k+1} \\ \end{aligned} \]

\[ \begin{aligned} \sum_{i=0}^k (\frac{1}{\eta_{i+1}} - \frac{1}{\eta_i}) &\ge \sum_{i=0}^k \frac{(\alpha_i - \frac{L}{2}\alpha^2_i)}{||x_0 - x^\star||^2_2} \\ \frac{1}{\eta_{k+1}} - \frac{1}{\eta_0} &\ge \frac{\sum_{i=0}^k (\alpha_i - \frac{L}{2}\alpha^2_i)}{||x_0 - x^\star||^2_2} \\ \frac{1}{\eta_{k+1}} &\ge \frac{\sum_{i=0}^k (\alpha_i - \frac{L}{2}\alpha^2_i)}{||x_0 - x^\star||^2_2} \\ f(x_{k+1} - f(x^\star)) &\le \frac{||x_0 - x^\star||^2_2}{\sum_{i=0}^k (\alpha_i - \frac{L}{2}\alpha^2_i)} \end{aligned} \]

If \(\sum_{i=1}^k \alpha_i = \infty\) and \(\sum_{i=1}^k \alpha^2_i < \infty\), the Variant Step Size can be shown to converge. Particularly when \(\alpha_k = \frac{1}{k}\), we have \[ f(x_{k+1} - f(x^\star)) \le \Theta(\frac{1}{\log k})||x_0 - x^\star||^2_2 \]

Convergence with Strong Convexity

Strong Convexity

A function \(f\) is strongly convex if there exists a \(l > 0\) such that \[ f(y) \ge f(x) + (y - x)^T \nabla f(x) + \frac{l}{2}||y - x||^2_2, \forall x,y \in \mathrm{dom}(f) \] Further, strong convexity is equivalent to \(f(x) - \frac{l}{2}||x||^2_2\) being convex.

With strong convexity, we can bound \(f(y)\) from below with a convex quadratic approximation.

Convergence Rate

Fixed Step Size

We have assumed that \(f\) is \(L\)-smooth above. That is \[ \begin{equation} \label{sm} f(y) \le f(x) + (y - x)^T \nabla f(x) + \frac{L}{2}||y - x||^2_2, \forall x,y \in \mathrm{dom}(f) \end{equation} \] In this section we further assume that \(f\) is \(L\)-smooth as well as strong convex: \[ \begin{equation} \label{sc} f(y) \ge f(x) + (y - x)^T \nabla f(x) + \frac{l}{2}||y - x||^2_2, \forall x,y \in \mathrm{dom}(f) \end{equation} \] Let \(F(y) = f(x) + (y - x)^T \nabla f(x) + \frac{l}{2}||y - x||^2_2\). Take derivative of \(F\) w.r.t. \(y\) and make it zero to give \[ \begin{aligned} \nabla f(x) + l(y - x) = 0 \\ y_0 - x = -\frac{\nabla f(x)}{l} \end{aligned} \] Because the second derivative of \(F(y)\) is \(l \succeq 0\), \(F(y)\) reaches global minimum at the local minima \(y_0\). Substitute \(y_0\) back to \(F(y)\) to give \[ f(y) \ge F(y) \ge F(y_0) = f(x) - \frac{1}{2l}||\nabla f(x)||^2_2 \] This holds when \(y = x^\star\). That is, \[ \begin{aligned} f(x^\star) &\ge f(x) - \frac{1}{2l}||\nabla f(x)||^2_2 \\ ||\nabla f(x)||^2_2 &\ge 2l(f(x) - f(x^\star)) \end{aligned} \] The above inequality is referred to as Polyak-Lojasiewicz inequality. By combining it with \(\eqref{diff}\), where we implicitly assume a fixed step-size \(\alpha = \frac{1}{L}\), we have \[ \begin{aligned} f(x_{k+1}) &\le f(x_k) - \frac{1}{2L}||\nabla f(x_k)||^2_2 \\ f(x_{k+1}) - f(x^\star) &\le f(x_k) - f(x^\star) + \frac{1}{2L}(-||\nabla f(x_k)||^2_2) \\ f(x_{k+1}) - f(x^\star) &\le f(x_k) - f(x^\star) + \frac{1}{2L}(-2l(f(x) - f(x^\star))) \\ &= (1 - \frac{l}{L}) (f(x_k) - f(x^\star)) \end{aligned} \]

\[ \begin{aligned} f(x_{k+1}) - f(x^\star) &\le (1 - \frac{l}{L}) (f(x_k) - f(x^\star)) \\ &\le (1 - \frac{l}{L})(1 - \frac{l}{L}) (f(x_{k-1}) - f(x^\star)) \\ &\le \dots \\ &\le (1 - \frac{l}{L})^{k+1} (f(x_0) - f(x^\star)) \end{aligned} \]

From equation \(\eqref{sm}\) and equation \(\eqref{sc}\), \[ \begin{aligned} f(x) + (y - x)^T \nabla f(x) + \frac{l}{2}||y - x||^2_2 \le f&(y) \le f(x) + (y - x)^T \nabla f(x) + \frac{L}{2}||y - x||^2_2 \\ l &\le L \\ \end{aligned} \] Usually it would be the case that \(l < L\), thus \(0 < 1 - \frac{l}{L} < 1\).

Non-convex Case

Without convexity condition, we can still exploit the property of \(L\)-smoothness. Sum up on both sides of equation \(\eqref{diff}\) from \(k = 0\), we have \[ \begin{aligned} \sum_{i=0}^k f(x_{k+1}) &\le \sum_{i=0}^k f(x_k) - \frac{1}{2L}||\nabla f(x_k)||^2_2 \\ \sum_{i=0}^k f(x_{k+1}) - f(x_k) &\le -\frac{1}{2L} \sum_{i=0}^k ||\nabla f(x_k)||^2_2 \\ f(x_{k+1}) &\le f(x_0) - \frac{1}{2L}\sum_{i=0}^k ||\nabla f(x_k)||^2_2 \end{aligned} \] Because \(x^\star\) is a local minima, \[ \begin{aligned} f(x^\star) \le f(x_{k+1}) &\le f(x_0) - \frac{1}{2L}\sum_{i=0}^k ||\nabla f(x_k)||^2_2 \\ f(x^\star) - f(x_0) &\le - \frac{1}{2L}\sum_{i=0}^k ||\nabla f(x_k)||^2_2 \\ \sum_{i=0}^k ||\nabla f(x_k)||^2_2 &\le 2L(f(x_0) - f(x^\star)) \\ \end{aligned} \] In this case, the error is measured by the first derivative. Suppose \(\tilde x = \arg \min_i ||\nabla f(x_i)||^2_2\), where the first derivative of \(f\) is the closest to \(0\), \[ \begin{aligned} k||\nabla f(\tilde x)||^2_2 &\le 2L(f(x_0) - f(x^\star)) \\ ||\nabla f(\tilde x)||^2_2 &\le \frac{2L(f(x_0) - f(x^\star))}{k} \\ \end{aligned} \]

Convergence of Gradient Descent.pdf

Convergence rate of gradient descent algorithm | bps/Hz (wordpress.com)

Strong convexity · Xingyu Zhou’s blog

Last updated on Jan 11, 2022