Subgradient
Definition
Subgradient of a function \(f: \R^n \mapsto \R\) at \(x\) is defined as \[ \partial f(x) = \{g|f(y) \ge f(x) + g^T(y - x), \forall y \in dom(f) \} \] Or put it another way, the subgradient defines a hyper-plane passing through \((x,f(x))\): $$ ^T ( - ) = 0,
y ^n, t \ \[ And \] ^T ( - ) , (y, t) epi(f) \[ This is because by definition, \] \[\begin{aligned} g^T(y - x) - (f(y) - f(x)) &\le 0, \forall y &\Rightarrow \\ g^T(y - x) - (t - f(x)) &\le 0, \forall y,\forall t \ge f(y) &\Rightarrow \\ \left[ \begin{array} \\ g \\ -1 \\ \end{array} \right]^T \Big( \left[ \begin{array} \\ y \\ t \\ \end{array} \right] - \left[ \begin{array} \\ x \\ f(x) \\ \end{array} \right] \Big) &\le 0, \forall (y, t) \in epi(f) \end{aligned}\]$$
where \(\left[\begin{array}\\ g \\ -1 \\\end{array}\right]\) is the normal of the tangent plane of \(f\) at \(x\).
Properties
The subgradient \(\partial f(x)\) is a set, instead of a single number like the gradient.
Subgradient is a monotonic operator: \[ (u - v)^T(y - x) \ge 0, \forall u \in \partial f(y), \forall v \in \partial f(x) \] This can be shown by: \[ \begin{aligned} \left. \begin{array} \\ f(y) \ge f(x) + v^T(y - x) \\ f(x) \ge f(y) + u^T(x - y) \\ \end{array} \right\} &\Rightarrow f(y) + f(x) \ge f(x) + f(y) - (u - v)^T(y - x) \\ &\Rightarrow (u - v)^T(y - x) \ge 0 \end{aligned} \]
\(x^\star = \arg \min_x f(x) \iff 0 \in \partial f(x^\star)\).
If \(f\) is differentiable at \(x\), then \(\partial f(x) = \{\nabla f(x)\}\).
Suppose \(p \ne \nabla f(x) \and p \in \partial f(x)\), then for \(y = x + r(p - \nabla f(x))\), we have \[ \begin{aligned} f(y) - f(x) &\ge p^T(y - x) \\ f(x + r(p - \nabla f(x))) - f(x) &\ge rp^T(p - \nabla f(x)) \\ \frac{f(x + r(p - \nabla f(x))) - f(x)}{r} &\ge p^T(p - \nabla f(x)) \\ \end{aligned} \]
\[ \begin{aligned} (p - \nabla f(x))^T(p - \nabla f(x)) &\le \frac{f(x + r(p - \nabla f(x))) - f(x)}{r} - \nabla f(x)^T(p - \nabla f(x)) \\ ||p - \nabla f(x)||^2_2 &\le \frac{f(x + r(p - \nabla f(x))) - f(x)- \nabla f(x)^Tr(p - \nabla f(x))}{r} \\ \end{aligned} \]
Take limit on \(r\) on both sides to give \[ \lim_{r \to 0}||p - \nabla f(x)||^2_2 \le \lim_{r \to 0}\Big( \frac{f(x + r(p - \nabla f(x))) - f(x)- \nabla f(x)^Tr(p - \nabla f(x))}{r} \Big) \\ \] By the definition of gradient, \[ \begin{aligned} f&(x + r(p - \nabla f(x))) - f(x)- \nabla f(x)^Tr(p - \nabla f(x)) \\ &= \mathcal{O}(||r(p - \nabla f(x))||^2_2) \\ &= ||p - \nabla f(x)||^2\mathcal{O}(r^2) \\ &= \mathcal{O}(r^2) \end{aligned} \]
\[ \lim_{r \to 0}\Big( \frac{f(x + r(p - \nabla f(x))) - f(x)- \nabla f(x)^Tr(p - \nabla f(x))}{r} \Big) = \lim_{r \to 0}\frac{\mathcal{O}(r^2)}{r} = 0 \]
Therefore, \[ \begin{gather} ||p - \nabla f(x)||^2_2 \le 0 \\ 0 \le ||p - \nabla f(x)||^2_2 \le 0 \\ ||p - \nabla f(x)||^2_2 = 0 \end{gather} \] contradicting the assumption that \(p \ne \nabla f(x)\). Thus, \(p = \nabla f(x)\).
If \(f(x) = \alpha_1 f_1(x) + \alpha_2 f_2(x)\), then \(\partial f(x) = \alpha_1 \partial f_1(x) + \alpha_2 f_2(x)\).