6-online-learning
In standard supervised learning, we consider the training data are given in a batch randomly sampled from a fixed distribution. However, in real-world applications, data may come in an one-by-one fashion, while the underlying distribution evolves as the time goes.
In online learning, we suppose the learning task is formed as a game between the learner and nature players:
- at every iteration \(t\), nature reveals the input \(x_t \in \mathcal{X}\) to the learner;
- learner outputs a prediction \(p_t \in \mathcal{Y}\);
- nature reveals the true label \(y_t \in \mathcal{Y}\), and Learner will suffer a loss \(l(y_t, p_t)\);
- learner updates its prediction model.
In the online learning framework, we no longer have two different phases of training and testing. An implicit goal in online learning is to distribute the computations over all the iterations as uniformly as possible.
Online Learning
Definition: regret of an online learner. Given an expert \(h: \mathcal{X} \to \mathcal{Y}\), the regret of the online learner is defined as the extra cumulative loss of the learner with respect to expert \(h\): \[ \newcommand{\reg}{{\mathrm{Regret}}} \reg(h) \triangleq \sum_{t=1}^T[\ell(y_t, p_t)] - \sum_{t=1}^T[\ell(y_t, h(x_t))] \] Next, for a set of experts \(\mathcal{H}\), we define the learner’s regret as the worst-case regret for any expert \(h \in \mathcal{H}\): \[ \reg(\mathcal{H}) \triangleq \max_{h \in \mathcal{H}} \reg(h) = \sum_{i=1}^T[\ell(y_t, p_t)] - \min_{h \in \mathcal{H}} \sum_{i=1}^T[\ell(y_t, h(x_t))] \]
The overall objective is to have a regret that is sublinear in \(T\).
Examples of Learning Strategies
Consider a binary classification task with \(\mathcal{Y} = \{ 0,1 \}\), zero/one loss and an adversary nature that always generates the opposite label to the learner’s prediction.
Consider the expert system \(\mathcal{H} = \{ h_0, h_1 \}\) where \(h_i(x) = i\). We claim that the regret w.r.t. \(\mathcal{H}\) will be at least \(\frac{T}{2}\) at iteration \(T\). \[ \begin{aligned} \reg(h_i) &= \sum_{t=1}^T[\ell(y_t, p_t)] - \sum_{t=1}^T[\ell(y_t, h_i(x_t))] \\ &= T - \sum_{t=1}^T[\ell(y_t, h_i(x_t))] \\ \end{aligned} \] But \(\reg(h_0) + \reg(h_1) = T\). As a result, \(\reg(\mathcal{H}) = \max_{h \in \{ h_0, h_1 \}} \reg(h) \ge \frac{T}{2}\).
Consider a binary classification task with \(\mathcal{Y} = \{ 0,1 \}\), zero/one loss and a realizable scenario where nature generates the label according to an expert \(h^* \in \mathcal{H}\). In this scenario, the cumulative loss of a learning algorithm is equal to its regret w.r.t \(\mathcal{H}\).
Consider the follow-the-best algorithm where at each iteration \(t\) we arbitrarily choose among the experts with the best score up to now for prediction. The regret w.r.t. \(\mathcal{H}\) can be as large as \(|\mathcal{H}|-1\). This is because every time an expert makes a mistake, it is excluded from candidates.
Consider the majority algorithm where we vote for the label with the majority vote among experts in \(\mathcal{H}\). The regret w.r.t. \(\mathcal{H}\) is upper-bounded by \(\log_2|\mathcal{H}|\). This is because every time there is a mistake, half of considered experts are excluded from consideration.
Online Convex Optimization
Adversary nature and realizable scenario represents the two extreme wings of situations. To analyze the general non-realizable situations, we introduce a framework called online convex optimization where
learner chooses model parameters \(w_t\) from a convex set \(S\) at iteration \(t\);
nature chooses a convex loss function \(f_t\);
the regret w.r.t. model parameter \(u\) will be \[ \reg(u) = \sum_{t=1}^T f_t(w_t) - \sum_{t=1}^T f_t(u) \\ \reg(S) = \sum_{t=1}^T f_t(w_t) - \min_{u \in S} \sum_{t=1}^T f_t(u) \]
Example: Online Linear Regression
The setting is as follows:
nature reveals input vector \(x_t \in \R^d\);
learner chooses model parameters \(w_t \in \R^d\);
nature reveals output \(y_t \in \R\) and the loss value at iteration \(t\) is \[ f_t(w_t) = (w_t^\top x_t - y_t)^2 \]
Convexification
Consider a general loss \(\ell\) (which doesn’t have to be convex) and a finite set of experts \(\mathcal{H} = \{ h_1,\dots,h_m \}\). To convexify the problem, the learner searches for a probability distribution over the \(m\) experts which means \(w_t \in \Delta_m\) where \(\Delta_m\) is the set of all categorical distributions over \(\{ 1,\dots,m \}\). The online learning algorithm will be as follows:
nature reveals input vector \(x_t \in \R^d\);
learner chooses model parameters \(w_t \in \R^m\);
nature reveals output \(y_t \in \R\) and the loss value at iteration \(t\) is \[ f_t(w_t) = w_t^\top \underbrace{[\ell(h_1(x_t), y_t), \dots, \ell(h_m(x_t), y_t)]}_{L_t} \\ \]
We will assume a online convex optimization setting from now on.
Follow-the-leader Strategy
The follow-the-leader (FTL) strategy seems a natural choice to the online learning: \[ w_t = \arg \min_{w \in S} \sum_{i=1}^{t-1} f_i(w) \] Note that now it is a convex optimization problem.
Lemma: regret bound for FTL. Given that \(w_t\) is chosen according to a FTL strategy, we have the following upper bound on the FTL learner’s regret at iteration \(T\): \[ \reg(S) \le \sum_{i=1}^T [f_t(w_t) - f_{t}(w_{t+1})] \] Proof. Recall that \(\reg(S) = \max_{u \in S} \reg(u)\). We need to show for every \(u \in S\), \[ \reg(u) = \sum_{t=1}^T [f_t(w_t) - f_t(u)] \le \sum_{t=1}^T [f_t(w_t) - f_t(w_{t+1})] \\ \iff \\ \sum_{t=1}^T f_t(w_{t+1}) \le \sum_{t=1}^T f_t(u) \] \(w_{t+1}\) is the one-step-ahead expert (relative to \(f_t\)); so it should perform better than any other choice. We use induction to show it. When \(T=1\), we trivially have \(\forall u, f_1(u) \ge f_1(w_2)\). Suppose when \(T=k\), the above holds. Then, for every \(u \in S\), \[ \sum_{i=1}^k f_t(w_{t+1}) \le \sum_{i=1}^k f_t(u) \] As a result, \[ \begin{aligned} \sum_{i=1}^{k+1} f_t(w_{t+1}) &\le f_{k+1}(w_{k+2}) + \sum_{i=1}^k f_t(u) \\ &\Downarrow_{u=w_{k+2}} \\ \sum_{i=1}^{k+1} f_t(w_{t+1}) &\le \sum_{i=1}^{k+1} f_t(w_{k+2}) \\ &= \min_{w \in S} \sum_{i=1}^{k+1} f_t(w) \\ &\Downarrow \\ \sum_{i=1}^{k+1} f_t(w_{t+1}) &\le \sum_{i=1}^{k+1} f_t(u) \end{aligned} \] which holds for every \(u \in S\).
Example I
Consider a quadratic loss function where nature chooses the input \(z_t\) satisfying the norm bound \(\|z_t\|_2 \le M\): \[ f_t(w) = \frac{1}{2} \|w-z_t\|_2^2 \] Let \(S = \{ x:\|x\|_2 \le M \}\). Then the FTL strategy will choose \[ w_{t+1} = \arg \min_{w \in S} \sum_{i=1}^t \frac{1}{2} \|w-z_i\|_2^2 = \frac{1}{t} \sum_{i=1}^t z_i \] Note that \[ \begin{gathered} w_{t+1} = \frac{t-1}{t} w_t + \frac{1}{t} z_t \\ w_{t+1} - z_t = \frac{t-1}{t}(w_t - z_t) \end{gathered} \] The regret bound will be \[ \begin{aligned} &\reg(S) \le \sum_{t=1}^T [f_t(w_t) - f_t(w_{t+1})] \\ &= \sum_{t=1}^T \frac{1}{2} [\|w_t-z_t\|_2^2 - \|w_{t+1}-z_t\|_2^2] \\ &= \sum_{t=1}^T \frac{1}{2} [\|w_t-z_t\|_2^2 - (1-\frac{1}{t})^2 \|w_t-z_t\|_2^2] \\ &= \sum_{t=1}^T \frac{1}{2} [(\frac{2}{t} - \frac{1}{t^2})\|w_t-z_t\|_2^2] \\ &\le \sum_{t=1}^T \frac{1}{t} \|w_t-z_t\|_2^2 \\ &\le \sum_{t=1}^T \frac{1}{t} 4M^2 \\ &\le 4M^2 (\log T + 1) \end{aligned} \]
Example II
There are cases where FTL could also fail. We give an example here. Consider a linear loss function where nature chooses the input \(z_t\) satisfying the norm bound \(\|z_t\|_2 \le M\):
\[ f_t(w) = w^\top z_t \] Let \(S = \{ x:\|x\|_2 \le M \}\). Then the FTL strategy will choose \[ w_{t+1} = \arg \min_{w \in S} \sum_{i=1}^t w^\top z_t = \arg \min_{w \in S} w^\top \sum_{i=1}^t z_t = \frac{M}{\|\sum_{i=1}^t z_t\|_2} \sum_{i=1}^t z_t \] But consider the one-dimensional input sequence \(-0.5, 1, -1, 1, -1, \dots\) The parameters learner gives would be \(M, -M, M, -M, \dots\) The loss incurred would be \(\sum_{i=1}^T w_i z_i = (T-1)M = M \cdot O(T)\). But the strategy that sets \(w=0\) will give a loss of \(0\).
The crux of the problem is the sudden change of the nature.
Follow-the-regularized-leader Strategy
Continuing the discussion of Example II, one makeup for it is the follow-the-regularized-leader (FTRL) strategy. That is, the learner returns \[ w_{t+1} = \arg \min_{w \in S} \psi(w) + \sum_{i=1}^{t} f_i(w) \] Let \(\psi(x)\) be the common choice \(\frac{\lambda}{2}\|x\|_2^2\) and let \(f_i\) still be the linear loss. Then, \[ \begin{aligned} w_{t+1} &= \arg \min_{w \in S} \frac{\lambda}{2} \|w\|_2^2 + w^\top \sum_{i=1}^{t} z_i \\ &\Downarrow_{v_t = -\frac{1}{\lambda} \sum_{i=1}^t z_i} \\ &= \arg \min_{w \in S} \frac{\lambda}{2} \|w - v_t\|_2^2 - \frac{\lambda}{2} \|v_t\|_2^2 \\ &= \arg \min_{w \in S} \frac{\lambda}{2} \|w - v_t\|_2^2 \\ &= \Pi_S (v_t) \\ \end{aligned} \] If \(S=\R^d\), \[ \begin{gathered} w_{t+1} = -\frac{1}{\lambda} v_t \\ w_{t+1} - w_t = -\frac{1}{\lambda} z_t = -\frac{1}{\lambda} \nabla f_t(w_t) \\ \end{gathered} \] The update rule resembles the formulation of gradient descent. If \(\lambda\) is too large, the update of learner will be too stable and learner is not able to adapt to the environment. If \(\lambda\) is too small, learner can easily overfit to the noise.
But actually, we don’t need to update at every step. By the closed-form formula of \(w_{t+1}\), we can accumulate via \(v_t\) and do the projection at the last step \(T\).
Lemma: regret bound for FTRL. Suppose that \(f_t(w) = w^\top z_t\) is linear loss, \(S\) is a convex set and \(\psi(w) = \frac{\lambda}{2} \|w\|_2^2\). We have the following upper bound on the FTRL learner’s regret at iteration \(T\): \[ \reg(u) \le \frac{\lambda}{2} \|u\|_2^2 + \frac{1}{\lambda} \sum_{i=1}^T \|z_t\|_2^2 \] Proof. A natural idea is to reuse the conclusion from FTL. We may synthesize an iteration \(0\) for FTL such that \(f_0(w) = \psi(w)\). By doing so, \[ w_{t+1}^\text{FTL} = \arg \min_{w \in S} \sum_{i=0}^T f_t(w) = \arg \min_{w \in S} [\psi(w) + \sum_{i=1}^T f_t(w)] = w_{t+1}^\text{FTRL} \] Thus, for every \(u\), \(f_t(w_t^\text{FTRL}) - f(u) = f_t(w_t^\text{FTL}) - f(u)\) for \(t=1,\dots,T\) and \[ \begin{aligned} \reg_{0:T}^\text{FTL}(u) &= \psi(w_0) - \psi(u) + \reg_{1:T}^\text{FTL}(u) \\ &= \psi(w_0) - \psi(u) + \reg_{1:T}^\text{FTRL}(u) \end{aligned} \] Note that by FTL’s bound, \[ \begin{aligned} &\reg_{0:T}^\text{FTL}(u) = \sum_{t=0}^T [f_t(w_t) - f_t(u)] \\ &= \psi(w_0) - \psi(u) + \sum_{t=1}^T [f_t(w_t) - f_t(u)] \\ &\Downarrow_\text{one step ahead} \\ &\le \psi(w_0) - \psi(w_1) + \sum_{t=1}^T [f_t(w_t) - f_t(w_{t+1})] \\ \end{aligned} \] As a result, \[ \begin{aligned} &\reg_{1:T}^\text{FTRL}(u) \le \psi(u) - \psi(w_1) + \sum_{t=1}^T [f_t(w_t) - f_t(w_{t+1})] + \psi(w_1) - \psi(w_0) \\ &= \psi(u) - \psi(w_1) + \sum_{t=1}^T [f_t(w_t) - f_t(w_{t+1})] \\ &= \psi(u) - \psi(w_1) + \sum_{t=1}^T z_t^\top (w_t - w_{t+1}) \\ &\le \psi(u) - \psi(w_1) + \sum_{t=1}^T \|z_t\|_2 \|w_t - w_{t+1}\|_2 \\ &= \frac{\lambda}{2} \|u\|_2^2 - \frac{\lambda}{2} \|w_1\|_2^2 + \sum_{t=1}^T \|z_t\|_2 \|\Pi_S (-\frac{1}{\lambda} \sum_{i=1}^{t-1} z_i) - \Pi_S (-\frac{1}{\lambda} \sum_{i=1}^t z_i)\|_2 \\ &\Downarrow_\text{by the shrinking property of projection of Hilbert norm, which is $\ell_2$ here ??} \\ &\le \frac{\lambda}{2} \|u\|_2^2 - \frac{\lambda}{2} \|w_1\|_2^2 + \sum_{t=1}^T \|z_t\|_2 \frac{1}{\lambda} \|z_t\|_2 \\ &\le \frac{\lambda}{2} \|u\|_2^2 + \frac{1}{\lambda} \sum_{t=1}^T \|z_t\|_2^2 \end{aligned} \]
Corollary: best choice of \(\lambda\) for FTRL. Given above inequality, suppose that \(\|u\|_2 \le B\) for every \(u \in S\) and \(\|z_t\| \le M\). Then, \[ \reg(S) \le \frac{\lambda B^2}{2} + \frac{TM^2}{\lambda} \] Minimizing the upper bound over \(\lambda > 0\) results in \(\lambda^* = \frac{M}{B} \sqrt{2T}\), under which \[ \reg(S) \le MB \sqrt{2T} \]
The above is the conclusion drawn for \(\ell_2\)-norm regularizer and linear loss. We would like to extend them to general case.
Online Gradient Descent (General Loss)
When \(f\) is not a linear function, we can “linearize” it using the Taylor expansion: \[ \begin{aligned} f(w) &\approx f(w_t) + \nabla f(w_t)^\top (w-w_t) = [f(w_t) - \nabla f(w_t)^\top w_t] + \underbrace{\nabla f(w_t)^\top}_{z_t} w \end{aligned} \] As a result, \[ \begin{aligned} w_{t+1} &= \arg \min_{w \in S} \psi(w) + \sum_{i=1}^{t} f_i(w) \\ &\approx \arg \min_{w \in S} \psi(w) + \sum_{i=1}^{t} \nabla f_{i}(w_{i})^\top w \\ \end{aligned} \] Note that \(f_{t+1}\) first appears in the derivation of \(w_{t+1}\). The latest model learner output is \(w_t\) so we just expand there. This gives rise to the online gradient descent algorithm. The update rule will be
Algorithm: online gradient descent.
Pick an initial point \(w_1 \in S\).
For \(t=1\) to \(T\) do:
- Output \(w_t\) (in fact, this \(w_t\) is computed in last timestep and formally should have been \(w_{t-1}\) in the previous context) and receive \(f_t\)
- Compute \(z_t = \nabla f_t(w_t)\)
- Pick a stepsize \(\alpha_t > 0\) (which is usually fixed w.r.t. \(t\))
- Update \(w_{t+\frac{1}{2}} \leftarrow w_t - \alpha_t z_t\)
- Project \(w_{t+\frac{1}{2}}\) onto \(S\) as \(w_{t+1} \leftarrow \Pi_S(w_{t+\frac{1}{2}})\)
Theorem: regret bound for OGD learner. In the online convex optimization setting, we have the following regret bound for the OGD learner initialized at \(w_1 = 0\) with respect to every \(u \in S\): \[ \reg(u) \le \frac{\|u\|_2^2}{2\alpha} + \frac{\alpha}{2} \sum_{i=1}^T \|z_t\|_2^2 \] If we assume that every \(f_t\) is \(\rho\)-Lipschitz (and thus \(\|z_t\|_2^2 = \|\nabla f_t(w_t)\|_2^2 \le \rho^2\)) and \(\|u\|_2 \le B\) for every \(u \in S\), we will further have the following for \(\alpha^* = \frac{B}{\rho \sqrt{T}}\): \[ \reg(S) \le B \rho \sqrt{T} \] Proof. Because \(w_{t+1}\) is the projection of \(w_{t+\frac{1}{2}}\) onto \(S\), for every \(u \in S\), we have \[ \begin{aligned} &\|w_{t+\frac{1}{2}} - u\|_2^2 = \|w_{t+\frac{1}{2}} - w_{t+1} + w_{t+1} - u\|_2^2 \\ &= \underbrace{\|w_{t+\frac{1}{2}} - w_{t+1}\|_2^2}_{\ge 0} + \|w_{t+1} - u\|_2^2 \\ &\quad\quad + 2\underbrace{(w_{t+\frac{1}{2}} - w_{t+1})^\top (w_{t+1} - u)}_{\ge 0 \text{ due to the property of projection}} \\ &\ge \|w_{t+1} - u\|_2^2 \end{aligned} \] As a result, \[ \begin{aligned} &\frac{1}{2} \|w_{t+1} - u\|_2^2 - \frac{1}{2} \|w_t - u\|_2^2 \\ \le &\frac{1}{2} \|w_{t+\frac{1}{2}} - u\|_2^2 - \frac{1}{2} \|w_t - u\|_2^2 \\ = &\frac{1}{2} \|w_t - \alpha z_t - u\|_2^2 - \frac{1}{2} \|w_t - u\|_2^2 \\ = &\frac{1}{2} \|\alpha z_t\|_2^2 - \alpha z_t^\top (w_t - u) \\ = &\frac{1}{2} \alpha^2 \|z_t\|_2^2 - \alpha \nabla f_t(w_t)^\top (w_t - u) \\ \le &\frac{1}{2} \alpha^2 \|z_t\|_2^2 - \alpha [f_t(w) - f_t(u)] \\ \end{aligned} \] Adding up the above inequality for \(t=1,\dots,T\) gives \[ \begin{aligned} \alpha \sum_{t=1}^T (f_t(w_t) - f_t(u)) &\le \frac{1}{2} \sum_{t=1}^T \alpha^2 \|z_t\|_2^2 + \frac{1}{2} \|w_1 - u\|_2^2 - \frac{1}{2} \|w_{T+1} - u\|_2^2 \\ \sum_{t=1}^T (f_t(w_t) - f_t(u)) &\le \frac{\alpha}{2} \sum_{t=1}^T \|z_t\|_2^2 + \frac{1}{2\alpha} \|w_1 - u\|_2^2 - \frac{1}{2\alpha} \|w_{T+1} - u\|_2^2 \\ &\le \frac{\alpha}{2} \sum_{t=1}^T \|z_t\|_2^2 + \frac{1}{2\alpha} \|w_1 - u\|_2^2 \\ &\Downarrow_{w_1=0} \\ &= \frac{\alpha}{2} \sum_{t=1}^T \|z_t\|_2^2 + \frac{1}{2\alpha} \|u\|_2^2 \end{aligned} \]
Online Mirror Descent (General Regularizer)
Fenchel Conjugate
Definition: Fenchel (convex) conjugate. For a function \(\psi: \R^d \to \R\), we define its Fenchel conjugate as \[ \psi^*(\theta) = \sup_{\omega \in \R^d} \omega^\top \theta - \psi(\omega) \]
Proposition: convexity of Fenchel conjugate. For every function \(\psi: \R^d \to \R\), its Fenchel conjugate \(\psi^*\) is a convex function.
Proof. \(\psi^*\) is the supremum over affine functions.
Proposition: gradient of Fenchel conjugate. Consider the Fenchel conjugate of a differentiable function \(\psi\). Then by Danskin’s theorem, the gradient of the conjugate function will be \[ \nabla \psi^*(\theta) = \arg \max_{\omega \in \R^d}\ \omega^\top \theta - \psi(\omega) \] If \(\psi\) is convex, then \[ \nabla \psi^*(\theta) = \arg \max_{\omega \in \R^d}\ \omega^\top \theta - \psi(\omega) = (\nabla \psi)^{-1}(\theta) \]
Proposition: Fenchel conjugate of convex functions. Consider a convex function \(\psi: \R^d \to \R\). Then \(\psi\)’s double conjugate \(\psi^{**} = \psi\). On the other hand, if \(\psi\) is not convex, \(\psi^{**}\) is \(\psi\)’s convex envelope.
Proposition: Fenchel-Young inequality. Consider \(\psi: \R^d \to \R\) and its Fenchel conjugate \(\psi^*\). Then for every \(\omega, \theta \in \R^d\), \[ \omega^\top \theta \le \psi(\omega) + \psi^*(\theta) \] Proof. The proof is simply an interpretation of the definition: \[ \psi^*(\theta) = \max_{\omega} [\omega^\top \theta - \psi(\omega)] \ge \omega^\top \theta - \psi(\omega) \]
Some examples of Fenchel conjugate: \[ \begin{gathered} \psi(\omega) = \frac{\lambda}{2} \|\omega\|_2^2 \to \psi^*(\theta) = \frac{1}{2\lambda} \|\theta\|_2^2 \\ \psi(\omega) = \frac{1}{2} \omega^\top A \omega \to \psi^*(\theta) = \frac{1}{2} \theta^\top A^{-1} \theta \\ \psi(\omega) = \sum_{i=1}^d \omega_i \log \omega_i \to \psi^*(\theta) = \sum_{i=1}^d \exp(\theta_i - 1) \end{gathered} \] Now refer to the optimization problem \[ \begin{aligned} &w_{t+1} = \arg \min_{w \in S} \psi(w) + \sum_{i=1}^{t} f_i(w) \\ &\approx \arg \min_{w \in S} \psi(w) + \sum_{i=1}^{t} \nabla f_{i}(w_{i})^\top w \\ &= \arg \max_{w \in S} \underbrace{\left[-\sum_{i=1}^{t} \nabla f_{i}(w_{i})\right]^\top}_{v_{t+1}^\top} w - \psi(w) \\ &= \nabla \psi^*(v_t) \end{aligned} \] This motivates the online mirror descent.
Algorithm: online mirror descent.
- Pick an initial point \(w_1 \in S\).
- For \(t=1\) to \(T\) do:
- Output \(w_t\) (in fact, this \(w_t\) is computed in last timestep and formally should have been \(w_{t-1}\) in the previous context) and receive \(f_t\)
- Compute \(z_t = -\nabla f_t(w_t)\)
- Update \(v_t \leftarrow v_{t-1} + z_t\)
- Update \(w_{t+1} \leftarrow \arg \min_{w} \psi(w) - w^\top \theta = \nabla \psi^*(v_t)\)
Definition: strongly-convex and smooth function. Given a convex differentiable function \(\psi : \R^d \to \R\), consider its Bregman divergence \(D_\psi\). Then,
- We call \(\psi\) \(\mu\)-strongly-convex with respect to norm function \(\| \cdot \|\) if for every \(w, u\) we have \(D_\psi(w \| u) \ge \mu \|w − u\|_2^2\), which is equivalent to that for every \(x\), \(H_\psi(x) \succeq \mu I\).
- We call \(\psi\) \(\lambda\)-smooth with respect to norm function \(\| \cdot \|\) if for every \(w, u\) we have \(D_\psi(w \| u) \le \frac{\lambda}{2} \|w − u\|_2^2\), which is equivalent to \(\| \nabla \psi(w) - \nabla \psi(u)\|_2 \le \lambda \|w-u\|\).
Theorem: strong convexity and smoothness. \(\psi\) is \(\mu\)-strongly-convex w.r.t to norm \(\| \cdot \|\) if and only if its Fenchel conjugate \(\psi^*\) is \(\frac{1}{\mu}\)-smooth w.r.t. dual norm \(\| \cdot \|_*\) where \[ \| w \|_* = \max_{\|u\| \le 1} w^\top u \]