3-rademacher-complexity
Rademacher Complexity
[!note]
Sometimes people use hypothesis set and function set interchangeably. But to differentiate, a hypothesis \(h\) is typically a feature function \(f\) composed with a loss function \(\ell\).
In this section, we deal with the infinite hypothesis set, but still adopt the zero-one loss.
Definition: Rademacher complexity. For a function set \(\mathcal{H}\), given an i.i.d. sample of points \(\{X_1, \dots, X_n\} \sim X^n\) from a probability distribution \(\mathcal{P}\), and a statistically-independent Rademacher variables \(\sigma = [\sigma_1, \dots, \sigma_n]\) (with i.i.d. components), \(\mathcal{H}\)’s Rademacher complexity is defined as \[ \newcommand{\rade}{\mathsf{R}} \rade_{\mathcal{P},n}(\mathcal{H}) \triangleq \E_{X^n, \sigma} \left[ \sup_{h \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \sigma_i h(X_i) \right] \]
Rademacher complexity reflects the generalization capability of the function class \(\mathcal{H}\) w.r.t. the probability distribution \(\mathcal{P}\) given sample size \(n\). A large Rademacher complexity indicates that the model is able to transform the random variable of distribution \(\mathcal{P}\) to match a statistically-independent variable such that the correlation can be maximized. Or simply, a large Rademacher complexity indicates better (over)fitting capability.
In learning theory, given a function class \(\mathcal{F}\), we usually define \[ \mathcal{H} \triangleq \{ h(x,y) \triangleq \ell(f(x), y): f \in \mathcal{F} \} \] In Rademacher complexity discussion, \(\ell\) is still chosen to be zero/one-loss.
Bounding via McDiarmid’s Inequality
Recall in the uniform convergence theorem, we have \[ \begin{equation} \label{uniconv} \Pr(\epsilon_\text{excess}(\hat f) \ge \epsilon) \le \Pr(\sup_{f \in \mathcal{F}} |L(f) - \hat L(f)| \ge \frac{\epsilon}{2}) \end{equation} \] The supremum on the right-hand side is not easy to deal with. McDiarmid’s inequality comes handy.
Theorem: McDiarmid’s inequality. Let \(f: \mathcal{X}^n \mapsto \R\) be a function such that for every \(x_1,\dots,x_n,x_1',\dots,x_n'\) the following holds \[ \forall 1 \le i \le n, |f(x_1,\dots,x_i,\dots,x_n) - f(x_1,\dots,x_i',\dots,x_n)| \le c_i \] Then, assuming \(x_1,\dots,x_n\) are the realizations of independent random variables \(X_1,\dots,X_n\) respectively, we have \[ \Pr(f(x_1,\dots,x_n) - \E[f(X_1,\dots,X_n)] \ge \epsilon) \le \exp(\frac{-2 \epsilon^2}{\sum_{i=1}^n c_i^2}) \]
[!note]
McDiarmid’s inequality is a generalization of Hoeffding’s inequality. Simply choosing \(f\) in the McDiarmid’s inequality to be the empirical mean would recover the Hoeffding’s inequality.
Now back to \(\eqref{uniconv}\). For convenience, we define the worst-case generalization error: \[ \begin{aligned} G_n &\triangleq \sup_{f \in \mathcal{F}} [L(f) - \hat L(f)] \\ &= \sup_{f \in \mathcal{F}} [\E[\ell(f(X), Y)] - \frac{1}{n} \sum_{i=1}^n \ell(f(X_i), Y_i)] \end{aligned} \]
Treating \(G_{n}\) as the supremum over \(\mathcal{F}\) of function \(g\) defined on \(f \in \mathcal{F}\) and \(n\)-sized sample, we have \(G_{n}(x_1,\dots,x_n) = \sup_{f \in \mathcal{F}} g_f(x_1, \dots, x_n)\) where \(g_f(x_1,\dots,x_n) = L(f) - \hat L(f)\). Note that \[ \begin{aligned} &\underbrace{g_f(x_1,\dots,x_i,\dots,x_n)}_{A(f)} - \underbrace{g_f(x_1,\dots,x_i',\dots,x_n)}_{B(f)} \\ &= \frac{1}{n} [\ell(f(x_i'), y_i') - \ell(f(x_i), y_i)] \\ &\Downarrow_\text{$\ell$ is a zero/one-loss} \\ &\le \frac{1}{n} \end{aligned} \] > Lemma. If for every \(f \in \mathcal{F}\), \(|A(f) - B(f)| \le \epsilon\), then \(|\sup_{f} A(f) - \sup_{f} B(f)| \le \epsilon\). > > Proof. > \[ > \begin{aligned} > \sup_{f} A(f) - \sup_{f} B(f) \le \sup_f [A(f) - B(f)] \le \epsilon \\ > \sup_{f} B(f) - \sup_{f} A(f) \le \sup_f [B(f) - A(f)] \le \epsilon > \end{aligned} > \]
As a result, \[ \left| \sup_{f \in \mathcal{F}} g_f(x_1,\dots,x_i,\dots,x_n) - \sup_{f \in \mathcal{F}} g_f(x_1,\dots,x_i',\dots,x_n) \right| \le \frac{1}{n} \] Apply the McDiarmid’s inequality to give \[ \Pr[G_n - \E[G_n] \ge \epsilon] \le \exp(-2n \epsilon^2) \] The remaining step is to bound \(\E[G_n]\) (it has to be small for the above to be meaningful). \(\E[G_n]\) is actually an expectation over a sample \(\mathrm{X}\) of size \(n\). \[ \E[G_n] = \E_\mathrm{X} \left[ \sup_{f \in \mathcal{F}} [L(f) - \hat L_\mathrm{X}(f)] \right] \] We again apply the symmetrization trick. Let \(\mathrm{X}'\) be a virtual sample independent from \(\mathrm{X}\) but converge in distribution to \(\mathrm{X}\). Then \(L(f) = \E[\hat L_{\mathrm{X}'}(f)]\). As a result, \[ \begin{aligned} \E[G_n] &= \E_\mathrm{X} \left[ \sup_{f \in \mathcal{F}} [\E_{\mathrm{X}'}[\hat L_{\mathrm{X}'}(f)] - \hat L_\mathrm{X}(f)] \right] \\ &= \E_\mathrm{X} \left[ \sup_{f \in \mathcal{F}} [\E_{\mathrm{X}'}[\hat L_{\mathrm{X}'}(f) - \hat L_\mathrm{X}(f)]] \right] \\ &\le \E_\mathrm{X} \left[ \E_{\mathrm{X}'} [\sup_{f \in \mathcal{F}}[\hat L_{\mathrm{X}'}(f) - \hat L_\mathrm{X}(f)]] \right] \\ &= \E_{\mathrm{X}, \mathrm{X}'} \sup_{f \in \mathcal{F}} [\hat L_{\mathrm{X}'}(f) - \hat L_\mathrm{X}(f)] \end{aligned} \] Again we introduce another Rademacher random vector of length \(n\) so that \(\sigma, \mathrm{X}, \mathrm{X}'\) are independent. Let \(Z_i \triangleq \ell(f(X_i), Y_i), Z_i' \triangleq \ell(f(X_i'), Y_i')\). We have \[ \begin{gathered} Z_i' - Z_i &\stackrel{d}{=}& \sigma_i [Z_i' - Z_i] \\ \hat L_\mathrm{X'}(f) - \hat L_\mathrm{X}(f) &\stackrel{d}{=}& \frac{1}{n} \sum_{i=1}^n \sigma_i [Z_i' - Z_i] \end{gathered} \] and \[ \begin{aligned} &\E[G_n] \le \E_{\mathrm{X}, \mathrm{X}'} \sup_{f \in \mathcal{F}} [\hat L_{\mathrm{X}'}(f) - \hat L_\mathrm{X}(f)] \\ &= \E_{\mathrm{X}, \mathrm{X}', \sigma} \sup_{f \in \mathcal{F}} [\frac{1}{n} \sum_{i=1}^n \sigma_i [Z_i' - Z_i]] \\ &= \E_{\mathrm{X}', \sigma} \sup_{f \in \mathcal{F}} [\frac{1}{n} \sum_{i=1}^n \sigma_i Z_i'] \\ &\quad + \E_{\mathrm{X}, \sigma} \sup_{f \in \mathcal{F}} [\frac{1}{n} \sum_{i=1}^n -\sigma_i Z_i] \\ &\le 2 \E_{\mathrm{X}, \sigma} \sup_{f \in \mathcal{F}} [\frac{1}{n} \sum_{i=1}^n \sigma_i Z_i] \\ &= 2 \rade_{n}(\mathcal{\mathcal{H}}) \end{aligned} \] Next note that excess error is bounded by twice the worst-case generalization error: \[ \begin{aligned} &\Pr[L(\hat f) - L(f^*) \ge \epsilon] \le \Pr[\sup_{f \in \mathcal{F}} |L(f) - \hat L(f)| \ge \frac{\epsilon}{2}] \\ &= \Pr[\underbrace{\sup_{f \in \mathcal{F}} [L(f) - \hat L(f)]}_{G_n} \ge \frac{\epsilon}{2}] + \Pr[\underbrace{\sup_{f \in \mathcal{F}} [(-L(f)) - (-\hat L(f))]}_{G_n'} \ge \frac{\epsilon}{2}] \\ &= \Pr[G_n - \E[G_n] \ge \frac{\epsilon}{2} - \E[G_n]] + \Pr[G_n' - \E[G_n'] \ge \frac{\epsilon}{2} - \E[G_n']] \\ &\le \exp[-2n(\frac{\epsilon}{2} - \E[G_n])^2] + \exp[-2n(\frac{\epsilon}{2} - \E[G_n'])^2] \\ &\Downarrow_{-\mathcal{H} \triangleq \{ h(x,y) \triangleq -\ell(f(x), y): f \in \mathcal{F} \}} \\ &\le \exp[-2n(\frac{\epsilon}{2} - 2\rade_n(\mathcal{H}))^2] + \exp[-2n(\frac{\epsilon}{2} - 2\rade_n(\mathcal{-H}))^2] \\ &= 2 \exp[-2n(\frac{\epsilon}{2} - 2\rade_n(\mathcal{H}))^2] \end{aligned} \] Putting things together, we have
Theorem: excess error bound via Rademacher complexity. For a hypothesis set \(\mathcal{F}\), define \[ \mathcal{H} = \{ h(x,y) \triangleq \mathbb{1}[f(x) \ne y]: f \in \mathcal{F} \} \] Then, with probability at least \(1-\delta\), \[ L(\hat f) - L(f^*) \le 4 \rade_n(\mathcal{H}) + \sqrt{\frac{2 \log(2/\delta)}{n}} \]
Here we emphasize that, a small Rademacher complexity implies a small gap between ERM model and PRM model. But they may as well be equally bad. It is just that we usually assume a very low population risk.
Comparison of Convergence Bounds
Realizability case corresponds to the noiseless setting; non-realizability case corresponds to the noisy setting.
Method | 0/1-loss | Realizability | Finite Hypothesis | \(z_i-\mu\) Assumption | Excess Risk Bound |
---|---|---|---|---|---|
Introductory case | ✔ | ✔ | ✔ | N/A | \(\frac{\log t/\delta}{n}\) |
Via Markov’s inequality | ❌ | ❌ | ✔ | Expectation and variance exists. | \(2 \sigma\sqrt{\frac{t}{n \delta}}\) (derived on my own) |
Via Chernoff’s inequality | ✔ | ❌ | ✔ | Sub-Gaussian | \(\sqrt{\frac{8\log (2t/\delta)}{n}}\) (derived on my own) |
Via Hoeffding’s inequality | ✔ | ❌ | ✔ | Bounded | \(\sqrt{\frac{2\log (2t/\delta)}{n}}\) |
Via McDiarmid’s Inequality | ✔ | ❌ | ❌ | None | \(4 \mathsf{R}_n(\mathcal{H}) + \sqrt{\frac{2 \log(2/\delta)}{n}}\) |
Empirical Rademacher Complexity
Definition: empirical Rademacher complexity. For a hypothesis set \(\mathcal{H}\) and fixed dataset \(\mathrm{X} = \{ x_1,\dots,x_n \}\), we define \(\mathcal{H}\)’s empirical Rademacher complexity as \[ \hat \rade_{\mathrm{X}}(\mathcal{H}) \triangleq \E_{\sigma} \left[ \sup_{h \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \sigma_i h(x_i) \right] \]
Can we approximate Rademacher complexity via sampling? That is, given a fixed dataset, can we use stochastic optimization (gradient ascent) to obtain the max of the model’s capability. The answer is yes. Refer to the Question 6 of Homework 1.
Rademacher Complexity Algebra
Theorem: basic properties of Rademacher complexity.
Monotonicity: If \(\mathcal{H}_1 \subseteq \mathcal{H}_2\), then \(\rade(\mathcal{H}_1) \le \rade(\mathcal{H}_2)\).
Singleton set: If \(\mathcal{H} = \{ h \}\) contains only one function, then \(\rade(\mathcal{H}) = 0\).
Proof. See below: \[ \begin{aligned} \rade_{\mathcal{P},n}(\mathcal{H}) &= \E_{X^n, \sigma} \left[ \sup_{h \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \sigma_i h(X_i) \right] \\ &= \E_{X^n, \sigma} \left[\frac{1}{n} \sum_{i=1}^n \sigma_i h(X_i) \right] \\ &= \frac{1}{n} \sum_{i=1}^n \E[\sigma_i] \E[h(X_i)] = 0 \end{aligned} \]
Non-negativity: \(\rade_n(\mathcal{H}) \ge 0\) for any non-empty \(\mathcal{H}\).
Proof. This is immediate after the property monotonicity and property singleton set.
Scalar product: For constant \(c\) and \(c \mathcal{H} \triangleq \{ c h: h \in \mathcal{H} \}\), \(\rade_n(c \mathcal{H}) = |c| \rade_n(\mathcal{H})\).
Lipschitz composition
If \(g: \R \mapsto \R\) is a \(\rho\)-Lipschitz function, i.e. \[ \forall z, z' \in \R, |g(z) - g(z')| \le \rho |z - z'| \] then \(\rade_n(g \circ \mathcal{H}) \le \rho \rade_n(\mathcal{H})\).
Convex hull
For a hypothesis set \(\mathcal{H} = \{ h_1,\dots,h_t \}\), we define its convex hull as \[ \textrm{convex-hull}(\mathcal{H}) \triangleq \{ \sum_{i=1}^t \alpha_i h_i: \alpha_1,\dots,\alpha_t \ge 0, \sum_{i=1}^t \alpha_i = 1 \} \] Then \(\rade_n(\textrm{convex-hull}(\mathcal{H})) = \rade_n(\mathcal{H})\).
Interestingly, a convex hull contains an infinite number of hypothesis. But its Rademacher complexity does not increase with its infinite cardinality.
Applications of Rademacher Complexity
In this section, we study how to apply the Rademacher complexity to various machine learning models.
\(l_2\)-norm-bounded Linear Functions
Theorem: empirical Rademacher complexity of \(l_2\)-norm-bounded linear functions. Consider the following set of \(l_2\)-norm bounded linear functions: \[ \mathcal{H} = \{ h_w(x) \triangleq w^T x: \|w\|_2 \le M \} \] Then for a dataset \(\mathrm{X} = \{ x_1,\dots,x_n \}\), we have the following bound on the empirical Rademacher complexity: \[ \newcommand{\rade}{\mathsf{R}} \hat \rade_{\mathrm{X}}(\mathcal{H}) \le \frac{M \max_i \|x_i\|_2}{\sqrt{n}} \] Proof. \[ \begin{aligned}[t] &\hat \rade_{\mathrm{X}}(\mathcal{H}) = \E_{\sigma} \left[ \sup_{\|w\|_2 \le M} \frac{1}{n} \sum_{i=1}^n \sigma_i w^T x_i \right] \\ &= \frac{1}{n} \E_{\sigma} \left[ \sup_{\|w\|_2 \le M} w^T (\sum_{i=1}^n \sigma_i x_i) \right] \\ &\Downarrow_\text{optimization over a compact set} \\ &= \frac{1}{n} \E_{\sigma} \left[ \max_{\|w\|_2 \le M} w^T (\sum_{i=1}^n \sigma_i x_i) \right] \\ &\Downarrow_\text{Cauchy-Schwartz inequality} \\ &= \frac{1}{n} \E_{\sigma} \left[ M \|\sum_{i=1}^n \sigma_i x_i\|_2 \right] \\ &= \frac{M}{n} \E_{\sigma} \left[ \|\sum_{i=1}^n \sigma_i x_i\|_2 \right] \\ \end{aligned} \quad \text{cont'd} \begin{aligned}[t] &\le \frac{M}{n} \sqrt{\E_{\sigma} \left[ \|\sum_{i=1}^n \sigma_i x_i\|_2^2 \right]} \\ &= \frac{M}{n} \sqrt{\E_{\sigma} \left[ \sum_{i=1}^n \sum_{j=1}^n \sigma_i \sigma_j x_i^T x_j \right]} \\ &= \frac{M}{n} \sqrt{\left[ \sum_{i=1}^n \sum_{j=1}^n \E[\sigma_i \sigma_j] x_i^T x_j \right]} \\ &= \frac{M}{n} \sqrt{\sum_{i=1}^n \|x_i\|_2^2} \\ &\le \frac{M}{n} \sqrt{n \max_i \|x_i\|_2^2} \\ &= \frac{M \max_i \|x_i\|_2}{\sqrt{n}} \\ \end{aligned} \]
The implication of the above theorem is that excess risk is of order \(\mathcal{O}(\frac{1}{\sqrt{n}})\).
Neural Network
Theorem: empirical Rademacher complexity of ReLU-based and Frobenius-norm-bounded feedforward neural nets. Consider the following set of \(L\)-layer neural nets with ReLU activation function: \[ \mathcal{H} = \{ h(x) = W_L \psi_\text{ReLU}(W_{L-1} \dots \psi_\text{ReLU}(W_1 x)): \forall i, \|W_i\|_F \le M \} \] Then for a dataset \(\mathrm{X} = \{ x_1,\dots,x_n \}\), we have the following bound based on the empirical Rademacher complexity: \[ \hat \rade_{\mathrm{X}}(\mathcal{H}) \le \frac{(\sqrt{2L} + 1) M \max_i \|x_i\|_2}{\sqrt{n}} \]
[!note]
Similar result can be derived for ReLU-like activation functions in the above theorem. When we say ReLU-like, we mean that the activation function needs to be Lipschitz and homogeneous for positive scalars (e.g. \(\forall a > 0, \mathop{\mathrm{ReLU}}(az) = a\mathop{\mathrm{ReLU}}(z)\)). These two properties are critical in the proof.
As an aside, linearity is homogeneity plus additivity.
Massart Lemma
Deriving bounds for different models can be tedious. It would be immensely helpful to have a general rule or framework that allows us to plug in different models and obtain the bounds more easily. Massart lemma is one of such rule.
Theorem: Massart lemma. Suppose that \(\mathcal{H} = \{ h_1,\dots,h_t \}\) is a finite set of \(t\) functions. Also, suppose that for every \(h \in \mathcal{H}\) and dataset \(\mathrm{X} = \{ x_1,\dots,x_n \}\) the following holds: \[ \frac{1}{n} \sum_{i=1}^n h(x_i)^2 \le M \] Then, the following bound on the empirical Rademacher complexity holds: \[ \hat \rade_{\mathrm{X}}(\mathcal{H}) \le \sqrt{\frac{2M \log t}{n}} \] Proof. \[ \begin{aligned}[t] &\hat \rade_\mathrm{X}(\mathcal{H}) = \frac{1}{n} \E_\sigma \left[ \sup_{h \in \mathcal{H}} \sum_{i=1}^n \sigma_i h(x_i) \right] \\ &\Downarrow_{\mathrm{h} = {[h(x_1),\dots,h(x_n)]}^\intercal} \\ &= \frac{1}{n} \E_\sigma \left[ \max_{\mathrm{h} \in \mathrm{H}} \sigma^T \mathrm{h} \right] \\ &= \frac{1}{n} \E_\sigma \left[ \frac{1}{\lambda} \log \max_{\mathrm{h} \in \mathrm{H}} \exp(\lambda \sigma^T \mathrm{h}) \right] \\ &\le \frac{1}{n} \E_\sigma \left[ \frac{1}{\lambda} \log \sum_{j=1}^t \exp(\lambda \sigma^T \mathrm{h}_j) \right] \\ &\le \frac{1}{n \lambda} \log \E_\sigma \left[ \sum_{j=1}^t \exp(\lambda \sigma^T \mathrm{h_j}) \right] \\ \end{aligned} \quad \text{cont'd} \begin{aligned}[t] &= \frac{1}{n \lambda} \log \sum_{j=1}^t \E_\sigma \left[ \exp(\lambda \sigma^T \mathrm{h}_j) \right] \\ &= \frac{1}{n \lambda} \log \sum_{j=1}^t \prod_{i=1}^n \underbrace{\E \left[ \exp(\lambda \sigma_i^T h_j(x_i)) \right]}_{M_\sigma(\lambda h_j(x_i))} \\ &\le \frac{1}{n \lambda} \log \sum_{j=1}^t \prod_{i=1}^n \exp(\frac{1}{2} h_j^2(x_i) \lambda^2) \\ &\le \frac{1}{n \lambda} \log \sum_{j=1}^t \exp(\frac{nM \lambda^2}{2} ) \\ &= \frac{\log t}{n \lambda} + \frac{M \lambda}{2} \\ &\le \sqrt{\frac{2M \lambda \log t}{n}} \end{aligned} \]
Massart lemma is essentially helpful in the derivation of Rademacher complexity of various norm-bounded linear functions.
\(l_1\)-norm-bounded Linear Functions
Massart lemma can be applied to \(l_1\)-norm-bounded functions.
Corollary: empirical Rademacher complexity of \(l_1\)-norm-bounded linear functions. Consider the following set of \(l_1\)-norm bounded linear functions: \[ \mathcal{H} = \{ \R^d \to \R \ni h_w(x) \triangleq w^T x: \|w\|_1 \le M \} \] Then for a dataset \(\mathrm{X} = \{ x_1,\dots,x_n \}\), we have the following bound on the empirical Rademacher complexity: \[ \hat \rade_{\mathrm{X}}(\mathcal{H}) \le M \max_i \|x_i\|_\infty \sqrt{\frac{2 \log (2d)}{{n}}} \]
Proof. We highlight that \(\mathcal{H}\) is the convex hull of the following set of functions: \[ \tilde{\mathcal{H}} = \{ h_1(x), h_2(x),\dots,h_d(x),-h_1(x),\dots,-h_d(x) \} \] where \(h_i(x) = M x_i\). This is because, for any \(h_{w} \in \mathcal{H}\), we can rewrite it as \[ h_w(x) = w^T x = \sum_{i=1}^n w_i x_i = \underbrace{\sum_{i=1}^n \frac{|w_i|}{\|w\|_1}}_{\text{sum to 1}} \underbrace{\sign(w_i) \|w\|_1 x_i}_{\in \mathrm{convex-hull}(\tilde{\mathcal{H}})} \] By Rademacher complexity’s property, we know that \(\hat \rade_\mathrm{X}(\tilde{\mathcal{H}}) = \hat \rade_\mathrm{X}(\mathcal{H})\). Hence, we may use Massart lemma on \(\tilde{\mathcal{H}}\). To bound on \(\frac{1}{n} \sum_{i=1}^n h_w^2(x_i)\), note that \[ \begin{aligned} &\frac{1}{n} \sum_{i=1}^n h_w^2(x_i) = \frac{1}{n} \sum_{i=1}^n (w^T x_i)^2 \\ &\Downarrow_\text{Holder's inequality} \\ &\le \frac{1}{n} \sum_{i=1}^n \|w\|_1^2 \|x_i\|_\infty^2 \\ &\le M^2 \max_i \|x_i\|_\infty^2 \end{aligned} \] As a result, \[ \begin{equation} \begin{aligned} \hat \rade_\mathrm{X}(\mathcal{H}) &\le \sqrt{\frac{2(M^2 \max_i \|x\|_\infty^2) \log (2d)}{n}} \\ &= M \max_i \|x_i\|_\infty \sqrt{\frac{2 \log (2d)}{{n}}} \end{aligned} \tag{By Holder's Inequality} \end{equation} \] The derivation via Cauchy-Schwartz inequality is not as tight as Holder’s inequality, which is to be shown below. Note that \[ \begin{aligned} &\frac{1}{n} \sum_{i=1}^n h_w^2(x_i) = \frac{1}{n} \sum_{i=1}^n (w^T x_i)^2 \\ &\Downarrow_\text{Cauchy-Schwartz inequality} \\ &\le \frac{1}{n} \sum_{i=1}^n \|w\|_2^2 \|x_i\|_2^2 \\ &\le \frac{1}{n} \sum_{i=1}^n \|w\|_1^2 \|x_i\|_2^2 \\ &\le M^2 \max_i \|x_i\|_2^2 \end{aligned} \] From another perspective, since \(\|w\|_2 \le \|w\|_1 \le \sqrt{d} \|w\|_2\), \(\|w\|_1\) cannot be greater than \(M\) if \(\|w\|_2\) is no greater than \(M \sqrt{d}\). Thus, \[ \mathcal{H} \subseteq \mathcal{H}' \triangleq \{ h_w(x) = w^T x: \|w\|_2 \le M\sqrt{d} \} \] By Rademacher complexity’s property and from previous conclusion on \(l_2\)-norm-bounded Linear Functions, we have \[ \begin{equation} \hat \rade_\mathrm{X}(\mathcal{H}) \le \hat \rade_{\mathrm{X}}(\mathcal{H'}) = M \max_i \|x_i\|_2 \sqrt{\frac{d}{n}} \tag{By Cauchy-Schwartz Inequality} \end{equation} \] Obviously, the bound derived with Massart lemma is tighter for \(l_1\)-norm bounded linear functions, because of the order of \(d\) and because \(\|x\|_\infty \le \|x\|_2\).
\(l_\infty\)-norm-bounded Linear Functions
A similar bound can be derived for \(l_\infty\)-norm-bounded functions, which is a convex hull of \(2^d\) points.
Another application of Massart lemma is to be shown in connecting the VC dimension and Rademacher complexity.