2-uniform-convergence
In a typical supervised learning, the goal is to find a function \(\mathcal{F} \ni f: \mathcal{X} \to \mathcal{Y}\) that minimizes the loss measured by the loss function \(\ell: \mathcal{Y} \times \mathcal{Y} \to \R^+\) over the distribution \(P_{X,Y}\). However, given that the sample size is limited, it is only possible to minimize over the given samples \((x_1,y_1), \dots, (x_n,y_n)\), yielding the empirical risk minimization (ERM).
- Empirical risk minimization vs. population risk minimization \[ \begin{gather} \hat f \triangleq \min_{f \in \mathcal{F}} \hat L(f) \triangleq \frac{1}{n} \sum_{i=1}^n [\ell(f(x_i), y_i)] \tag{ERM} \\ f^* \triangleq \min_{f \in \mathcal{F}} L(f) \triangleq \E_{P_{X,Y}} [\ell(f(X), Y)] \tag{PRM/Supervised Learning} \\ \end{gather} \]
In machine learning, training is exactly an optimization process. But machine learning additionally takes into consideration the adaptation of the trained model from training data to unknown test data. Thus, we introduce the concept of generalization error to reflect the model’s generalization capability.
Generalization risk alone cannot be the only index. It only measures the performance difference between training set and test set. A model that is the same worse on the training data and the test data “generalizes” well. Thus, we introduce the concept of excess error to reflect the model’s overall capability.
- Generalization error vs. excess error \[ \begin{gather} \epsilon_\text{gen}(\hat f) \triangleq L(\hat f) - \hat L(\hat f) \tag{Generalization Error} \\ \epsilon_\text{excess}(\hat f) \triangleq L(\hat f) - L(f^*) \tag{Excess Error} \end{gather} \]
Note that generalization error may be positive, zero or negative; excess error cannot be negative. Also note that even a small excess error does not imply an effective learning algorithm: the EMR model may be as bad as the PRM model. It is just that we usually assume a very low population risk.
The overall purpose is to build a bound on the excess error for the ERM method with respect to the sample size, which is the main topic of uniform convergence analysis.
Introductory Case and Initial Assumptions
As a kickstart, we make the following assumptions:
- We consider the zero-one loss.
- We assume a realizable scenario where \(L(f^*) = 0\) and \(f^*\) belongs to the hypothesis set.
- We consider a finite hypothesis set \(\mathcal{F} = \{ f_1,\dots,f_t \}\) with \(t\) functions.
Note that due to the realizability assumption, the excess risk is exactly the population risk.
Theorem: excess error bound for realizable finite hypothesis set. The following population risk bound holds for the ERM solution \(\hat f\) with probability at least \(1-\delta\): \[ \epsilon_\text{excess}(\hat f) = L(\hat f) \le \frac{\log t + \log \frac{1}{\delta}}{n} \] Proof. Let \(B \triangleq \{ f \in \mathcal{F}, L(f) > \epsilon \}\). Since the problem is realizable in that \(L(f^*) = 0\) and \(f^* \in \mathcal{F}\), we have \[ \begin{gathered} 0 \le \hat L(\hat f) \le \hat L(f^*) = L(f^*) = 0 \\ \Rightarrow \hat L(\hat f) = 0 \end{gathered} \]
\[ \begin{aligned}[t] &\Pr(L(\hat f) > \epsilon) \\ &= \Pr(\hat f \in B) \\ &\Downarrow_{\hat L(\hat f) = 0} \\ &\le \Pr(\exists h \in B, \hat L(h) = 0) \\ &\le \sum_{h \in B} \Pr(\hat L(h) = 0) \\ \\ &\Downarrow \\ &\le \sum_{h \in B} \exp(-\epsilon n) \\ &= t \exp(-\epsilon n) \end{aligned} \quad \begin{aligned}[t] &\Pr(\hat L(h) = 0) \\ &= \Pr(\forall i, h(x_i) = y_i) \\ &= \prod_i \Pr(h(x_i) = y_i) \\ &\Downarrow \\ &= (1-L(h))^n \\ &\le \exp(-L(h) n) \\ &\Downarrow \\ \Leftarrow \quad & \le \exp(-\epsilon n) \end{aligned} \quad % \begin{gathered}[t] \begin{aligned}[t] \E[\mathbb{1}[h(x_i) \ne y_i]] = L(h) \\ 1 - \E[\mathbb{1}[h(x_i) = y_i]] = L(h) \\ 1 - \Pr(h(x_i) = y_i) = L(h) \\ \end{aligned} \\ \Downarrow \\ \begin{gathered} &\Leftarrow &\Pr(h(x_i) = y_i) = 1 - L(h) \\ \\ \\ &\Leftarrow &(1-z)^n \le \exp(-z n) \end{gathered} \end{gathered} \]
Interestingly, \(\log t\) reflects the capacity of the function class.
Uniform Convergence Analysis
The three assumptions are restrictive. In this section, we first try to wriggle out of them and draw some general conclusions. Then, we bring some of them back to show some more meaningful results.
First note that \[ \begin{gathered} 0 \le \epsilon_\text{excess}(\hat f) = L(\hat f) - L(f^*) \\ = \underbrace{L(\hat f) - \hat L(\hat f)}_{\epsilon_\text{gen}(\hat f)} + \underbrace{\hat L(\hat f) - \hat L(f^*)}_{\le 0} + \underbrace{\hat L(f^*) - L(f^*)}_{-\epsilon_\text{gen}(f^*)} \end{gathered} \] That is (this is the key inequality that relates the excess error and generalization error, based on which various inequalities are derived later on), \[ \begin{aligned} \epsilon_\text{excess}(\hat f) &\le L(\hat f) - \hat L(\hat f) + \hat L(f^*) - L(f^*) \\ &\Downarrow_{0 \le \epsilon_\text{excess}(\hat f)} \\ &\le |L(\hat f) - \hat L(\hat f)| + |\hat L(f^*) - L(f^*)| \\ &\le 2 \sup_{f \in \mathcal{F}} |L(f) - \hat L(f)| \end{aligned} \] In other words, a sufficient condition for excess risk to be upper-bounded by \(\epsilon\) is that for every \(f \in \mathcal{F}\), the generalization error is uniformly less than \(\epsilon/2\). \[ \begin{gather} \Pr(\epsilon_\text{excess}(\hat f) \le \epsilon) \ge \Pr(\sup_{f \in \mathcal{F}} |L(f) - \hat L(f)| \le \frac{\epsilon}{2}) \label{uniconv-1} \\ \Downarrow \notag \\ \Pr(\epsilon_\text{excess}(\hat f) \ge \epsilon) \le \Pr(\sup_{f \in \mathcal{F}} |L(f) - \hat L(f)| \ge \frac{\epsilon}{2}) \label{uniconv-2} \\ \end{gather} \] The supremum operation is still ambiguous. To demystify it, we can instead turn to study the distribution of \(L(f) - \hat L(f)\) for arbitrary \(f \in \mathcal{F}\), which is a form of deviation of population mean from the empirical mean. That is, we remove \(\sup\) with following inequality: \[ \begin{equation} \begin{aligned} &\Pr(\epsilon_\text{excess}(\hat f) \ge \epsilon) \le \Pr(\sup_{f \in \mathcal{F}} |L(f) - \hat L(f)| \ge \frac{\epsilon}{2}) \\ &= \Pr(\cup_{1 \le i \le t} |L(f_i) - \hat L(f_i)| \ge \frac{\epsilon}{2}) \\ &\le \sum_{i=1}^t \Pr(|L(f_i) - \hat L(f_i)| \ge \frac{\epsilon}{2}) \\ \end{aligned} \label{uniconv-3} \end{equation} \] The formula on the last line is the reason why we can only deal with finite hypothesis set in this section. It is also reminiscent of the law of large numbers and central limit theorem in the probability and statistics. But remember that, both of them provides a guarantee in an asymptotic fashion, which is not suitable due to finite sample size. To prove a non-asymptotic generalization bound, we can use some readily-available tail bounds from the probability literature.
Bounding via Markov’s Inequality
In a verbatim way, treat every \(l(f(x_i), y_i)\) as an i.i.d. sample \(z_i\) and assume that both the expectation and the variance exist for \(z_i\). That is, let \[ \begin{gathered} \mu \triangleq \E[z_i], \sigma^2 \triangleq \Var[z_i] \end{gathered} \] For some arbitrary \(f' \in \mathcal{F}\), we have \[ \begin{gathered} \hat L(f') = \hat \mu_n = \frac{1}{n} \sum_{i=1}^n z_i,L(f') = \mu \\ \end{gathered} \] Then by Chebyshev’s inequality (derived from Markov’s inequality), \[ \begin{gathered} \left. \begin{gathered} \E[\frac{1}{n} \sum_{i=1}^n z_i] = \mu \\ \Var[\frac{1}{n} \sum_{i=1}^n z_i] = \frac{\sigma^2}{n} \\ \end{gathered} \right\} \Rightarrow \Pr(|\frac{1}{n} \sum_{i=1}^n z_i - \mu | \ge \frac{\epsilon}{2}) \le \frac{4\sigma^2}{n \epsilon^2} \end{gathered} \]
Adopting \(\eqref{uniconv-3}\), we have \[ \begin{aligned} \Pr(\epsilon_\text{excess}(\hat f) \ge \epsilon) \le \sum_{i=1}^t \Pr(|\frac{1}{n} \sum_{i=1}^n z_i - \mu | \ge \frac{\epsilon}{2}) \le \frac{4t \sigma^2}{n \epsilon^2} \end{aligned} \] This indicates that the excess risk of \(\hat f\) will decay by a \(\mathcal{O}(\frac{1}{n})\) rate. Is this a good decay rate? Not yet, because even a statistical inference of binomial distribution’s \(p\) parameter can give an exponential decay rate.
Remember that sitting above the polynomial order is the exponential order. By far, we only make very limited assumption on \(z_i\). For Gaussian and bounded random variable, we show below that we can improve the decay rate from the geometric one to an exponential one.
Bounding via Hoeffding’s Inequality
Moment Generating Function
Definition: moment generating function. For a random variable \(Z\), the moment generating function (MGF) \(M_Z: \R \to \R\) is defined as \[ M_Z(t) \triangleq \E[e^{t Z}] = \int_{-\infty}^{+\infty} p_Z(z) e^{t z} \d z \]
Proposition: MGF of sum of independent random variables. If \(Z_1\) and \(Z_2\) are independent random variables, then the MGF of their sum is the product of their MGFs: \[ M_{Z_1 + Z_2}(t) = M_{Z_1}(t) M_{Z_2}(t) \] Proof. \[ \begin{aligned} &M_{Z_1+Z_2}(t) = \E_{Z_1, Z_2}[e^{t(Z_1 + Z_2)}] \\ &= \E_{Z_1} \E_{Z_2} [e^{t(Z_1 + Z_2)}] \\ &= \E_{Z_1} [e^{t Z_1}] \E_{Z_2} [e^{t Z_2}] \\ &= M_{Z_1}(t) M_{Z_2}(t) \end{aligned} \]
Proposition: MGF of linear transformation of random variables. For a random variable \(Z\) and constants \(\alpha \ne 0, \beta\), then the MGFs of the random variable \(Z' \triangleq \alpha Z\) and the random variable \(Z'' \triangleq Z + \beta\) are \[ \begin{aligned}[t] &M_{Z'}(t) = \int_{-\infty}^{+\infty} p_{Z'}(z) e^{t z} \d z \\ &= ??\alpha \int_{-\infty}^{+\infty} p_{Z}(z/\alpha) e^{(\alpha t) * (z/\alpha)} \d (z/\alpha) \\ &= M_Z(\alpha t) \end{aligned} \quad \begin{aligned}[t] &M_{Z''}(t) = \int_{-\infty}^{+\infty} p_{Z''}(z) e^{t z} \d z \\ &= e^{\beta t} \int_{-\infty}^{+\infty} p_{Z}(z-\beta) e^{t (z-\beta)} \d (z-\beta) \\ &= e^{\beta t} M_Z(t) \end{aligned} \] In short, \[ M_{\alpha Z + \beta}(t) = \E[e^{(\alpha Z + \beta) t}] = e^{\beta t} \E[e^{\alpha Z t}] = e^{\beta t} M_Z(\alpha t) \] Show that the expectation of linear transformation of RV is the linear transformation of expectation of RV??
Chernoff’s Inequality
Theorem: Chernoff’s inequality. Consider random variable \(Z\) with moment generating function \(M_Z\). Then, \[ \forall t > 0: \Pr(Z \ge \epsilon) \le \frac{M_Z(t)}{e^{t \epsilon}} \] Proof. Define the random variable \(V = e^{t Z}\) for \(t > 0\). Then by the Markov’s inequality \[ \Pr(Z \ge \epsilon) = \Pr(V \ge e^{t \epsilon}) \le \frac{\E[V]}{e^{t \epsilon}} = \frac{M_Z(t)}{e^{t \epsilon}} \]
Note that \(\hat \mu_n - \mu\) can be rewritten in a summation form \(\sum_{i=1}^n \frac{z_i - \mu}{n}\). Therefore, for all \(t > 0\) \[ \begin{aligned} &\Pr(\hat \mu_n - \mu \ge \epsilon) \le \frac{M_{\hat \mu_n - \mu}(t)}{e^{t \epsilon}} \\ &= \frac{(M_{(Z - \mu)/n}(t))^n}{e^{t \epsilon}} \\ &= \frac{(M_{(Z - \mu)}(t/n))^n}{e^{t \epsilon}} \\ \end{aligned} \] Since \(n\) is fixed, we have \[ \begin{gathered} \forall t > 0, \Pr(\hat \mu_n - \mu \ge \epsilon) \le \frac{(M_{(Z - \mu)}(t))^n}{e^{t \epsilon}} \\ \Downarrow \\ \Pr(\hat \mu_n - \mu \ge \epsilon) \le \frac{\inf_{t > 0}\ (M_{Z - \mu}(t))^n}{e^{t \epsilon}} \\ \end{gathered} \]
If only \(\inf_{t > 0}\ M_{Z - \mu}(t) < 1\)! We divert to the applications of Chernoff’s inequality for some well-defined distributions. And then we show that the scenario that \(\inf_{t > 0}\ M_{Z - \mu}(t) < 1\) is not rare.
Zero-mean Gaussians
Proposition: MGF of zero-mean Gaussians. Suppose \(Z \sim \mathcal{N}(0, \sigma^2)\) is zero-mean Gaussian random variable. Then \[ \begin{aligned} &M_Z(t) = \int_{-\infty}^{+\infty} p_Z(z) e^{t z} \d z \\ &= \int_{-\infty}^{+\infty} \frac{e^{-\frac{z}{2 \sigma^2} + tz}} {\sqrt{2 \pi \sigma^2}} \d z \\ &= \int_{-\infty}^{+\infty} \frac{e^{-\frac{(z - \sigma^2 t)^2}{2 \sigma^2} + \frac{\sigma^2 t^2}{2}}} {\sqrt{2 \pi \sigma^2}} \d z \\ &= e^{\frac{\sigma^2 t^2}{2}} \int_{-\infty}^{+\infty} \frac{e^{-\frac{(z - \sigma^2 t)^2}{2 \sigma^2}}} {\sqrt{2 \pi \sigma^2}} \d (z-\sigma^2 t^2) \\ &= e^{\frac{\sigma^2 t^2}{2}} \end{aligned} \]
Note that it is the nice property of MGF of zero-mean Gaussian that confines our discussion to zero-mean variables. A direct application of Chernoff’s inequality gives
Theorem: Chernoff tail bound for zero-mean Gaussians. The optimized Chernoff tail bound for \(Z \sim \mathcal{N}(0, \sigma^2)\) is \[ \begin{aligned} \Pr(Z \ge \epsilon) &\le \inf_{t>0}\quad M_Z(t) e^{-t \epsilon} \\ &= \inf_{t>0}\quad e^{\frac{\sigma^2 t^2}{2} - \epsilon t} \\ &= e^{-\frac{\epsilon^2}{2\sigma^2}} \end{aligned} \]
A direct application of the above theorem on the sum of i.i.d. zero-mean Gaussians gives the following:
Corollary: Chernoff-based concentration inequality for zero-mean Gaussians. Given i.i.d. sample \(z_1, \dots, z_n \sim \mathcal{N}(0, \sigma^2)\), we have the following error bound for empirical mean \(\hat \mu_n = \frac{1}{n} \sum_{i=1}^{n} z_i\): \[ \Pr(\hat \mu_n - \mu \ge \epsilon) \le e^{-\frac{n \epsilon^2}{2 \sigma^2}} \]
Sub-Gaussians
The key step in the illustration of Chernoff’s inequality for the Gaussian case is the derivation of MGF. On the wide spectrum of non-Gaussian distributions, we focus on those whose MGF is smaller than that of some zero-mean Gaussians.
Definition: sub-Gaussian random variables. We call random variable \(Z\) with mean \(\mu\) sub-Gaussian with parameters \(\sigma^2\) if the MGF of \(Z-\mu\) satisfies the following inequality for all \(t > 0\): \[ M_{Z-\mu}(t) \le \exp(\frac{\sigma^2 t^2}{2}) \]
By the way, it is interesting to know that if a random variable is sub-Gaussian with parameter \(\sigma^2\), then its variance is upper-bounded by \(\sigma^2\).
Proposition: sum of independent sub-Gaussians. If \(Z_1\) and \(Z_2\) are two independent sub-Gaussian random variables with parameters \(\sigma_1^2\) and \(\sigma_2^2\), then \(Z_1 + Z_2\) will be sub-Gaussian with parameter \(\sigma_1^2 + \sigma_2^2\).
Proposition: MGF of scalar product of sub-Gaussians. If \(Z\) is a sub-Gaussian random variable with parameter \(\sigma^2\), then \(c Z\) for scalar \(c \in \R\) will be sub-Gaussian with parameter \(c^2 \sigma^2\).
We can draw similar conclusions for sub-Gaussians.
Theorem: Chernoff tail bound for sub-Gaussians. The optimized Chernoff tail bound for sub-Gaussian \(Z\) with parameter \(\sigma^2\) and with mean \(\mu\) is \[ \begin{aligned} \Pr(Z-\mu \ge \epsilon) \le \exp(-\frac{\epsilon^2}{2\sigma^2}) \end{aligned} \]
Corollary: Chernoff-based concentration inequality for sub-Gaussians. Given i.i.d. sub-Gaussian samples \(z_1, \dots, z_n\) with parameter \(\sigma^2\) and mean \(\mu\), we have the following error bound for empirical mean \(\hat \mu_n = \frac{1}{n} \sum_{i=1}^{n} z_i\): \[ \Pr(\hat \mu_n - \mu \ge \epsilon) \le \exp(-\frac{n \epsilon^2}{2 \sigma^2}) \]
Now it remains the question that what kind of random variables are sub-Gaussians.
Rademacher Distribution
We first introduce the Rademacher random variable \(X_\mathsf{R}\) whose PMF is \[ X_\mathsf{R} = \begin{cases} +1 & \text{w.p. $1/2$} \\ -1 & \text{w.p. $1/2$} \end{cases} \]
We can obtain its MGF as \[ \begin{aligned} M_{X_\mathsf{R}}(t) &= \E[e^{t X}] = \frac{1}{2} (e^t + e^{-t}) \\ &= \frac{1}{2} \sum_{k=0}^\infty (\frac{t^k}{k!} + \frac{(-t)^k}{k!}) \\ &= \sum_{s=0}^\infty \frac{t^{2s}}{(2s)!} \\ &\le \sum_{s=0}^\infty \frac{t^{2s}}{2^s(s)!} \\ &= \sum_{s=0}^\infty \frac{(t^2/2)^s}{s!} \\ &= e^{t^2/2} \end{aligned} \] This indicates that \(X_\mathsf{R}\) is sub-Gaussian with parameter \(1\).
Bounded Random Variable
Next we show that a random variable \(a \le Z \le b\) for scalars \(a,b\) is sub-Gaussian with parameter \((b-a)^2\). To show it, we apply the symmetrization trick, creating another random variable \(Z'\) i.i.d. as \(Z\): \[ \begin{aligned} &M_{Z-\E[Z]}(t) = \E_Z[e^{t(Z-\E[Z])}] \\ &= \E_Z[e^{t(Z-\E[Z'])}] \\ &\Downarrow_{\text{Jenson's inequality}} \\ &\le \E_Z \E_{Z'}[e^{t(Z-Z')}] \\ &= \E_{Z, Z'} [e^{t(Z-Z')}] \end{aligned} \] Directly concluding that \(M_{Z-\E[Z]}(t) \le e^{t(b-a)}\) is neither interesting nor helpful in resulting a sub-Gaussian. We introduce another Rademacher random variable \(\sigma\) so that \(\sigma, Z, Z'\) are independent. We have \[ Z-Z' \stackrel{d}{=} \sigma(Z-Z') \] Also note that \((Z-Z') \le (b-a)^2\). Hence, \[ \begin{aligned} &M_{Z-\E[Z]}(t) \le \E_{Z, Z'} [e^{t(Z-Z')}] \\ &= \E_{Z, Z', \sigma} [e^{[t(Z-Z')]\sigma}] \\ &= \E_{Z, Z'} \E_{\sigma} [e^{[t(Z-Z')]\sigma}] \\ &\le \E_{Z, Z'} e^{\frac{t^2(Z-Z')^2}{2}} \\ &\le e^{\frac{t^2(b-a)^2}{2}} \\ \end{aligned} \]
This indicates that \(Z\) is sub-Gaussian with parameter \((b-a)^2\). In fact, \(Z\)’s sub-Gaussian parameter can be further tightened.
Hoeffding’s Inequality
Lemma: Hoeffding’s lemma. Suppose that random variable \(Z\) is bounded and satisfies \(a \le Z \le b\) for scalars \(a,b \in \R\). Then, \(Z\) is sub-Gaussian with parameter \(\frac{(b-a)^2}{4}\). That is, we have \[ M_{Z-\mu}(t) \le \exp(\frac{t^2 (b-a)^2}{8}) \]
A direct application of Hoeffding’s lemma and Chernoff tail bound for sub-Gaussians gives the following:
Theorem: Hoeffding’s inequality. Suppose that random variables \(Z_1, \dots, Z_n\) are independent and bounded as \(a_i \le Z_i \le b_i\). Then defining the empirical mean \(\hat \mu_n \triangleq \frac{1}{n} \sum_{i=1}^n Z_i\) and the underlying mean \(\mu \triangleq \frac{1}{n} \sum_{i=1}^n \E[Z_i]\) results in the following concentration inequality \[ \Pr(\hat \mu_n - \mu \ge \epsilon) \le \exp\left( -\frac{2n^2 \epsilon^2}{\sum_{i=1}^n (b_i-a_i)^2} \right) \]
Corollary: Hoeffding’s concentration inequality for bounded random variables. Given i.i.d. bounded samples \(a \le z_1, \dots, z_n \le b\) with mean \(\mu\), we have the following error bound for empirical mean \(\hat \mu_n = \frac{1}{n} \sum_{i=1}^{n} z_i\): \[ \Pr(\hat \mu_n - \mu \ge \epsilon) \le \exp\left( -\frac{2 n \epsilon^2}{(b-a)^2} \right) \]
Revisiting Excess Error
In quite a long run of paragraphs, we have been working with statistics. Now let’s turn to uniform convergence analysis. We assume
- the usage of the zero/one-loss,
- and a finite set of hypothesis \(\mathcal{F} = \{ f_1,\dots,f_t \}\).
Theorem: excess error bound for finite hypothesis sets. Under the above assumptions, the following excess risk bound holds for the ERM solution \(\hat f\) with probability at least \(1-\delta\): \[ \epsilon_{\text{excess}}(\hat f) \le \sqrt{\frac{2(\log t + \log\frac{2}{\delta})}{n}} = \mathcal{O}(\sqrt{\frac{\log(t/\delta)}{n}}) \] Proof. According to \(\eqref{uniconv-3}\), we have \[ \Pr(\epsilon_\text{excess}(\hat f) \ge \epsilon) \le \Pr(\cup_{1 \le i \le t} |L(f_i) - \hat L(f_i)| \ge \frac{\epsilon}{2}) \] We can apply the union bound and Hoeffding’s lemma to further show that \[ \begin{aligned} &\Pr(\cup_{1 \le i \le t} |L(f_i) - \hat L(f_i)| \ge \frac{\epsilon}{2}) \le \sum_{i=1}^t \Pr(|L(f_i) - \hat L(f_i)| \ge \frac{\epsilon}{2}) \\ &= \sum_{i=1}^t [\Pr(\hat L(f_i) - L(f_i) \ge \frac{\epsilon}{2}) + \Pr(\underbrace{(-\hat L(f_i))}_{\hat \mu'} - \underbrace{(-L(f_i))}_{\mu'}) \ge \frac{\epsilon}{2})] \\ &\le \sum_{i=1}^t [\exp\left( -\frac{2n (\epsilon/2)^2}{(1-0)^2} \right) + \exp\left( -\frac{2n (\epsilon/2)^2}{(1-0)^2} \right)] \\ &= 2t \exp(-\frac{n\epsilon^2}{2}) \end{aligned} \] As a result, \[ \Pr(\epsilon_\text{excess}(\hat f) \ge \epsilon) \le 2t \exp(-\frac{n \epsilon^2}{2}) \] Let \(\delta = 2t \exp(-\frac{n \epsilon^2}{2})\). Then we can draw that, with probability at least \(1-\delta\), \[ \epsilon_\text{excess}(\hat f) \le \sqrt{\frac{2(\log (2t/\delta))}{n}} \]