4-vc-dimension
VC Dimension
Recall that in Massart lemma, we transfer our focus from the hypothesis set to \(\{ [h(x_1),\dots,h(x_n)]: h \in \mathcal{H} \}\) to give \(M\). Another aspect is this set’s cardinality \(t\), especially its relation with \(n\) and \(d\). In classification task, \(t\textendash n\) relation corresponds to label assignment which easily goes to exponential. But we particularly hope that this is not the case.
We restrict ourselves to binary classification task and zero/one-loss in this section.
Definition: shattering coefficient. Given a function set \(\mathcal{F}\) whose members map a feature vector \(x \mapsto \mathcal{X}\) to \(\mathcal{Y} = \{ 0,1 \}\), we define shattering coefficient \(s(\mathcal{F}, n)\) as the maximum number of different label assignment over datasets of size \(n\), i.e. \(\{ x_1,\dots,x_n \} \to \mathcal{X}^n\): \[ s_\mathcal{F}(n) \triangleq \max_{x_1,\dots,x_n \in \mathcal{X}} \left| \{ [f(x_1),\dots,f(x_n)]: f \in \mathcal{F} \} \right| \] A trivial upper bound for shattering coefficient is \(2^n\).
Corollary: Massart lemma applied to shattering coefficient. Given function set \(\mathcal{F}\) whose members map the input to \(\mathcal{Y} = \{ 0,1 \}\), a direct application of Massart lemma (\(M=1, t = s_\mathcal{F}(n)\)) gives the following bound on its empirical Rademacher complexity over every dataset \(\mathrm{X}\) of size \(n\): \[ \newcommand{\r}{\mathsf{R}} \hat \r_\mathrm{X}(\mathcal{F}) \le \sqrt{\frac{2 \log s_\mathcal{F}(n)}{n}} \] As a result, \[ \r_n(\mathcal{F}) \le \sqrt{\frac{2 \log s_\mathcal{F}(n)}{n}} \]
Definition: VC dimension. Given a function set \(\mathcal{F}\) with Boolean output in \(\mathcal{Y}=\{ 0,1 \}\), we define its VC dimension \(\newcommand{VC}{\mathrm{VC}} \VC(\mathcal{F})\) as the size \(n\) of the largest dataset \(S\) that can be shattered by \(\mathcal{F}\): \[ \VC(\mathcal{F}) = \sup\{ n: S_\mathcal{F}(n) = 2^n \} \] where \[ S_\mathcal{F}(n) = \left| \{ [f(x_1),\dots,f(x_n)]: f \in \mathcal{F} \} \right|,\ x_1\dots,x_n \in S \]
VC dimension kind of reflects the richness of the function set in classification task.
Examples
Here are some examples of function set:
Interval functions: \(\mathcal{F}_\text{int} = \{ f_{a,b}(x) = \one[a \le x \le b]: a,b \in \R \}\)
The VC dimension is \(2\). Actually, the shattering coefficient \(s_{\mathcal{F}_\text{int}}(n)\) for the set of interval functions is \(\binom{n+1}{2} + 1\).
The implication of this example is that the VC dimension is usually the dimension of parameter that characterizes the function set.
Binarized sinus functions: \(\{ f(x) = \one[\sin(\omega x) \le 0]: \omega \in \R \}\)
The VC dimension is \(\infty\). We show this by construction. Consider the data points \(x_1,\dots,x_n\) where \(x_i = 2^i\). Denote as \(y_1,\dots,y_n\) the output. Then it is necessary that \(\omega x_i\) is between \((2k_i+1)\pi\) and \((2k_i + 2y_i)\pi\) for some \(k_i\). We show that setting \(\omega = (0.y_1 \cdots y_n)_2 \pi\) suffices. \[ \begin{aligned} \sin(\omega x_i) &= \sin((y_1 \cdots y_i.y_{i+1} \cdots y_n)_2 \pi) \\ &= \sin((y_i.y_{i+1} \cdots y_n)_2 \pi) \\ &= y_i \end{aligned} \] This holds for any \(y_1,\dots,y_n\). As a result, the shattering coefficient for the set of binarized sinus functions is \(2^n\).
The implication of this example is that even when there is only one parameter, the VC dimension can go to infinity.
Binarized convex functions: \(\mathcal{F}_\text{convex} = \{ f(x) = \one[g(x) \le 0]: \text{$g$ is convex} \}\)
The VC dimension is \(\infty\). We show this by construction. Consider the data points \(x_1,\dots,x_n \in \R^d\) on the \(d\)-dimensional sphere. Denote as \(y_1,\dots,y_n\) the arbitrary output. Let \(I = \{ i: y_i = 1 \}\). Then \(\textrm{convex-hull}(\{ x_i: i \in I \})\) cannot contain \(x_j\)’s whose label are \(0\) (due to the geometric property of sphere). Let \[ g(x) = \one[x \in \textrm{convex-hull}(\{ x_i: i \in I \})] \]
Then we have \[ g(x_j) = \one[x_j \in \textrm{convex-hull}(\{ x_i: i \in I \})] = y_j \] As a result, the shattering coefficient for the set of binarized convex functions is \(2^n\).
Theorem: VC dimension of \(d\)-dimensional hyperplanes. Consider the set of \(d\)-dimensional linear (what about affine??) classification rules: \[ \mathcal{F} = \{ \one[w^\intercal x \ge 0]: w \in \R^d \} \] Then, the VC dimension \(\VC(\mathcal{F})\) of this function set will be \(d\).
Proof. We first show that \(\VC(\mathcal{F}) \le d\). Suppose on the contrary that \(\VC(\mathcal{F}) > d\). Then, there exists \(x_1,\dots,x_{d+1}\) that can be shattered by \(\mathcal{F}\). From linear algebra knowledge, we know that there exists \(c_1, \dots, c_{d+1} \in \R\) which are not all zeros such that \[ c_1 x_1 + \dots + c_{d+1} x_{d+1} = 0 \] Without loss of generality, suppose \(c_1 < 0\). There exists \(w\) such that \[ \forall i, y_i = \one[w^\intercal x_i \ge 0] = \one[c_i \ge 0] \] Then \[ c_1 w^\intercal x_1 + \dots + c_{d+1} w^\intercal x_{d+1} = w^\intercal(c_1 x_1 + \dots + c_{d+1} x_{d+1}) = 0 \] For any \(i\), if \(c_i \ge 0\), then \(w^\intercal x_i \ge 0\); if \(c_i < 0\), then \(w^\intercal x_i < 0\). Hence, \(c_i w^\intercal x_i \ge 0\). Specifically, \(c_1 w^\intercal x_i > 0\). Therefore, \[ c_1 w^\intercal x_1 + \dots + c_{d+1} w^\intercal x_{d+1} > 0 \] which is a contradiction.
Then it suffices to show that \(\VC(\mathcal{F}) = d\) if there exists \(x_1, \dots, x_d\) that can be shattered by \(\mathcal{F}\). We choose \(x_i = e_i\) which is the \(i\)-th standard basis of \(\R^d\). Then \(w = [y_1,\dots,y_d]^T\) would suffice. Q.E.D.
Connection with Rademacher Complexity
Lemma: Sauer’s lemma. Consider a function set \(\mathcal{F}\) with VC dimension \(\VC(\mathcal{F}) = d\). Then, for every integer \(n \in \N\): \[ s_\mathcal{F}(n) \le \sum_{i=0}^d {n \choose i} \]
Sauer’s lemma trivially holds if \(n \le d\) because \(\sum_{i=1}^d \binom{n}{i} = \sum_{i=1}^n \binom{n}{i} = 2^n\). We care about the case when \(d < n\), i.e. \(s_\mathcal{F}(n) \le \sum_{i=1}^d \binom{n}{i} \le (\frac{e n}{d})^d\).
Theorem: Rademacher complexity bound with VC dimension. Consider function set \(\mathcal{F}\) with Boolean output in \(\mathcal{Y} = \{ 0,1 \}\), whose VC dimension is \(\VC(\mathcal{F}) = d\). Then, we have the following bound on the Rademacher complexity over every dataset \(S\) of size \(n\): \[ \hat \r(\mathcal{F}) \le \sqrt{ \frac{2d (\log (n/d) + 1)}{n} } \] As a result, \[ \r_n(\mathcal{F}) \le \sqrt{\frac{2d (\log(n/d) + 1)}{n}} \] Proof. A direct application of Sauer’s lemma and Massart lemma applied to shattering coefficient gives the above.
Theorem: excess risk bound via VC dimension. Consider a hypothesis set \(\mathcal{F}\) with Boolean output which has \(\VC(\mathcal{F}) = d\). Suppose the loss function is zero/one loss. Then, with probability at least \(1-\delta\), \[ L(\hat f) - L(f^*) \le \sqrt{\frac{32 d((\log n/d) + 1)}{n}} + \sqrt{\frac{2 \log(2/\delta)}{n}} \] Proof. A direction application of excess risk bound via Rademacher complexity gives \[ \begin{aligned} &L(\hat f) - L(f^*) \le 4 \r_n(\mathcal{H}) + \sqrt{\frac{2 \log(2/\delta)}{n}} \\ &\le 4 \sqrt{\frac{2d(\log(n/d) + 1)}{n}} + \sqrt{\frac{2 \log(2/\delta)}{n}} \\ &= \sqrt{\frac{32d(\log(n/d) + 1)}{n}} + \sqrt{\frac{2 \log(2/\delta)}{n}} \\ \end{aligned} \]