4-vc-dimension

VC Dimension

Recall that in Massart lemma, we transfer our focus from the hypothesis set to $\{ [h(x_1),\dots,h(x_n)]: h \in \mathcal{H} \}$ to give $M$. Another aspect is this set’s cardinality $t$, especially its relation with $n$ and $d$. In classification task, $t\textendash n$ relation corresponds to label assignment which easily goes to exponential. But we particularly hope that this is not the case.

We restrict ourselves to binary classification task and zero/one-loss in this section.

Definition: shattering coefficient. Given a function set $\mathcal{F}$ whose members map a feature vector $x \mapsto \mathcal{X}$ to $\mathcal{Y} = \{ 0,1 \}$, we define shattering coefficient $s(\mathcal{F}, n)$ as the maximum number of different label assignment over datasets of size $n$, i.e. $\{ x_1,\dots,x_n \} \to \mathcal{X}^n$: \[ s_\mathcal{F}(n) \triangleq \max_{x_1,\dots,x_n \in \mathcal{X}} \left| \{ [f(x_1),\dots,f(x_n)]: f \in \mathcal{F} \} \right| \] A trivial upper bound for shattering coefficient is $2^n$.

Corollary: Massart lemma applied to shattering coefficient. Given function set $\mathcal{F}$ whose members map the input to $\mathcal{Y} = \{ 0,1 \}$, a direct application of Massart lemma ($M=1, t = s_\mathcal{F}(n)$) gives the following bound on its empirical Rademacher complexity over every dataset $\mathrm{X}$ of size $n$: \[ \newcommand{\r}{\mathsf{R}} \hat \r_\mathrm{X}(\mathcal{F}) \le \sqrt{\frac{2 \log s_\mathcal{F}(n)}{n}} \] As a result, \[ \r_n(\mathcal{F}) \le \sqrt{\frac{2 \log s_\mathcal{F}(n)}{n}} \]

Definition: VC dimension. Given a function set $\mathcal{F}$ with Boolean output in $\mathcal{Y}=\{ 0,1 \}$, we define its VC dimension $\newcommand{VC}{\mathrm{VC}} \VC(\mathcal{F})$ as the size $n$ of the largest dataset $S$ that can be shattered by $\mathcal{F}$: \[ \VC(\mathcal{F}) = \sup\{ n: S_\mathcal{F}(n) = 2^n \} \] where \[ S_\mathcal{F}(n) = \left| \{ [f(x_1),\dots,f(x_n)]: f \in \mathcal{F} \} \right|,\ x_1\dots,x_n \in S \]

VC dimension kind of reflects the richness of the function set in classification task.

Examples

Here are some examples of function set:

Interval functions: $\mathcal{F}_\text{int} = \{ f_{a,b}(x) = \one[a \le x \le b]: a,b \in \R \}$

The VC dimension is $2$. Actually, the shattering coefficient $s_{\mathcal{F}_\text{int}}(n)$ for the set of interval functions is $\binom{n+1}{2} + 1$.

The implication of this example is that the VC dimension is usually the dimension of parameter that characterizes the function set.
Binarized sinus functions: $\{ f(x) = \one[\sin(\omega x) \le 0]: \omega \in \R \}$

The VC dimension is $\infty$. We show this by construction. Consider the data points $x_1,\dots,x_n$ where $x_i = 2^i$. Denote as $y_1,\dots,y_n$ the output. Then it is necessary that $\omega x_i$ is between $(2k_i+1)\pi$ and $(2k_i + 2y_i)\pi$ for some $k_i$. We show that setting $\omega = (0.y_1 \cdots y_n)_2 \pi$ suffices. \[ \begin{aligned} \sin(\omega x_i) &= \sin((y_1 \cdots y_i.y_{i+1} \cdots y_n)_2 \pi) \\ &= \sin((y_i.y_{i+1} \cdots y_n)_2 \pi) \\ &= y_i \end{aligned} \] This holds for any $y_1,\dots,y_n$. As a result, the shattering coefficient for the set of binarized sinus functions is $2^n$.

The implication of this example is that even when there is only one parameter, the VC dimension can go to infinity.
Binarized convex functions: $\mathcal{F}_\text{convex} = \{ f(x) = \one[g(x) \le 0]: \text{$g$ is convex} \}$

The VC dimension is $\infty$. We show this by construction. Consider the data points $x_1,\dots,x_n \in \R^d$ on the $d$-dimensional sphere. Denote as $y_1,\dots,y_n$ the arbitrary output. Let $I = \{ i: y_i = 1 \}$. Then $\textrm{convex-hull}(\{ x_i: i \in I \})$ cannot contain $x_j$’s whose label are $0$ (due to the geometric property of sphere). Let \[ g(x) = \one[x \in \textrm{convex-hull}(\{ x_i: i \in I \})] \]

Then we have \[ g(x_j) = \one[x_j \in \textrm{convex-hull}(\{ x_i: i \in I \})] = y_j \] As a result, the shattering coefficient for the set of binarized convex functions is $2^n$.

Theorem: VC dimension of $d$-dimensional hyperplanes. Consider the set of $d$-dimensional linear (what about affine??) classification rules: \[ \mathcal{F} = \{ \one[w^\intercal x \ge 0]: w \in \R^d \} \] Then, the VC dimension $\VC(\mathcal{F})$ of this function set will be $d$.

Proof. We first show that $\VC(\mathcal{F}) \le d$. Suppose on the contrary that $\VC(\mathcal{F}) > d$. Then, there exists $x_1,\dots,x_{d+1}$ that can be shattered by $\mathcal{F}$. From linear algebra knowledge, we know that there exists $c_1, \dots, c_{d+1} \in \R$ which are not all zeros such that \[ c_1 x_1 + \dots + c_{d+1} x_{d+1} = 0 \] Without loss of generality, suppose $c_1 < 0$. There exists $w$ such that \[ \forall i, y_i = \one[w^\intercal x_i \ge 0] = \one[c_i \ge 0] \] Then \[ c_1 w^\intercal x_1 + \dots + c_{d+1} w^\intercal x_{d+1} = w^\intercal(c_1 x_1 + \dots + c_{d+1} x_{d+1}) = 0 \] For any $i$, if $c_i \ge 0$, then $w^\intercal x_i \ge 0$; if $c_i < 0$, then $w^\intercal x_i < 0$. Hence, $c_i w^\intercal x_i \ge 0$. Specifically, $c_1 w^\intercal x_i > 0$. Therefore, \[ c_1 w^\intercal x_1 + \dots + c_{d+1} w^\intercal x_{d+1} > 0 \] which is a contradiction.

Then it suffices to show that $\VC(\mathcal{F}) = d$ if there exists $x_1, \dots, x_d$ that can be shattered by $\mathcal{F}$. We choose $x_i = e_i$ which is the $i$-th standard basis of $\R^d$. Then $w = [y_1,\dots,y_d]^T$ would suffice. Q.E.D.

Connection with Rademacher Complexity

Lemma: Sauer’s lemma. Consider a function set $\mathcal{F}$ with VC dimension $\VC(\mathcal{F}) = d$. Then, for every integer $n \in \N$: \[ s_\mathcal{F}(n) \le \sum_{i=0}^d {n \choose i} \]

Sauer’s lemma trivially holds if $n \le d$ because $\sum_{i=1}^d \binom{n}{i} = \sum_{i=1}^n \binom{n}{i} = 2^n$. We care about the case when $d < n$, i.e. $s_\mathcal{F}(n) \le \sum_{i=1}^d \binom{n}{i} \le (\frac{e n}{d})^d$.

Theorem: Rademacher complexity bound with VC dimension. Consider function set $\mathcal{F}$ with Boolean output in $\mathcal{Y} = \{ 0,1 \}$, whose VC dimension is $\VC(\mathcal{F}) = d$. Then, we have the following bound on the Rademacher complexity over every dataset $S$ of size $n$: \[ \hat \r(\mathcal{F}) \le \sqrt{ \frac{2d (\log (n/d) + 1)}{n} } \] As a result, \[ \r_n(\mathcal{F}) \le \sqrt{\frac{2d (\log(n/d) + 1)}{n}} \] Proof. A direct application of Sauer’s lemma and Massart lemma applied to shattering coefficient gives the above.

Theorem: excess risk bound via VC dimension. Consider a hypothesis set $\mathcal{F}$ with Boolean output which has $\VC(\mathcal{F}) = d$. Suppose the loss function is zero/one loss. Then, with probability at least $1-\delta$, \[ L(\hat f) - L(f^*) \le \sqrt{\frac{32 d((\log n/d) + 1)}{n}} + \sqrt{\frac{2 \log(2/\delta)}{n}} \] Proof. A direction application of excess risk bound via Rademacher complexity gives \[ \begin{aligned} &L(\hat f) - L(f^*) \le 4 \r_n(\mathcal{H}) + \sqrt{\frac{2 \log(2/\delta)}{n}} \\ &\le 4 \sqrt{\frac{2d(\log(n/d) + 1)}{n}} + \sqrt{\frac{2 \log(2/\delta)}{n}} \\ &= \sqrt{\frac{32d(\log(n/d) + 1)}{n}} + \sqrt{\frac{2 \log(2/\delta)}{n}} \\ \end{aligned} \]

Last updated on Jan 7, 2022