Support Vector Machine
Support vector machine is used in binary classification task. It aims to find a linear hyperplane that separates the data with different labels with the maximum margin. By symmetry, there should be at least one margin point on both side of the decision boundary. These points are called support vectors. Further, not all points on the margin can be called support vectors. If we move support vectors away from the margin, the decision hyperplane should change.
Hard-margin SVM
Suppose the data \(\mathcal D = \{(x^{(i)}, y^{(i)}), i=1, \dots, M\}\), and is linearly separable. The separation hyperplane will be in the form of \(w^Tx + b = 0\). Let \(y^{(i)} = 1\) when \(x^{(i)}\) is above the hyperplane (\(w^Tx^{(i)} + b > 0\)), \(y^{(i)} = -1\) when \(x^{(i)}\) is below the hyperplane (\(w^Tx^{(i)} + b < 0\)).
The distance (margin) from a data point to the separation hyperplane will be \[ d^{(i)} = \frac{|w^Tx^{(i)} + b|}{||w||_2} = \frac{y^{(i)}(w^Tx^{(i)} + b)}{||w||_2}\\ \] \(d^{(i)}\) is called the geometric distance, while \(|w^Tx^{(i)} + b|\) is called the functional distance.
The margin of the hyperplane \(w^Tx + b = 0\) will be \[ d = \min\limits_{i \in \{1,\dots,M\}} d^{(i)} = \min\limits_{i \in \{1,\dots,M\}} \frac{y^{(i)}(w^Tx^{(i)}+b)}{||w||_2} \] The maximum margin solution is found by solving \[ \max\limits_{w,b} \min\limits_{i \in \{1,\dots,M\}} \frac{y^{(i)}(w^Tx^{(i)}+b)}{||w||_2} \] Suppose \((w^\prime, b^\prime)\) is one solution to the above. Then \((\lambda w^\prime, \lambda b^\prime)\) is also a solution, because \[ \frac{y^{(i)}((\lambda w^{\prime})^Tx^{(i)} + \lambda b^\prime)}{||\lambda w^\prime||_2} = \frac{\lambda y^{(i)}((w^{\prime})^Tx^{(i)} + b^\prime)}{\lambda ||w^\prime||_2} = \frac{y^{(i)}((w^{\prime})^Tx^{(i)} + b^\prime)}{||w^\prime||_2} \] There is an extra degree of freedom in this problem. Therefore, we may impose \[ \min\limits_{i \in \{1,\dots,M\}} y^{(i)}(w^Tx^{(i)}+b) = 1 \] to consume this freedom. Consequently, the problem becomes \[ \begin{gather} \max \frac{1}{||w||_2} \iff \min \frac{1}{2}||w||^2_2 \\ s.t.\quad y^{(i)}(w^Tx^{(i)} + b) \ge 1, i = 1,\dots,M \end{gather} \]
Soft-margin SVM
It may not happen that the real data are linearly-separable, or there may exist noisy samples that disrupt this linear separability. In such case, we may allow some samples to violate the margin. We define some slack variables \(\xi_i \ge 0\). \(\xi_i > 0\) means that the sample is inside the margin (\(\xi_i < 1\)), on the boundary (\(\xi_i = 1\)), or even worse, misclassified (\(\xi_i > 1\)). \(\xi_i = 0\) means that the sample is outside the margin.
Of course, these slack variables should be as small as possible. Then the problem becomes \[ \begin{gather} \min \frac{1}{2}||w||^2_2 + C\sum_{i=1}^M\xi_i \\ s.t.\quad y^{(i)}(w^Tx^{(i)} + b) \ge 1 - \xi_i, \xi_i \ge 0, i=1,\dots,M \end{gather} \] \(C\) is a penalty factor on violating the margin. To some extent, it avoid overfitting.
To solve this, first convert the problem into the standard form: \[ \begin{aligned} \min \frac{1}{2}||w||^2_2 &+ C\sum_{i=1}^M\xi_i \\ s.t.\quad 1 - \xi_i - y^{(i)}(w^Tx^{(i)} + b) &\le 0, i=1,\dots,M \\ -\xi_i &\le 0, i=1,\dots,M \end{aligned} \] This problem gives a strong duality. The solution will satisfy the KKT conditions. We first write down the Lagrangian: \[ L(w,b,\xi,\lambda,\mu) = \frac{1}{2}||w||^2_2 + C\sum_{i=1}^M\xi_i + \sum_{i=1}^M\lambda_i(1 - \xi_i - y^{(i)}(w^Tx^{(i)} + b)) - \sum_{i=1}^M\mu_i\xi_i \] Then we form the dual problem: \[ \max_{\lambda,\mu;\lambda_i,\mu_i \ge 0} \min_{w,b,\xi}L(w,b,\xi,\lambda,\mu) \] Take derivative w.r.t. \(w\), \(b\), and \(\xi\) to give \[ \begin{gather} \frac{\partial L}{\partial w} = w - \sum_{i=1}^M\lambda_iy^{(i)}x^{(i)}, \frac{\partial^2 L}{\partial w\partial w} = I \succeq 0 \\ \frac{\partial L}{\partial b} = -\sum_{i=1}^M\lambda_iy^{(i)}, \frac{\partial^2 L}{\partial b\partial b} = 0 \succeq 0 \\ \frac{\partial L}{\partial \xi} = C - \lambda - \mu, \frac{\partial^2 L}{\partial \xi\partial \xi} = 0 \succeq 0 \end{gather} \] The Hessian matrices of \(L\) w.r.t. \(w\), \(b\), and \(\xi\) are positive semi-definite. Therefore, \(\min\limits_{w,b,\xi}L(w,\xi,\lambda,\mu)\) is obtained at its local minimum, i.e. where its first-order derivative meets \(0\): \[ w = \sum_{i=1}^M\lambda_iy^{(i)}x^{(i)} \\ \sum_{i=1}^M\lambda_iy^{(i)} = 0 \\ \lambda + \mu = C \] Substitute above back to \(L\) to transform the dual problem into \[ \begin{gather} \max_{\lambda,\mu} \frac{1}{2}\sum_{i=1}^M\sum_{j=1}^M\lambda_i\lambda_jy^{(i)}y^{(j)}x^{(i)}\cdot x^{(j)} + C\sum_{i=1}^M\xi_i - \sum_{i=1}^M\sum_{j=1}^M\lambda_i\lambda_jy^{(i)}y^{(j)}x^{(i)}\cdot x^{(j)} + \sum_{i=1}^M\lambda_i(1 - \xi_i - y^{(i)}b) + \sum_{i=1}^M(\lambda_i - C)\xi_i \\ s.t.\quad \sum_{i=1}^M\lambda_iy^{(i)} = 0 \\ \lambda_i,\mu_i \ge 0, i=1,\dots,M \\ \Downarrow \\ \max_{\lambda,\mu}D(\lambda,\mu) = -\frac{1}{2}\sum_{i=1}^M\sum_{j=1}^M\lambda_i\lambda_jy^{(i)}y^{(j)}x^{(i)}\cdot x^{(j)} + \sum_{i=1}^M\lambda_i \\ s.t.\quad \sum_{i=1}^M\lambda_iy^{(i)} = 0 \\ 0 < \lambda_i < C, i=1,\dots,M \\ \end{gather} \] Since we know the optimal solution exists, instead of taking derivative of \(D\) w.r.t. \(\lambda\) to solve it, we assume the solution to above problem is \(\lambda^\star\). Then, \[ \begin{gathered} w^\star = \sum_{i=1}^M\lambda^\star_iy^{(i)}x^{(i)} \\ \mu^\star = C - \lambda^\star \end{gathered} \] Complementary constraints will give \[ \lambda^\star_i(1 - \xi^\star_i - y^{(i)}((w^{\star T})x^{(i)} + b^\star)) = 0 \\ \mu_i\xi_i = 0 \] There is at least one \(\lambda^\star_j > 0\) or else \(w^\star = \sum_{i=1}^M\lambda^\star_iy^{(i)}x^{(i)} = 0\), which means the primal constraints \[ 1 - \xi_i - y^{(i)}((w^\star)^Tx^{(i)} + b^\star) = 1 - \xi_i - y^{(i)}b^\star\le 0 \] won’t hold no matter what value \(b^\star\) takes. This specific \(\lambda^\star_j\)’s slackness complementary will give
\[ \begin{aligned} 0 &= (1 - \xi^\star_j - y^{(j)}((w^\star)^Tx^{(j)} + b^\star)) \\ b^\star &= \frac{1 - \xi^\star_j}{y^{(j)}} - (w^\star)^Tx^{(j)} \\ &= \frac{1 - \xi^\star_j}{y^{(j)}} - \sum_{i=1}^M\lambda^\star_iy^{(i)}x^{(i)}\cdot x^{(j)} \end{aligned} \]
According to the value of \(\lambda^\star_i\) (not known yet), added with primal constraints and the complementary constraints, \[ \begin{gather} 1 - \xi^\star_i - y^{(i)}((w^{\star T})x^{(i)} + b^\star) \le 0 \\ \xi_i \ge 0 \\ \lambda^\star_i(1 - \xi^\star_i - y^{(i)}((w^{\star T})x^{(i)} + b^\star)) = 0 \\ \mu^\star_i\xi^\star_i = 0 \\ \lambda^\star_i + \mu^\star_i = C \end{gather} \] we can make interpretations on the value of \(\lambda^\star_i\): $$ \[\begin{aligned} \lambda^\star_i = 0 &\Rightarrow \begin{cases} \mu^\star_i = C \Rightarrow \xi^\star_i = 0 \\ 1 - \xi^\star_i - y^{(i)}((w^{\star T})x^{(i)} + b^\star) \le 0 \end{cases} \Rightarrow y^{(i)}((w^{\star T})x^{(i)} + b^\star) \ge 1 \\ 0 < \lambda^\star_i < C &\Rightarrow \begin{cases} \mu^\star_i > 0 \Rightarrow \xi^\star_i = 0 \\ 1 - \xi^\star_i - y^{(i)}((w^{\star T})x^{(i)} + b^\star) = 0 \end{cases} \Rightarrow y^{(i)}((w^{\star T})x^{(i)} + b^\star) = 1 \\ \lambda^\star_i = C &\Rightarrow \begin{cases} \mu^\star_i = 0 \Rightarrow \xi^\star_i \ge 0 \\ 1 - \xi^\star_i - y^{(i)}((w^{\star T})x^{(i)} + b^\star) = 0 \end{cases} \Rightarrow y^{(i)}((w^{\star T})x^{(i)} + b^\star) \le 1 \\ \end{aligned}\]$$ They corresponds to cases where sample \(x^{(i)}\) is beyond the margin, is on the margin, within the margin, respectively. The key to the above derivation is \(\lambda^\star\), which is to be solved in Solving SVM section.
Hinge Loss Perspective
As an aside, soft-margin SVM is equivalent to \[ \min \frac{1}{C}||w||^2_2 + \sum_{i=1}^M\max(0, 1 - y^{(i)}(w^Tx^{(i)} + b)) \] where \(l_h(\hat y) = \max(0, 1 - y \hat y)\) is called the hinge loss and \(\frac{1}{C} \|w\|_2^2\) plays as the regularization term.
Kernel SVM
The term \(x^{(i)} \cdot x^{(j)}\) in \(D(\lambda,\mu)\) is the inner product between two samples. We can make a table storing these inner products. This naturally introduces the kernel trick, which means we can manipulate the entry in this table by remapping the \((x^{(i)}, x^{(j)})\) so that the entry represents the inner product of some higher-dimensional features. \[ \mathcal K(x^{(i)}, x^{(j)}) = \phi(x^{(i)}) \cdot \phi(x^{(j)}) \] This saves us from explicitly defining the mapping \(\phi\) of \(x\) to a higher-dimensional feature. This also saves the time of the computation of the inner products of these higher-dimensional features, than that of transform-then-inner-product method.
SVM is a linear classifier, which means its decision boundary is a linear hyperplane in input feature space. By applying kernel trick, we implicitly map the input feature to a higher dimensional one. Therefore the decision boundary would become a linear hyperplane in this higher-dimensional space: \[ w_\phi\phi(x) + b_\phi = 0 \] And thus this hyperplane is not linear in original feature space.
Polynomial Kernel
Input feature \(x = [x_1,\dots,x_N] \in \R^N\)
Kernel between two features is a \(p\)-th order polynomial: \[ \mathcal K(x, x^\prime) = (x^Tx^\prime)^p = (\sum_{i=1}^Nx_ix^\prime_i)^p \]
For example, when \(p = 2\), \[ K(x,x^\prime) = (\sum_{i=1}^Nx_ix^\prime_i)^2 = (\sum_{i=1}^N\sum_{j=1}^Nx_ix_jx^\prime_ix^\prime_j)^2 = \phi(x)\phi(x^\prime) \] The transformation applied on original input feature will be \[ \phi(x) = [x_1x_1,x_1x_2,\dots,x_1x_N,x_2x_1,\dots x_2x_N,\dots,x_Nx_1,\dots,x_Nx_N] \in \R^{N^2} \]
Radial Basis Function Kernel
Kernel between two features is similar to a Gaussian \[ \mathcal K(x,x^\prime) = e^{-\gamma||x - x^\prime||^2_2} \] where \(\gamma\) is a hyper-parameter to be determined.
This kernel is the case where it is hard to define the transformation \(\phi\) explicitly.
Solving SVM
\(\lambda^\star\) is the key to the final solution and by far it remains unsolved: \[ \begin{gather} \max_{\lambda,\mu}D(\lambda,\mu) = -\frac{1}{2}\sum_{i=1}^M\sum_{j=1}^M\lambda_i\lambda_jy^{(i)}y^{(j)}x^{(i)}\cdot x^{(j)} + \sum_{i=1}^M\lambda_i \\ s.t.\quad \sum_{i=1}^M\lambda_iy^{(i)} = 0 \\ 0 < \lambda_i < C, i=1,\dots,M \\ \end{gather} \] We can attack it by sequential minimal optimization.
Multi-class SVM
1 v.s. 1
Train 1-vs-1 SVM for all pairs of classes. Then decide by majority voting.
1 v.s. rest
Train 1-vs-rest SVM for all classes (\(y = 1\) when \(x\) belongs to the “1”). Choose the class with the largest geometric margin.