5-kernel-methods

Kernel Methods

Approximation Error

While a bounded excess risk is necessary for satisfactory learning performance, satisfactory learning results also require a small value for \(L(f^*)\), commonly called the approximation error.

We can decompose the loss of a supervised learning model as the sum of excess risk (also called estimation error or variance <not necessarily the statistical one>) and approximation error (also called bias): \[ L(\hat f) = \underbrace{L(\hat f) - L(f^*)}_{\text{excess risk}} + \underbrace{\color{red}{L(f^*)}}_\text{approximation eror} \] Another key question in machine learning is how to reduce the approximation error.

Revisiting Linear Regression

Recall that in linear regression we want to learn a linear model \(f_w(x) = w^T x\) to predict a continuous label \(Y \in \R\). When using squared-error loss, we obtain the following ERM task: \[ \begin{equation} \min_{w}\ \frac{1}{n} \sum_{i=1}^n \left( w^\top x_i - y_i \right)^2 \tag{Linear Regression} \label{linreg} \end{equation} \] One way to empower the above formulation is to extend linear function set to some richer function set. This is reminiscent of powerful models like neural network. However, another powerful and well-established approach is the kernel method which substitutes \(x \in \R^d\) with a fixed feature map \(\phi(x): \R^d \to \R^m\) transferring the input to a potentially high-dimensional space \(m \gg d\).

\[ \begin{equation} \min_w\ \frac{1}{n} \sum_{i=1}^n (w^\top \phi(x_i) - y_i)^2 \tag{Kernel Method} \label{kernel-method} \end{equation} \]

The typical way to solve the above problem would be the gradient descent: \[ w^{t+1} = w^{t} - \gamma \frac{2}{n} \sum_{i=1}^n ((w^t)^\top \phi(x_i) - y_i) \phi(x_i) \] One observation (representer theorem) is that, the update to \(w\) is a linear combination of \(\phi(x_i)\)’s. If \(w^0 = 0\), we have that \(w^t \in \Col(\phi(x_1),\dots,\phi(x_n))\). In this sense, the update of \(m\)-dimensional \(w\) is controlled by \(\phi(x_1),\dots,\phi(x_n)\), which is only \(n\)-dimensional. If \(m < n\), the above is not interesting because \(n\) \(m\)-dimensional vectors already cover \(\R^m\). On the other hand if \(m > n\), \(w^t\) will reside in a \(m\)-dimensional subspace. We reduce an \(m\)-dimensional optimization problem to an \(n\)-dimensional one.

Let’s rewrite \(w\) as the linear combination of \(\phi(x_1),\dots,\phi(x_n)\), i.e. \(w = \sum_{j=1}^n \alpha_j \phi(x_j)\). The problem becomes \[ \min_\alpha \frac{1}{n} \sum_{i=1}^n \left( \sum_{j=1}^n \alpha_j \underbrace{\langle \phi(x_j), \phi(x_i) \rangle}_{\mathcal{K}(x_j, x_i)} - y_i \right)^2 \] It turns out that the denotation of the problem can be simplified as \[ \min_\alpha \frac{1}{n} \| y - K \alpha \|_2^2 \] where \[ y = [y_1,\dots,y_n]^\top, \alpha = [\alpha_1, \dots, \alpha_n]^\top \\ K = \begin{pmatrix} \mathcal{K}(x_1, x_1) & \cdots & \mathcal{K}(x_1, x_n) \\ \vdots & \ddots & \vdots \\ \mathcal{K}(x_n, x_1) & \cdots & \mathcal{K}(x_n, x_n) \end{pmatrix} \] The question remains for what function \(\mathcal{K}: \R^m \times \R^m \to \R\), there exists a \(\phi: \R^d \to \R\) such that \(\mathcal{K}(x,x') = \langle \phi(x), \phi(x') \rangle\).

Kernel Function

Definition: kernel function. We call \(\mathcal{K}: \mathcal{X} \times \mathcal{X} \to \R\) a kernel function, if for every integer \(t \in \N\) and vectors \(x_1, \dots, x_t \in \mathcal{X}\), the matrix \(K \in \R^{t \times t}\) with the \((i,j)\)-th entry \(\mathcal{K}(x_i, x_j)\) will be symmetric and positive semi-definite (why better PSD??): \[ K = \begin{pmatrix} \mathcal{K}(x_1, x_1) & \cdots & \mathcal{K}(x_1, x_n) \\ \vdots & \ddots & \vdots \\ \mathcal{K}(x_n, x_1) & \cdots & \mathcal{K}(x_n, x_n) \end{pmatrix} \succeq 0 \]

Theorem: \(\mathcal{K}: \R^d \times \R^d \to \R\) is a kernel function if and only if there exists a feature map \(\phi: \R^d \to \R^m\) such that for every \(x, x'\), \[ \mathcal{K}(x, x') = \langle \phi(x), \phi(x') \rangle \] Note that there is a subtlety that \(m\) can be \(\infty\).

Examples

  1. Linear kernel: \(\mathcal{K}_\text{linear}(x, x') = \langle x, x' \rangle\)

  2. Degree-\(r\) polynomial kernel: \(\mathcal{K}_\text{poly-$r$}(x, x') = (1 + \langle x, x' \rangle)^r\)

    The interesting thing about the polynomial kernel is that, if we directly use the degree-\(r\) feature map \(\phi_r\), the number of variables will be \(m = O(d^r)\); but in this case, the number of variables is just \(n\)​.

    As for the proof, it simply follows from the fact that sum (between \(1\) and \(x^\top x\)) and product (among \((1+x^\top x)\)’s) of kernel functions are kernel functions.

  3. Gaussian kernel with bandwidth \(\sigma^2\): \(\mathcal{K}_\text{Gaussian($\sigma^2$)}(x, x’) = \exp(-\frac{\|x-x'\|_2^2}{2\sigma^2})\)

    The reason why we don’t use \(f(x, x') = -\|x-x'\|_2^2\)​ is that the resulting matrix is not PSD.

    As for the proof, we can separate the terms: \[ \begin{aligned} &\exp(-\frac{\|x-x'\|_2^2}{2\sigma^2}) = \exp(-\frac{\|x\|_2^2}{2\sigma^2} - \frac{\|x'\|_2^2}{2\sigma^2} + \frac{x^T x'}{\sigma^2}) \\ &= \underbrace{\exp(-\frac{\|x\|_2^2}{2\sigma^2})}_{\varphi(x)} \underbrace{\exp(-\frac{\|x'\|_2^2}{2\sigma^2})}_{\varphi(x')} \exp(\frac{x^T x'}{\sigma^2}) \\ \end{aligned} \] Now it remains to show that \(\exp(\frac{x^T x'}{\sigma^2})\) is a kernel function. To show it, recall the Taylor series: \[ \exp(\frac{x^T x'}{\sigma^2}) = \sum_{i=0}^\infty \frac{(x^T x')^m}{\sigma^{2m} i!} \] which is the sum of infinite-many kernel functions.

Properties

Proposition: sum of kernel functions. If \(\mathcal{K}_1\) and \(\mathcal{K}_2\) are kernel functions, then their sum, i.e. \[ \mathcal{K}(x,x') = \mathcal{K}_1(x,x') + \mathcal{K}_2(x,x') \] will also be a kernel function.

Proof. Consider concatenating \(\phi(x) = [\phi_1(x), \phi_2(x)]\).

Theorem: product of kernel functions. If \(\mathcal{K}_1\) and \(\mathcal{K}_2\) are kernel functions, then their product, i.e. \[ \mathcal{K}(x,x') = \mathcal{K}_1(x,x') \times \mathcal{K}_2(x,x') \] will also be a kernel function.

Proof. Let \(A, B\) be two symmetric matrices whose spectral decompositions are \[ \begin{aligned} A \odot B &= \left( \sum_{i=1}^t \lambda_i u_i u_i^T \right) \odot \left( \sum_{j=1}^{t'} \gamma_j v_j v_j^T \right) \\ &= \sum_{i=1}^t \sum_{j=1}^{t'} \lambda_i \gamma_j (u_i u_i^T) \odot (v_j v_j^T) \\ &= \sum_{i=1}^t \sum_{j=1}^{t'} \lambda_i \gamma_j (u_i \odot v_j) (u_i \odot v_j)^T \\ \end{aligned} \] As a result, \(A \odot B\) is PSD.

Defining Function Space via Kernel

The motivation of RKHS is that, we need to define a function space (which is a linear transformation of the feature function associated with the kernel function) based on the kernel function \(\mathcal{K}\). The reason why we need such function space is that, we can never resort to optimizing over \(\set{w^T \phi: w \in \R^m}\) where \(\phi\) is not guaranteed to have a closed-form formula in kernel method. We need to find a surrogate function space to do the optimization.

Definition: reproducing kernel Hilbert space (RKHS). For kernel function \(\mathcal{K}\), we define the reproducing kernel Hilbert space \(\mathcal{H}\) as the following set of functions: \[ \mathcal{H} = \{ f(x) \triangleq \sum_{i=1}^t \alpha_i \mathcal{K}(x_i, x): t \in \N, \alpha_1,\dots,\alpha_t \in \R, x_1,\dots,x_t \in \mathcal{X} \} \]

This looks frustrating because it seems that we convert an \(m\)-dimensional optimization problem to an infinite-dimensional one. But later the representer theorem will show us interesting conclusion.

Example: RKHS of linear kernel

\[ \begin{aligned} \mathcal{H} &= \{ \sum_{i=1}^t \alpha_i x_i^T x: t \in \N, \alpha_1,\dots,\alpha_t \in \R, x_1,\dots,x_t \in \mathcal{X} \} \\ \end{aligned} \] Note that \[ \sum_{i=1}^t \alpha_i x_i^T x = (\sum_{i=1}^t \alpha_i x_i^T) x \] \((\sum_{i=1}^t \alpha_i x_i^T)\) can span all the \(w \in \R^d\). Therefore, \(\mathcal{H}\) is the set of all the linear functions.

Example: RKHS of polynomial kernel

Take \(r=2\)​ as an example. \[ \begin{aligned} \mathcal{H} &= \{ \sum_{i=1}^t \alpha_i (1 + x_i^T x)^2: t \in \N, \alpha_1,\dots,\alpha_t \in \R, x_1,\dots,x_t \in \mathcal{X} \} \\ \end{aligned} \] Note that \[ \begin{aligned} &\sum_{i=1}^t \alpha_i (1 + x_i^T x)^2 = \sum_{i=1}^t \alpha_i (1 + 2 x_i^T x + (x_i^T x)^2) \\ &= \sum_{i=1}^t \alpha_i + (\sum_{i=1}^t 2\alpha_i x_i^T) x + \sum_{i=1}^t \alpha_i (x^T x_i x_i^T x) \\ &= \sum_{i=1}^t \alpha_i + (\sum_{i=1}^t 2\alpha_i x_i^T) x + x^T (\sum_{i=1}^t \alpha_i x_i x_i^T) x \\ \end{aligned} \] Now with the quadratic form \(x^T A x + bx + c\) for arbitrary symmetric matrix \(A\) with rank \(r\), vector \(b\) and constant \(c\), we first apply spectral decomposition to \(A\) to get \[ A = \sum_{i=1}^r \lambda_i v_i v_i^T \] For \(i=1,\dots,r\), we choose \[ x_i = v_i, x_{r+i} = -v_{i}, \\ b = \sum_{i=1}^r \gamma_i v_i ??, \\ \alpha_i + \alpha_{r+i} = \lambda_{i}, \\ \alpha_i - \alpha_{r+i} = \gamma_{i}/2 \\ \] Further, we can choose \(x_{2r+1},\dots,x_{2r+k} = 0\) and \(\alpha_{2r+1},\dots,\alpha_{2r+k}\) such that \(\sum_{i=1}^k \alpha_{2r+k} = c\).

In this way, we can rewrite any quadratic function into the form of element of \(\mathcal{H}\). As a result, \(\mathcal{H}\)​ is the set of all the quadratic functions.

Hilbert Space

Note that in the above we go to RKHS via the kernel instead of the feature function. This is due to the same reason that the computation of feature function is in a higher dimension, which is more costly.

Definition: inner product in an RKHS. Given two functions \(f(x) = \sum_{i=1}^t \alpha_i \mathcal{K}(x_i, x)\) and \(g(x) = \sum_{j=1}^{t'} = \alpha_j' \mathcal{K}(x_j', x)\) in the RKHS \(\mathcal{H}\) of kernel \(\mathcal{K}\), we define their inner produce as \[ \langle f,g \rangle = \sum_{i=1}^t \sum_{j=1}^{t'} \alpha_i \beta_j \mathcal{K}(x_i, x_j') \]

Proposition: RKHS with the defined inner product is a Hilbert space. The RKHS defined for a kernel \(\mathcal{K}\) coupled with the above inner product will result in a Hilbert space, where the following holds for every \(f, f_1, f_2, g \in \mathcal{H}\):

  1. Symmetry: \(\langle f, g \rangle = \langle g, f \rangle\)
  2. Linearity and homogenity: for all \(\gamma \in \R\): \(\langle f_1 + \gamma f_2, g \rangle = \langle f_1, g \rangle + \gamma \langle f_2, g \rangle\)
  3. Positive definiteness: \(\langle f, f \rangle \ge 0\) where the equality holds only for \(f = 0\)

With the Hilbert space, we can determine the norm resident in it:

Definition: norm in an RKHS. For a function \(f(x) = \sum_{i=1}^t \alpha_i \mathcal{K}(x_i, x)\) in the RKHS \(\mathcal{H}\) for kernel \(\mathcal{K}\), we define its norm as \[ \|f\|_\mathcal{H}^2 \triangleq \langle f,f \rangle = \sum_{i=1}^t \sum_{j=1}^t \alpha_i \alpha_j \mathcal{K}(x_i, x_j) = \alpha^\top K \alpha \] where \(K \in \R^{t \times t}\) and \(K_{ij} = \mathcal{K}(x_i, x_j)\).

Learning with Kernel Functions

For \(f(x) = w^T x\) and \(g(x) = (w')^T x\) that belong to the RKHS of linear kernel \(\mathcal{H}_\text{linear} = \{ f_w(x) = w^\top x: w \in \R^d \}\), we have \[ \begin{aligned} \langle f,g \rangle &= \sum_i \sum_j \alpha_i \beta_j x_i^T x_j \\ &= \sum_i \alpha_i x_i^T \sum_j \beta_j x_j \\ &= \sum_i \alpha_i x_i^T w' \\ &= w^T w' \\ \end{aligned} \] As a result, \(\|f\|_{\mathcal{H}_\text{linear}} = \|w\|\). We generalize from \(\eqref{linreg}\) to \[ \begin{equation} \min_{f \in \mathcal{H}}\ \frac{1}{n} \sum_{i=1}^n \ell(f(x_i), y_i) + Q(\|f\|_\mathcal{H}) \tag{Regulaized Kernel Method} \label{regularized-kernel-method} \end{equation} \] where \(\mathcal{H}\) is a general RKHS \(Q: \R^+ \to \R\) is an increasing function acting as a regularization term.

Theorem: representer theorem. Given samples \(x_1,\dots,x_n\), consider the \(\eqref{regularized-kernel-method}\) problem. Every optimal solution \(f^* \in \mathcal{H}\) satisfies the following for some real coefficients \(\alpha_1,\dots,\alpha_n \in \R\): \[ f^*(x) = \sum_{j=1}^n \alpha_j \mathcal{K}(x_j, x) \] Proof. To show it, first note that \(f^*\) comes from \(\mathcal{H}\). Let \(\phi\) be the feature function associated with the kernel function \(\mathcal{K}\). Thus, \(f^*(x)\) can be written as \((w^*)^\top \phi(x)\) where \(w^* = \sum_{j=1}^n \alpha_j \phi(x_j) + v\) and \(v\) is the component that is perpendicular to \(\Col(\phi(x_1), \dots, \phi(x_n))\)​.

Let \(\tilde w = \sum_{j=1}^n \alpha_j \phi(x_j)\). We claim that \(v = 0\); otherwise \(\tilde f(x) = \tilde w^\top x\) incurs the same loss but smaller norm, contradicting the fact that \(f^*\) is the optimal.

Representer theorem implies that the kernel method solves the following optimization problem: \[ \min_{\alpha \in \R^n} \frac{1}{n} \sum_{i=1}^n \ell \left( \sum_{j=1}^n \alpha_j \mathcal{K}(x_j, x_i), y_i \right) + Q(\sqrt{\alpha^\top K \alpha}) \]

Note how we drop the intractable \(\phi\) in \(\eqref{kernel-method}\) in the above formulation.

Example: Kernel-based Ridge Regression

We choose \(\ell(\hat y, y) = (\hat y-y)^2\). We choose \(Q(w) = \lambda \|w\|_2^2\). The kernel function is the linear kernel. \[ \begin{gathered} \min_{w \in \R^d} \sum_{i=1}^n (w^\top x_i -y_i)^2 + \lambda \|w\|_2^2 \\ \Downarrow \\ \min_{\alpha \in \R^n} \|y - K \alpha\|_2^2 + \lambda \alpha^\top K \alpha \end{gathered} \] The above is a convex optimization problem. As a result, \[ \begin{aligned} 2 K (K \alpha^* - y) + 2\lambda K \alpha^* &= 0 \\ K (K + \lambda) \alpha^* &= K y \\ \end{aligned} \]

\(\alpha^* = (K + \lambda I)^{-1} K y\) will be the solution.

Example: Kernel-based SVM

We choose \(\ell(\hat y, y) = \max(0, 1 - \hat y y)\) which is the hinge loss. We choose \(Q(w) = \lambda \|w\|_2^2\). The kernel function is the linear kernel. \[ \min_{w \in \R^d} \sum_{i=1}^n \ell(w^T x_i, y_i) + \lambda \|w\|_2^2 \\ \] Note that \(\max(0, 1-z) = \max_{\alpha \in [0,1]} \alpha(1-z)\). The problem can be reformulated as \[ \min_{w \in \R^d} \max_{\alpha \in [0,1]^n} \sum_{i=1}^n \alpha_i (1 - y_i w^T x_i) + \lambda \|w\|_2^2 \\ \] The minimax theorem implies that (the reason for choosing \(\alpha \in [0,1]^n\) instead of \(\alpha \in \{ 0,1 \}^n\) is that we want to satisfy the condition of minimax theorem) we can swap the \(\min\) and \(\max\)​ in the above formula to obtain \[ \max_{\alpha \in [0,1]^n} \min_{w \in \R^d} \sum_{i=1}^n \alpha_i (1 - y_i w^T x_i) + \lambda \|w\|_2^2 \\ \] The minimization term inside gives \[ \begin{aligned} &\min_{w \in \R^d} \sum_{i=1}^n \alpha_i (1 - y_i w^\top x_i) + \lambda \|w\|_2^2 \\ =\ &\sum_{i=1}^n \alpha_i + \min_{w \in \R^d} - \sum_{i=1}^n \alpha_i y_i x_i^\top w + \lambda \|w\|_2^2 \\ =\ &\sum_{i=1}^n \alpha_i - \frac{2}{\lambda} \big\| \sum_{i=1}^n \alpha_i y_i x_i^\top \big\|_2^2 \\ =\ &\sum_{i=1}^n \alpha_i - \frac{2}{\lambda} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \langle x_i, x_j \rangle \\ \end{aligned} \] Therefore, the dual optimization problem to SVM is \[ \max_{\alpha \in [0,1]^n} \sum_{i=1}^n \alpha_i - \frac{2}{\lambda} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \underbrace{\langle x_i, x_j \rangle}_{\mathcal{K}(x_i, x_j)} \] If we define \(\tilde K = [y_i y_j \mathcal{K}(x_i, x_j)]\), the problem becomes \[ \max_{\alpha \in [0,1]^n} 1^\top \alpha - \frac{2}{\lambda} \alpha^\top \tilde K \alpha \] The method for solving such constrained optimization would be the coordinate descent.

Shift-invariant Kernel Functions

Definition: shift-invariant kernel. We call a kernel function \(\mathcal{K}: \mathcal{X} \times \mathcal{X} \to \R\) shift-invariant if there exists function \(\kappa: \mathcal{X} \to \R\) such that for every \(x,x' \in \mathcal{X}\): \[ \mathcal{K}(x,x') = \kappa(x-x') \] Note that \(\kappa\) must be even.

One example would be the Gaussian kernel. On the other hand, linear kernel and polynomial kernel are not shift-invariant (easy to verify). An important question to ask, how to determine whether a kernel function is shift-invariant or not.

Definition: Fourier series. Given a function \(f: \R^d \to \R\), we define its Fourier (transform) series \(\hat f: \R \to \R\) as \[ \hat f(\omega) \triangleq \int f(x) \exp(-i\omega^\top x) \d x \] The inverse Fourier transform of \(f\) would be (related to spherical coordinate??) \[ f(x) = (\frac{1}{2\pi})^d \int \hat f(\omega) \exp(i \omega^\top x) \d \omega \\ \Rightarrow (\frac{1}{2\pi})^d \int \hat f(\omega) \d \omega = f(0) \] The Fouries (transform) series of \(\hat f\) would be \[ \hat{\hat f}(x) = \int \hat f(\omega) \exp(-i\omega^\top x) \d x = (\frac{1}{2 \pi})^d f(-x) \]

Theorem: Bochner’s theorem. A function \(\kappa: \R^d \to \R\) results in a valid shift-invariant kernel \(\mathcal{K}(x,x') = \kappa(x-x')\) if and only if its Fourier transform \(\hat \kappa\) is real and non-negative everywhere, i.e., \[ \forall \omega \in \R^d, \hat \kappa(\omega) \ge 0 \]

Example: Gaussian Kernel

The Gaussian kernel \(\mathcal{K}(x,x') = \kappa_\text{Gaussian}(z \triangleq x-x') = \exp(-\frac{\|z\|_2^2}{2\sigma^2})\) is a valid and shift-invariant kernel function because \[ \hat \kappa_\text{Gaussian}(\omega) = ({\sqrt{2\pi}}{\sigma})^d \exp(-\frac{\sigma^2 \|\omega\|_2^2}{2}) \ge 0 \] There are two interesting observations:

  1. Gaussian-shaped function is fixed point w.r.t. Fourier transform in that its Fourier series is still Gaussian-shaped.

  2. Note that \((\frac{1}{2 \pi})^d \hat \kappa_\text{Gaussian}\) is in essence the PDF of normal distribution \(\mathcal{N}(0, \frac{1}{\sigma^2} I)\).

    If the \(\sigma^2\) is small, \(\hat \kappa_\text{Gaussian}(\omega)\) will spread out; if \(\sigma^2\) is large, \(\hat \kappa_\text{Gaussian}(\omega)\) will concentrate.

Non-Example: Box Similarity Score

A non-example of shift-invariant kernel is the box similarity score where \[ s(x,x') = \begin{cases} 1 & \text{if $\|x-x'\|_2 \le \epsilon$} \\ 0 & \text{otherwise} \end{cases} \] In this example, we can easily tell that \(s\) is shift-invariant. But we still have to check if box similarity score is a valid kernel function. We form the following function \[ \Pi_\epsilon(z) = \begin{cases} 1 & \text{if $\|z\| \le \epsilon$} \\ 0 & \text{otherwise} \end{cases} \] We attempt to check both the validity and the shift-invariance in one gut via Bochner’s theorem. Let’s try with \(x \in \R\) first: \[ \begin{aligned} &\hat \Pi_\epsilon(\omega) = \int_{-\infty}^{\infty} \Pi_\epsilon(x) \exp(-i \omega x) \d x \\ &= \int_{-\epsilon}^{+\epsilon} [\cos (\omega x) + i \sin(\omega x)] \d x \\ &= \int_{-\epsilon}^{+\epsilon} \cos (\omega x) \d x \\ &= \frac{2 \sin (\omega \epsilon)}{\omega} \end{aligned} \] The Fourier series \(\hat \Pi_\epsilon\) is not non-negative everywhere. As a result, box similarity score is not valid or not shift-invariant. But box similarity score is shift-invariant. Thus, box similarity score is not valid.

Example: Sinc Kernel

The kernel function associated with the \(\DeclareMathOperator{\sinc}{sinc} \kappa_\text{sinc}(z) = \sinc z \triangleq \frac{\sin(\pi z)}{\pi z}\) function (note that here \(z\) is univariate) is a valid and shift-invariant kernel function. This holds because \[ \begin{gathered} \hat \Pi_\epsilon(\omega) = \frac{2 \sin (\omega \epsilon)}{\omega} = 2 \epsilon \sinc(\frac{\omega \epsilon}{\pi}) \\ \end{gathered} \]

And thus, \[ \begin{gathered} \begin{aligned} \frac{1}{2\pi} \Pi_\epsilon(-z) &= \hat{\hat \Pi}_\epsilon(z) \\ &\Downarrow_\text{evenness of $\Pi_\epsilon$} \\ \frac{1}{2\pi} \Pi_\epsilon(z) &=\int 2 \epsilon \sinc(\frac{\omega \epsilon}{\pi}) \exp(-i\omega z)\d \omega \\ \frac{1}{2\pi} \Pi_\epsilon(z) &= C_1 \cdot \widehat{\sinc}(C_2 z) \ge 0 \end{aligned} \\ \end{gathered} \] In fact, \[ \widehat{\sinc}(\omega) = \mathbb{1}[-1 \le \omega \le 1] \ge 0 \]

The above inspires us of how to construct a shift-invariant kernel: begin from a non-negative even function \(f\) and take \(\hat f\)’s Fourier series as the kernel function. \(\hat f\) in this case is also called the synthesis function.

Random Fourier Features

Recall that in the synthesizing process, \(\kappa\) is the carefully-chosen non-negative even function to form a valid shift-invariant kernel. However, Fourier transform of \(\hat \kappa\) is not always easy to derive. To tackle it, note that because \(\kappa\) is non-negative and even, \(\hat \kappa\) is also non-negative and even. Recall the inverse Fourier transform: \[ \begin{gathered} (\frac{1}{2\pi})^d \int \hat \kappa(\omega) \d \omega = \kappa(0) \\ \\ \begin{aligned} \kappa(\underbrace{x-x'}_z) &= (\frac{1}{2 \pi})^d \int \hat \kappa(\omega) \exp(i \omega z) \d \omega \ge 0 \\ % &\Downarrow_{\text{connecting to synthesizing due to the evenness of $\hat \kappa$ ??}} \\ % &= (\frac{1}{2 \pi})^d \int \hat \kappa(\omega) \exp(-i \omega z) \d \omega \\ &= (\frac{1}{2 \pi})^d \int \hat \kappa(\omega) \exp(i \omega x - i \omega x') \d \omega \\ \end{aligned} \end{gathered} \] We can intentionally rescale \(\kappa(0)\) to \(1\). Thus, \((\frac{1}{2\pi})^d \hat \kappa\) is a PDF and \(\kappa(z)\) is in a form of expectation: \[ \begin{aligned} &\kappa(z) = \E_{\omega \sim (\frac{1}{2 \pi})^d \hat \kappa}[e^{i \omega^\top (x-x')}] = \E[\underbrace{\exp(iw^\top x)}_{\phi(x)} \underbrace{\exp(-iw^\top x')}_{\overline{\phi(x')}}] \\ &= \E_{\omega \sim (\frac{1}{2 \pi})^d \hat \kappa}[\cos\omega (x-x') - i\sin\omega (x-x')] \\ &\Downarrow_\text{evenness of $\hat \kappa$} \\ &= \E_{\omega \sim (\frac{1}{2 \pi})^d \hat \kappa}[\cos\omega (x-x')] \\ &= \E_{\omega \sim (\frac{1}{2 \pi})^d \hat \kappa}[\cos \omega x \cos \omega x' + \sin \omega x \sin \omega x'] \\ &\Downarrow_\text{using $r$ samples} \\ &\approx \frac{1}{r} \sum_{i=1}^r (\cos \omega_i x \cos \omega_i x' + \sin \omega_i x \sin \omega_i x') \end{aligned} \] Let \(\tilde \phi: \R^d \to \R^{2r}\) and \(\tilde \phi(x) = \frac{1}{\sqrt{r}}[\cos(\omega_1 x), \dots, \cos(\omega_r x), \sin(\omega_1 x), \dots, \sin(\omega_r x)]\). We have \[ \kappa(z) \approx \tilde \phi(x)^\top \tilde \phi(x') \] \(\tilde \phi\) is called the random Fourier features. With this approximation, we obtain the proxy problem to \(\eqref{kernel-method}\) \[ \begin{gather*} \min_{w \in \R^m}\ \frac{1}{n} \sum_{i=1}^n (w^\top \phi(x) - y_i)^2 \\ \approx \\ \min_{\tilde w \in \R^{2r}}\ \frac{1}{n} \sum_{i=1}^n (\tilde w^\top \tilde\phi(x) - y_i)^2 \\ \end{gather*} \] This trick is especially helpful in Gaussian kernel case where \(m\) is \(\infty\). Using the kernel function (instead of feature function) notation, we have the following theorem:

Theorem: approximation error of random Fourier features. Suppose \(\mathcal{K}\) is a shift-invariant kernel function. We consider a subset of the resulting RKHS \(\mathcal{H}\) where the Fourier coefficients are bounded by \(C\) as \[ \mathcal{H}_C \triangleq \{ (\frac{1}{2\pi})^d \int \alpha(\omega) \hat \kappa(\omega) \underbrace{\exp(i\omega z)}_{\phi_{\omega}(z)} \d \omega: |\alpha(\omega)| \le C \} \] Consider the norm \(\| \cdot \|\) inducted by a distribution \(q\)-based inner product \(\langle f,g \rangle = \E_q[f(x) g(x)]\). Then for \(f^*\) and sample \(\omega_1,\dots,\omega_m\) i.i.d. drawn from \((\frac{1}{2\pi})^d \hat \kappa\), with probability at least \(1-\delta\), there exists coefficients \(\alpha_1,\dots,\alpha_m\) such that \[ \| \frac{1}{m} \sum_{i=1}^m \alpha_i \phi_{\omega_i} - f^* \| \le \sqrt{\frac{C^2}{m}} + \sqrt{\frac{2 \log(1/\delta)}{m}} \] Proof. See Homework 2.

Previous
Next