Introduction

Introduction#

Kernels are used like mathematical magic that is able to fit complicated curve with a collection simple functions called a kernel.

The motivation comes from the observation that many learning algorithms:

have weights that can be written as a linear combination of sample points. $$ \theta = X^\top a $$
have the property that inner products of the design matrix $\Phi(X)$ can be written without computing $\Phi(X)$ directly.

The change is that instead of solving for the parameters $\theta$, it is now substituted with $a$ (the linear combination coefficients) which is called the dual parameters.

The kernel function is defined as the calculation involving the in-sample vector $X_i$ and out of sample vector $z$

\[ k(X_i, z) \]

The kernel matrix is defined as all possible kernel functions for the feature matrix

\[ K = XX^\top; \qquad K_{ij} = k(X_i,X_j) \]