Support Vector Machines

Why SVM?

The Support Vector Machine (SVM) is one of the most modern technique used for regression-and classification problems. The SVM is a supervised learning model. In this approach we are given a set of input vectors ($x_n$) paired with corresponding target values ($t_n$). The goal is to "learn" a model made out of these training set to make correct predictions of t for newly presented input data $x$. In classification, $t$ is discrete and represented as class labels. In regression $t$ is a continuous variable. The prediction $y(x,w)$ is expressed as the following linear model: $$ \begin{equation} y(x,w) = \sum_{m=0}^{M} w_m \phi_m(x) = w^T\phi \label{ref1} \end{equation} $$ Here, $\phi_{m}(x)$ are basis functions and the parameters ${w_m}$ are the associated weights.

How SVM works

There are different types of the SVM: in the classification case, there is the original one the Maximal Margin Classifier, the kernelized version, the soft-margin version and the soft-margin kernelized version which combines the first 3 versions and is used most frequently. In the regression case there is the Support Vector Regression (SVR). However, all SVMs are basically a specialization of (\ref{ref1}) and make predictions based on the following function: $$ \begin{equation} y(x,w)=\sum_{i=1}^{N}w_{i}K(x,x_i)+w_0 \label{ref2} \end{equation} $$ where K stands for kernel functions.

The SVM searches for the optimal hyperplane that best separates the data by maximizing the margin. Maximizing the margin is an optimization problem that is equivalent to minimizing the norm of the weight vector: $$ \begin{equation} minimize_{w,b} \space \frac{1} {2} ||w||^2 \end{equation} $$ subject to $$ \begin{equation} y_{i}(w * x_{i} + b)-1 \geq 0, \space i=1,...,m \label{ref3} \end{equation} $$ The convex quadratic optimization problem from (\ref{ref3}) can be solved by lagrange multipliers: $$ \begin{equation} \mathcal{L}(w,b,\alpha) = \mathcal{f}(w)-\sum_{i=1}^{m}\alpha_{i}g_{i}(w,b) \end{equation} $$ $$ \begin{equation} \mathcal{L}(w,b,\alpha) = \frac{1}{2}||w||^{2}-\sum_{i=1}^{m}\alpha_{i}[y_{i}(w*x_{i}+b)-1] \end{equation} $$ $$ \begin{equation} \min_{w,b} \space \max_{a} \mathcal{L}(w,b,\alpha) \end{equation} $$ subject to $$ \begin{equation} \alpha_{i}\geq0, \space i=1,...,m \label{ref4} \end{equation} $$ The problem from (\ref{ref4}) can then be rewritten in dual form and becomes a Wolfe Dual Problem. In this way the parameters $w$ and $b$ can be found: $$ \begin{equation} w=\sum_{i=1}^{m}\alpha_{i}y_{i}x_{i} \label{ref5} \end{equation} $$ $$ \begin{equation} b=y_{i}-w*x_{i} \label{ref6} \end{equation} $$

Because (\ref{ref4}) contains inequality constraints, the solution must fulfill certain conditions, the Karush-Kuhn-Tucker (KKT) conditions. These conditions are: stationarity-, primal feasibility-, dual feasibility- and complementary slackness condition. $$ \begin{equation} \alpha_{i}[y_{i}(w*x_{i}+b)-1]=0 \space for \space for \space all \space i=1,...,m \label{ref7} \end{equation} $$ The fourth one, the complementary slackness condition from (\ref{ref7}), says either $\alpha_{i}$ or the original condition should be zero. The support vectors are the ones that have a positive $\alpha_{i}$ which means the second condition is active. If a solution meets the four KKT conditions, this solution is optimal. However, unclean data, for instance outliers, are not uncommon and impair linear separability. Vapnik and Cortes modified the original SVM to the Soft Margin SVM which means, in the classification case, outliers are allowed. The original optimization problem from (\ref{ref3}) has been changed to (\label{ref8}): the slack variable $\zeta$ and the hyperparameter $C$ have been added. Regarding to this, the outliers can now be on the "wrong side" but a hyperplane can still be found. $$ \begin{equation} minimize_w,b,\zeta \space \frac{1}{2}||w||^{2} + C \sum_{i=1}^{m} \zeta_{i} \label{ref8} \end{equation} $$ subject to $$ \begin{equation} y_{i}(w*x_{i}+b) \geq 1-\zeta_{i} \end{equation} $$ $$ \begin{equation} \zeta_{i} \geq 0 \space for \space any \space i=1,...,m \end{equation} $$ The hyperparameter $C$ can then be used to control how the SVM should deal with outliers.

Kernels

Normally, the linear kernel (scalar product) is used to calculate the distance of two points $\mathbf a, \mathbf b$: $$ \begin{equation} K_{\mathrm{linear}}(\mathbf a, \mathbf b) = \mathbf a \cdot \mathbf b = \sum_i a_i \cdot b_i. \end{equation} $$

Alternatively, a polynomial kernel can be used to penalize large distances: $$ \begin{equation} K_{\mathrm{polynomial}}(\mathbf a, \mathbf b) = (\mathbf a \cdot \mathbf b + c)^d, \end{equation} $$ where $c=0, d=2$ are the pre-set constants in our implementation.

Another kernel is a radial basis function (RBF) which uses an exponential function to penalize large distances: $$ \begin{equation} K_{\mathrm{RBF}}(\mathbf a, \mathbf b) = \exp\left( - \gamma \cdot \sqrt[d]{ \sum_i (a_i - b_i)^d } \right), \end{equation} $$ where our implementation chooses $d=2$. Note that the inner term of this function describes the mean squared distance between the vectors $\mathbf a, \mathbf b$. Of course, other RBF formulations are possible as well.

Parameters

When changing the parameters, see that when you want to apply a larger C, often a smaller ε is required and vice versa.In many cases, a bad combination of the two parameters will result in not seeing the decision boundary, as they move outside the plotting domain.

Epsilon ε

Intuitively, the ε parameter defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. The ε parameters can be seen as the inverse of the radius of influence of samples selected by the model as support vectors. When ε is very small, the model is too constrained and cannot capture the complexity or “shape” of the data. The region of influence of any selected support vector would include the whole training set. The resulting model will behave similarly to a linear model with a set of hyperplanes that separate the centers of high density of any pair of two classes.
Keep ε between 0.001 and 1

C

The C parameter trades off correct classification of training examples against maximization of the decision function’s margin. For larger values of C, a smaller margin will be accepted if the decision function is better at classifying all training points correctly. A lower C will encourage a larger margin, therefore a simpler decision function, at the cost of training accuracy. In other words: C behaves as a regularization parameter in the SVM. As you can see when applying larger values for C, the support vectors move farther away from the decision boundary. Support vectors are displayed as X's.

Small C makes the constraints easy to ignore which leads to a large margin.
Large C allows the constraints hard to be ignored which leads to a small margin.
For C = inf, all the constraints are enforced.

Keep C between 1 and 10000

Support Vector Machines

Introduction

SVM Settings

Values

Why SVM?

How SVM works

Kernels

Parameters

Epsilon ε

C

References