Support Vector Machines

Why SVM?

The Support Vector Machine (SVM) is one of the most modern technique used for regression-and classification problems. The SVM is a supervised learning model. In this approach we are given a set of input vectors ($x_n$) paired with corresponding target values ($t_n$). The goal is to "learn" a model made out of these training set to make correct predictions of t for newly presented input data $x$. In classification, $t$ is discrete and represented as class labels. In regression $t$ is a continuous variable. The prediction $y(x,w)$ is expressed as the following linear model: $$y(x,w) = \sum_{m=0}^{M} w_m \phi_m(x) = w^T\phi \label{ref1}$$ Here, $\phi_{m}(x)$ are basis functions and the parameters ${w_m}$ are the associated weights.

How SVM works

There are different types of the SVM: in the classification case, there is the original one the Maximal Margin Classifier, the kernelized version, the soft-margin version and the soft-margin kernelized version which combines the first 3 versions and is used most frequently. In the regression case there is the Support Vector Regression (SVR). However, all SVMs are basically a specialization of (\ref{ref1}) and make predictions based on the following function: $$y(x,w)=\sum_{i=1}^{N}w_{i}K(x,x_i)+w_0 \label{ref2}$$ where K stands for kernel functions.

The SVM searches for the optimal hyperplane that best separates the data by maximizing the margin. Maximizing the margin is an optimization problem that is equivalent to minimizing the norm of the weight vector: $$minimize_{w,b} \space \frac{1} {2} ||w||^2 $$ subject to $$y_{i}(w * x_{i} + b)-1 \geq 0, \space i=1,...,m \label{ref3}$$ The convex quadratic optimization problem from (\ref{ref3}) can be solved by lagrange multipliers: $$\mathcal{L}(w,b,\alpha) = \mathcal{f}(w)-\sum_{i=1}^{m}\alpha_{i}g_{i}(w,b)$$ $$\mathcal{L}(w,b,\alpha) = \frac{1}{2}||w||^{2}-\sum_{i=1}^{m}\alpha_{i}[y_{i}(w*x_{i}+b)-1]$$ $$\min_{w,b} \space \max_{a} \mathcal{L}(w,b,\alpha)$$ subject to $$\alpha_{i}\geq0, \space i=1,...,m \label{ref4}$$ The problem from (\ref{ref4}) can then be rewritten in dual form and becomes a Wolfe Dual Problem. In this way the parameters $w$ and $b$ can be found: $$w=\sum_{i=1}^{m}\alpha_{i}y_{i}x_{i} \label{ref5}$$ $$b=y_{i}-w*x_{i} \label{ref6}$$

Because (\ref{ref4}) contains inequality constraints, the solution must fulfill certain conditions, the Karush-Kuhn-Tucker (KKT) conditions. These conditions are: stationarity-, primal feasibility-, dual feasibility- and complementary slackness condition. $$\alpha_{i}[y_{i}(w*x_{i}+b)-1]=0 \space for \space for \space all \space i=1,...,m \label{ref7}$$ The fourth one, the complementary slackness condition from (\ref{ref7}), says either $\alpha_{i}$ or the original condition should be zero. The support vectors are the ones that have a positive $\alpha_{i}$ which means the second condition is active. If a solution meets the four KKT conditions, this solution is optimal. However, unclean data, for instance outliers, are not uncommon and impair linear separability. Vapnik and Cortes modified the original SVM to the Soft Margin SVM which means, in the classification case, outliers are allowed. The original optimization problem from (\ref{ref3}) has been changed to (\label{ref8}): the slack variable $\zeta$ and the hyperparameter $C$ have been added. Regarding to this, the outliers can now be on the "wrong side" but a hyperplane can still be found. $$minimize_w,b,\zeta \space \frac{1}{2}||w||^{2} + C \sum_{i=1}^{m} \zeta_{i} \label{ref8}$$ subject to $$y_{i}(w*x_{i}+b) \geq 1-\zeta_{i}$$ $$\zeta_{i} \geq 0 \space for \space any \space i=1,...,m $$ The hyperparameter $C$ can then be used to control how the SVM should deal with outliers.

References

...

Support Vector machines

Introduction

Settings

Values

Why SVM?

How SVM works

References