A Visualization
This website was created as part of the digitalization project "Webtools for teaching" at Frankfurt University of Applied Sciences (FRA UAS). "Computer-based visualization systems provide visual representations of datasets designed to help people carry out tasks more effectively." External representation replaces cognition with perception. We hope that our interactive visualization will make the SVM easier to understand for you.
The Support Vector Machine (SVM) is one of the most modern technique used for regression-and classification problems. The SVM is a supervised learning model. In this approach we are given a set of input vectors ($x_n$) paired with corresponding target values ($t_n$). The goal is to "learn" a model made out of these training set to make correct predictions of t for newly presented input data $x$. In classification, $t$ is discrete and represented as class labels. In regression $t$ is a continuous variable. The prediction $y(x,w)$ is expressed as the following linear model: $$ \begin{equation} y(x,w) = \sum_{m=0}^{M} w_m \phi_m(x) = w^T\phi \label{ref1} \end{equation} $$ Here, $\phi_{m}(x)$ are basis functions and the parameters ${w_m}$ are the associated weights.
There are different types of the SVM: in the classification case, there is the original one the Maximal Margin Classifier, the kernelized version, the soft-margin version and the soft-margin kernelized version which combines the first 3 versions and is used most frequently. In the regression case there is the Support Vector Regression (SVR). However, all SVMs are basically a specialization of (\ref{ref1}) and make predictions based on the following function: $$ \begin{equation} y(x,w)=\sum_{i=1}^{N}w_{i}K(x,x_i)+w_0 \label{ref2} \end{equation} $$ where K stands for kernel functions.
The SVM searches for the optimal hyperplane that best separates the data by maximizing the margin. Maximizing the margin is an optimization problem that is equivalent to minimizing the norm of the weight vector: $$ \begin{equation} minimize_{w,b} \space \frac{1} {2} ||w||^2 \end{equation} $$ subject to $$ \begin{equation} y_{i}(w * x_{i} + b)-1 \geq 0, \space i=1,...,m \label{ref3} \end{equation} $$ The convex quadratic optimization problem from (\ref{ref3}) can be solved by lagrange multipliers: $$ \begin{equation} \mathcal{L}(w,b,\alpha) = \mathcal{f}(w)-\sum_{i=1}^{m}\alpha_{i}g_{i}(w,b) \end{equation} $$ $$ \begin{equation} \mathcal{L}(w,b,\alpha) = \frac{1}{2}||w||^{2}-\sum_{i=1}^{m}\alpha_{i}[y_{i}(w*x_{i}+b)-1] \end{equation} $$ $$ \begin{equation} \min_{w,b} \space \max_{a} \mathcal{L}(w,b,\alpha) \end{equation} $$ subject to $$ \begin{equation} \alpha_{i}\geq0, \space i=1,...,m \label{ref4} \end{equation} $$ The problem from (\ref{ref4}) can then be rewritten in dual form and becomes a Wolfe Dual Problem. In this way the parameters $w$ and $b$ can be found: $$ \begin{equation} w=\sum_{i=1}^{m}\alpha_{i}y_{i}x_{i} \label{ref5} \end{equation} $$ $$ \begin{equation} b=y_{i}-w*x_{i} \label{ref6} \end{equation} $$
Because (\ref{ref4}) contains inequality constraints, the solution must fulfill certain conditions, the Karush-Kuhn-Tucker (KKT) conditions. These conditions are: stationarity-, primal feasibility-, dual feasibility- and complementary slackness condition. $$ \begin{equation} \alpha_{i}[y_{i}(w*x_{i}+b)-1]=0 \space for \space for \space all \space i=1,...,m \label{ref7} \end{equation} $$ The fourth one, the complementary slackness condition from (\ref{ref7}), says either $\alpha_{i}$ or the original condition should be zero. The support vectors are the ones that have a positive $\alpha_{i}$ which means the second condition is active. If a solution meets the four KKT conditions, this solution is optimal. However, unclean data, for instance outliers, are not uncommon and impair linear separability. Vapnik and Cortes modified the original SVM to the Soft Margin SVM which means, in the classification case, outliers are allowed. The original optimization problem from (\ref{ref3}) has been changed to (\label{ref8}): the slack variable $\zeta$ and the hyperparameter $C$ have been added. Regarding to this, the outliers can now be on the "wrong side" but a hyperplane can still be found. $$ \begin{equation} minimize_w,b,\zeta \space \frac{1}{2}||w||^{2} + C \sum_{i=1}^{m} \zeta_{i} \label{ref8} \end{equation} $$ subject to $$ \begin{equation} y_{i}(w*x_{i}+b) \geq 1-\zeta_{i} \end{equation} $$ $$ \begin{equation} \zeta_{i} \geq 0 \space for \space any \space i=1,...,m \end{equation} $$ The hyperparameter $C$ can then be used to control how the SVM should deal with outliers.
Normally, the linear kernel (scalar product) is used to calculate the distance of two points $\mathbf a, \mathbf b$: $$ \begin{equation} K_{\mathrm{linear}}(\mathbf a, \mathbf b) = \mathbf a \cdot \mathbf b = \sum_i a_i \cdot b_i. \end{equation} $$
Alternatively, a polynomial kernel can be used to penalize large distances: $$ \begin{equation} K_{\mathrm{polynomial}}(\mathbf a, \mathbf b) = (\mathbf a \cdot \mathbf b + c)^d, \end{equation} $$ where $c=0, d=2$ are the pre-set constants in our implementation.
Another kernel is a radial basis function (RBF) which uses an exponential function to penalize large distances: $$ \begin{equation} K_{\mathrm{RBF}}(\mathbf a, \mathbf b) = \exp\left( - \gamma \cdot \sqrt[d]{ \sum_i (a_i - b_i)^d } \right), \end{equation} $$ where our implementation chooses $d=2$. Note that the inner term of this function describes the mean squared distance between the vectors $\mathbf a, \mathbf b$. Of course, other RBF formulations are possible as well.
When changing the parameters, see that when you want to apply a larger C, often a smaller ε is required and vice versa.In many cases, a bad combination of the two parameters will result in not seeing the decision boundary, as they move outside the plotting domain.