Support Vector Machine
Dr. Ashish Tendulkar
Machine Learning Techniques
IIT Madras
We will learn support vector machines (SVM) in this module.
- SVM is a supervised machine learning algorithm that can be used for both classification and regression problems.
- We will focus on classification aspect of SVM in our course.
- SVM works in both binary and multiclass classification set ups.
- SVM is a discriminative classifier like perceptron and logistic regression
Overview
Following our template, we will describe all five components of ML set up for SVM one by one.
Training data is the first component of our ML framework.
Training data
- In binary classification set up, training data consists of
- Feature matrix \(\mathbf{X}\) with shape \((n, m)\). Note that each example is represented with \(m\) features and there are total \(n\) examples.
- Label vector \(\mathbf{y}\) containing labels for \(n\) examples and its shape is \((n,)\).
- In multiclass and multilabel set ups, training data consists of feature matrix with the same specification as the binary set up.
- Label matrix \(\mathbf{Y}\) containing one of \(k\) labels for each of \(n\) examples and its shape is \((n, k)\).
Model is the second component of our ML framework, which will be discussed in the context of a binary classification set up.
Model
- Note that we have written bias unit separately over here. This is a linear classifier, which is familiar to us.
- Labels are assumed to be +1 and -1.
Feature vector
Label
Linear combination of features
Weight vector
Learning problem
Given the training data, find the best values for \(\mathbf{w}\) and \(b\) that results into the minium loss.
The SVM learns a hyperplane separating two classes with parameters \(\mathbf{w}\) and \(b\).
SVM finds the hyperplane in slightly different manner that other classifiers that we have studied in this course.
It selects the hyperplane that maximizes the distance to the closest data points from both classes.
In other words, it is the hyperplane with maximum margin between two classes.
How does it select such a hyperplane? It uses appropriate loss functions for the objective.
Loss function
Let's learn a few concepts before setting up the loss function.
There are three components here:
- Separating hyperplane
- Bounding planes
- Support vectors
Separating hyperplanes is the classifier. It is at equal distance from both the classes and separates them such that there is maximum margin between two classes.
Bounding planes are parallel to separating hyperplanes on the either sides and pass through support vectors.
Support vectors are subset of training points closer to the separating hyperplane and influence its position and orientation
The bounding planes are defined as follows:
The bounding plane on the side of the positive class has the following equation:
Bounding planes
We can write this in one equation as follows using the label of an example.
The bounding plane on the side of the negative class has the following equation:
Any point on or above the bounding plane belongs to the positive class:
Any point on or below the bounding plane belongs to the negative class:
Compactly, the correctly classified points satisfy the following equation:
This constraint ensures that none of the points falls within the margin.
Support vectors
Support vectors are points on the bounding planes.
Margin
Width of margin is the projection of \(\left(\mathbf{x}_{+} - \mathbf{x}_{-}\right)\) on unit normal vector \(\frac{\mathbf{w}}{||\mathbf{w} ||}\). Mathematically,
For positive support vector \(\mathbf{x}_{+}\): (Using dot product between two vectors here)
For negative support vector \(\mathbf{x}_{-}\):
Our objective is to maximize the margin, \( \frac{2}{||\mathbf{w}||}\), which is equivalent to minimizing
Therefore the optimization problem of linear SVM is written as follows:
such that
This is called a primal problem and is guaranteed to have a global minimum.
Let's solve this problem with an optimization procedure.
such that
- This is quadratic optimization problem: quadratic objective with linear constraint.
- Can be efficiently solved with QCPC (Quadratically constrained quadratic program) solvers
The optimization problem of linear SVM is as follows:
Optimizing SVM Primal Problem
Dual of SVM primal problem
- Alternatively we can optimize the dual of the primal problem.
- For that we will make use of Lagrange multipliers.
subject to \(\alpha^{(i)} \geq 0 \ \ \forall i = 1, \ldots, n\)
This is called Lagrangian function of the SVM, which is differentiable with respect to \(\mathbf{w}\) and \(b\).
which implies
We have
Substituting these in Lagrangian function of SVM, we get dual problem.
such that
This is a concave problem that is maximized using a solver.
such that
The dual problem is easier to solve as it is expressed in terms of the Lagrange multipliers.
The dual problem depends on the inner product of training data.
Strong duality requires Karush–Kuhn–Tucker (KKT) condition:
This implies
If \(\mathbf{\alpha}^{(i)} > 0 \) then \(\mathbf{y}^{(i)} (\mathbf{w}^T \mathbf{x}^{(i)} + b) = 1 \). The data point is a support vector. In other words, it is located on one of the bounding planes.
If \(\mathbf{y}^{(i)} (\mathbf{w}^T \mathbf{x}^{(i)} + b) > 1 \), the distance between data point and the separating hyperplane is more than the margin.
Both primal and dual problems can be solved with optimization solvers to obtain the separating hyperplane for SVMs.
This flavour of SVM where classes are linearly separable is called hard margin SVM
Soft margin SVMs
- The classes are largely linearly separable, but there are a few misclassifications or a few points lie with in margin.
- We are unable to find a perfect hyperplane that maximizes the margin.
- We would like to make some adjustments to the loss function so as to learn a hyperplane with tolerance to a small number of misclassified examples.
Soft margin SVMs
Slack variable
We introduce a slack variable \(\mathbf{\xi}^{(i)}\) for each training point in the constraint as follows:
The constraints are now a less-strict because each training point \(\mathbf{x}^{(i)}\) need only be at a distance of \(1 - \mathbf{\xi}^{(i)}\) from the separating hyperplane instead of a hard distance of 1.
Soft margin SVMs
In order to prevent slack variable becoming too large, we penalize it in the objective function
such that
- Slack allows input to be closer to the hyperplane or even be on the wrong side.
- C is large - SVM becomes strict and tries to get all points to the correct side of the hyperplane.
- C is small - SVM slacks and allows many misclassifications or point to lie with in margin.
Let's derive an unconstrained formulation.
For \(C \neq 0\), our objective is to minimize \(\mathbf{\xi}^{(i)}\) as much as possible, which is possible with \(y^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)} + b) = 1 - \mathbf{\xi}^{(i)}\)
This is equivalent to
Let's plug this in the soft margin SVM objective function.
We add a non-zero slack \(1 - \mathbf{y}^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b)\) for misclassified points or points inside margin.
- The second term is the hinge loss.
x-axis is the \(t = y^{(i)} (\mathbf w.\mathbf{x}^{(i)}+b) \) and y-axis is the misclassification cost.
We need to minimize the soft margin loss to find the max-margin classifier.
The partial derivative of the second term depends on the misclassification penalty:
Writing compactly:
Partial derivatives
Gradient descent update rule
Evaluation measures
- Usual classification evaluation measures like precision, recall, f1-score and accuracy.
Kernel SVMs
Basic Idea
- Kernel SVM is used for non-linearly separable data.
- Remember that so far we were performing non-linear transformation on the input feature space (e.g. polynomial transformation) and then training the model in the transformed space for learning non-linear decision boundaries.
- Kernel SVM computes dot product in transformed feature space, but without explicitly calculating the transformation.
such that
Recall SVM dual objective function
Writing this with kernel function
Kernel performs dot product between input feature vectors in high dimensional space without actually projecting or transforming the input features in that space.
Linear kernel:
Polynomial kernel:
Radial basis functions (RBF)
Demonstrating kernel trick with polynomial
Model with kernel SVM
Copy of Support vector Machine
By Swarnim POD
Copy of Support vector Machine
- 137