Machine Learning Techniques
We will learn support vector machines (SVM) in this module.
Following our template, we will describe all five components of ML set up for SVM one by one.
Training data is the first component of our ML framework.
Model is the second component of our ML framework, which will be discussed in the context of a binary classification set up.
Feature vector
Label
Linear combination of features
Weight vector
Given the training data, find the best values for \(\mathbf{w}\) and \(b\) that results into the minium loss.
The SVM learns a hyperplane separating two classes with parameters \(\mathbf{w}\) and \(b\).
SVM finds the hyperplane in slightly different manner that other classifiers that we have studied in this course.
It selects the hyperplane that maximizes the distance to the closest data points from both classes.
In other words, it is the hyperplane with maximum margin between two classes.
How does it select such a hyperplane? It uses appropriate loss functions for the objective.
Let's learn a few concepts before setting up the loss function.
There are three components here:
Separating hyperplanes is the classifier. It is at equal distance from both the classes and separates them such that there is maximum margin between two classes.
Bounding planes are parallel to separating hyperplanes on the either sides and pass through support vectors.
Support vectors are subset of training points closer to the separating hyperplane and influence its position and orientation
The bounding planes are defined as follows:
The bounding plane on the side of the positive class has the following equation:
We can write this in one equation as follows using the label of an example.
The bounding plane on the side of the negative class has the following equation:
Any point on or above the bounding plane belongs to the positive class:
Any point on or below the bounding plane belongs to the negative class:
Compactly, the correctly classified points satisfy the following equation:
This constraint ensures that none of the points falls within the margin.
Support vectors are points on the bounding planes.
Width of margin is the projection of \(\left(\mathbf{x}_{+} - \mathbf{x}_{-}\right)\) on unit normal vector \(\frac{\mathbf{w}}{||\mathbf{w} ||}\). Mathematically,
For positive support vector \(\mathbf{x}_{+}\): (Using dot product between two vectors here)
For negative support vector \(\mathbf{x}_{-}\):
Our objective is to maximize the margin, \( \frac{2}{||\mathbf{w}||}\), which is equivalent to minimizing
Therefore the optimization problem of linear SVM is written as follows:
such that
This is called a primal problem and is guaranteed to have a global minimum.
Let's solve this problem with an optimization procedure.
such that
The optimization problem of linear SVM is as follows:
subject to \(\alpha^{(i)} \geq 0 \ \ \forall i = 1, \ldots, n\)
This is called Lagrangian function of the SVM, which is differentiable with respect to \(\mathbf{w}\) and \(b\).
which implies
We have
Substituting these in Lagrangian function of SVM, we get dual problem.
such that
This is a concave problem that is maximized using a solver.
such that
The dual problem is easier to solve as it is expressed in terms of the Lagrange multipliers.
The dual problem depends on the inner product of training data.
Strong duality requires Karush–Kuhn–Tucker (KKT) condition:
This implies
If \(\mathbf{\alpha}^{(i)} > 0 \) then \(\mathbf{y}^{(i)} (\mathbf{w}^T \mathbf{x}^{(i)} + b) = 1 \). The data point is a support vector. In other words, it is located on one of the bounding planes.
If \(\mathbf{y}^{(i)} (\mathbf{w}^T \mathbf{x}^{(i)} + b) > 1 \), the distance between data point and the separating hyperplane is more than the margin.
Both primal and dual problems can be solved with optimization solvers to obtain the separating hyperplane for SVMs.
This flavour of SVM where classes are linearly separable is called hard margin SVM
We introduce a slack variable \(\mathbf{\xi}^{(i)}\) for each training point in the constraint as follows:
The constraints are now a less-strict because each training point \(\mathbf{x}^{(i)}\) need only be at a distance of \(1 - \mathbf{\xi}^{(i)}\) from the separating hyperplane instead of a hard distance of 1.
In order to prevent slack variable becoming too large, we penalize it in the objective function
such that
Let's derive an unconstrained formulation.
For \(C \neq 0\), our objective is to minimize \(\mathbf{\xi}^{(i)}\) as much as possible, which is possible with \(y^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)} + b) = 1 - \mathbf{\xi}^{(i)}\)
This is equivalent to
Let's plug this in the soft margin SVM objective function.
We add a non-zero slack \(1 - \mathbf{y}^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b)\) for misclassified points or points inside margin.
x-axis is the \(t = y^{(i)} (\mathbf w.\mathbf{x}^{(i)}+b) \) and y-axis is the misclassification cost.
We need to minimize the soft margin loss to find the max-margin classifier.
The partial derivative of the second term depends on the misclassification penalty:
Writing compactly:
such that
Recall SVM dual objective function
Writing this with kernel function
Kernel performs dot product between input feature vectors in high dimensional space without actually projecting or transforming the input features in that space.
Linear kernel:
Polynomial kernel:
Radial basis functions (RBF)