Support Vector Machine

Dr. Ashish Tendulkar

Machine Learning Techniques

IIT Madras

We will learn support vector machines (SVM) in this module.

  • SVM is a supervised machine learning algorithm that can be used for both classification and regression problems.
  • We will focus on classification aspect of SVM in our course.
  • SVM works in both binary and multiclass classification set ups.
  • SVM is a discriminative classifier like perceptron and logistic regression

Overview

Following our template, we will describe all five components of ML set up for SVM one by one.

Training data is the first component of our ML framework.

Training data

  • In binary classification set up, training data consists of
    • Feature matrix \(\mathbf{X}\) with shape \((n, m)\). Note that each example is represented with \(m\) features and there are total \(n\) examples.
    • Label vector \(\mathbf{y}\) containing labels for \(n\) examples and its shape is \((n,)\). 
  • In multiclass and multilabel set ups, training data consists of feature matrix with the same specification as the binary set up.  
    • Label matrix \(\mathbf{Y}\) containing one of \(k\) labels for each of \(n\) examples and its shape is \((n, k)\). 

Model is the second component of our ML framework, which will be discussed in the context of a binary classification set up.

Model

y = \mathrm{sign}(\mathbf{w}^T \mathbf{x} + b)
  • Note that we have written bias unit separately over here.  This is a linear classifier, which is familiar to us.
  • Labels are assumed to be +1 and -1.

Feature vector

Label

Linear combination of features

Weight vector

y = \mathrm{sign}(\mathbf{w}^T \mathbf{x} + b)

Learning problem

Given the training data, find the best values for \(\mathbf{w}\) and \(b\) that results into the minium loss.

The SVM learns a hyperplane separating two classes with parameters \(\mathbf{w}\) and \(b\).

SVM finds the hyperplane in slightly different manner that other classifiers that we have studied in this course.

It selects the hyperplane that maximizes the distance to the closest data points from both classes. 

In other words, it is the hyperplane with maximum margin between two classes.

How does it select such a hyperplane? It uses appropriate loss functions for the objective.

Loss function

Let's learn a few concepts before setting up the loss function.

There are three components here:

  • Separating hyperplane
  • Bounding planes
  • Support vectors

Separating hyperplanes is the classifier.  It is at equal distance from both the classes and separates them such that there is maximum margin between two classes.

Bounding planes are parallel to separating hyperplanes on the either sides and pass through support vectors.

Support vectors are subset of training points closer to the separating hyperplane and influence its position and orientation

The bounding planes are defined as follows:

The bounding plane on the side of the positive class has the following equation:

\mathbf{w}^T \mathbf{x} + b = 1

Bounding planes

We can write this in one equation as follows using the label of an example.

y(\mathbf{w}^T \mathbf{x} + b) = 1

The bounding plane on the side of the negative class has the following equation:

\mathbf{w}^T \mathbf{x} + b = -1

Any point on or above the bounding plane belongs to the positive class:

\mathbf{w}^T \mathbf{x} + b \geq 1

Any point on or below the bounding plane belongs to the negative class:

\mathbf{w}^T \mathbf{x} + b \leq -1

Compactly, the correctly classified points satisfy the following equation:

y(\mathbf{w}^T \mathbf{x} + b) \geq 1

This constraint ensures that none of the points falls within the margin.

Support vectors

Support vectors are points on the bounding planes.

y(\mathbf{w}^T \mathbf{x} + b) = 1

Margin

Width of margin is the projection of \(\left(\mathbf{x}_{+} - \mathbf{x}_{-}\right)\) on unit normal vector \(\frac{\mathbf{w}}{||\mathbf{w} ||}\).  Mathematically,

\text{width} = \left(\mathbf{x}_{+} - \mathbf{x}_{-} \right). \frac{\mathbf{w}}{||\mathbf{w}||}

For positive support vector \(\mathbf{x}_{+}\):  (Using dot product between two vectors here)

For negative support vector \(\mathbf{x}_{-}\): 

\begin{aligned} \mathbf{w}.\mathbf{x}_{+} + b &= +1 \\ \mathbf{w}.\mathbf{x}_{+} &= 1 - b \end{aligned}
\begin{aligned} \mathbf{w}.\mathbf{x}_{-} + b &= -1 \\ \mathbf{w}.\mathbf{x}_{-} &= - 1 - b \end{aligned}
\begin{aligned} \text{width} &= (\mathbf{x}_{+}−\mathbf{x}_{−}).\frac{\mathbf{w}}{∥\mathbf{w}∥} \\ &= \frac{\mathbf{w}.\mathbf{x}_{+} - \mathbf{w}.\mathbf{x}_{-}}{∥\mathbf w∥} \\ &= \frac{1-b-(-1-b)}{∥\mathbf w∥} \\ &= \frac{1-b+1+b}{∥\mathbf w∥} \\ &= \frac{2}{∥\mathbf w∥}\\ \end{aligned}

Our objective is to maximize the margin, \( \frac{2}{||\mathbf{w}||}\), which is equivalent to minimizing

||\mathbf{w}|| = \frac{1}{2} ||\mathbf{w}||^2

Therefore the optimization problem of linear SVM is written as follows:

\min_{\mathbf{w}, b} \frac 12 \Vert \mathbf{w} \Vert^2

such that

y^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)} + b) \ge 1, i = 1, \ldots,n

This is called a primal problem and is guaranteed to have a global minimum.   

Let's solve this problem with an optimization procedure.

\min_{\mathbf{w}, b} \frac 12 \Vert \mathbf{w} \Vert^2

such that

y^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)} + b) \ge 1, i = 1, \ldots,n
  • This is quadratic optimization problem: quadratic objective with linear constraint.
  • Can be efficiently solved with QCPC (Quadratically constrained quadratic program) solvers 

The optimization problem of linear SVM is as follows:

Optimizing SVM Primal Problem

Dual of SVM primal problem

  • Alternatively we can optimize the dual of the primal problem.  
  • For that we will make use of Lagrange multipliers.
J_p(\mathbf{w}, {b}, \mathbf{\alpha}) =\frac{1}{2}||\mathbf{w}||^2 - \sum_{i=1}^{n} \alpha^{(i)} \left(y^{(i)}(\mathbf w.\mathbf{x}^{(i)}+b)-1 \right)

subject to \(\alpha^{(i)}  \geq 0 \ \ \forall i = 1, \ldots, n\) 

This is called Lagrangian function of the SVM, which is differentiable with respect to \(\mathbf{w}\) and \(b\).

\begin{aligned} \frac{\partial}{\partial \mathbf{w}} J(\mathbf{w}, {b}, \mathbf{\alpha}) = \mathbf{w} - \sum_{i=1}^n \alpha^{(i)} y^{(i)}\mathbf{x}^{(i)} &= 0 \end{aligned}
\mathbf{w} =\sum_{i=1}^n \alpha^{(i)} y^{(i)}\mathbf{x}^{(i)}
\begin{aligned} \frac{\partial}{\partial b} J(\mathbf{w}, {b}, \mathbf{\alpha}) = \sum_{i=1}^n \alpha^{(i)} y^{(i)} = 0 \\ \end{aligned}

which implies

We have

\mathbf{w} =\sum_{i=1}^n \alpha^{(i)} y^{(i)}\mathbf{x}^{(i)}
\sum_{i=1}^n \alpha^{(i)} y^{(i)} = 0

Substituting these in Lagrangian function of SVM, we get dual problem.

J(\mathbf{w}, {b}, \mathbf{\alpha}) =\frac{1}{2}||\mathbf{w}||^2 - \sum \alpha^{(i)} \left(y^{(i)}(\mathbf w.\mathbf{x}^{(i)}+b)-1 \right)
\begin{aligned} J_d (\mathbf{\alpha}) &= \sum_{i=1}^n \alpha^{(i)} - \frac 12 \sum_{i=1}^n \sum_{k=1}^n \alpha^{(i)} \alpha^{(k)} y^{(i)} y^{(k)} {x^{(i)}}^T x^{(k)} \\ &= \sum_{i=1}^n \alpha^{(i)} - \frac 12 \sum_{i=1}^n \sum_{k=1}^n \langle \alpha^{(i)} y^{(i)} x^{(i)}, \alpha^{(k)} y^{(k)} x^{(k)} \rangle \\ \end{aligned}
\alpha^{(i)} \ge 0, i\in 1, \ldots, n \\ \sum_{i=1}^n \alpha^{(i)} y^{(i)} = 0

such that

This is a concave problem that is maximized using a solver.

\begin{aligned} J_d (\mathbf{\alpha}) &= \sum_{i=1}^n \alpha^{(i)} - \frac 12 \sum_{i=1}^n \sum_{k=1}^n \alpha^{(i)} \alpha^{(k)} y^{(i)} y^{(k)} {x^{(i)}}^T x^{(k)} \\ &= \sum_{i=1}^n \alpha^{(i)} - \frac 12 \sum_{i=1}^n \sum_{k=1}^n \color{blue}\langle \alpha^{(i)} y^{(i)} x^{(i)}, \alpha^{(k)} y^{(k)} x^{(k)} \rangle \\ \end{aligned}
\alpha^{(i)} \ge 0, i\in 1, \ldots, n \\ \sum_{i=1}^n \alpha^{(i)} y^{(i)} = 0

such that

The dual problem is easier to solve as it is expressed in terms of the Lagrange multipliers.

The dual problem depends on the inner product of training data.

Strong duality requires Karush–Kuhn–Tucker (KKT) condition:

\mathbf{\alpha}^{(i)} \left(\mathbf{y}^{(i)} (\mathbf{w}^T \mathbf{x}^{(i)} + b) - 1 \right) = 0, \forall i \in 1, \ldots, n

This implies

If \(\mathbf{\alpha}^{(i)} > 0 \) then \(\mathbf{y}^{(i)} (\mathbf{w}^T \mathbf{x}^{(i)} + b) = 1 \).  The data point is a support vector.  In other words, it is located on one of the bounding planes.

If \(\mathbf{y}^{(i)} (\mathbf{w}^T \mathbf{x}^{(i)} + b) > 1 \), the distance between data point and the separating hyperplane is more than the margin.

Both primal and dual problems can be solved with optimization solvers to obtain the separating hyperplane for SVMs.

This flavour of SVM where classes are linearly separable is called hard margin SVM

Soft margin SVMs

  • The classes are largely linearly separable, but there are a few misclassifications or a few points lie with in margin.
  • We are unable to find a perfect hyperplane that maximizes the margin.
  • We would like to make some adjustments to the loss function so as to learn a hyperplane with tolerance to  a small number of misclassified examples.

Soft margin SVMs

Slack variable

We introduce a slack variable \(\mathbf{\xi}^{(i)}\) for each training point in the constraint as follows: 

\begin{aligned} y^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)} + b) &\geq 1 - \mathbf{\xi}^{(i)}; \\ \mathbf{\xi}^{(i)} &> 0 \end{aligned}

The constraints are now a less-strict because each training point \(\mathbf{x}^{(i)}\) need only be at a distance of \(1 - \mathbf{\xi}^{(i)}\)  from the separating hyperplane instead of a hard distance of 1.

Soft margin SVMs

In order to prevent slack variable becoming too large, we penalize it in the objective function

\min_{\mathbf{w}, b} \frac 12 \Vert \mathbf{w} \Vert^2 + \color{blue} C \sum_{i=1}^{n} \mathbf{\xi}^{(i)}

such that

\begin{aligned} y^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)} + b) &\ge 1 - \color{blue} \mathbf{\xi}^{(i)} \\ \forall_i \mathbf{\xi}^{(i)} &\geq 0 \end{aligned}
  • Slack allows input to be closer to the hyperplane or even be on the wrong side.
  • C is large - SVM becomes strict and tries to get all points to the correct side of the hyperplane.
  • C is small - SVM slacks and allows many misclassifications or point to lie with in margin.

Let's derive an unconstrained formulation.

For \(C \neq 0\), our objective is to minimize \(\mathbf{\xi}^{(i)}\) as much as possible, which is possible with \(y^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)} + b) = 1 - \mathbf{\xi}^{(i)}\)

\mathbf{\xi}^{(i)} = \begin{cases} 1 - \mathbf{y}^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b) &\text{if } \mathbf{y}^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b) < 1\\ 0 & \text{if } \mathbf{y}^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b) \geq 1 \end{cases}

This is equivalent to

\mathbf{\xi}^{(i)} = \text{max}\left(1 - \mathbf{y}^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b), 0 \right)

Let's plug this in the soft margin SVM objective function.

We add a non-zero slack \(1 - \mathbf{y}^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b)\) for misclassified points or points inside margin.

\min_{\mathbf{w}, b} \frac 12 \Vert \mathbf{w} \Vert^2 + \color{blue} C \sum_{i=1}^{n} \mathbf{\xi}^{(i)}
\min_{\mathbf{w}, b} \frac 12 \Vert \mathbf{w} \Vert^2 + C \sum_{i=1}^{n} \text{max}\left(1 - \mathbf{y}^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b), 0 \right)
  • The second term is the hinge loss.

x-axis is the \(t = y^{(i)} (\mathbf w.\mathbf{x}^{(i)}+b) \) and y-axis is the misclassification cost.

We need to minimize the soft margin loss to find the max-margin classifier.

\begin{aligned} \frac{\partial}{\partial \mathbf{w}} J(\mathbf{w}, {b}) = \frac{\partial}{\partial \mathbf{w}} \left(\frac 12 \Vert \mathbf{w} \Vert^2 + C \sum_{i=1}^{n} \text{max}\left(1 - \mathbf{y}^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b), 0 \right) \right) \end{aligned}

The partial derivative of the second term depends on the misclassification penalty:

\frac{\partial}{\partial \mathbf{w}} \max \left( 0, [1 - \mathbf{y}^{(i)} (\mathbf{w}^T \mathbf{x}^{(i)} + b)]\right) = \begin{cases} \textbf 0 &\text{if max } (0,[1-\mathbf{y}^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b)]) = 0 \\ \textbf y^{(i)} \mathbf{x}^{(i)} &\text{otherwise} \end{cases}
\frac{\partial}{\partial b} \max \left( 0, [1 - \mathbf{y}^{(i)} (\mathbf{w}^T \mathbf{x}^{(i)} + b)]\right) = \begin{cases} \textbf 0 &\text{if max } (0,[1-\mathbf{y}^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b)]) = 0 \\ \textbf y^{(i)} &\text{otherwise} \end{cases}
\frac{\partial}{\partial \mathbf{w}} J(\mathbf{w}, {b}) = \mathbf{w} + C \sum_{i=1}^{n} \mathbf{1} \left(1-\mathbf{y}^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b) > 0 \right) \mathbf{y}^{(i)} \mathbf{x}^{(i)}

Writing compactly:

\frac{\partial}{\partial b} J(\mathbf{w}, {b}) = C \sum_{i=1}^{n} \mathbf{1} \left(1-\mathbf{y}^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b) > 0 \right) \mathbf{y}^{(i)}

Partial derivatives

\frac{\partial}{\partial \mathbf{w}} \max \left( 0, [1 - \mathbf{y}^{(i)} (\mathbf{w}^T \mathbf{x}^{(i)} + b)]\right) = \begin{cases} \textbf 0 &\text{if max } (0,[1-\mathbf{y}^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b)]) = 0 \\ \textbf y^{(i)} \mathbf{x}^{(i)} &\text{otherwise} \end{cases}
\frac{\partial}{\partial b} \max \left( 0, [1 - \mathbf{y}^{(i)} (\mathbf{w}^T \mathbf{x}^{(i)} + b)]\right) = \begin{cases} \textbf 0 &\text{if max } (0,[1-\mathbf{y}^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b)]) = 0 \\ \textbf y^{(i)} &\text{otherwise} \end{cases}

Gradient descent update rule

\mathbf{w}^{(\text{new})} = \mathbf{w}^{(\text{old})} - \text{learning rate } \times \left( \mathbf{w} + C \sum_{i=0}^{n} {\color{blue} \mathbf{1} \left(1-\mathbf{y}^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b) > 0 \right)} \mathbf{y}^{(i)} \mathbf{x}^{(i)} \right)
b^{(\text{new})} = b^{(\text{old})} - \text{learning rate } \times C \sum_{i=0}^{n} {\color{blue} \mathbf{1} \left(1-\mathbf{y}^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b) > 0 \right)} \mathbf{y}^{(i)}

Evaluation measures

  • Usual classification evaluation measures like precision, recall, f1-score and accuracy.

Kernel SVMs

Basic Idea

  • Kernel SVM is used for non-linearly separable data.
  • Remember that so far we were performing non-linear transformation on the input feature space (e.g. polynomial transformation) and then training the model in the transformed space for learning non-linear decision boundaries.
  • Kernel SVM computes dot product in transformed feature space, but without explicitly calculating the transformation.
\begin{aligned} J_d (\mathbf{\alpha}) &= \sum_{i=1}^n \alpha^{(i)} - \frac 12 \sum_{i=1}^n \sum_{k=1}^n \alpha^{(i)} \alpha^{(k)} \mathbf{y}^{(i)} \mathbf{y}^{(k)} \color{blue}{\mathbf{x}^{(i)}}^T \mathbf{x}^{(k)} \\ \end{aligned}
\alpha^{(i)} \ge 0, i\in 1, \ldots, n \\ \sum_{i=1}^n \alpha^{(i)} \mathbf{y}^{(i)} = 0

such that

Recall SVM dual objective function

\begin{aligned} &= \sum_{i=1}^n \alpha^{(i)} - \frac 12 \sum_{i=1}^n \sum_{k=1}^n \alpha^{(i)} \alpha^{(k)} \mathbf{y}^{(i)} \mathbf{y}^{(k)} \color{blue}K({\mathbf{x}^{(i)}}, \mathbf{x}^{(k)}) \\ \end{aligned}

Writing this with kernel function

Kernel performs dot product between input feature vectors in high dimensional space without actually projecting or transforming the input features in that space.

K(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}) = {\mathbf{x}^{(i)}}^T \mathbf{x}^{(j)}

Linear kernel:

Polynomial kernel:

K(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}) = \left( 1 + {\mathbf{x}^{(i)}}^T \mathbf{x}^{(j)} \right)^d

Radial basis functions (RBF)

K(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}) = \exp \left( - \frac{(\mathbf{x}^{(i)}-\mathbf{x}^{(j)})^2}{\sigma^2} \right)

Demonstrating kernel trick with polynomial

\begin{aligned} \phi(\textbf a)^T \phi(\textbf b) &= \begin{pmatrix} a_1^2 \\ \sqrt 2 \space a_1 a_2 \\ a_2^2 \end{pmatrix} ^T \begin{pmatrix} b_1^2 \\ \sqrt 2 \space b_1 b_2 \\ b_2^2 \end{pmatrix} \\ &= a_1^2b_1^2 + 2 a_1b_1a_2b_2 + a_2^2+b_2^2 \\ &= (a_1b_1 + a_2b_2)^2 \\ &= \Bigg ( \begin{pmatrix} a_1 \\ a_2 \end{pmatrix}^T \begin{pmatrix} b_1 \\ b_2 \end{pmatrix} \Bigg )^2 \\ &= (\textbf a^T \textbf b)^2 \end{aligned}

Model with kernel SVM

h(\mathbf{x})= \text{sign} \left(\sum_{i=1}^{n} \mathbf{\alpha}^{(i)} \mathbf{y}^{(i)} K(\mathbf{x}^{(i)}, \mathbf{x}) + b \right)

Copy of Support vector Machine

By Swarnim POD

Copy of Support vector Machine

  • 137