Dr. Ashish Tendulkar

Machine Learning Techniques

IIT Madras

Models of Classification

What topics will be discussed?

  1. Discriminant functions: Learn direct mapping between feature vector \(\mathbf{x}\) and label \(y\).
  2. Generative and discriminative models:
    • Generative classifiers model class conditional densities \( p(\mathbf{x}|y)\) for features and prior probabilities of classes \(p(y)\) and then through Bayes theorem, calculate \(p(y|\mathbf{x})\).
    • Discriminative classifiers learn conditional probability distribution \(p(y|\mathbf{x})\) through parameteric models.
  3. Instance based models - Compare the test examples with the training examples and assigns class labels based on certain measure of similarity.

Part I: Classification setup

Classification set up

  • Predict class label \(y\) of an example based on the feature vector \(\mathbf{x}\).
  • Class label \(y\) is a discrete quantity unlike a real number in regression set up.

Nature of class label

  • Label is a discrete quantity - precisely an element in some finite set of class labels.
  • Depending on the nature of the problem, we have one or more labels assigned to each example.

Types of classification

  1. Single label classification - where each example has exactly one label. 
    • e.g. is the applicant eligible for loan?
    • Label set: {yes, no}.
    • Label either yes or no.
  2. Multi-label classification - where each example has more than one label.
    • e.g. identifying different types of fruits in a picture.

Label representation: Single example

  1. Single label classification: Label is a scalar quantity and is represented by \(y\).
  2. Multi-label classification: More than one label hence vector representation \(\mathbf{y}\).
\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_\color{blue}{k} \end{bmatrix}

Label set: \(y = \{y_1, y_2, \ldots, y_k\}\) has \(k\) elements/labels.

Depending on the presence of the label, the corresponding label is set to 1.

Example: Single label classification (Binary)

  • Is the application eligible for loan?
  • Label set: \(\{yes, no\}\), usually converted to \(\{1, 0\}\)
    • Label: either \(yes\ (1)\) or \(no (0)\).
  • Training example:
    • Feature vector: \(\mathbf{x}\) - features for loan application like age of applicant, income, number of dependents etc.
    • Label: \(y\)

Example: Single label classification (Multiclass)

  • Types of iris flower
  • Label set: \(C = \{versicolor, setosa, virginica\}\)
  • Label: exactly one label from set \(C\).

Example: Single label classification (Multiclass)

  • Types of iris flower

versicolor

setosa

virginica

Image Source: Wikipedia.org

Label encoding in multiclass setup

Use one-hot encoding scheme for label encoding.

 

  • Make use of a vector \(\mathbf{y}\) with components equal to the number of labels in the label set.
  • In iris example, this would become:
\mathbf{y} = \begin{bmatrix} y_{versicolor} \\ y_{setosa} \\ y_{virginica} \\ \end{bmatrix}

Example: Label encoding (single label)

Let's assume that the flower has label versicolor, we will encode it as follows:

\mathbf{y} = \begin{bmatrix} y_{versicolor} = 1\\ y_{setosa} = 0\\ y_{virginica} = 0\\ \end{bmatrix}

Note that the component of \(\mathbf{y}\) corresponding to the label versicolor is 1, every other component is 0.

\mathbf{y} = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}

Example: Multi-label Classification

  • Label all fruits from an image.
  • Label set: List of fruits e.g. \(\{apple,guava, mango, orange, banana, strawberry, \}\)
  • Label: One or more fruits as they are present in the image.
\mathbf{y} = \begin{bmatrix} y_{apple} \\ y_{guava} \\ y_{mango} \\ y_{orange} \\ y_{banana} \\ y_{strawberry} \\ \end{bmatrix}

Example: Multi-label Classification

Sample image

Image source: Wikipedia.org

Example: Multi-label Classification

Apple

Banana

Orange

Different fruits in the images are:

Example: Multi-label Classification

  • Let's assume that the labels are \(\mathbf{apple}\), \(\mathbf{orange}\) and \(\mathbf{banana}\).
\mathbf{y} = \begin{bmatrix} y_{apple} = 1\\ y_{guava} = 0\\ y_{mango} = 0\\ y_{orange} = 1\\ y_{banana} = 1\\ y_{strawberry} = 0\\ \end{bmatrix}
\mathbf{y} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 1 \\ 1 \\ 0\\ \end{bmatrix}

becomes

Training Data: Binary Classification

  • Let's denote \(D\) as a set of \(n\)  pairs of a features vector \(\mathbf{x}_{m \times 1}\) and a label \(y\), to represent examples.
D = \left\{ (\mathbf{X}, \mathbf{y})\right\} = \left\{ (\mathbf{x}^{(i)}, y^{(i)})\right\}_{i=1}^{n}
  • \(\mathbf{X}\) is a feature matrix corresponding to all the training examples and has shape \(n \times m\). In this matrix, each feature vector is transposed and represented as a row in this matrix. 

Training Data: Binary Classification

D = \left\{ (\mathbf{X}, \mathbf{y})\right\} = \left\{ (\mathbf{x}^{(i)}, y^{(i)})\right\}_{i=1}^{n}
\mathbf{X}_{n \times m} = \begin{bmatrix} - \left(x^{(1)}\right)^T - \\ - \left(x^{(2)}\right)^T - \\ \vdots \\ - \left(x^{(n)}\right)^T - \\ \end{bmatrix}

Concretely, the feature vector for \(i\)-th training example \(\mathbf{x}^{(i)}\) can be obtained by \(\mathbf{X}[i]\):

Training Data: Binary Classification

D = \left\{ (\mathbf{X}, \mathbf{y})\right\} = \left\{ (\mathbf{x}^{(i)}, y^{(i)})\right\}_{i=1}^{n}
\mathbf{y} = \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(n)} \\ \end{bmatrix}

\(\mathbf{y}\) is a label vector of shape \(n \times 1\).  The \(i\)-th entry in this vector gives label for \(i\)-th example, which is denoted by \(y^{(i)}\).

Training Data: Multi-class classification

D = \left\{ (\mathbf{X}, \mathbf{Y})\right\} = \left\{ (\mathbf{x}^{(i)}, \mathbf{y}^{(i)})\right\}_{i=1}^{n}

A set of \(n\) pairs of a feature vector \(\mathbf{x}\) and a label vector \(\mathbf{y}\) representing examples.  

We denote it by \(D\):

where
\(\mathbf{X}\) is an \(n \times m\) feature matrix:

\mathbf{X}_{n \times m} = \begin{bmatrix} - \left(\mathbf{x}^{(1)}\right)^T - \\ - \left(\mathbf{x}^{(2)}\right)^T - \\ \vdots \\ - \left(\mathbf{x}^{(n)}\right)^T - \\ \end{bmatrix}

Training Data: Multi-class classification

D = \left\{ (\mathbf{X}, \mathbf{Y})\right\} = \left\{ (\mathbf{x}^{(i)}, \mathbf{y}^{(i)})\right\}_{i=1}^{n}

\(\mathbf{Y}\) is a label matrix of shape \(n \times k\), where \(k\) is the total number of classes in label set.

\mathbf{Y} = \begin{bmatrix} - \left(\mathbf{y}^{(1)}\right)^T - \\ - \left(\mathbf{y}^{(2)}\right)^T - \\ \vdots \\ - \left(\mathbf{y}^{(n)}\right)^T - \\ \end{bmatrix}

Multi-class and multi-label classification label vector

\(\mathbf{Y}\) is a label matrix of shape \(n \times k\), where \(k\) is the total number of classes in label set.

\mathbf{Y} = \begin{bmatrix} - \left(\mathbf{y}^{(1)}\right)^T - \\ - \left(\mathbf{y}^{(2)}\right)^T - \\ \vdots \\ - \left(\mathbf{y}^{(n)}\right)^T - \\ \end{bmatrix}

Multi-class and multi-label classification label vector

  •  Multi-class classification: For \(\left(\mathbf{y}^{(i)}\right)^T\), exactly one entry corresponding to the class label is 1.
  • Multi-label classification: For \(\left(\mathbf{y}^{(i)}\right)^T\), more than one entries corresponding to the class labels can be 1.

Part II: Discriminant Functions

Overview

  \(x\)                      Discriminant Function                             \(y\)

Example: Two classes

Simplest discriminant function is very similar to the linear regression:

\begin{aligned} y &= w_0 + w_1 x_1 + \ldots + w_m x_m \\ &= w_0 + \mathbf{w}^T \mathbf{x} \end{aligned}

where,

  • \(w_0\): Bias [Keeping this separately for a reason]
  • \(\mathbf{w}\): Weight vector
  • \(\mathbf{x}\): Feature vector
  • \(y\): label

The simplest discriminant function \(y = w_o + \mathbf{w}^T \mathbf{x}\) represents a hyperplane in \(m-1\) dimensional space where \(m\) is the number of features.

Geometric Interpretation

Geometric Interpretation

The discriminant function is a hyperplane in (m-1)-D space i.e. \(2-1=1\)-D space, which is a line.  Note that here \(m=2\) features.

Classification with discriminant functions 

Classification with discriminant functions 

Classification is performed as follows:

y =\left\{ \begin{aligned} 1, & \ \text{if}\ w_0 + \mathbf{w}^T \mathbf{x} > 0 \\ 0, & \ \text{otherwise} \end{aligned}\right. % y =\left\{ % \begin{array}{@{}ll@{}} % 1, & \text{if}\ w_0 + \mathbf{w}^T \mathbf{x} > 0 \\ % 0, & \text{otherwise} % \end{array}\right.

The decision boundary is defined by

w_0 + \mathbf{w}^T \mathbf{x} = 0

Classification with discriminant functions 

What does \(\mathbf{w}\) represent?

Consider two points \(x^{(A)}\) and \(x^{(B)}\) on the decision surface, we will have

\mathbf{w}^T (\mathbf{x}^{(A)} - \mathbf{x}^{(B)}) = 0
y^{(A)} = w_0 + \mathbf{w}^T \mathbf{x}^{(A)} = 0 \\ y^{(B)} = w_0 + \mathbf{w}^T \mathbf{x}^{(B)} = 0

Since \(y^{(A)} = y^{(B)} = 0\), \(y^{(A)} - y^{(B)}\) results into the following equation:

What does \(\mathbf{w}\) represent?

The vector \(\mathbf{w}\) is orthogonal to every vector lying within the decision surface, hence it determines the orientation of the decision surface.

What does \(w_0\) represent?

For points on decision surface, we have

\begin{aligned} w_0 + \mathbf{w}^T \mathbf{x} &= 0 \\ \mathbf{w}^T \mathbf{x} &= - w_0 \\ \end{aligned}
\dfrac{\mathbf{w}^T \mathbf{x}}{||\mathbf{w}||} = - \dfrac{w_0}{||\mathbf{w}||}

Normalizing both sides with the length of the vector \(||\mathbf{w}||\), we get normal distance from the origin to the decision surface:

\(w_0\) determines the location of the decision surface

What does \(w_0\) represent?

What does \(y\) represent?

\(y\) gives signed measure of perpendicular distance of the point \(\mathbf{x}\) from the decision surface.

  • \(w_0\) determines the location of the decision surface.
  • \(\mathbf{w}\) is orthogonal to every vector lying within the decision surface, hence it determines the orientation of the decision surface.

What does \(y\) represent?

Alternate interpretation

By using a dummy feature \(x_0\) and setting it to 1, we get the following equation:

\begin{aligned} y &= w_0 x_0+ w_1 x_1 + \ldots + w_m x_m \\ &= \mathbf{w}^T \mathbf{x} \end{aligned}

This represents a decision surface that is \(m\)-D hyperplane passing through the origin of (\(m+1\))-D space.

Multiple classes

Assuming the number of classes to be \(k > 2\), we can build discriminant functions in two ways:

 

  • One-vs-rest: Build \(k-1\) discriminant functions. Each discriminant function solves two class classification problem: class \(C_k\) vs \(not \ C_k\).
  • One-vs-one: One discriminant function per pair of classes.  Total functions = \({k \choose 2} = \frac{k  (k-1)}{2}\)  

Issues with one-vs-rest

  • Two discriminant functions for each class \(C_1\) and \(C_2\).
  • Each discriminant function separates \(C_k\) and not \(C_k\).
  • Region of ambiguity is in green.

Issues with one-vs-one

  • \(k(k-1)/2\) discriminant functions for each class pair \(C_i\) and \(C_j\).
  • Each discriminant function separates \(C_i\) and \(C_j\).
  • Each point is classified by majority vote.
  • Region of ambiguity is in green.

How do we fix it?

A single \(k\)-class discriminant comprising \(k\) linear functions as follows:

\begin{aligned} y_k &= w_{k0} + w_{k1} x_1 + \ldots + w_{km} x_m \\ &= w_{k0} + \mathbf{w_k}^T \mathbf{x} \end{aligned}

How do we fix it?

Concretely:

\begin{aligned} y_\color{blue}{1} &=& w_{\color{blue}{1}\color{black}{0}} + \mathbf{w_\color{blue}{1}}^T \mathbf{x} \\ y_\color{blue}{2} &=& w_{\color{blue}{2}\color{black}{0}} + \mathbf{w_\color{blue}{2}}^T \mathbf{x} \\ \vdots \\ y_\color{blue}{k} &=& w_{\color{blue}{k}\color{black}{0}} + \mathbf{w_\color{blue}{k}}^T \mathbf{x} \end{aligned}

Classification in \(k\)-discriminant functions

Assign label \(y_k\) to example \(\mathbf{x}\) if \(y_k > y_j, \forall j \ne k\)

\begin{aligned} (w_{k0} - w_{j0}) + (\mathbf{w}_k - \mathbf{w}_j)^T \mathbf{x} = 0 \end{aligned}

The decision boundary between classes \(y_k\) and \(y_j\) corresponds to \(m-1\) dimensional hyperplane:

This has the same form as the decision boundary for the two class cases:

\( w_0 + \mathbf{w}^T \mathbf{x} = 0 \)

Now that we have a model of linear discriminant functions, we will study two approaches for learning the parameters of the model:

  • Least squares
  • Perceptron

Least squares classification

Train-test split (TODO)

Sample Training Data

Let's implement the model inference function:

def predict(x, w):
  z = x @ w
  return np.array([1 if z_val >= 0 else 0 for z_val in z]) 

Sample Training Data

Decision Boundary Visualization

A random decision boundary

Loss function: Least Square Error

The total loss is the sum of square of errors between actual and predicted labels at each training point.  

The error at \(i\)-th training point is calculated as follows:

\begin{aligned} e^{(i)} &= (\mathrm{\color{blue}{actual\ label} - \color{red}{predicted\ label}})^2 \\ &= \left (\color{blue}{y^{(i)}} - \color{red}{h_{\mathbf{w}}(\mathbf{x}^{(i)})} \color{black}\right)^2 \\ &= \left (\color{blue}{y^{(i)}} - \color{red}{\mathbf{w}^T \mathbf{x}^{(i)}} \color{black}\right)^2 \end{aligned}

Loss function: Least Square Error

The total loss \(J(\mathbf{w})\) is sum of errors for each training point:

J(\mathbf{w}) = \sum_{i=1}^{n} e^{(i)} = \mathbf{e}^T \mathbf{e}

Note that the loss is dependent on the value of \(\mathbf{w}\) - as these value changes, we get a new model, which will result in different prediction and hence affects the error at each training point.

Optimization: Normal equation

Calculate derivative of loss function \(J(\mathbf{w})\) w.r.t. weight vector \(\mathbf{w}\).

\dfrac{\partial J(\mathbf{W})}{\partial \mathbf{W}} = 2(\mathbf{X}^T \mathbf{X} \mathbf{W} - \mathbf{X}^T \mathbf{Y})

Set \(\dfrac{\partial J(\mathbf{W})}{\partial \mathbf{W}}\) to 0 and solve for \(\mathbf{W}\):

\begin{aligned} 0 &= 2(\mathbf{X}^T \mathbf{X} \mathbf{W} - \mathbf{X}^T \mathbf{Y}) \\ \mathbf{W} &= \left( \mathbf{X}^T \mathbf{X} \right)^{-1} \mathbf{X}^T \mathbf{Y} \end{aligned}

Whenever \(\mathbf{X}^T \mathbf{X}\) is not full rank, we calculate pseudo-inverse: \(\left( \mathbf{X}^T \mathbf{X} \right)^{-1} \mathbf{X}^T\)

\( \mathbf{W} = \left( \mathbf{X}^T \mathbf{X} \right)^{-1} \mathbf{X}^T \mathbf{Y} \)

Optimization: Normal equation

Evaluation metrics

  • Confusion matrix
  • Precision/Recall/F1 score

*Note to Swarnim*: Please write one line code for confusion matrix and precision/recall/f1 and report these metrics in a slide: on one side show confusion matrix and on the other side all metric values.