Dr. Ashish Tendulkar
Machine Learning Techniques
IIT Madras
Models of Classification
What topics will be discussed?
- Discriminant functions: Learn direct mapping between feature vector \(\mathbf{x}\) and label \(y\).
-
Generative and discriminative models:
- Generative classifiers model class conditional densities \( p(\mathbf{x}|y)\) for features and prior probabilities of classes \(p(y)\) and then through Bayes theorem, calculate \(p(y|\mathbf{x})\).
- Discriminative classifiers learn conditional probability distribution \(p(y|\mathbf{x})\) through parameteric models.
- Instance based models - Compare the test examples with the training examples and assigns class labels based on certain measure of similarity.
Part I: Classification setup
Classification set up
- Predict class label \(y\) of an example based on the feature vector \(\mathbf{x}\).
- Class label \(y\) is a discrete quantity unlike a real number in regression set up.
Nature of class label
- Label is a discrete quantity - precisely an element in some finite set of class labels.
- Depending on the nature of the problem, we have one or more labels assigned to each example.
Types of classification
-
Single label classification - where each example has exactly one label.
- e.g. is the applicant eligible for loan?
- Label set: {yes, no}.
- Label either yes or no.
-
Multi-label classification - where each example has more than one label.
-
e.g. identifying different types of fruits in a picture.
-
Label representation: Single example
- Single label classification: Label is a scalar quantity and is represented by \(y\).
- Multi-label classification: More than one label hence vector representation \(\mathbf{y}\).
Label set: \(y = \{y_1, y_2, \ldots, y_k\}\) has \(k\) elements/labels.
Depending on the presence of the label, the corresponding label is set to 1.
Example: Single label classification (Binary)
- Is the application eligible for loan?
- Label set: \(\{yes, no\}\), usually converted to \(\{1, 0\}\)
- Label: either \(yes\ (1)\) or \(no (0)\).
- Training example:
- Feature vector: \(\mathbf{x}\) - features for loan application like age of applicant, income, number of dependents etc.
- Label: \(y\)
Example: Single label classification (Multiclass)
- Types of iris flower
- Label set: \(C = \{versicolor, setosa, virginica\}\)
- Label: exactly one label from set \(C\).
Example: Single label classification (Multiclass)
- Types of iris flower
versicolor
setosa
virginica
Image Source: Wikipedia.org
Label encoding in multiclass setup
Use one-hot encoding scheme for label encoding.
- Make use of a vector \(\mathbf{y}\) with components equal to the number of labels in the label set.
- In iris example, this would become:
Example: Label encoding (single label)
Let's assume that the flower has label versicolor, we will encode it as follows:
Note that the component of \(\mathbf{y}\) corresponding to the label versicolor is 1, every other component is 0.
Example: Multi-label Classification
- Label all fruits from an image.
- Label set: List of fruits e.g. \(\{apple,guava, mango, orange, banana, strawberry, \}\)
- Label: One or more fruits as they are present in the image.
Example: Multi-label Classification
Sample image
Image source: Wikipedia.org
Example: Multi-label Classification
Apple
Banana
Orange
Different fruits in the images are:
Example: Multi-label Classification
- Let's assume that the labels are \(\mathbf{apple}\), \(\mathbf{orange}\) and \(\mathbf{banana}\).
becomes
Training Data: Binary Classification
- Let's denote \(D\) as a set of \(n\) pairs of a features vector \(\mathbf{x}_{m \times 1}\) and a label \(y\), to represent examples.
- \(\mathbf{X}\) is a feature matrix corresponding to all the training examples and has shape \(n \times m\). In this matrix, each feature vector is transposed and represented as a row in this matrix.
Training Data: Binary Classification
Concretely, the feature vector for \(i\)-th training example \(\mathbf{x}^{(i)}\) can be obtained by \(\mathbf{X}[i]\):
Training Data: Binary Classification
\(\mathbf{y}\) is a label vector of shape \(n \times 1\). The \(i\)-th entry in this vector gives label for \(i\)-th example, which is denoted by \(y^{(i)}\).
Training Data: Multi-class classification
A set of \(n\) pairs of a feature vector \(\mathbf{x}\) and a label vector \(\mathbf{y}\) representing examples.
We denote it by \(D\):
where
\(\mathbf{X}\) is an \(n \times m\) feature matrix:
Training Data: Multi-class classification
\(\mathbf{Y}\) is a label matrix of shape \(n \times k\), where \(k\) is the total number of classes in label set.
Multi-class and multi-label classification label vector
\(\mathbf{Y}\) is a label matrix of shape \(n \times k\), where \(k\) is the total number of classes in label set.
Multi-class and multi-label classification label vector
- Multi-class classification: For \(\left(\mathbf{y}^{(i)}\right)^T\), exactly one entry corresponding to the class label is 1.
- Multi-label classification: For \(\left(\mathbf{y}^{(i)}\right)^T\), more than one entries corresponding to the class labels can be 1.
Part II: Discriminant Functions
Overview
\(x\) Discriminant Function \(y\)
Example: Two classes
Simplest discriminant function is very similar to the linear regression:
where,
- \(w_0\): Bias [Keeping this separately for a reason]
- \(\mathbf{w}\): Weight vector
- \(\mathbf{x}\): Feature vector
- \(y\): label
The simplest discriminant function \(y = w_o + \mathbf{w}^T \mathbf{x}\) represents a hyperplane in \(m-1\) dimensional space where \(m\) is the number of features.
Geometric Interpretation
Geometric Interpretation
The discriminant function is a hyperplane in (m-1)-D space i.e. \(2-1=1\)-D space, which is a line. Note that here \(m=2\) features.
Classification with discriminant functions
Classification with discriminant functions
Classification is performed as follows:
The decision boundary is defined by
Classification with discriminant functions
What does \(\mathbf{w}\) represent?
Consider two points \(x^{(A)}\) and \(x^{(B)}\) on the decision surface, we will have
Since \(y^{(A)} = y^{(B)} = 0\), \(y^{(A)} - y^{(B)}\) results into the following equation:
What does \(\mathbf{w}\) represent?
The vector \(\mathbf{w}\) is orthogonal to every vector lying within the decision surface, hence it determines the orientation of the decision surface.
What does \(w_0\) represent?
For points on decision surface, we have
Normalizing both sides with the length of the vector \(||\mathbf{w}||\), we get normal distance from the origin to the decision surface:
\(w_0\) determines the location of the decision surface
What does \(w_0\) represent?
What does \(y\) represent?
\(y\) gives signed measure of perpendicular distance of the point \(\mathbf{x}\) from the decision surface.
- \(w_0\) determines the location of the decision surface.
- \(\mathbf{w}\) is orthogonal to every vector lying within the decision surface, hence it determines the orientation of the decision surface.
What does \(y\) represent?
Alternate interpretation
By using a dummy feature \(x_0\) and setting it to 1, we get the following equation:
This represents a decision surface that is \(m\)-D hyperplane passing through the origin of (\(m+1\))-D space.
Multiple classes
Assuming the number of classes to be \(k > 2\), we can build discriminant functions in two ways:
- One-vs-rest: Build \(k-1\) discriminant functions. Each discriminant function solves two class classification problem: class \(C_k\) vs \(not \ C_k\).
- One-vs-one: One discriminant function per pair of classes. Total functions = \({k \choose 2} = \frac{k (k-1)}{2}\)
Issues with one-vs-rest
- Two discriminant functions for each class \(C_1\) and \(C_2\).
- Each discriminant function separates \(C_k\) and not \(C_k\).
- Region of ambiguity is in green.
Issues with one-vs-one
- \(k(k-1)/2\) discriminant functions for each class pair \(C_i\) and \(C_j\).
- Each discriminant function separates \(C_i\) and \(C_j\).
- Each point is classified by majority vote.
- Region of ambiguity is in green.
How do we fix it?
A single \(k\)-class discriminant comprising \(k\) linear functions as follows:
How do we fix it?
Concretely:
Classification in \(k\)-discriminant functions
Assign label \(y_k\) to example \(\mathbf{x}\) if \(y_k > y_j, \forall j \ne k\)
The decision boundary between classes \(y_k\) and \(y_j\) corresponds to \(m-1\) dimensional hyperplane:
This has the same form as the decision boundary for the two class cases:
\( w_0 + \mathbf{w}^T \mathbf{x} = 0 \)
Now that we have a model of linear discriminant functions, we will study two approaches for learning the parameters of the model:
- Least squares
- Perceptron
Least squares classification
Train-test split (TODO)
Sample Training Data
Let's implement the model inference function:
def predict(x, w):
z = x @ w
return np.array([1 if z_val >= 0 else 0 for z_val in z])
Sample Training Data
Decision Boundary Visualization
A random decision boundary
Loss function: Least Square Error
The total loss is the sum of square of errors between actual and predicted labels at each training point.
The error at \(i\)-th training point is calculated as follows:
Loss function: Least Square Error
The total loss \(J(\mathbf{w})\) is sum of errors for each training point:
Note that the loss is dependent on the value of \(\mathbf{w}\) - as these value changes, we get a new model, which will result in different prediction and hence affects the error at each training point.
Optimization: Normal equation
Calculate derivative of loss function \(J(\mathbf{w})\) w.r.t. weight vector \(\mathbf{w}\).
Set \(\dfrac{\partial J(\mathbf{W})}{\partial \mathbf{W}}\) to 0 and solve for \(\mathbf{W}\):
Whenever \(\mathbf{X}^T \mathbf{X}\) is not full rank, we calculate pseudo-inverse: \(\left( \mathbf{X}^T \mathbf{X} \right)^{-1} \mathbf{X}^T\)
\( \mathbf{W} = \left( \mathbf{X}^T \mathbf{X} \right)^{-1} \mathbf{X}^T \mathbf{Y} \)
Optimization: Normal equation
Evaluation metrics
- Confusion matrix
- Precision/Recall/F1 score
*Note to Swarnim*: Please write one line code for confusion matrix and precision/recall/f1 and report these metrics in a slide: on one side show confusion matrix and on the other side all metric values.
Classification
By Swarnim POD
Classification
- 207