Machine Learning Techniques
e.g. identifying different types of fruits in a picture.
Label set: \(y = \{y_1, y_2, \ldots, y_k\}\) has \(k\) elements/labels.
Depending on the presence of the label, the corresponding label is set to 1.
versicolor
setosa
virginica
Image Source: Wikipedia.org
Use one-hot encoding scheme for label encoding.
Let's assume that the flower has label versicolor, we will encode it as follows:
Note that the component of \(\mathbf{y}\) corresponding to the label versicolor is 1, every other component is 0.
Sample image
Image source: Wikipedia.org
Apple
Banana
Orange
Different fruits in the images are:
becomes
Concretely, the feature vector for \(i\)-th training example \(\mathbf{x}^{(i)}\) can be obtained by \(\mathbf{X}[i]\):
\(\mathbf{y}\) is a label vector of shape \(n \times 1\). The \(i\)-th entry in this vector gives label for \(i\)-th example, which is denoted by \(y^{(i)}\).
A set of \(n\) pairs of a feature vector \(\mathbf{x}\) and a label vector \(\mathbf{y}\) representing examples.
We denote it by \(D\):
where
\(\mathbf{X}\) is an \(n \times m\) feature matrix:
\(\mathbf{Y}\) is a label matrix of shape \(n \times k\), where \(k\) is the total number of classes in label set.
\(\mathbf{Y}\) is a label matrix of shape \(n \times k\), where \(k\) is the total number of classes in label set.
\(x\) Discriminant Function \(y\)
Simplest discriminant function is very similar to the linear regression:
where,
The simplest discriminant function \(y = w_o + \mathbf{w}^T \mathbf{x}\) represents a hyperplane in \(m-1\) dimensional space where \(m\) is the number of features.
The discriminant function is a hyperplane in (m-1)-D space i.e. \(2-1=1\)-D space, which is a line. Note that here \(m=2\) features.
Classification is performed as follows:
The decision boundary is defined by
Consider two points \(x^{(A)}\) and \(x^{(B)}\) on the decision surface, we will have
Since \(y^{(A)} = y^{(B)} = 0\), \(y^{(A)} - y^{(B)}\) results into the following equation:
The vector \(\mathbf{w}\) is orthogonal to every vector lying within the decision surface, hence it determines the orientation of the decision surface.
For points on decision surface, we have
Normalizing both sides with the length of the vector \(||\mathbf{w}||\), we get normal distance from the origin to the decision surface:
\(w_0\) determines the location of the decision surface
\(y\) gives signed measure of perpendicular distance of the point \(\mathbf{x}\) from the decision surface.
By using a dummy feature \(x_0\) and setting it to 1, we get the following equation:
This represents a decision surface that is \(m\)-D hyperplane passing through the origin of (\(m+1\))-D space.
Assuming the number of classes to be \(k > 2\), we can build discriminant functions in two ways:
A single \(k\)-class discriminant comprising \(k\) linear functions as follows:
Concretely:
Assign label \(y_k\) to example \(\mathbf{x}\) if \(y_k > y_j, \forall j \ne k\)
The decision boundary between classes \(y_k\) and \(y_j\) corresponds to \(m-1\) dimensional hyperplane:
This has the same form as the decision boundary for the two class cases:
\( w_0 + \mathbf{w}^T \mathbf{x} = 0 \)
Now that we have a model of linear discriminant functions, we will study two approaches for learning the parameters of the model:
Let's implement the model inference function:
def predict(x, w):
z = x @ w
return np.array([1 if z_val >= 0 else 0 for z_val in z])
A random decision boundary
The total loss is the sum of square of errors between actual and predicted labels at each training point.
The error at \(i\)-th training point is calculated as follows:
The total loss \(J(\mathbf{w})\) is sum of errors for each training point:
Note that the loss is dependent on the value of \(\mathbf{w}\) - as these value changes, we get a new model, which will result in different prediction and hence affects the error at each training point.
Calculate derivative of loss function \(J(\mathbf{w})\) w.r.t. weight vector \(\mathbf{w}\).
Set \(\dfrac{\partial J(\mathbf{W})}{\partial \mathbf{W}}\) to 0 and solve for \(\mathbf{W}\):
Whenever \(\mathbf{X}^T \mathbf{X}\) is not full rank, we calculate pseudo-inverse: \(\left( \mathbf{X}^T \mathbf{X} \right)^{-1} \mathbf{X}^T\)
\( \mathbf{W} = \left( \mathbf{X}^T \mathbf{X} \right)^{-1} \mathbf{X}^T \mathbf{Y} \)
*Note to Swarnim*: Please write one line code for confusion matrix and precision/recall/f1 and report these metrics in a slide: on one side show confusion matrix and on the other side all metric values.