Machine Learning Techniques
We will cover
Let's look at the first component that is training data, which is very similar to other classification algorithms.
Note that perceptron can solve only binary classification problems. Hence \(y^{(i)} \in \{-1, +1\}\)
Let's look at the second component that is model, which is inspired from brain neurons.
Feature transformation
Weight vector
Feature vector
Label
Non-linear activation function
Linear combination of features
where, \(f(.)\) is a non-linear activation function. Here we use sign or threshold function as \(f(.)\):
It can also be written as
Remember
Note that for values of \( x \lt 0 \), we have \(y = -1\)
And for values of \( x \ge 0 \), we have \(y = +1\)
Now that we have seen the model of perceptron, let's look at the third component that is loss function.
Let \(\widehat{y}^{(i)} \in \{-1, +1\}\) be the prediction from perceptron and \(y^{(i)}\) be the actual label for \(i\)-th example. The error \(e^{(i)}\) is calculated as
For correctly classified examples, the error is 0
For misclassified examples, the error is \(-\mathbf{w}^T \phi(x^{(i)})y^{(i)}\)
The error can be compactly written as:
\(h_\mathbf{w}(\mathbf{x}^{(i)}): y = \text{sign}\left(\mathbf{w}^T \phi(\mathbf{x}^{(i)}) \right)\) is either +1 or -1.
If the decision is correct, then
\(h_\mathbf{w}(\mathbf{x}^{(i)}) = +1\) and \(y^{(i)} = +1\)
\(h_\mathbf{w}(\mathbf{x}^{(i)}) = -1\) and \(y^{(i)} = -1\)
or :
Note that in both the cases the error is 0.
\(h_\mathbf{w}(\mathbf{x}^{(i)}) = +1\) and \(y^{(i)} = -1\)
If the decision is wrong, then
\(h_\mathbf{w}(\mathbf{x}^{(i)}) = -1\) and \(y^{(i)} = +1\):
Note that in both the cases the error is 1.
The error is a piecewise linear function: it is zero in the correctly classified regions and a linear function of \(\mathbf{w}\) in mis-classified region.
\(J(\mathbf{w})\) is not differentiable in \(\mathbf{w}\).
We can control the loss is by adjusting the value of \(\mathbf{w}\).
The loss is directly proportional to \(\mathbf{w}\).
Thus for misclassified example, we can reduce loss by reducing \(\mathbf{w}\).
And for correctly classified examples, we leave \(\mathbf{w}\) unchanged.
The next task is to obtain the weight vector that minimizes the loss.
We will look at the optimization procedure used in perceptron. This procedure is known as perceptron update rule.
Linear separable examples lead to convergence of the algorithm with zero training loss, else it oscillates.
Note for correctly classified examples, \(\left( y^{(i)}-\widehat{y}^{(i)} \right)\) = 0 and hence there is no change in the weight vector.
Let's understand perceptron update rule for various values of \(\widehat{y}^{(i)}\) and \(y^{(i)}\):
(Case 1: Correct classification) \(\widehat{y}^{(i)}\) = \(y^{(i)}\):
(Case 2: Negative class misclassification) \(y^{(i)}\) = -1 and \(\widehat{y}^{(i)}\) = 1:
(Case 3: Positive class misclassification) \(y^{(i)}\) = 1 and \(\widehat{y}^{(i)}\) = -1:
Image source: Wikipedia.org
Finally let's look at the evaluation metrics. They are same as other classification algorithms.
Precision (P) = \(\dfrac{TP}{TP+FP} = \dfrac{25}{25+0} = 1.0\)
Recall (R) = \(\dfrac{TP}{TP+FN} = \dfrac{25}{25+0} = 1.0\)
F1 Score = \( \dfrac{2 \times P \times R}{P+R} = \dfrac{2 \times 1 \times 1}{1+1} =1.0 \)
def compute_confusion_matrix(y, y_predicted):
# Create a 2D matrix of size n by n
confusion_matrix = np.zeros((n,n))
# Populate entries of confusion matrix
for i_x,i_y in zip(y_predicted, y):
confusion_matrix[i_x,i_y] +=1
return confusion_matrix