Dr. Ashish Tendulkar
Machine Learning Techniques
IIT Madras
Perceptron
We will cover
- Five different components of this algorithm, just like any other ML algorithm along with mathematical details.
- Implementation from scratch in Python with Numpy.
Perceptron is a binary classification algorithm.
- Invented in 1958 by Frank Rosenblatt and was intended to be a machine, rather than a program.
- Perceptron was meant to be a rough model of how individual neurons work in the brain.
Perceptron was motivated from neurons
Let's look at the first component that is training data, which is very similar to other classification algorithms.
Training Data
- Feature matrix: \(\mathbf{X}_{n \times m}\)
- Label vector: \(\mathbf{y}_{n \times 1}\)
Note that perceptron can solve only binary classification problems. Hence \(y^{(i)} \in \{-1, +1\}\)
Let's look at the second component that is model, which is inspired from brain neurons.
Model \(h_\mathbf{w}(\mathbf{x})\)
Feature transformation
Weight vector
Feature vector
Label
Non-linear activation function
Linear combination of features
Model \(h_\mathbf{w}(\mathbf{x})\)
where, \(f(.)\) is a non-linear activation function. Here we use sign or threshold function as \(f(.)\):
It can also be written as
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1959752/images/8907214/MLClassificationStepFunction.png)
Model visualization
Remember
Note that for values of \( x \lt 0 \), we have \(y = -1\)
And for values of \( x \ge 0 \), we have \(y = +1\)
Now that we have seen the model of perceptron, let's look at the third component that is loss function.
Loss function
Let \(\widehat{y}^{(i)} \in \{-1, +1\}\) be the prediction from perceptron and \(y^{(i)}\) be the actual label for \(i\)-th example. The error \(e^{(i)}\) is calculated as
For correctly classified examples, the error is 0
For misclassified examples, the error is \(-\mathbf{w}^T \phi(x^{(i)})y^{(i)}\)
The error can be compactly written as:
Loss function
Loss function Illustration
\(h_\mathbf{w}(\mathbf{x}^{(i)}): y = \text{sign}\left(\mathbf{w}^T \phi(\mathbf{x}^{(i)}) \right)\) is either +1 or -1.
If the decision is correct, then
\(h_\mathbf{w}(\mathbf{x}^{(i)}) = +1\) and \(y^{(i)} = +1\)
\(h_\mathbf{w}(\mathbf{x}^{(i)}) = -1\) and \(y^{(i)} = -1\)
or :
Note that in both the cases the error is 0.
\(h_\mathbf{w}(\mathbf{x}^{(i)}) = +1\) and \(y^{(i)} = -1\)
If the decision is wrong, then
Loss function Illustration
\(h_\mathbf{w}(\mathbf{x}^{(i)}) = -1\) and \(y^{(i)} = +1\):
Note that in both the cases the error is 1.
Loss function
The error is a piecewise linear function: it is zero in the correctly classified regions and a linear function of \(\mathbf{w}\) in mis-classified region.
\(J(\mathbf{w})\) is not differentiable in \(\mathbf{w}\).
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1959005/images/9144490/pasted-from-clipboard.png)
We can control the loss is by adjusting the value of \(\mathbf{w}\).
The loss is directly proportional to \(\mathbf{w}\).
Thus for misclassified example, we can reduce loss by reducing \(\mathbf{w}\).
And for correctly classified examples, we leave \(\mathbf{w}\) unchanged.
The next task is to obtain the weight vector that minimizes the loss.
We will look at the optimization procedure used in perceptron. This procedure is known as perceptron update rule.
Optimization procedure
- Initialize \(\mathbf{w}^{(0)} = \mathbf{0}\)
- For each training example \(\left(\mathbf{x}^{(i)}, y^{(i)} \right)\):
Linear separable examples lead to convergence of the algorithm with zero training loss, else it oscillates.
Note for correctly classified examples, \(\left( y^{(i)}-\widehat{y}^{(i)} \right)\) = 0 and hence there is no change in the weight vector.
Let's understand perceptron update rule for various values of \(\widehat{y}^{(i)}\) and \(y^{(i)}\):
(Case 1: Correct classification) \(\widehat{y}^{(i)}\) = \(y^{(i)}\):
(Case 2: Negative class misclassification) \(y^{(i)}\) = -1 and \(\widehat{y}^{(i)}\) = 1:
(Case 3: Positive class misclassification) \(y^{(i)}\) = 1 and \(\widehat{y}^{(i)}\) = -1:
Optimization procedure: Convergence
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1959752/images/9007586/MLClassificationPerceptronTrainingOnSeparableDat.gif)
- On linearly separable data, the optimization procedure eventually converges.
Image source: Wikipedia.org
Optimization procedure: Convergence
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1959752/images/9136892/perceptronConvergence.gif)
- On linearly separable data, the optimization procedure converges.
Optimization procedure: Oscillations
- On non linearly separable data, the optimization procedure never converges.
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1959752/images/9136895/PerceptronOscillatingLoss.gif)
Finally let's look at the evaluation metrics. They are same as other classification algorithms.
Evaluation metrics
- Calculate confusion matrix based on predicted and actual labels.
- Calculate classification metrics like precision, recall, accuracy, F1-score from confusion matrix.
Confusion matrix and evaluation metrics: Example
![](https://s3.amazonaws.com/media-p.slid.es/uploads/1959752/images/9052720/MLPerceptronConfusionMatrix.png)
- TP = 25
- FP = 0
- TN = 25
- FN = 0
Precision (P) = \(\dfrac{TP}{TP+FP} = \dfrac{25}{25+0} = 1.0\)
Recall (R) = \(\dfrac{TP}{TP+FN} = \dfrac{25}{25+0} = 1.0\)
F1 Score = \( \dfrac{2 \times P \times R}{P+R} = \dfrac{2 \times 1 \times 1}{1+1} =1.0 \)
def compute_confusion_matrix(y, y_predicted):
# Create a 2D matrix of size n by n
confusion_matrix = np.zeros((n,n))
# Populate entries of confusion matrix
for i_x,i_y in zip(y_predicted, y):
confusion_matrix[i_x,i_y] +=1
return confusion_matrix
Confusion matrix implementation
Perceptron
By Swarnim POD
Perceptron
- 151