Machine Learning Techniques
IIT Madras
K-nearest neighbour or KNN is a supervised learning algorithm that can be used for both regression and classification tasks.
An example is labeled by the company it keeps.
This is an example of binary classification problem with 3-NN classifier. In this example, nearest neighbours are calculated based on Euclidean distance.
It has examples from two classes: Red and Green.
Now there are two new points \(P_1\) and \(P_2\), for which labels are unknown or yet to be predicted.
Each point looks at its 3 neighbours and computes the class that is represented by 2 or 3 neighbours. Hence, the point is labeled with the majority class in its neighbourhood.
Following two metrics are used quite often:
Manhatten Distance:
Euclidean distance :
Distance between two points \(\mathbf{x}_1\) and \(\mathbf{x}_2\) represented with \(m\) features is calculated as follows:
Writing this compactly
This can be rewritten in vectorized format as follows
Writing this compactly as follows
The vectorized form is as follows:
For classification task, the \(k\) neighbours take part in voting. The class that receives highest number of votes is the predicted class.
For regression task, the output/prediction is calculated as average of the outputs/labels of \(k\) neighbours.
Let us apply KNN technique and visualise how a new example is assigned a label. For this example value of \(k\) is set to be 3.
Visualize data and the new example
Find nearest 3 neighbours
Label the new data point with majority class out of 3.
Decision Boundary
Generate another dataset and observe decision boundary for different values of \(k\).
The value \(k\) that yields in minimum test error is most suitable.