K-Means Clustering
Dr. Ashish Tendulkar
Machine Learning Techniques
IIT Madras
In this course, so far we studied supervised machine learning algorithms, where training data consist of features and labels.
There is another class of ML models where training data contains only features and labels are not available. They are called unsupervised ML algorithms.
Clustering is an example of unsupervised ML algorithm.
Overview
Clustering is the process of grouping similar data points or examples in the training set in the same cluster.
You may wonder what do we do with training examples represented with features but without label.
Clustering is widely used in many application such as
Customer profiling
Anomaly detection
Image segmentation
Image compression
Geostatistics
Astronomy
Just like any ML algorithm clustering also has five components:
- Training data
- Model
- Loss function
- Optimization
- Model selection/evaluation
Components of clustering
The training data consists of examples with only features.
\(D = \{\mathbf{x}^{(i)}\}_{i=1}^{n}\)
Training data
Each example is represented with \(m\) features.
Model
The model of clustering is as follows:
- We need to assign each point in the training set to one of the \(k\) clusters. This is called hard clustering.
- Each point has a probability of membership to \(k\) clusters such that the sum of probabilities is 1. This is called soft clustering.
Examples/Clusters | C1 | C2 | ... | Ck |
---|---|---|---|---|
x1 | 1 | 0 | 0 | |
x2 | 0 | 1 | 0 | |
... | ||||
xn | 0 | 0 | 1 |
In this course we will be focusing on hard clustering.
Examples/Clusters | C1 | C2 | ... | Ck |
---|---|---|---|---|
x1 | 1 | 0 | 0 | |
x2 | 0 | 1 | 0 | |
... | ||||
xn | 0 | 0 | 1 |
Cluster \(c_r; 1 \leq r \leq k\) is represented by its centroid, which is calculated as average of vector of points in that cluster.
In this model, there are two unknowns:
- Cluster centroid
- Membership of points to the clusters
![](https://s3.amazonaws.com/media-p.slid.es/uploads/2010658/images/9113211/pasted-from-clipboard.png)
The data points are usually assigned to the nearest clusters based on a chosen distance measure.
Euclidean distance is one of the commonly used measures in this process.
Euclidean distance between a data point \(\mathbf{x}^{(i)}\) and \(\mathbf{\mu}^{(r)} \) in \(m\) dimensions is calculated as
Loss function
![](https://s3.amazonaws.com/media-p.slid.es/uploads/2010658/images/9113211/pasted-from-clipboard.png)
\(\mu_1\)
\(\mu_2\)
\(C_1\)
\(C_2\)
\(x_1\)
\(x_2\)
Optimization
We use k-means algorithm for optimization here.
- Start off with k initial cluster centers.
2. Assign each point to the closest cluster center.
3. For each cluster, recompute its center as the average of all its assigned points.
4. Repeat 2 and 3 until centroids don't move or certain number of iterations have been performed.
Data
![](https://s3.amazonaws.com/media-p.slid.es/uploads/2010658/images/9109157/pasted-from-clipboard.png)
Let's generate points for 3 clusters from sklearn library.
Visualization
![](https://s3.amazonaws.com/media-p.slid.es/uploads/2010658/images/9109169/pasted-from-clipboard.png)
Visualization
![](https://s3.amazonaws.com/media-p.slid.es/uploads/2010658/images/9109174/pasted-from-clipboard.png)
Limitations
Local Optima
K-means doesn't reach optimal SSE all the times, because the random initialization could cause the algorithm to reach poor clustering.
![](https://s3.amazonaws.com/media-p.slid.es/uploads/2010658/images/9109198/pasted-from-clipboard.png)
Limitations
Local Optima
K-means doesn't reach optimal SSE all the times, because the random initialization could cause the algorithm to reach poor clustering.
![](https://s3.amazonaws.com/media-p.slid.es/uploads/2010658/images/9109199/pasted-from-clipboard.png)
Limitations
Data is not in spherical shape
Let's look at the following example, where there are two clusters in moon shape. The points in the clusters can not be contained by a sphirical shape.
![](https://s3.amazonaws.com/media-p.slid.es/uploads/2010658/images/9109203/pasted-from-clipboard.png)
Limitations
For data sets with large number of data points and features, K-means will be quite slow to converge.
K is unknown at the begining
It is computationally intensive to compute K by elbow method, because it involves running the complete algorithm for many possible values of K.
Large datasets
We evaluate clustering solution by SSE measure that was defined as a loss function.
Model selection
How do we find suitable value of \(k\)?
- Elbow method
- Silhoutte method
The Intuition behind elbow method: the cost function does not improve much by increasing K beyond the optimal value.
![](https://s3.amazonaws.com/media-p.slid.es/uploads/2010658/images/9109190/pasted-from-clipboard.png)
Elbow Method
\(a\) is the mean distance between the instances in the cluster.
\(b\) is the mean distance between the instance and the instances in the next closest cluster.
![](https://s3.amazonaws.com/media-p.slid.es/uploads/2010658/images/9109196/pasted-from-clipboard.png)
Silhoutte Coefficient
Example 1: Image Segmentation
![](https://s3.amazonaws.com/media-p.slid.es/uploads/2010658/images/9109214/pasted-from-clipboard.png)
Example 1: Image Segmentation
![](https://s3.amazonaws.com/media-p.slid.es/uploads/2010658/images/9109215/pasted-from-clipboard.png)
Example 2: Digit Classification
![](https://s3.amazonaws.com/media-p.slid.es/uploads/2010658/images/9109218/pasted-from-clipboard.png)
Let us try to classify the digits dataset by K-means clustering.
Example 2: Digit Classification
![](https://s3.amazonaws.com/media-p.slid.es/uploads/2010658/images/9109227/pasted-from-clipboard.png)
Clustering is the task of grouping observations so that members of the same group, or cluster, are more similar to each other by some measure than they are to members of other clusters.
K-means is an unsupervised clustering algorithm.
Applications:
- Customer Profiling
- Dimensionality reduction
- Anomaly Detection
- Market segmentation,
- Computer vision (Image segmentation, Image Compression)
- Geo-statistics
- Astronomy
Copy of K-Means
By Swarnim POD
Copy of K-Means
- 126