Machine Learning Techniques
In this course, so far we studied supervised machine learning algorithms, where training data consist of features and labels.
There is another class of ML models where training data contains only features and labels are not available. They are called unsupervised ML algorithms.
Clustering is an example of unsupervised ML algorithm.
Clustering is the process of grouping similar data points or examples in the training set in the same cluster.
You may wonder what do we do with training examples represented with features but without label.
Clustering is widely used in many application such as
Customer profiling
Anomaly detection
Image segmentation
Image compression
Geostatistics
Astronomy
Just like any ML algorithm clustering also has five components:
The training data consists of examples with only features.
\(D = \{\mathbf{x}^{(i)}\}_{i=1}^{n}\)
Each example is represented with \(m\) features.
The model of clustering is as follows:
Examples/Clusters | C1 | C2 | ... | Ck |
---|---|---|---|---|
x1 | 1 | 0 | 0 | |
x2 | 0 | 1 | 0 | |
... | ||||
xn | 0 | 0 | 1 |
In this course we will be focusing on hard clustering.
Examples/Clusters | C1 | C2 | ... | Ck |
---|---|---|---|---|
x1 | 1 | 0 | 0 | |
x2 | 0 | 1 | 0 | |
... | ||||
xn | 0 | 0 | 1 |
Cluster \(c_r; 1 \leq r \leq k\) is represented by its centroid, which is calculated as average of vector of points in that cluster.
In this model, there are two unknowns:
The data points are usually assigned to the nearest clusters based on a chosen distance measure.
Euclidean distance is one of the commonly used measures in this process.
Euclidean distance between a data point \(\mathbf{x}^{(i)}\) and \(\mathbf{\mu}^{(r)} \) in \(m\) dimensions is calculated as
\(\mu_1\)
\(\mu_2\)
\(C_1\)
\(C_2\)
\(x_1\)
\(x_2\)
We use k-means algorithm for optimization here.
2. Assign each point to the closest cluster center.
3. For each cluster, recompute its center as the average of all its assigned points.
4. Repeat 2 and 3 until centroids don't move or certain number of iterations have been performed.
Let's generate points for 3 clusters from sklearn library.
K-means doesn't reach optimal SSE all the times, because the random initialization could cause the algorithm to reach poor clustering.
K-means doesn't reach optimal SSE all the times, because the random initialization could cause the algorithm to reach poor clustering.
Let's look at the following example, where there are two clusters in moon shape. The points in the clusters can not be contained by a sphirical shape.
For data sets with large number of data points and features, K-means will be quite slow to converge.
It is computationally intensive to compute K by elbow method, because it involves running the complete algorithm for many possible values of K.
We evaluate clustering solution by SSE measure that was defined as a loss function.
How do we find suitable value of \(k\)?
The Intuition behind elbow method: the cost function does not improve much by increasing K beyond the optimal value.
\(a\) is the mean distance between the instances in the cluster.
\(b\) is the mean distance between the instance and the instances in the next closest cluster.
Let us try to classify the digits dataset by K-means clustering.
Clustering is the task of grouping observations so that members of the same group, or cluster, are more similar to each other by some measure than they are to members of other clusters.
K-means is an unsupervised clustering algorithm.
Applications: