4.5 Hierarchical Clustering (HCA)
Hierarchical Clustering (HCA): -
Hierarchical clustering is an unsupervised machine learning algorithm used to group unlabeled data into clusters.
(or)
Hierarchical clustering is a technique that organizes data into a hierarchy of clusters, which is shown using a tree-like diagram called a dendrogram, without fixing the number of clusters in advance.
-
It builds clusters in a step-by-step hierarchical structure
-
The structure looks like a tree, called a dendrogram
-
No need to predefine number of clusters (unlike K-Means)
Dendrogram
-
A tree-like diagram that shows how clusters are formed
-
Y-axis → Distance between clusters
-
X-axis → Data points
It helps to decide where to cut to get clusters.
using hierarchical clustering:
·
Step 1: Start with individual clusters: Each car type initially forms its
own cluster.
·
Step 2: Calculate distances: Determine the similarity or distance between each pair of clusters
(e.g., using Euclidean distance).
·
Step 3: Merge closest clusters: Merge the two most similar clusters into a single cluster.
·
Step 4: Repeat: Continue merging the closest clusters until you have a single
cluster or the desired number of clusters.
·
Step 5: Visualize with a dendrogram: The dendrogram will show how clusters are
merged, revealing the hierarchy of car types.
The hierarchical clustering technique has two approaches:
1.
Agglomerative: Agglomerative is a bottom-up approach, in which
the algorithm starts with taking all data points as single clusters and merging
them until one cluster is left.
Steps to perform agglomerative hierarchical
clustering:
·
Step-1: Create each data point as a single
cluster. Let's say there are N data points, so the number of clusters will also
be N.
·
Step-2: Take two closest data points or
clusters and merge them to form one cluster. So, there will now be N-1
clusters.
- Step-3: Again, take the two closest clusters and merge them together to form one cluster. There will be N-2 clusters.
·
Step-4: Repeat Step 3 until only one cluster
left. So, we will get the following clusters.
·
Step-5: Once all the clusters are combined into
one big cluster, develop the dendrogram to divide the clusters as per the
problem.
Measure for the distance between two clusters
There are
various ways to calculate the distance between two clusters, and these ways
decide the rule for clustering. These measures are called Linkage
methods. Some of the popular linkage methods are:
1.
Single Linkage: It is the Shortest Distance between the closest points of the
clusters.
2. Complete Linkage: It is the farthest distance between the two points of two different clusters. It is one of the popular linkage methods as it forms tighter clusters than single-linkage.
3. Average Linkage: It is the linkage method in which the distance between each pair of datasets is added up and then divided by the total number of datasets to calculate the average distance between two clusters. It is also one of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the centroid of the clusters is calculated.
The dendrogram is a tree-like structure that is mainly used to
store each step as a memory that the HC algorithm performs.
In the dendrogram plot, the Y-axis shows the Euclidean distances
between the data points, and the x-axis shows all the data points of the given
dataset.
1.
Divisive Clustering:
Ø
Divisive algorithm is the reverse of the agglomerative algorithm as it
is a top-down approach.
Ø
Divisive
clustering is a hierarchical clustering method that involves dividing the
dataset or cluster into smaller subsets or clusters and recursively split it
into smaller clusters until each data point forms its own cluster.
Ø
It splits
clusters into two sub-clusters based on the analysis of all possible
bipartitions.
A "criterion" (standard rule) is a standard
used to evaluate or optimize a model or algorithm, referring to the objective
function being minimized or maximized, a stopping condition, or a measure of
model quality.
Steps to perform divisive hierarchical clustering:
·
Step 1: Consider entire dataset as a single cluster.
·
Step 2: Choose a
criterion to determine how to split the current cluster. This could be based on
distance, density or other measures.
·
Step 3: Apply the
chosen criterion to the current cluster to determine the best way to divide it
into two sub clusters this can be done using methods like K-Means or other
distance-based criteria.
·
Step 4: Update
the clustering to reflect the Split, now we have two clusters instead of one.
·
Step 5: Repeat
this process on each of the cluster divided into sub clusters, which continues
until stopping criteria is met.
·
Step 6: Stop
criterion to stop the recursive splitting. This could be on a minimum cluster
size, a maximum number of iterations or other criteria.