4.5 Hierarchical Clustering (HCA)

Hierarchical Clustering (HCA): -

Hierarchical clustering is an unsupervised machine learning algorithm used to group unlabeled data into clusters.

(or)

Hierarchical clustering is a technique that organizes data into a hierarchy of clusters, which is shown using a tree-like diagram called a dendrogram, without fixing the number of clusters in advance.

  • It builds clusters in a step-by-step hierarchical structure

  • The structure looks like a tree, called a dendrogram

  • No need to predefine number of clusters (unlike K-Means)

Dendrogram

  • A tree-like diagram that shows how clusters are formed

  • Y-axis → Distance between clusters

  • X-axis → Data points

 It helps to decide where to cut to get clusters.

using hierarchical clustering:

·        Step 1: Start with individual clusters: Each car type initially forms its own cluster. 

·        Step 2: Calculate distances: Determine the similarity or distance between each pair of clusters (e.g., using Euclidean distance). 

·        Step 3: Merge closest clusters: Merge the two most similar clusters into a single cluster. 

·        Step 4: Repeat: Continue merging the closest clusters until you have a single cluster or the desired number of clusters. 

·        Step 5: Visualize with a dendrogram: The dendrogram will show how clusters are merged, revealing the hierarchy of car types. 

 

The hierarchical clustering technique has two approaches:


1.     Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with taking all data points as single clusters and merging them until one cluster is left.


Steps to perform agglomerative hierarchical clustering:

·        Step-1: Create each data point as a single cluster. Let's say there are N data points, so the number of clusters will also be N.


·        Step-2: Take two closest data points or clusters and merge them to form one cluster. So, there will now be N-1 clusters.

  •    Step-3: Again, take the two closest clusters and merge them together to form one cluster. There will be N-2 clusters.

·        Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters. 




·        Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to divide the clusters as per the problem.

 

 Measure for the distance between two clusters

 

There are various ways to calculate the distance between two clusters, and these ways decide the rule for clustering. These measures are called Linkage methods. Some of the popular linkage methods are:

1.      Single Linkage: It is the Shortest Distance between the closest points of the clusters.



2.      Complete Linkage: It is the farthest distance between the two points of two different clusters. It is one of the popular linkage methods as it forms tighter clusters than single-linkage.



3.   Average Linkage: It is the linkage method in which the distance between each pair of datasets is added up and then divided by the total number of datasets to calculate the average distance between two clusters. It is also one of the most popular linkage methods.

4.  Centroid Linkage: It is the linkage method in which the distance between the centroid of the clusters is calculated. 


The dendrogram is a tree-like structure that is mainly used to store each step as a memory that the HC algorithm performs.

 

In the dendrogram plot, the Y-axis shows the Euclidean distances between the data points, and the x-axis shows all the data points of the given dataset.

 


1.     Divisive Clustering: 

Ø Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down approach.

Ø Divisive clustering is a hierarchical clustering method that involves dividing the dataset or cluster into smaller subsets or clusters and recursively split it into smaller clusters until each data point forms its own cluster.

Ø It splits clusters into two sub-clusters based on the analysis of all possible bipartitions.




A "criterion" (standard rule) is a standard used to evaluate or optimize a model or algorithm, referring to the objective function being minimized or maximized, a stopping condition, or a measure of model quality.

 

Steps to perform divisive hierarchical clustering:

·        Step 1:  Consider entire dataset as a single cluster.

·        Step 2: Choose a criterion to determine how to split the current cluster. This could be based on distance, density or other measures.

·        Step 3: Apply the chosen criterion to the current cluster to determine the best way to divide it into two sub clusters this can be done using methods like K-Means or other distance-based criteria.

·        Step 4: Update the clustering to reflect the Split, now we have two clusters instead of one.

·        Step 5: Repeat this process on each of the cluster divided into sub clusters, which continues until stopping criteria is met.

·        Step 6: Stop criterion to stop the recursive splitting. This could be on a minimum cluster size, a maximum number of iterations or other criteria.

 












 

Popular posts from this blog

operators in c programming

2.4 Arrays in c programming

Variables in c