4.2 Clustering in un supervised Learning
Clustering in un supervised Learning: -
- Clustering is an unsupervised learning technique which is used to divide data into groups so that similar data points are in the same group.
- There are no labels given. The algorithm itself finds groups based on similarity.
- Clustering is an unsupervised learning technique that groups similar data points using similarity and distance measures such as Euclidean distance.
- It uses similarity and distance to form groups.
Example: Shopping
Online stores group customers like:
-
Group 1 → People who buy electronics
-
Group 2 → People who buy clothes
-
Group 3 → People who buy groceries
This helps in giving recommendations.
-
Same group → More similar
-
Different groups → Less similar
Clustering helps to:
-
Discover hidden patterns in data
-
Organize large datasets
-
Make better decisions (business, healthcare, etc.)
Common methods: K-means, Hierarchical, DBSCAN
Similarity in Clustering: -
- Similarity means how much two data points are alike.
- If two items are very similar, they will be placed in the same cluster.
Example: Fruits
Compare two fruits:
-
Apple → Red, round
-
Cherry → Red, round
These are very similar, so they go into the same group.
Another comparison:
-
Banana → Yellow, long
-
Apple → Red, round
These are less similar, so they go into different groups.
-
More similarity → Same group
-
Less similarity → Different groups
Distance Measures in Clustering
Distance measure tells us how far apart two data points are.
Instead of saying “similar” or “different”, we calculate a number (distance).
-
Small distance → Very similar
-
Large distance → Very different
Common Distance Measure:
- Euclidean Distance is the most commonly used method.
- It is the straight-line distance between two points.
Example
Let’s take two students:
-
Student A → (2 study hours, 50 marks)
-
Student B → (3 study hours, 55 marks)
Distance between them is small, so they are similar.
Another pair:
-
Student A → (2, 50)
-
Student C → (10, 90)
Distance is large, so they are very different.
Clustering → Group similar data points
-
Similarity → How alike two data points are
-
Distance → How far two data points are
Small distance = High similarity
Large distance = Low similarity
Advantages of Clustering
-
No labeled data needed
-
Finds hidden patterns
-
Useful for large datasets
-
Easy to understand (especially K-means)
Disadvantages of Clustering
-
Choosing correct number of clusters is difficult
-
Sensitive to noise and outliers
-
Results may vary based on algorithm
Applications of Clustering
-
Customer segmentation
-
Recommendation systems (Netflix, Amazon)
-
Social network analysis
-
Medical diagnosis
-
Image segmentation