4.2 Clustering in un supervised Learning

 Clustering in un supervised Learning: -

  • Clustering is an unsupervised learning technique which is used to divide data into groups so that similar data points are in the same group.
  • There are no labels given. The algorithm itself finds groups based on similarity.
  • Clustering is an unsupervised learning technique that groups similar data points using similarity and distance measures such as Euclidean distance.
  • It uses similarity and distance to form groups.

Example: Shopping

Online stores group customers like:

  • Group 1 → People who buy electronics

  • Group 2 → People who buy clothes

  • Group 3 → People who buy groceries

This helps in giving recommendations.

  • Same group → More similar

  • Different groups → Less similar


Clustering helps to:

  • Discover hidden patterns in data

  • Organize large datasets

  • Make better decisions (business, healthcare, etc.)


Common methods: K-means, Hierarchical, DBSCAN


Similarity in Clustering: -

  • Similarity means how much two data points are alike.
  • If two items are very similar, they will be placed in the same cluster.

Example: Fruits

Compare two fruits:

  • Apple → Red, round

  • Cherry → Red, round

These are very similar, so they go into the same group.

Another comparison:

  • Banana → Yellow, long

  • Apple → Red, round

These are less similar, so they go into different groups.

  • More similarity → Same group

  • Less similarity → Different groups


Distance Measures in Clustering

Distance measure tells us how far apart two data points are.

Instead of saying “similar” or “different”, we calculate a number (distance).

  • Small distance → Very similar

  • Large distance → Very different


Common Distance Measure: 

  • Euclidean Distance is the most commonly used method.
  • It is the straight-line distance between two points.



Example

Let’s take two students:

  • Student A → (2 study hours, 50 marks)

  • Student B → (3 study hours, 55 marks)

Distance between them is small, so they are similar.

Another pair:

  • Student A → (2, 50)

  • Student C → (10, 90)

Distance is large, so they are very different.

  • Clustering → Group similar data points

  • Similarity → How alike two data points are

  • Distance → How far two data points are

        Small distance = High similarity
         Large distance = Low similarity


Advantages of Clustering

  • No labeled data needed

  • Finds hidden patterns

  • Useful for large datasets

  • Easy to understand (especially K-means)

Disadvantages of Clustering

  • Choosing correct number of clusters is difficult

  • Sensitive to noise and outliers

  • Results may vary based on algorithm

Applications of Clustering

  • Customer segmentation

  • Recommendation systems (Netflix, Amazon)

  • Social network analysis

  • Medical diagnosis

  • Image segmentation






























Popular posts from this blog

operators in c programming

2.4 Arrays in c programming

Variables in c