1.13 K-Nearest Neighbors (KNN) in Machine Learning

K-Nearest Neighbors (KNN) in Machine Learning

K-Nearest Neighbors, commonly called KNN, is a supervised machine learning algorithm.

It is used for both classification and regression problems.

K-Nearest Neighbors (KNN) is a supervised machine learning algorithm that classifies a data point based on the majority class of its K nearest neighbors using distance calculation.

K-Nearest Neighbors is also called as a lazy learner algorithm because it does not learn from the training set immediately instead it stores the entire dataset and performs computations only at the time of classification and makes predictions only when needed.

It works by finding the "k" closest data points (neighbors) to a given input and makes a predictions based on the majority class (for classification) or the average value (for regression).

These closest points are called neighbors.

Classification: The class label of a data point is determined by the majority class among its k-nearest neighbors.

Regression: The predicted value is the average of the values of the k-nearest neighbors.

Example: Imagine you're deciding which fruit it is based on its shape and size. You compare it to fruits you already know.

If k = 3, the algorithm looks at the 3 closest fruits to the new one.
If 2 of those 3 fruits are apples and 1 is a banana, the algorithm says the new fruit is an apple because most of its neighbours are apples.

KNN Algorithm Works

Assume we want to classify a new point.

Step 1: Choose value of K

K = number of nearest neighbors to consider.
Common values: 3, 5, 7.

Step 2: Calculate Distance

Usually Euclidean distance:

$Distance = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}$

Step 3: Find Nearest Neighbors

Select K points with smallest distance. The k data points with the smallest distances to the target point are nearest neighbors.

Step 4: Majority Voting

For classification → choose majority class
For regression → take average value

Example: Loan Approval

Features:

Income
Age

If a new person applies for a loan, KNN checks:

People with similar income and age
If most similar people were approved → approve
If most were rejected → reject

Statistical Methods for Selecting k

Cross-Validation: Cross-Validation method is a good way to find the best value of k is by using k-fold cross-validation. This means dividing the dataset into k parts. The model is trained on some of these parts and tested on the remaining ones. This process is repeated for each part. The k value that gives the highest average accuracy during these tests is usually the best one to use.

Elbow Method: In Elbow Method we draw a graph showing the error rate or accuracy for different k values. As k increases the error usually drops at first. But after a certain point error stops decreasing quickly. The point where the curve changes direction and looks like an "elbow" is usually the best choice for k.

Odd Values for k: It’s a good idea to use an odd number for k especially in classification problems. This helps avoid ties when deciding which class is the most common among the neighbours.

Advantages of KNN

Simple to understand
No training time
Works well for small datasets

Disadvantages of KNN

Slow for large datasets
Sensitive to irrelevant features
Needs feature scaling (important!)

Always normalize data using:

Min-Max Scaling
Standardization

Applications of the KNN Algorithm

Here are some real life applications of KNN Algorithm.

Recommendation Systems: Many recommendation systems, such as those used by Netflix or Amazon, rely on KNN to suggest products or content. KNN observes at user behavior and finds similar users. If user A and user B have similar preferences, KNN might recommend movies that user A liked to user B.

Spam Detection: KNN is widely used in filtering spam emails. By comparing the features of a new email with those of previously labeled spam and non-spam emails, KNN can predict whether a new email is spam or not.

Customer Segmentation: In marketing firms, KNN is used to segment customers based on their purchasing behavior . By comparing new customers to existing customers, KNN can easily group customers into segments with similar choices and preferences. This helps businesses target the right customers with right products or advertisements.

Speech Recognition: KNN is often used in speech recognition systems to transcribe spoken words into text. The algorithm compares the features of the spoken input with those of known speech patterns. It then predicts the most likely word or command based on the closest matches.

USES OF KNN:

1. Classification Problems: KNN is widely used for classifying data points into discrete categories, such as:

o Handwriting recognition

o Image classification

o Spam email detection

2. Regression problems: KNN can be used to predict continuous values, for instance:

o Predicting house prices

o Forecasting stock prices

3. Anomaly Detection: KNN can help in detecting outliers by comparing a data point to its nearest to its nearest neighbors.

4. Recommendation Systems: KNN can be used in collaborative filtering for recommending products or services based on the similarity between users or items.

Search This Blog

ROHIT's Smart Class Room