1.13 K-Nearest Neighbors (KNN) in Machine Learning
K-Nearest Neighbors (KNN) in Machine Learning
- K-Nearest Neighbors, commonly called KNN, is a supervised machine learning algorithm.
- It is used for both classification and regression problems.
- K-Nearest Neighbors (KNN) is a supervised machine learning algorithm that classifies a data point based on the majority class of its K nearest neighbors using distance calculation.
- K-Nearest Neighbors is also called as a lazy learner algorithm because it does not learn from the training set immediately instead it stores the entire dataset and performs computations only at the time of classification and makes predictions only when needed.
- It works by finding the "k" closest data points (neighbors) to a given input and makes a predictions based on the majority class (for classification) or the average value (for regression).
- These closest points are called neighbors.
- Classification: The class label of a data point is determined by the majority class among its k-nearest neighbors.
- Regression: The predicted value is the average of the values of the k-nearest neighbors.
Example: Imagine you're deciding which fruit it is based on its shape and size. You compare it to fruits you already know.
- If k = 3, the algorithm looks at the 3 closest fruits to the new one.
- If 2 of those 3 fruits are apples and 1 is a banana, the algorithm says the new fruit is an apple because most of its neighbours are apples.
KNN Algorithm Works
Assume we want to classify a new point.
Step 1: Choose value of K
- K = number of nearest neighbors to consider.
- Common values: 3, 5, 7.
Step 2: Calculate Distance
- Usually Euclidean distance:
Step 3: Find Nearest Neighbors
- Select K points with smallest distance. The k data points with the smallest distances to the target point are nearest neighbors.
Step 4: Majority Voting
-
For classification → choose majority class
-
For regression → take average value
Features:
-
Income
-
Age
If a new person applies for a loan, KNN checks:
-
People with similar income and age
-
If most similar people were approved → approve
-
If most were rejected → reject
- Cross-Validation: Cross-Validation method is a good way to find the best value of k is by using k-fold cross-validation. This means dividing the dataset into k parts. The model is trained on some of these parts and tested on the remaining ones. This process is repeated for each part. The k value that gives the highest average accuracy during these tests is usually the best one to use.
- Elbow Method: In Elbow Method we draw a graph showing the error rate or accuracy for different k values. As k increases the error usually drops at first. But after a certain point error stops decreasing quickly. The point where the curve changes direction and looks like an "elbow" is usually the best choice for k.
- Odd Values for k: It’s a good idea to use an odd number for k especially in classification problems. This helps avoid ties when deciding which class is the most common among the neighbours.
Advantages of KNN
- Simple to understand
- No training time
- Works well for small datasets
Disadvantages of KNN
- Slow for large datasets
- Sensitive to irrelevant features
- Needs feature scaling (important!)
Always normalize data using:
-
Min-Max Scaling
-
Standardization
Applications
of the KNN Algorithm
Here are
some real life applications of KNN Algorithm.
- Recommendation
Systems:
Many recommendation systems, such as those used by Netflix or Amazon, rely
on KNN to suggest products or content. KNN observes at user behavior and
finds similar users. If user A and user B have similar preferences, KNN
might recommend movies that user A liked to user B.
- Spam Detection: KNN is widely
used in filtering spam emails. By comparing the features of a new email
with those of previously labeled spam and non-spam emails, KNN can predict
whether a new email is spam or not.
- Customer
Segmentation: In marketing firms, KNN is used to segment
customers based on their purchasing behavior . By comparing new customers
to existing customers, KNN can easily group customers into segments with
similar choices and preferences. This helps businesses target the right
customers with right products or advertisements.
- Speech
Recognition: KNN is often used in speech recognition systems
to transcribe spoken words into text. The algorithm compares the features
of the spoken input with those of known speech patterns. It then predicts
the most likely word or command based on the closest matches.
USES OF KNN:
1.
Classification Problems: KNN is widely used for classifying data
points into discrete categories, such as:
o
Handwriting
recognition
o
Image
classification
o
Spam email
detection
2.
Regression problems: KNN can be used to predict continuous values,
for instance:
o
Predicting
house prices
o
Forecasting
stock prices
3.
Anomaly Detection: KNN can help in detecting outliers by
comparing a data point to its nearest to its nearest neighbors.
4.
Recommendation Systems: KNN can be used in collaborative filtering
for recommending products or services based on the similarity between users or
items.