2.11 Evaluating Model Performance (Machine Learning)
Evaluating Model Performance (Machine Learning)
- Evaluating model performance is an important step in machine learning during model development and testing.
- It uses evaluation metrics to measure how well a model performs on data.
- Model evaluation helps answer questions such as:
1. Did the model learn meaningful patterns from the data?
2. Will the model perform well on new unseen data?
3. Is the model overfitting or underfitting?
- Machine learning models must be tested carefully because a model that works well on training data may fail on new data.
Proper evaluation of a machine learning model helps in:
-
Measuring accuracy and reliability of predictions
-
Avoiding overfitting (model memorizes training data)
-
Avoiding underfitting (model fails to learn patterns)
-
Comparing multiple models to choose the best one
-
Tuning hyperparameters to improve performance
Methods for Evaluating Model Performance
To evaluate machine learning models and reduce overfitting, we commonly use two methods:
-
Hold-Out Method
-
Cross-Validation
1. Hold-Out Method
- The Hold-Out method is the simplest technique used to evaluate machine learning models.
- In this method, the dataset is split into two parts:
Training dataset – used to train the model
Testing dataset – used to evaluate the model
Usually, a larger portion is used for training and a smaller portion for testing.
Example
Suppose we have 1000 data records.
-
Training data = 800 records
-
Testing data = 200 records
Steps:
-
Train the model using the 800 records
-
Test the model using the 200 records
-
Measure the model performance using evaluation metrics
Advantages
-
Simple to implement
-
Fast evaluation
Disadvantages
-
Performance depends on how the data is split
-
Results may vary if the split changes
2. Cross-Validation
- Cross-Validation is a more reliable evaluation method where the dataset is split multiple times to test the model.
- The most common type is K-Fold Cross-Validation.
K-Fold Cross-Validation:
In this method:
-
The dataset is divided into K equal parts (folds).
-
The model is trained K times.
-
Each time:
-
One fold is used for testing
-
Remaining K-1 folds are used for training
-
Finally, the average performance of all runs is calculated.
Example (5-Fold Cross Validation)
Dataset → divided into 5 folds
The final accuracy = average of all 5 results.
Advantages
-
Uses entire dataset efficiently
-
Gives more reliable performance results
-
Reduces bias
Classification Model Evaluation Methods
Classification is used to categorize data into predefined classes or labels.
Examples:
-
Email → Spam / Not Spam
-
Image → Cat / Dog
-
Loan → Approved / Rejected
To evaluate classification models, we use several metrics:
-
Accuracy
-
Precision
-
Recall
-
F1 Score
-
Confusion Matrix
1. Accuracy
- Accuracy measures the percentage of correct predictions made by the model.
Formula
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Where:
-
TP = True Positive
-
TN = True Negative
-
FP = False Positive
-
FN = False Negative
Example
Suppose a model tested 100 emails:
-
Correct spam predictions = 40
-
Correct non-spam predictions = 50
Total correct predictions = 90
Accuracy = 90 / 100 = 90%
Limitation
Accuracy does not work well with imbalanced datasets.
Example:
If 95% emails are non-spam, a model predicting always non-spam still gets 95% accuracy, but the model is useless.
2. Precision
- Precision measures how many predicted positive cases are actually positive.
Formula
Precision = TP / (TP + FP)
Example
Suppose a model predicts 50 emails as spam.
Out of those:
-
40 are actually spam (TP)
-
10 are not spam (FP)
Precision = 40 / (40 + 10)
Precision = 0.80 (80%)
Interpretation
Out of all predicted spam emails, 80% are correct.
Limitation
Precision does not consider False Negatives.
3. Recall
- Recall measures how many actual positive cases the model correctly identifies.
Formula
Recall = TP / (TP + FN)
Example
Suppose there are 60 actual spam emails.
The model detects 40 of them.
Recall = 40 / (40 + 20)
Recall = 0.67 (67%)
Interpretation
The model detects 67% of all spam emails.
Limitation
High recall may produce more false positives.
4. F1 Score
- F1 Score is the harmonic mean of Precision and Recall.
- It balances both metrics.
Formula
F1 Score = 2 × (Precision × Recall) / (Precision + Recall)
Example
Precision = 0.80
Recall = 0.67
F1 Score =
2 × (0.80 × 0.67) / (0.80 + 0.67)
F1 Score ≈ 0.73
It is useful when:
-
Dataset is imbalanced
-
Both precision and recall are important
5. Confusion Matrix
- A Confusion Matrix is a table used to evaluate classification models.
- It shows the number of correct and incorrect predictions.
- For binary classification, it is a 2 × 2 matrix.
True Positive (TP)
Model predicts Yes and the actual value is Yes.
Example:
Model predicts Dog and the image is actually Dog.
False Positive (FP)
Model predicts Yes, but the actual value is No.
Also called Type I Error.
Example:
Model predicts Dog, but the image is Not Dog.
False Negative (FN)
Model predicts No, but the actual value is Yes.
Also called Type II Error.
Example:
Model predicts Not Dog, but the image is actually Dog.
Example: Dog Image Classification
Example counts:
TP = 30
TN = 50
FP = 10
FN = 10
These values are used to calculate accuracy, precision, recall, and F1 score.