1. 11 Underfitting and Overfitting in Machine Learning
Underfitting and Overfitting in Machine Learning
In machine learning, a model is considered good when it:
-
Learns patterns from the training data.
-
Performs well on new, unseen data.
-
Does not simply memorize the training data.
-
Does not ignore important patterns.
To check this, we compare performance on:
-
Training data
-
Validation or test data
Two major problems that affect performance are underfitting and overfitting. These are closely related to bias and variance.
Bias
Bias is the error caused when a model is too simple to understand the real pattern in the data.
-
High bias means the model makes strong assumptions.
-
It ignores important relationships.
-
It leads to underfitting.
Example:
Using a straight line (linear regression) to model data that actually follows a curve.
Assuming all birds can fly. The model ignores birds like ostrich and penguin.
Result:
-
Poor performance on training data.
-
Poor performance on test data.
High bias = Underfitting
Low variance
Variance
Variance is the error caused when a model learns too much from the training data, including noise.
-
High variance means the model is too sensitive to training data.
-
It captures noise instead of the real pattern.
-
It leads to overfitting.
Example:
Fitting a very complex curve that passes through every training point.
Result:
-
Very high accuracy on training data.
-
Poor performance on test data.
High variance = Overfitting
Low bias
Underfitting
- underfitting where the model performs poorly on both the training data and new data.
- Underfitting happens when the model is too simple to capture the actual pattern in the data.
- A model that cannot learn the underlying relationship in the data.
- Imagine data points forming a curve, but the model draws only a straight line. The line does not follow the pattern.
Example: The student didn't study enough and doesn't understand the basic formulas.
Characteristics
-
High bias
-
Low variance
-
Poor training accuracy
-
Poor testing accuracy
Reasons for Underfitting
-
Model is too simple.
-
Important features are missing.
-
Very small training dataset.
-
Too much regularization.
-
Features are not properly scaled.
Note: The underfitting model has High bias
and low variance.
Overfitting
- overfitting where the model performs well on the training data but poorly on new data.
- Overfitting happens when the model learns too much from training data, including noise and outliers.
- A model that memorizes training data instead of learning general patterns.
Example: The student memorized the exact practice problems from the textbook but can't solve a slightly different problem on the actual test.
Characteristics
-
Low bias
-
High variance
-
Very high training accuracy
-
Low testing accuracy
Reasons for Overfitting
-
Model is too complex.
-
Too many features.
-
Small training dataset.
-
No regularization.
-
Noise in training data.
Note: The overfitting model has low bias
and high variance.
·
Underfitting: Straight line trying to fit a curved dataset but cannot capture the
data’s patterns, leading to poor performance on both training and test sets.
·
Overfitting: A squiggly curve passing through all training points, failing to
generalize performing well on training data but poorly on test data.
·
Appropriate Fitting: Curve that follows the data trend without overcomplicating to capture
the true patterns in the data.
Bias–Variance Tradeoff using a target board example.
-
The center of the target = true value (correct prediction)
-
The dots = model predictions
-
Bias = how far predictions are from the center
-
Variance = how spread out the predictions are
1) Low Bias – Low Variance
- Predictions are close to the true value and tightly grouped.
- Model is accurate and consistent.
Performance:
-
Good training accuracy.
-
Good test accuracy.
-
Best case scenario.
- Predictions are around the true value on average, But they are widely spread.
- Model understands the pattern. But it changes a lot with small changes in data.
Performance:
-
Very high training accuracy.
-
Poor test accuracy.
-
Model is unstable.
- Predictions are tightly grouped, But far from the true value.
- Model is consistent, But consistently wrong.
Performance:
-
Low training accuracy.
-
Low test accuracy.
This is underfitting. The model is too simple to capture the pattern.
- Predictions are far from the center and are also widely spread.
- Model is inaccurate and inconsistent.
Performance:
-
Very poor training accuracy.
-
Very poor test accuracy.
This is the worst situation. Model is both too simple in understanding and unstable.