Using our wildest imagination, we can picture a dataset consisting of features X and labels Y, as on the left. Also imagine that we’d like to generalize this relationship to additional values of X - that we’d like to predict future values based on what we’ve already seen before.
With our imagination now undoubtedly spent, we can take a very simple approach to modeling the relationship between X and Y by just drawing a line to the general trend of the data.
A Simple Model
Our simple model isn’t the best at modeling the relationship - clearly there's information in the data that it's failing to capture.
We'll measure the performance of our model by looking at the mean-squared error of its output and the true values (displayed in the bottom barchart). Our model is close to some of the training points, but overall there's definitely room for improvement.
The error on the training data is important for model tuning, but what we really care about is how it performs on data we haven't seen before, called test data. So let's check that out as well.
Low Complexity & Underfitting
Uh-oh, it looks like our earlier suspicions were correct - our model is garbage. The test error is even higher than the train error!
In this case, we say that our model is underfitting the data: our model is so simple that it fails to adequately capture the relationships in the data. The high test error is a direct result of the lack of complexity of our model.
An underfit model is one that is too simple to accurately capture the relationships between its features X and label Y.
A Complex Model
Our previous model performed poorly because it was too simple. Let's try our luck with something more complex. In fact, let's get as complex as we can - let's train a model that predicts every point in our training data perfectly.
Great! Now our training error is zero. As the old saying goes in Tennessee: Fool me once - shame on you. Fool me twice - er... you can't get fooled again ;).
High Complexity & Overfitting
Wait a second... Even though our training error from our model was effectively zero, the error on our test data is high. What gives?
Unsurprisingly, our model is too complicated. We say that it overfits the data. Instead of learning the true trends underlying our dataset, it memorized noise and, as a result, the model is not generalizable to datasets beyond its training data.
Overfitting refers to the case when a model is so specific to the data on which it was trained that it is no longer applicable to different datasets.
In situations where your training error is low but your test error is high, you've likely overfit your model.
Test Error Decomposition
Our test error can come as a result of both under- and over- fitting our data, but how do the two relate to each other?
In the general case, mean-squared error can be decomposed into three components: error due to bias, error to to variance, and error due to noise.
Or, mathematically:
We can’t do much about the irreducible term, but we can make use of the relationship between both bias and variance to obtain better predictions.
Bias
Bias represents the difference between the average prediction and the true value:
The term is a tricky one. It refers to the average prediction after the model has been trained over several independent datasets. We can think of the bias as measuring a systematic error in prediction.
These different model realizations are shown in the top chart, while the error decomposition (for each point of data) is shown in the bottom chart.
For underfit (low-complexity) models, the majority of our error comes from bias.
Variance
As with bias, the notion of variance also relates to different realizations of our model. Specifically, variance measures how much, on average, predictions vary for a given data point:
As you can see in the bottom plot, predictions from overfit (high-complexity) models show a lot more error from variance than from bias. It’s easy to imagine that any unseen data points will be predicted with high error.
Finding A Balance
To obtain our best results, we should work to find a happy medium between a model that is so basic it fails to learn meaningful patterns in our data, and one that is so complex it fails to generalize to unseen data .
In other words, we don’t want an underfit model, but we don’t want an overfit model either. We want something in between - something with enough complexity to learn learn the generalizable patterns in our data.
By trading some bias for variance (i.e. increasing the complexity of our model), and without going overboard, we can find a balanced model for our dataset.
Across Complexities
We just showed, at different levels of complexity, a sample of model realizations alongside their corresponding prediction error decompositions.
Let’s direct our focus to the error decompositions across model complexities.
For each level of complexity, we’ll aggregate the error decomposition across all data-points, and plot the aggregate errors at their level of complexity.
This aggregation applied to our balanced model (i.e. the middle level of complexity) is shown to the left.
The Bias Variance Trade-off
Repeating this aggregation across our range of model complexities, we can see the relationship between bias and variance in prediction errors manifests itself as a U-shaped curve detailing the trade off between bias and variance.
When a model is too simple (i.e. small values along the x-axis), it ignores useful information, and the error is composed mostly of that from bias.
When a model is too complex (i.e. large values along the x-axis), it memorizes non-general patterns, and the error is composed mostly of that from variance.
The ideal model aims to minimize both bias and variance. It lays in the sweet spot - not too simple, nor too complex. Achieving such a balance will yield the minimum error.