MLU-explAIn

The Bias Variance Tradeoff

Jared Wilber & Brent Werness, January 2021

Prediction errors can be decomposed into two main subcomponents of interest: error from bias, and error from variance. The tradeoff between a model's ability to minimize bias and variance is foundational to training machine learning models, so it's worth taking the time to understand the concept.

Using our wildest imagination, we can picture a dataset consisting of features X and labels Y, as on the left. Also imagine that we’d like to generalize this relationship to additional values of X - that we’d like to predict future values based on what we’ve already seen before.

With our imagination now undoubtedly spent, we can take a very simple approach to modeling the relationship between X and Y by just drawing a line to the general trend of the data.

A Simple Model

Our simple model isn’t the best at modeling the relationship - clearly there's information in the data that it's failing to capture.

We'll measure the performance of our model by looking at the mean-squared error of its output and the true values (displayed in the bottom barchart). Our model is close to some of the training points, but overall there's definitely room for improvement.

The error on the training data is important for model tuning, but what we really care about is how it performs on data we haven't seen before, called test data. So let's check that out as well.

Low Complexity & Underfitting

Uh-oh, it looks like our earlier suspicions were correct - our model is garbage. The test error is even higher than the train error!

In this case, we say that our model is underfitting the data: our model is so simple that it fails to adequately capture the relationships in the data. The high test error is a direct result of the lack of complexity of our model.

An underfit model is one that is too simple to accurately capture the relationships between its features X and label Y.

A Complex Model

Our previous model performed poorly because it was too simple. Let's try our luck with something more complex. In fact, let's get as complex as we can - let's train a model that predicts every point in our training data perfectly.

Great! Now our training error is zero. As the old saying goes in Tennessee: Fool me once - shame on you. Fool me twice - er... you can't get fooled again ;).

High Complexity & Overfitting

Wait a second... Even though our training error from our model was effectively zero, the error on our test data is high. What gives?

Unsurprisingly, our model is too complicated. We say that it overfits the data. Instead of learning the true trends underlying our dataset, it memorized noise and, as a result, the model is not generalizable to datasets beyond its training data.

Overfitting refers to the case when a model is so specific to the data on which it was trained that it is no longer applicable to different datasets.

In situations where your training error is low but your test error is high, you've likely overfit your model.

Test Error Decomposition

Our test error can come as a result of both under- and over- fitting our data, but how do the two relate to each other?

In the general case, mean-squared error can be decomposed into three components: error due to bias, error to to variance, and error due to noise.

Or, mathematically:

We can’t do much about the irreducible term, but we can make use of the relationship between both bias and variance to obtain better predictions.

Bias

Bias represents the difference between the average prediction and the true value:

The term is a tricky one. It refers to the average prediction after the model has been trained over several independent datasets. We can think of the bias as measuring a systematic error in prediction.

These different model realizations are shown in the top chart, while the error decomposition (for each point of data) is shown in the bottom chart.

For underfit (low-complexity) models, the majority of our error comes from bias.

Variance

As with bias, the notion of variance also relates to different realizations of our model. Specifically, variance measures how much, on average, predictions vary for a given data point:

As you can see in the bottom plot, predictions from overfit (high-complexity) models show a lot more error from variance than from bias. It’s easy to imagine that any unseen data points will be predicted with high error.

Finding A Balance

To obtain our best results, we should work to find a happy medium between a model that is so basic it fails to learn meaningful patterns in our data, and one that is so complex it fails to generalize to unseen data .

In other words, we don’t want an underfit model, but we don’t want an overfit model either. We want something in between - something with enough complexity to learn learn the generalizable patterns in our data.

By trading some bias for variance (i.e. increasing the complexity of our model), and without going overboard, we can find a balanced model for our dataset.

Across Complexities

We just showed, at different levels of complexity, a sample of model realizations alongside their corresponding prediction error decompositions.

Let’s direct our focus to the error decompositions across model complexities.

For each level of complexity, we’ll aggregate the error decomposition across all data-points, and plot the aggregate errors at their level of complexity.

This aggregation applied to our balanced model (i.e. the middle level of complexity) is shown to the left.

The Bias Variance Trade-off

Repeating this aggregation across our range of model complexities, we can see the relationship between bias and variance in prediction errors manifests itself as a U-shaped curve detailing the trade off between bias and variance.

When a model is too simple (i.e. small values along the x-axis), it ignores useful information, and the error is composed mostly of that from bias.

When a model is too complex (i.e. large values along the x-axis), it memorizes non-general patterns, and the error is composed mostly of that from variance.

The ideal model aims to minimize both bias and variance. It lays in the sweet spot - not too simple, nor too complex. Achieving such a balance will yield the minimum error.

Many models have parameters that change the final learned models, called hyperparameters Let's look at how these hyperparameters may be used to control the the bias-variance tradeoff with two examples: LOESS Regression and K-Nearest Neighbors.

LOESS Regression

LOESS (LOcally Estimated Scatterplot Smoothing) regression is a nonparametric technique for fitting a smooth surface between an outcome and some predictor variables. The curve fitting at a given point is weighted by nearby data. This weighting is governed by a smoothing hyperparameter, which represents the proportion of neighboring data used to calculate each local fit.

Thus, the bias variance tradeoff for LOESS may be controlled for via the smoothness parameter. When the smoothness is small, the amount of data we consider is insufficient for an accurate fit, resulting in large variance. However, if we make the smoothness too high (i.e. over-smoothed), we trade local information for global, resulting in large bias.

Below a LOESS curve is fit to two variables. Randomize the training data to observe the effect different model realizations have on variance, and control the smoothness to observe the tradeoff between under- and over-fitting (and thus, bias and variance).

K-Nearest Neighbors

K-Nearest Neighbors (KNN) classification is a simple technique for assigning class membership to a data point with some majority vote of its K-nearest neighbors. For example, when K = 1, the data point is simply assigned to the class of that single nearest neighbor. If K = 69, the data point is assigned to the majority class of its 69-nearest neighbors.

We can observe the bias variance tradeoff in KNN directly by playing with the hyperparameter K. When K is small, only a small number of neighbors are considered during the classification vote. The resulting islands and jagged boundaries are a result of high variance, as classifications are determined by very localized neighborhoods. For large values of K, we see very smoothed regions that deviate sharply from the true decision boundary - go too high and you’ll just obtain a majority vote. This is high bias. On the other hand, for medium values of K, we see smooth the regions that follow along the true decision boundary.

Explore the trade-off for yourself below! The plot on the left shows the training data. The plot on the right shows decision regions based on the current value of K. Deeper colors reflect more confidence in the classification. Hover over a point to see its classification to the right, and the K-nearest neighbors used for consideration to the left.

What About Double Descent?

You may have heard about the phenomena, known as Double Descent, wherein which the classic U-shaped bias variance curve (seen above and in textbooks everywhere) is followed by a second dip (shown below). Such a phenomena must nullify the bias variance tradeoff we just spent so long explaining, right?

Fret not, dear reader! As we will detail in our next series of articles, Double Descent actually supports the classical notion of the bias variance trade off. Stay tuned to learn more!

It's Finally Over

Exhales deeply. It's finally over. Thanks for reading! We hope that the article is insightful no matter where you are along your machine learning journey, and that you came away with a better undersatnding of the bias variance tradeoff.

To make things compact, we skipped over some relevant topics (such as regularization), but stay-tuned for more MLU-Explain articles, where we plan to explain those, and other, topics related to machine learning.

To learn more about machine learning, check out our self-paced courses, our youtube videos, and the Dive into Deep Learning textbook. If you have any comments/ideas/etc. related to MLU-Explain articles, feel free to reach out directly. The code is available here.

References + Open Source

This article is a product of the following resources + the awesome people who made (& contributed to) them:

Reconciling modern machine learning practice and the bias-variance trade-off
(Mikhail Belkin; Daniel Hsu; Siyuan Ma; Soumik Mandal, 2019)
The Elements of Statistical Learning
(Trevor Hastie; Robert Tibshirani; Jerome Friedman, 2009).
Dive into Deep Learning (Aston Zhang and Zachary C. Lipton and Mu Li and Alexander J. Smola, 2020).
Understanding The Bias-Variance Tradeoff (Scott Fortmann-Roe, 2012).
LOESS Curve Fitting (statsdirect.com).
D3.js (Mike Bostock, Philippe Rivière)
Rough Notation (Preet Shihn)
KaTeX (Emily Eisenberg, Sophie Alpert)
Scrollama (Russel Goldenberg)
d3-regression (Harry Stevens)