Interactive animation

 

THE BIAS-VARIANCE TRADEOFF

 

This page gives some additional informations about the bias-variance tradeoff.

 ______________________________________________

 

A model is a set of estimators

We'll use regression as an example. The data is supposed to have been generated by a process :

y = f(x1, x2 , ..., xp) + ε

 

where f is deterministic, and ε is random with 0 mean. A regression model y* = f *(x1, x2 , ..., xp) is built from the sample. Let x0 be a point of the feature space. Then f *(x0) is hoped to be close to y0 = f(x0), the true value of the regression function.

Because of the randomness of ε, that is, the randomness of the sample, the model depends on the actual sample used to build it. Another sample would have led to a different model, and therefore a different response at x0. So the response of a model at any point is a random variable.

In the terminology of Statistics, such a regression model therefore puts at each and every point x0 of the the feature space a random variable that is an estimator of y0, the true value of the regression function at this point. This estimator is denoted by f *(y , x = x0), or f *0  for short.

Bias-variance decomposition

We here focus on the response error at x0 (although a more general study can be conducted on the entire space, taking into account the unconditional probability distribution p(x)).

The estimator f *0 is good if its realizations are close to the true value y0 in a probabilistic sense, that is, for instance, if its Mean Square Error (MSE) :

MSE = E[( f *0 - y0]

is small.

It is easily shown that :

MSE = Bias² + Variance

where "Bias" and "Variance" are that of the response of the model, considered as an estimator of y0.

So the errors made by a model have two origins :

Models come in families

It is never the case that a data set makes obvious the choice of a particular model architecture. The analyst will always consider several candidate models, and his goal is of course to select the model with the most accurate response (on new data).

For example, in the case of regression, it is common to have many candidate independent variables (the regressors). A large part of the effort of model building will consist in identifying an adequate subset of regressors to be incorporated into the model. But each subset of regressors will yield a model, so a family of models is to be considered. At point x0, each of these models will have its own bias, its own variance, and therefore its own error level (MSE).

The bias-variance tradeoff

The bias-variance tradeoff principle states that within a given family of models :

Identifying this best model with certainty is of course impossible, as this would require knowing the true regression function f(x). But attempts can be made to identify models which are probably good. This is the object of "model selection" (see below).

Model complexity

It is convenient to consider the number of parameters (the complexity of the model) as a way to sort models in a family. The bias-variance tradeoff then states that, in the family of models :

Consequently, the "best" model will always have a number of parameters that is neither too small nor too large. The analyst will have to find the proper tradeoff between bias and variance within this family of models, largely (but not only) by tuning the number of parameters.

Overparametrization and Overfitting

The true performance of a model is that observed on new data that did not take part to the construction of the model, not the observed performance on the design data.

For example, if  f * is chosen in the family of polynomials, then higher degree polynomials (large number of parameters) can get closer to the data points than lower degree polynomials, thus leading to a lower quadratic error. In fact, if the design set contains n data points, it is well known that a n-degree polynom will go exactly through the points, thus reducing the error on the design set to 0. But this polynom undergoes oscillations that are both very large and whose features strongly depend on the exact positions of the points, thus conducing to a model with a huge variance and very large response errors.
 

Let's insist again : even a moderate overparametrization can cause the variance of the model to grow in an explosive way. Because this phenomenon is masked by excellent performances on the design set, and becomes visible only when it is too late (that is, when the model is put to work on new data), it tends to be overlooked by the newcomer to data modeling.

The bias-variance tradeoff is universal

We illustrated the bias-variance tradeoff with regression as an example. But the bias-variance tradeoff is absolutely universal and shows up under different guises in any kind of data modeling. Let's give a few additional examples :


The adjustment of the complexity of a Decision Tree is often made a posteriori by a so-called pruning mechanism. The Tree is first purposely grown up to an exagerated depth. Then branches that are deemed superfluous are removed.

 

The list is endless. All models are subject to the bias-variance tradeoff. More precisely, any model belongs to a family of models, some exhibiting a large bias but a small variance, some exhibiting a small bias but a large variance, the "best" model being somewhere in between.

Sample size

Parametric and non parametric models

            Among the models hereabove mentioned :

Both these types of models are of course submitted to the bias-variance tradeoff. But recall that parametric models are helped by a large amount of information made available to them through an assumed analytical form of the underlying distribution. Conversely, non parametric models are "on their own" and must compensate, whenever possible, this lack of a priori information by information found in supplementary data.

So, in a situation where it is justified to use a parametric model, its non parametric cousin will suffer from a larger bias, a larger variance, or both (depending in the choices made by the analyst).

The "curse of dimensionality"

            Quite generally, larger samples make for smaller variances. Unfortunately, practical considerations prohibit resorting to arbitrarily large samples to bypass the bias-variance problem.

Conversely, small samples make the bias-variance tradeoff even more accute. For a given bias, the variance of the model response is larger than for a model built from a larger sample.

The sample size issue is both important and complex as a new concept now steps in : that of the dimensiona of the data space. Is a 1000-observation sample large or small ?

 So sample size by itself means nothing. What really matters is not the number of observations, but the density of the observations in the feature space. This density collapses as more dimensions are added for a given sample size, and therefore as more parameters added to the model.

Conversely, if one wants to maintain the same density (and therefore maintain the accuracy of the model) when more dimensions are added, then the sample size should increase enormously (usually exponentially with the number of dimensions). This is known as the "curse of dimensionality".


It would seem that is some cases, "number of parameters" and "data space dimension" are different issues. For example, in a Multilayer Perceptron (MLP),  it would seem that neurons (and therefore parameters) can be added to the hidden layer without changing the dimension of the data space. But is is not so : the hidden layer of the MLP projects the data on an "intermediate" space, and the dimension of this space is what matters for the linear output layer, and therefore for the model accuracy.

Model selection

In the family of models that is being considered, how is the "best" model going to be identified ? First, we will never be certain that we have identified the best model in the family, because of the random nature of the sample. But it is possible (and necessary) to identify models that are probably fairly good. This can be done in two ways :

 

The analyst then builds several models, and retains the model with the lowest predicted error level.

 

 

________________________________________________________

 

Related readings :

Estimation

Bootstrap

Cross-validation

Download this Glossary