|
Interactive animation |
THE BIAS-VARIANCE TRADEOFF
This page gives some additional informations about the bias-variance tradeoff.
______________________________________________
We'll use regression as an example. The data is supposed to have been generated by a process :
y = f(x1, x2 , ..., xp) + ε
where f is deterministic, and ε is random with 0 mean. A regression model y* = f *(x1, x2 , ..., xp) is built from the sample. Let x0 be a point of the feature space. Then f *(x0) is hoped to be close to y0 = f(x0), the true value of the regression function.
Because of the randomness of ε, that is, the randomness of the sample, the model depends on the actual sample used to build it. Another sample would have led to a different model, and therefore a different response at x0. So the response of a model at any point is a random variable.
In the terminology of Statistics, such a regression model therefore puts at each and every point x0 of the the feature space a random variable that is an estimator of y0, the true value of the regression function at this point. This estimator is denoted by f *(y , x = x0), or f *0 for short.
We here focus on the response error at x0 (although a more general study can be conducted on the entire space, taking into account the unconditional probability distribution p(x)).
The estimator f *0 is good if its realizations are close to the true value y0 in a probabilistic sense, that is, for instance, if its Mean Square Error (MSE) :
MSE = E[( f *0 - y0)²]
is small.
It is easily shown that :
MSE = Bias² + Variance
where "Bias" and "Variance" are that of the response of the model, considered as an estimator of y0.
So the errors made by a model have two origins :
It is never the case that a data set makes obvious the choice of a particular model architecture. The analyst will always consider several candidate models, and his goal is of course to select the model with the most accurate response (on new data).
For example, in the case of regression, it is common to have many candidate independent variables (the regressors). A large part of the effort of model building will consist in identifying an adequate subset of regressors to be incorporated into the model. But each subset of regressors will yield a model, so a family of models is to be considered. At point x0, each of these models will have its own bias, its own variance, and therefore its own error level (MSE).
The bias-variance tradeoff principle states that within a given family of models :
Identifying this best model with certainty is of course impossible, as this would require knowing the true regression function f(x). But attempts can be made to identify models which are probably good. This is the object of "model selection" (see below).
It is convenient to consider the number of parameters (the complexity of the model) as a way to sort models in a family. The bias-variance tradeoff then states that, in the family of models :
Consequently, the "best" model will always have a number of parameters that is neither too small nor too large. The analyst will have to find the proper tradeoff between bias and variance within this family of models, largely (but not only) by tuning the number of parameters.
The true performance of a model is that observed on new data that did not take part to the construction of the model, not the observed performance on the design data.
For example,
if f * is chosen in the family of polynomials,
then higher degree polynomials (large number of parameters) can get closer to
the data points than lower degree polynomials, thus leading to a lower quadratic
error. In fact, if the design set contains
n data points, it is well known that a n-degree polynom will go
exactly through the points, thus reducing the error on the design set to 0.
But this polynom undergoes oscillations that are both very large and whose features
strongly depend on the exact positions
of the points, thus conducing to a model with a huge variance
and very large response errors.
Let's insist again : even a moderate overparametrization can cause the variance of the model to grow in an explosive way. Because this phenomenon is masked by excellent performances on the design set, and becomes visible only when it is too late (that is, when the model is put to work on new data), it tends to be overlooked by the newcomer to data modeling.
We illustrated the bias-variance tradeoff with regression as an example. But the bias-variance tradeoff is absolutely universal and shows up under different guises in any kind of data modeling. Let's give a few additional examples :
When data is not distributed around a straight line, but rather around a slightly convex curve, Simple Linear Regression may be replaced by several SLR models, each one operating in a different region of the regressor space. But in how many regions should this domain be partitioned ?
We use this example for illustrating the bias-variance
tradeoff with an interactive animation (
).
Multiple Linear Regression has no adjustable degree, as polynomial regression does. The response surface is always a hyperplane, that will never undergo the oscillations that we just mentioned for polynoms. So MLR might seem to be safe from overfitting. But it's not : as the dimension of the space of regressors gets larger and larger (more and more variables incorporated into the model), this hyperplane will fit the data points better and better, thus making the Sum of Squared Residuals smaller and smaller.
One of the main questions in Multiple Linear Regression is therefore the identification of the best set of predictors, "best" in the sense of overall MSE, and therefore in the sense of the bias-variance tradeoff.
Ridge Regression is a variant of MLR that artificially introduces a bias on the parameters of the model, and consequently on its predictions too. It is expected that this bias will cause a reduction in the variance of the model parameters and predictions. The amount of bias is controled by a "ridge parameter", that acts as if the number of parameters of the model was actually smaller than the number of regressors.
Adjusting the value of the ridge parameter is convenient way of optimizing the bias-variance tradeoff.
The standard linear discriminant functions are fine when the classes have identical covariance matrices, an almost academic situation. What of the more general situation ? Should the discriminant functions be made quadratic, as theory suggests ? Not necessarily so. If the data is sparse (we will return to the notion of "sparse data" in a moment), the linear model might provide more accurate predictions than the "augmented" model, because its contains fewer parameters than its bigger quadratic counterpart.
How deeply should a Decision Tree be developed ?
The adjustment of the complexity of a Decision Tree is often made
a posteriori by a so-called pruning mechanism. The Tree is first
purposely grown up to an exagerated depth. Then branches that are deemed superfluous
are removed.
The list is endless. All models are subject to the bias-variance tradeoff. More precisely, any model belongs to a family of models, some exhibiting a large bias but a small variance, some exhibiting a small bias but a large variance, the "best" model being somewhere in between.
Among the models hereabove mentioned :
Both these types of models are of course submitted to the bias-variance tradeoff. But recall that parametric models are helped by a large amount of information made available to them through an assumed analytical form of the underlying distribution. Conversely, non parametric models are "on their own" and must compensate, whenever possible, this lack of a priori information by information found in supplementary data.
So, in a situation where it is justified to use a parametric model, its non parametric cousin will suffer from a larger bias, a larger variance, or both (depending in the choices made by the analyst).
Quite generally, larger samples make for smaller variances. Unfortunately, practical considerations prohibit resorting to arbitrarily large samples to bypass the bias-variance problem.
Conversely, small samples make the bias-variance tradeoff even more accute. For a given bias, the variance of the model response is larger than for a model built from a larger sample.
The sample size issue is both important and complex as a new concept now steps in : that of the dimensiona of the data space. Is a 1000-observation sample large or small ?
So sample size by itself means nothing. What really matters is not the number of observations, but the density of the observations in the feature space. This density collapses as more dimensions are added for a given sample size, and therefore as more parameters added to the model.
Conversely, if one wants to maintain the same density (and therefore maintain the accuracy of the model) when more dimensions are added, then the sample size should increase enormously (usually exponentially with the number of dimensions). This is known as the "curse of dimensionality".
It would seem that is some cases, "number of parameters"
and "data space dimension" are different issues. For example, in a
Multilayer Perceptron (MLP), it would seem that neurons (and therefore
parameters) can be added to the hidden layer without changing the dimension
of the data space. But is is not so : the hidden layer of the MLP projects the data
on an "intermediate" space, and the dimension of this space
is what matters for the linear output layer, and therefore for the model accuracy.
In the family of models that is being considered, how is the "best" model going to be identified ? First, we will never be certain that we have identified the best model in the family, because of the random nature of the sample. But it is possible (and necessary) to identify models that are probably fairly good. This can be done in two ways :
The analyst then builds several models, and retains the model with the lowest predicted error level.
________________________________________________________
Related readings :