Interactive animation

This expression summarizes the fact that introducing a certain amount of bias into an otherwise unbiased estimator may improve its performances.

# The bias-variance tradeoff for an estimator

The performance of an estimator θ* of a parameter θ is measured by its Mean Square Error (MSE) that is shown to be :

MSE = Var(θ*) + Bias(θ*

Although the lack of bias is an attractive feature of an estimator, it does not guarantee the lowest possible value of the MSE. This minimum value is attained when a proper tradeoff is found between :

* The bias of the estimator,  and

* Its variance

so as to make the value of the above expression smallest.

As a matter of fact, it is commonly observed that introducing a certain amount of bias into an otherwise unbiased estimator can lead to a significant reduction of its variance, so much so that the MSE will be reduced and therefore the perfomance of the estimator will be improved.

-----

In the Tutorial below, we show that of the two classic estimators of the variance :

* The sample variance (biased) :

s² = 1/n.Σi(xi - µ

* And the "corrected" sample variance (unbiased) :

S² = 1/(n - 1).Σi(xi - µ

the first one has the lower MSE of the two (despite its bias) when considering normal distributions.

We'll then identify a third estimator that is even better (lower MSE) than s² although its bias is the largest of the three.

A similar phenomenon is observed, for example, when estimating the parameter θ of the uniform distribution U[0, θ] and is illustrated here by an interactive animation.

# The bias-variance tradeoff for models

The bias-variance tradeoff (or "bias-variance dilemma") is a very important issue in data modeling. Ignoring it is a frequent cause of model failure, and although it has a deep theoretical rooting, it can be explained in simple terms.
-----

A model consists of :

• An architecture,  and
• Parameters.

For example, in polynomial regression :

• The architecture is a polynom, unambiguoulsy identified by its degree.
• The parameters are the coefficients of the polynom.

Once the architecture (the degree) is decided upon, fitting the model consists in finding the appropriate values of the parameters (in this case, using the Least Squares approach).

-----

But the analyst has first to decide on the appropriate degree of the polynom.

• If the data is highly non linear, a low degree polynom ("1" in the illustration below) will not have the flexibility needed to capture the global shape of the distribution. The polynomial line will be most of the time far from the data points, leading to large errors.
The model is then said to have a large bias because the bias of its predictions for a given x (blue dot) is high.
On the other hand, because of this very rigidity, the predictions of the model will depend only little on the particular sample that was used for building the model, and will therefore have a low variance (lower image of the illustration below).

• But too large a degree will make this polynomial line very sensitive to the details of the sample. Another sample would have lead to a completely different model, with completely different predictions (lower image of the illustration below).
The model is then said to have a large variance because the variance of its predictions (for a given x) is large.
In good models, points that are far from the true regression line (green) have a large contribution to the quadratic error. But here, because of the flexibility confered by its high degree,  the polynomial line can now get close to these points (low bias), and the quadratic error measured on the design sample is low. So the model appears to be performing well, but will in fact perform poorly on new data.

____________________

This is the essence of the bias-variance dilemma. In the example of the polynomial regression, it says that :

• A polynom with too few parameters (too low a degree) will make large errors because of a large bias.
• A polynom with too many parameters (too high a degree) will make large errors because of a large variance.

The degree of the "best" polynom must therefore be somewhere "in-between".

_________________________________

This phenomenon is not specific to polynomial regression. In fact, it shows-up under various guises in any kind of model. So, quite generally, the bias-variance tradeoff principle can be stated as follows :

 Models with too few parameters are inaccurate because of a large bias (not enough flexibility). Models with too many parameters are inaccurate because of a large variance (too much sensitivity to the sample). Identifying the best model requires identifying the proper "model complexity" (number of parameters).

This important issue is illustrated by an interactive animation that you'll find here .

We address some aspects of the bias-variance tradeoff in the next section :

• The model as a family of estimators.
• Decomposition of the model error into "Bias error" and "Variance error".
• The notion of "model complexity".
• Universality of the bias-variance tradeoff.
• Overparametrization and overfitting.
• The influence of sample size (parametric and non parametric models). The "Curse of dimensionality".
• Model selection.

__________________________________________________________________

 Tutorial

In this Tutorial, we compare the performances (MSE) of the two natural estimators of the variance of the normal distribution.

We show that the uncorrected (biased) estimator is performing better than its corrected (unbiased) counterpart.

-----

We then recognize that these two estimators belong to a class of estimators, and identify the best (lowest MSE) estimator in the class. Its bias will turn out to be even larger than that of the uncorrected sample variance.

 Comparing two estimators of the variance of the normal distribution MSE of the corrected (unbiased) sample variance MSE of the uncorrected (biased) sample variance Bias Variance MSE An even better estimator of the variance A class of estimators Identifying the best estimator in the class Properties of the best estimator Comparing the properties of the three estimators TUTORIAL

____________________________________________________