INTERACTIVE ANIMATION : MSE and BIAS-VARIANCE TRADEOFF
This animation illustrates both :
1) The concept of Mean Square Error (MSE) of a model prediction at a particular point in the space of predictors ("Local" mode of operation of the animation).
2) The concept of "Bias-variance tradeoff", that bears on the MSE of the model predictions averaged over the entire space of the predictors ("Global" mode of operation of the animation).
The animation opens in the "Local" mode.
In the upper frame are :
1) A red straight line that is the deterministic part of a data generating process.
2) A sample drawn from this process.
3) A black straight line, that is the Least Squares Line of this sample.
4) A point x0 of the space of the unique regressor.
MSE of the model prediction at one position x0 ("Local" mode)
Influence of the position of the measurement point
* Click on "Go". In the lower left frame are displayed the bais, variance and MSE of the Simple Linear Regression model (SLR).
Notice that the bias converges to 0, in accordance with the property of SLR to be an unbiased estimator at any point of the regressor range when the standard conditions are fulfilled (which is the case here).
* Use your mouse to drag x0 to the right end of the range of x. The bias is still 0, but the variance increases and so does the local MSE, in accordance with the properties of SLR.
* Now drag x0 to the center of the range of x : the variance decreases, and so does the MSE of the model prediction. It can be shown that this variance is minimal when x0 is the barycenter of the measurement points.
Influence of noise
Increase the noise level (the "error variance" in the terminology of regression). The above scheme can be duplicated, with higher levels of variance and MSE : the prediction performance of the model degrades as the measurement errors increase.
Influence of sample size
Increase the sample size. The above scheme can be duplicated, with lower levels of variance and MSE : the prediction performance of the model improves as the sample size increases.
Click on the "Sample" button of the "Mask" box if you are annoyed by the flickering of the sample.
Influence of departure from linearity
Return "Size" to 10 and "Noise" to 5.
Increase the convexity of the red curve ('Convex.") to 1. The data generating process is no longer linear and a s consequence, the SLR model is biased. The amount of bias depends on de x0. Check that :
* The bias is negative in the central part of the range (red "-" sign to the left of the "Biases" bar).
* But it is positive at either end of the domain.
* It is 0 for two intermediate points.
The convexité of the curve has little influence on the variance.
So it appears that when the data generating process is non linear, the performance of the SLR model degrades, mostly because the model is now biased. The value of the bias depends sharply on the position of the measuring point x0.
Influence of model complexity
The convexity of the red curve suggests to partition the domain of x into two subdomains, and have one SLR model in each of the two subdomains (a technique known as "Piecewise Linear Regression").
* In the "Model" box, click on the button labled "2". The 1-line black model is still there, but we now also have a model made up of 2 blue lines, one in each of the two halves of the domain of x. Visual examination is enough for hoping that this new model will be able to accommodate the convexity of the red curve to a certain extent, which the 1-segment model can't do at all.
This is basically true, but things are not that simple.
Increase the convexity to level 2 just to make the differences between models more pronounced. Position x0 where the SLR model has its largest (negative) bias, around the middle of the range of x.
* The single segment model (black) is indeed severely handicapped by its large biais. Its variance, although smaller than that of the 2-segment model (blue), cannot restore the balance : the best model is the 2-segment model.
We're close to one end of a blue segment, and consequently in a region where the variance of the 2-segment model is rather large, but this large variance is not enough to cancel the superiority of the model.
The same happens at either end of the range of x which are regions where the SLR model is strongly positively biased.
All of the above is still true when the sample size is increased in large proportions : increasing the value of this this parameter has no influence on a model bias, and reduces the variances of the two models in comparable proportions.
* On the other hand, if x0 is positionned close to a point where the bias of the 1-segment model is 0, this model becomes the best one again : its variance is of the same order of magnitude as that of the 2-segment model, which has now a comparatively large bias.
Position your mouse over a bar to know the true value of the corresponding quantity.
So we see that the efficacy of the transition from a 1-segment model to a 2-segment model depends on where the measurements are made.
What about a 3-segment model ?
* Click on the "3" button of the "Models" box while keeping the other two models selected. Keep "Convex." at 2 and "Noise" at the moderate level 5.
* Explore the range of x, and notice that over most of the range, the bias of the 1-segment model is so large that it makes the model the worst of the three.
* Now you may unselect Model 1 so as to make the comparison between Models 2 and 3 easier.
You may make the bias, variances and MSE bars longer by clicking on the yellow background behind the group of bars that have become too short for comfortable reading.
Let's now explore the central region in more detail. :
* When x0 is close to the middle of the central green segment, both the bias and variance of Model 3 are lower than the corresponding quantities for Model 2, and Model 3 is then muche better than Model 2.
* But drag now x0 to the right so that it is both close to the middle of the blue segment and close to the end of the green segment. The bias of Model 2 is now low, the variance of Model 3 is large, and Model 2 is now the better of the two.
* Select again Model 1 (while keeping Models 2 and 3 selected too). In the two previous configurations, it is the worst of the three because of its large bias. But if you position x0 in a region of low bias for Model 1, this model becomes the best again (although only in a very narrow region), despite its relatively large variance.
The conclusion of this first part is that models with various levels of complexity have very different behaviors depending on where in the space of predictors they are used. In particular, which model is the best (lowest MSE) depends on where in space you are.
Facing a new data set, the analyst has no way of visualizing the above phenomena, and so there is a need for a global quality criterion. This criterion will be the averaged MSE of a model over the predictor space, ponderated by the probability density of the vector of the predictors.
For reasons of simplicity, we'll assume this density uniform over the domain of x. In other words, we'll assume that x0 is a r.v. with a uniform distribution over the range of x.
Averaged MSE, bias-variance tradeoff ("Global" mode)
Click on "Reset", then on "Local" in the upper-left corner of the animation. We are now in the "Global" mode, and x0 vanishes.
* The "Biases" group of bars now display the average biases of the models over the predictor space.
* The "Variances" group of bars now display the average variances of the models over the predictor space.
* The "MSE" group of bars now display the average MSEs of the models over the predictor space.
The average MSE is not equal to the average Variance plus the square of the average Bias. It is equal to the this quantity measured at one point, averaged over the whole space.
* Click on "Go". The average performances of the SLR model are displayed. Play with the noise level and sample size, and observe that these performances vary as expected.
Return now to the default settings ("Noise" = 5, et "Size" = 10).
* Keep the data generator linear ("Convex." = 0), and select Model 2 in addition to the already selected Model 1. Its average bias is of course 0, but its average variance is larger than that of Model 1. This is because each half of the model takes into account only a fraction of the data.
Just to be sure, select now Model 3, and notice that its average variance is even larger than that of Model 2.
* Increase now the convexity to "1". This small change causes a drastic change in the performances of the models.
- The average bias of Model 1 increases sharply in absolute value (note that all models have a negative average bias despite their strong positive bias at the end of each segment).
- Because the variances are not affected much by the convexity, Model 2 becomes the best model because of its low average bias. This is because this model is more flexible than the SLR model, and can therefore accommodate the convexity of the data generator to a certain extent.
- Model 3 is the worst because of its large average variance.
This configuration is quite representative of the bias-variance tradeoff :
- The simplest model (Model 1) is poor because of its large bias. Its low variance cannot compensate for this large bias, and the average MSE is large.
- The most complex model (Model 3) is poor because of its large variance. Its low bias cannot compensate for this large variance, and the average MSE is large.
- The best model (Model 2) is somewhere in-between. Its bias is not the lowest, neither is its variance, but its average MSE is the lowest because of a good tradeoff between these two quantities.
* Increase the convexity up to 2. The average bias of Model 1 increases again, and this model now becomes the worst of the three. The best one is still Model 2, and Model 3 is between the two.
Model 1 is now out of the race. Unselect it, and make the bars of the display longer by clicking on the yellow backgrounds behind the bars.
* Increase the convexity up to its maximum level (3). The MSE of the the two remaining models (2 and 3) are now influenced by two parameters : the noise level and the sample size.
Observe that for a given sample size, the performance of Model 3 degrades faster than that of Model 2 when the noise level is inceased because of the fast increase of its variance. For a various sample sizes, find the noise level that will make Models 2 and 3 roughly equivalent.
In conclusion, when one considers the overall performance of models over the predictor space, a tradeoff between low bias and low variance appears clearly. Models that are too simple are strongly biased, whereas models that are too complex have too large a variance. The best model is somewhere in between these two extremes.
The optimal complexity depends sharply on :
* The nature of the process that generated the data (here, departure from linearity).
* The noise level on the measurements ( "error variance").
* The sample size.
Generally, the only way to determine the optimal complexity is to compare different models with different complexities by cross-validation or bootstrap.
1) The bias-variance tradeoff was illustrated by 1-dimensional data. So it's not directly linked to the dimension of the data space (number of variables), but rather to the complexity of the models (number of parameters). However, many models are such that the number of their parameters is a direct consequence of the data space dimension, the most obvious example being Multiple Linear Regression. Preliminary steps of variables selection or dimensionality reduction are then efficient means of reducing the number of parameters of the models, while loosing as littlte information as possible about the data.
2) In the family of "Piecewise Linear Regression" models, the number of parameters increases by 2 units when replacing a model by the next model in the complexity scale (each partial model being characterized by one slope and one intercept).
Had we illustrated the bias-variance dilemma using polynomial regression, the number of parameters would have increased by one unit only from one model to the next (increase of the degree of the polynom), therefore allowing a finer tuning of the complexity. The same can be said for Multiple Linear Regression.
But the complexity of a model can also be made continuously adjustable by methods such as the penalization of the Sum of Squares. The most popular of these techniques is Ridge Regression. The complexity of the model is then a real number that is smaller than the (integer) number of parameters. It is called the "effective number of parameters", and its value is determined by an additional parameter (the "ridge parameter"), whose value is defined by the analyst.