Interactive animation

Least Squares line

 The Simple Linear Regression model is materialized by a straight line, called the "Least Squares Line". This line is a condensed graphic representation of the distribution of the sample in the (x, y) plane. It is further used to predict "y" for new values of "x".

 

The very name of this line tells how it is determined. For any straigth line D in the plane :

    * Measure the vertical distance from a point to the line D,

    * Square this value,

    * Add the results for all points in the sample.


 It can be shown that there is one, and only one line for which this quantity is minimal. This is the Least Squares Line.

 

 

The following interactive figure illustrates the concept of Least Squares Line.

 

 

 

The "Book of Animations" on your computer

 

 

The number of points can be changed in the "Reset" mode only. "Noise" is in arbitrary units.

 

Drag the green cursors to move the "candidate" line until you get the lowest possible value in the mobile display.

This value is a modified version of the sum of the squares of the distances between the points and the line :

    * First, this sum is divided by the number of points, in order to obtain the average value of the squares of the distances of the points to the line.

    * Then, one takes the square root of this new quantity in order to obtain not the square of a distance, but something akin to a distance, which is easier to visualize (this is pretty much what we do when switching from variance to Standard Deviation). This last quantity is then displayed. It looks pretty much like the average distance from the points to the line, but it's not the average distance from the points to the line.

____________________

 

For a given sample, try several starting positions for the line. You'll easily convince yourself that you always end up with the same final line : there is only one line such that any small change of the position of the line always causes an increase of the sum of squares. This is a very important property. It is linked to the fact that we are trying to account for the sample with a straight line.
In more complex situations, a more complex shape may be appropriate. It may then happen that several different "curved" lines are such that any small change of the position or shape of a curve will cause an increase of the sum of squares. This is what happens, for instance, with Neural Networks.

 

Interactive animation

Least Squares  (Weighted)

 

One of the standard assumptions of Simple Linear Regression is that the variance of the noise is constant throughout the range of the independent variable x (homoscedasticity). This condition is necessary for the Least Squares Line to be the best predictor.
This assumption is not always met in practice. Actually, it is quite common to observe situations where the variance of the noise depends on the value of x (heteroscedasticity). The most common case is when the noise gets larger as x gets larger.


In such a situation, the ordinary Least Squares line is not the best linear prediction model anymore, and has to be replaced by the Weighted Least Squares line. The general idea is that an observation with a large noise variance should be given less importance in defining the prediction line than an observation with a small noise variance. This goal is met by resorting to a modified version of the sum of the squares of the residuals (SSR). More specifically, if (xi, yi) are the observations :

    * The ordinary Least Squares line minimizes the quantity :

SSR = i (yi* - yi

 where  yi* = a + b.xi is the predicted value for observation #i, whereas
 

    *  The Weighted Least Squares line minimizes the quantity :

SSRw = i wi.(yi* - yi

 where the wi  are appropriate "weigths" whose role is to reduce the influence of residuals in regions of high noise variance.

 

How are the wi determined ? It can be shown that the weight assigned to observation i should be inversely proportional to the noise variance at xi.

wi = k / var(yi)


where k is a proportionality constant.


We treat here a simpler but similar problem that should give you a feel for the use of the inverses of local noise variance as weights.

Of course, the difficulty is to determine the value var(yi) of the noise variance at each observation. A frequently used approximation is to assume that the variance var(yi) is simply proportional to xi : if xj is twice as large as xi, then var(yj ) is assumed to be twice as large as var(yi).

var(yi) = c.xi


where c is a proportionality constant. The weights wi are then also inversely proportional to xi :

wi ~ 1 / xi 

We give here a quite realistic example for which this assumption is fully justified.

 

The following animation illustrates the concept of Weighted Least Square regression.

 

 

The "Book of Animations" on your computer

 



The illustration proposes :

    * A grey, fixed "true" regression line that passes through the origin (just for convenience),

    * A set of observations generated from this line with a noise level proportional to x,

    * The blue "Ordinary Least Squares" line (LS),

    * The red "Weighted Least Squares" line (WLS),

    * The "true" value to be predicted for the value of x where the green slider currently stands (thick black tick to the left of the y axis).

    * The average of the already predicted values of the two models (ordinary and weighted) for the value of x where the slider currently stands (thick blue and red ticks to the left of the y axis).

 

After clicking on "Go", each "average prediction" tick is sandwiched between two thin ticks positioned each one standard deviation away from the average of the set of already observed predictions. All those ticks quickly converge toward their final positions.

 

1) In the "Next" mode, compare the LS and WLS lines near the origin, where the noise level is lowest. Observe that the WLS places more emphasis on staying close to the data points in this region than the LS does.
 

2) In the "Run" mode, observe that both the LS and the WLS average ticks ultimately line up with the fixed black tick (y value to be predicted) : both the LS and WLS predictors are unbiased. The means of their distributions ("expected values") are just the (common) value to be predicted.
 

3) Observe that the standard deviation of the WLS prediction distribution becomes ultimately smaller than that of the LS prediction distribution. The magnitude of this improvement is certainly not breathtaking, but it is clearly visible. Its numerical value in posted in the "Results" frame ("Std. Dev. ratio").
   Equivalently, observe the predicted values for both LS and WLS ("Next" mode), and notice that the red horizontal line (WLS prediction) is more often than not sandwiched between the black (true value) and blue (LS prediction) lines, meaning that the WLS prediction is more often than not more accurate than the LS prediction. When these predictions are on either side of the value to be predicted, the WLS prediction is more often than not closer to this value than the LS prediction.

Equivalently, observe that the WLS  more often than not sits within the angle between the regression line and the LS line, meaning that it is "closer" to this line than the LS line. When the LS and the WLS line are on either side of the regression line, the WLS line is, more often than not, closer to the regression line than the LS line.
 

4) Conduct several runs with a different position of the slider each time ("Reset"). Notice that there is one position where the standard deviations of LS and WLS are equal : both models have identical predicting performances for this position.
Conversely, the difference in standard deviations is largest at either end of the range of the independent variable x: resorting to WLS (instead of just LS) becomes more advantageous for predictions to be made near extreme values of x. The improvement is particularly noticeable for small values of x, which is to be expected as WLS is particularly careful to staying close to data points in regions of low noise level.
 

5) For a given number of points and slider position, conduct several runs with a different noise level each time. Observe that, although standard deviations become larger with increased noise level, their ratio remains constant. So the degree of improvement from LS to WLS does not depend on the absolute noise level, only on how this noise varies along the x axis (up to a scale factor).
 

6) Increase the number of points ("Reset" mode), and observe that the ratio of the standard deviations decreases, that is,  the advantage of WLS over LS becomes more perceptible. This is to be expected as :

    *  increasing the number of points may be construed as just extending the data set further to the right with additioinal points, and then change the x scale to confine the data set within the same range,

    * and we previously noted that the the edge of WLS over LS increases as you move further to the right.

_________________________________

 

We present here the mathematical derivation of the coefficients (slope and intercept) of the Weighted Least Squares line.

 

" Leave One Out"

See "Validation".

Download this Glossary

 

Want to contribute to this site ?