|
Interactive animation |
The Simple Linear Regression model is materialized by a straight line, called the "Least Squares Line". This line is a condensed graphic representation of the distribution of the sample in the (x, y) plane. It is further used to predict "y" for new values of "x".
The very name of this line tells how it is determined. For any straigth line D in the plane :
* Measure the vertical distance from a point to the line D,
* Square this value,
* Add the results for all points in the sample.
It can be shown that there is one, and only
one line for which this quantity is minimal. This is the Least Squares Line.
The following interactive figure illustrates the concept of Least Squares Line.
The number of points can be changed in the "Reset" mode only. "Noise" is in arbitrary units.
Drag the green cursors to move the "candidate" line until you get the lowest possible value in the mobile display.
This value is a modified version of the sum of the squares of the distances between the points and the line :
* First, this sum is divided by the number of points, in order to obtain the average value of the squares of the distances of the points to the line.
* Then, one takes the square root of this new quantity in order to obtain not the square of a distance, but something akin to a distance, which is easier to visualize (this is pretty much what we do when switching from variance to Standard Deviation). This last quantity is then displayed. It looks pretty much like the average distance from the points to the line, but it's not the average distance from the points to the line.
____________________
For a given sample, try several starting positions
for the line. You'll easily convince yourself that you always end up with the
same final line : there is only one line such that any small change of
the position of the line always causes an increase of the sum of squares. This
is a very important property. It is linked to the fact that we are trying to
account for the sample with a straight line.
In more complex situations,
a more complex shape may be appropriate. It may then happen that several
different "curved" lines are such that any small change of the position
or shape of a curve will cause an increase of the sum of squares. This
is what happens, for instance, with Neural
Networks.
|
Interactive animation |
One of the standard assumptions of Simple Linear
Regression is that the variance of the noise is constant throughout the
range of the independent variable x (homoscedasticity). This condition
is necessary for the Least
Squares Line to be the best predictor.
This assumption
is not always met in practice. Actually, it is quite common to observe situations
where the variance of the noise depends on the value of x (heteroscedasticity).
The most common case is when the noise gets larger as x gets larger.
In such a situation, the ordinary Least Squares line
is not the best linear prediction model anymore, and
has to be replaced by the Weighted Least Squares line. The general idea
is that an observation with a large noise variance should be given
less importance in defining the prediction line than an observation with
a small noise variance. This goal is met by resorting to a modified version
of the sum of the squares of the residuals (SSR). More specifically, if (xi,
yi) are the observations :
* The ordinary Least Squares line minimizes the quantity :
SSR =
i
(yi* - yi)²
where yi* =
a + b.xi is the predicted value for observation #i,
whereas
* The Weighted Least Squares line minimizes the quantity :
SSRw =
i
wi.(yi* - yi)²
where the wi are appropriate "weigths" whose role is to reduce the influence of residuals in regions of high noise variance.
How are the wi determined ? It can be shown that the weight assigned to observation i should be inversely proportional to the noise variance at xi.
wi = k / var(yi)
where k is a proportionality constant.
We treat here
a simpler but similar problem that should give you a feel for the use of the
inverses of local noise variance as weights.
Of course, the difficulty is to determine the value var(yi) of the noise variance at each observation. A frequently used approximation is to assume that the variance var(yi) is simply proportional to xi : if xj is twice as large as xi, then var(yj ) is assumed to be twice as large as var(yi).
var(yi) = c.xi
where c is a proportionality constant. The
weights wi are then also inversely proportional to xi
:
wi ~ 1 / xi
We give here a quite realistic example for which this assumption is fully justified.
The following animation illustrates the concept of Weighted Least Square regression.
The illustration proposes :
* A grey, fixed "true" regression line that passes through the origin (just for convenience),
* A set of observations generated from this line with a noise level proportional to x,
* The blue "Ordinary Least Squares" line (LS),
* The red "Weighted Least Squares" line (WLS),
* The "true" value to be predicted for the value of x where the green slider currently stands (thick black tick to the left of the y axis).
* The average of the already predicted values of the two models (ordinary and weighted) for the value of x where the slider currently stands (thick blue and red ticks to the left of the y axis).
After clicking on "Go", each "average prediction" tick is sandwiched between two thin ticks positioned each one standard deviation away from the average of the set of already observed predictions. All those ticks quickly converge toward their final positions.
1) In the "Next" mode, compare the LS and
WLS lines near the origin, where the noise level is lowest. Observe that the
WLS places more emphasis on staying close to the data points in this region
than the LS does.
2) In the "Run" mode, observe that both
the LS and the WLS average ticks ultimately line up with the fixed black tick
(y value to be predicted) : both the LS and WLS predictors are unbiased.
The means of their distributions ("expected values") are just the
(common) value to be predicted.
3) Observe that the standard deviation of the WLS
prediction distribution becomes ultimately smaller than that of the LS
prediction distribution. The magnitude of this improvement is certainly not
breathtaking, but it is clearly visible. Its numerical value in posted in the "Results"
frame ("Std. Dev. ratio").
Equivalently, observe the predicted
values for both LS and WLS ("Next" mode), and notice that the
red horizontal line (WLS prediction) is more often than not sandwiched between the black
(true value) and blue (LS prediction) lines, meaning that the WLS prediction is
more often than not more accurate than the LS prediction. When these predictions
are on either side of the value to be predicted, the WLS prediction is more
often than not closer to this value than the LS prediction.
Equivalently, observe that the WLS more often
than not sits within the angle between the regression line and the LS line, meaning
that it is "closer" to this line than the LS line. When the LS and
the WLS line are on either side of the regression line, the WLS line is, more
often than not, closer to the regression line than the LS line.
4) Conduct several runs with a different position
of the slider each time ("Reset"). Notice that there is one position where the standard
deviations of LS and WLS are equal : both models have identical predicting
performances for this position.
Conversely, the difference in standard deviations
is largest at either end of the range of the independent variable x:
resorting to WLS (instead of just LS) becomes more advantageous for predictions
to be made near extreme values of x. The improvement is particularly
noticeable for small values of x, which is to be expected as WLS is particularly
careful to staying close to data points in regions of low noise level.
5) For a given number of points and slider position,
conduct several runs with a different noise level each time. Observe that, although
standard deviations become larger with increased noise level, their ratio remains
constant. So the degree of improvement from LS to WLS does not depend on the
absolute noise level, only on how this noise varies along the x
axis (up to a scale factor).
6) Increase the number of points ("Reset" mode), and observe that the ratio of the standard deviations decreases, that is, the advantage of WLS over LS becomes more perceptible. This is to be expected as :
* increasing the number of points may be construed as just extending the data set further to the right with additioinal points, and then change the x scale to confine the data set within the same range,
* and we previously noted
that the the edge of WLS over LS increases as you move further to the right.
_________________________________
We present here the mathematical derivation of the coefficients (slope and intercept) of the Weighted Least Squares line.
See "Validation".