|
Interactive animation |
In mathematics, the "slope" is the tangent of the angle between a straight line and the x axis.
In Data Modeling, the terme "slope" is to
be found within the context of Simple Linear Regression.
It then refers to the slope of the Least
Squares Line (LSL).
Two parameters are needed to unambiguously define a straight
line, and the other parameter is usually the intercept.
The slope has a simple interpretation : suppose you move a distance dx to the right along the x axis. Then the corresponding point on the straight line goes up (or down) by a quantity dy, with :
dy = dx.Slope

So the slope tells how fast y changes when x is changed (although its numerical value depends on the units for x and y).
The LSL depends on the particular sample at hand. So does the slope which is then to be considered a random variable. Under the standard assumptions of SLR, the distribution of the slope is well understood, and can be calculated exactly.
The following figure illustrates the distribution of the slope under various "experimental" conditions. You need Flash Player to view it. If you don't have it, you can download it for free at www.macromedia.com/downloads/ .
The illustration first suggests :
* a regression line (in red),
* a sample,
* the corresponding LSL (in
blue), together with the current slope (also in blue).
To chose another regression line, click on "New".
The points in the illustration are equally spaced along the x axis. This may look like a severe limitation, but it's not :
* First, it is not an unsual situation in real life.
* But more importantly, SLR does not consider x as a random variable (only y is random). The distribution of the slope depends only on the number of points, the x standard deviation of the sample, and the noise level, those three quantities remaining constant when jumping from one sample to the next. The detailed positions of the points are not needed, so keeping points equally spaced, although a limitation, is not a severe one.
The bottom frame shows a gaussian curve, the theoretical distribution of the slope.
* The mean of the gaussian is positioned at the value of the slope of the true red regression line (which is unknown in real life). This is a consequence of the fact that the slope of the LSL is an unbiased estimator of the slope of the regression line.
* The standard deviation of the gaussian is the theoretical standard deviation of the distribution of the slope of the LSL.
Click on "Go" and observe the distribution
of the slope progressively build up.
________________________________________
The variance (or standard deviation) of the slope's
distribution is a fundamental quantity in SLR. It is a measure of the uncertainty
about the slope of the regression line, that is about the strength of the
link between the independent variable x and the response variable y.
It is the basis of the test that will decide whether the existence of a functional link
between x and y is a credible assumption.
If you're already
somewhat knowledgeable about SLR, you may be surprised that horizontal Regression
Lines are not banned from the above illustration, as they depict situations
where y does not depend on x. The reason is that we are here not
addressing the issue of the link between x and y, but only that
of the slope of the LSL, which is unambiguously defined whether or not there
is a link between x and y.
________________________________________
Contrary
to the intercept, the slope's distribution does not depend on the positions of the x
and y axes (this is the reason why none of these axes is adjustable in
the illustration). In other words :
* adding a same quantity to all the x values,
* and/or adding a same quantity to all the y values,
leaves the slope unchanged.
You may simulate a translation of the y axis
by translating the sample range (use "Left" and "Right"
controls, and keep the difference "Right - Left" constant), leaving
all other parameters constant. You may do that while retaining the same
regression line by first clicking on the small "Reset" button at the
bottom right corner of the illustration.
Notice that the slope's distribution
is unchanged when you translate the sample's range.
____________________________
Change the number of points (all other parameters
being held fixed), and observe that the standard deviation of the slope's
distribution always decreases with an increased the number of points. Increasing
the number of points reduces the uncertainty about the true position of the
regression line..
____________________________
Change the range of the sample (all other parameters
being held fixed). Observe that the standard deviation of the slope's
distribution always decreases when the range increases. This situation is similar to that of a direction in
space being defined by a pipe : the direction (regression line) is
more accurately defined for longer pipes.
____________________________
Notice that the variance of the slope's distribution does not
depend at all on the regression line (for a given set of values of the parameters) : click repetitively on "New",
and observe that although the position of the gaussian curve varies to reflect
the value of the slope of the regression line, its variance remains constant.
________________________________________________________________________
Basic results about the Slope in Simple Linear Regression
1) The equation of the LSL is denoted :
y = a + bx
The slope is therefore "b" (and "a" is the intercept).
2) Value of "b"
|
b = Cov(x, y)/Var(x) |
The slope can also be expressed as :
|
b = |
where :
* "
"
is the correlation
coefficient of x and y.
* sx and sy are the standard deviations of x and of y.
3) Properties of the slope "b" as an estimator
In this section, no assumption is made about the noise distribution, other than :
* the lack of correlation of the noise between any two observations.
*
the variance
²
of the noise being the same for all observations (homoscedasticity).
In particular, it is not assumed that the noise
is gaussian.
3-a) "b" is an unbiased estimator of the slope B of the true regression line.
|
E[b] = B |
where E denotes the expectation.
3-b) The variance of "b" is :
|
Var(b) = |
where n is the number of observations in
the sample.
3-c) The slope "b" and the intercept "a" are generally correlated.
|
Cov(a, b) = - |
Note that when
>
0, the covariance is negative : a lower slope usually (but not always) corresponds
to a larger intercept, which is quite intuitive.
Only when
= 0 (that is when the y axis is positioned on the x-average of
the sample) are the slope and the intercept uncorrelated. Recall that
this is also the situation that makes the variance of the intercept smallest.
4) Distribution of "b" under the assumption of a gaussian noise
The
noise is assumed to be normally distributed as N(0,
²).
4-1) "b" is normally distributed. So :
|
b ~ N(mean, variance) |
with "mean" and "variance" as in the previous paragraph (recall that these values do not depend on the nature of the noise).
4-2) "b" is an efficient estimator of B
No other unbiased estimator of B has a lower variance than "b".
4-3) "b" and any residual ui are independent variables.
4-4)
"b" and
are independent gaussian variables (there is no equivalent statement for the
intercept "a").
____________________________________________
Related readings
|
Want to contribute to this site ? |