|
Tutorials |
Simple Linear Regression
The simplest and most popular regression technique.
Simple Linear Regression (SLR) is a special regression model where :
As for any predictive technique, RLS has two main goals :
___________________________________
Simple Linear Regression addresses the following question :
These measurements lead to the following scatter plot
:

As a first step, SLR attempts to represent the fact that the experimental data points are approximately lined up. It does so by identifying the "best straight line" running through the cloud of points. This line is called the "Least Squares Line" (LSL) and will be described by a slope b and an intersect a. These will be the first two parameters of the SLR model.
It is then possible to quantify the fit of the LSL to the data. The result is a number between 0 and 1 called the Coefficient of determination, which is denoted R².
The next move is to consider that the reason why the data points are out of perfect alignement is that the measurements are spoiled by random errors (the x values are fixed and accurately known). Without these errors, the data points would all sit on an (unknown) straight line : the Regression Line.
With some rather loose assumptions about the probabilistic distributions of the errors, SLR then calculate some properties of the distributions of the parameters (mean, variance, covariance), and shows that the LSL is a good estimation of the true Regression Line.
SLR can also estimate the variance of the errors from the residuals of the data set with respect to the LSL. This estimated variance is the third and last parameter of the SLR model.
So far, little was assumed about the distributions of the errors. We now further assume that the errors are normally distributed.
It is then possible to completely determine the distributions of the parameters, of the predictions and of the residuals, and assign confidence intervals to the calculated values.
As for any model, it is necessary to ask whether the model just built is significant or not. In the case of SLR, the question is to assess how plausible it is that, assuming that there is no linear relationship between x and y, the points in the data set exhibit the observed degree of alignment.
SLR is an exceptional case where the problem is completely solved by appropriately designed tests.
All observations do not contribute equally to the properties of the model. It is important to identify those observations that have a particularly large influence on the model, if only to check that they are not corrupted by large errors.
____________________________________
SLR is definitely the most popular univariate regression technique, for several good reasons :
1) The "best straight line" (the Least Squares Line") is appealing to intuition and is easily calculated.
2) The parameters of the model thus obtained have good statistical properties, and can easily be interpreted.
3) With rather loose additional assumptions about the measurement errors, SLR is completely described by theory. In particular, touchy questions like model assesment and power of generalization are solved exactly without having to resort to cumbersome validation techniques.
Yet, some difficult problems are out of reach of SLR. Then, other regression techniques like splines or Neural Networks should certainly be considered.
_______________________________________
More than one one predictor x may be incorporated into a model that is linear both in the parameters and in the predictors {x1, x2 , ..., xn }:
y = a0 + a1x1 + a2x2 + ... + apxp
We then have a Multiple Linear Regression (MLR) model.
Many of the results in SLR carry out to MLR. But two new problems are now to be considered :
These two problems are not completely independent
from each other.
These issues are of great practical importance, and justify that MLR receive
a separate treatment in this Glossary.
____________________________________________________________
|
Tutorial 1 |
The first tutorial is an overview of the issues that will be developed in subsequent Tutorials. Is is therefore an extended an commented Table of Contents.
OVERVIEW OF SIMPLE LINEAR REGRESSION
|
The Least Squares Line (LSL) Definition Uniqueness Why squares ? The LSL is not symmetrical in x and y LSL and First Princpal Component Determination of the LSL Coefficient of determination Geometric interpretation of SLR Statistical properties of the parameters of the LSL The Simple Linear Model (SLM) Estimation of the parameters of the Simple Linear Model Regression line Variance of the measurement errors Predictions, observations and residuals Variance of the predictions, covariance between predictions and observations Variance and covariance of the residuals The normality assumption Distributions and confidence intervals Testing the significance of the model Leverage and influential observations |
||
|
TUTORIAL |
||
_______________________________________________________________________
|
Tutorial 2 |
We then calculate the values of the slope and intercept of the Least Squares Line. This is a problem in geometry, with no probabilistic issues.
CALCULATING THE PARAMETERS OF THE LEAST SQUARES LINE
|
The sum of the squared residuals The normal equations The Least Squares Line (LSL) The slope b The intercept a The extremum is a minimum The minimum is unique Standardized variables The LSL runs through the barycenter The LSL and the slope are independent of the positions of the axes The residuals The sum of the residuals is 0 The residuals are orthogonal to x The residuals are orthogonal to the predictions Special case b = 0 |
||
|
TUTORIAL |
||
____________________________________________________________
|
Tutorial 3 |
We now quantify the quality of the fit of the LSL to the data with the Coefficient of Determination, and identify its relationship with the Correlation Coefficient.
COEFFICIENT OF DETERMINATION R²
|
The trivial model The Coefficient of determination R² Explained variance and Coefficient of determination Decomposition of the Total Sum of Squares SST Interpretation of the decomposition of SST Explained variation Residual variation Second definition of R² Coefficient of determination and Coefficient of correlation |
||
|
TUTORIAL |
||
_________________________________________________________
|
Tutorial 4 |
We now give of the foregoing results a geometric interpretation that provides a more intuitive understanding of Single Linear Regression.
GEOMETRIC INTERPRETATION OF SLR
|
The representation space, or "space of variables" SLR as a projection The sum of the residuals is 0 The residual vector is orthogonal to x The residual vector is orthogonal to the predictions yi* The mean of the predictions is equal to the mean of the observations The LSL runs through the barycenter of the data set Decomposition of the Total Sum of Squares |
||
|
TUTORIAL |
||
_______________________________________________________________
|
Tutorial 5 |
The foregoing text dealt with a problem in geometry, of which the LSL appeared to be an intuitively satisfying solution. The notion of randomness appeared nowhere. We now broaden the scope of SLR. The data set {xi, yi} is still considered to be a set of measurements of y for certain fixed values of x, but corrupted by random errors. As a consequence, the slope and intercept are now random variables, whose statistical properties we establish.
STATISTICAL PROPERTIES OF THE PARAMETERS OF THE LSL
|
Random errors The parameters of the LSL are unbiased estimators Assumption #1 : errors have 0 mean The slope b is unbiased The intercept a is unbiased Variances of the parameters of the LSL Additional assumptions about the errors ei Variance of the slope b Variance of the intercept a Covariance of the slope and the intercept |
||
|
TUTORIAL |
||
______________________________________________________
|
Tutorial 6 |
We then address the issue of estimating the variance of the measurement errors.
UNBIASED ESTIMATION
OF THE VARIANCE OF THE ERRORS
|
Unbiased estimation of the variance s² of the measurement errors ei |
||
|
TUTORIAL |
||
This Table of Contents is very short. The topic is a bit more involved, although not really difficult. The unexpectedly simple final result is important both from theoretical and practical point of views.
____________________________________________________________
|
Tutorial 7 |
Up to now, we dealt with properties of the parameters of the regression line (slope, intercept, variance of errors) and of those of the LSL.
We now address the issue of the properties of the predictions, the observations and the residuals. These quantities are related to the LSL only, not to the true Regression Line.
Recall that the only assumptions about the measurement errors are :
The most important properties and relationships between these quantities are analyzed in this Tutorial.
PREDICTIONS, OBSERVATIONS AND RESIDUALS
|
Preliminary results The slope and the errors are uncorrelated The slope and the mean of the observations are uncorrelated Predictions Expectation of a prediction Covariance of two predictions Variance of a prediction Covariance between predictions and observations The residuals Expectation of a residual Covariance of two residuals Variance of a residual |
||
|
TUTORIAL |
||
_________________________________________________________
|
Tutorial 8 |
So far, we introduced no assumption about the nature of the distribution of the errors. We now formulate the natural assumption that the errors are normally distributed. It is then possible to calculate the distributions of the parameters, of the residuals and of the predictions, as well as calculate confidence intervals for these quantites. The distribution of the estimated error variance may also be calculated, as well as its variance.
DISTRIBUTIONS AND CONFIDENCE INTERVALS
UNDER THE NORMALITY ASSUMPTION
|
Assumed distribution of the errors ei Slope Distribution of the slope Confidence intervals on the slope (no demonstration) Intercept Distribution of the intercept Confidence intervals on the intercept (no demonstration) Predictions Distribution of the predictions Confidence intervals on the predictions (no demonstration) Distribution of s²* Distribution is Chi-square (no demonstration) Variance |
||
|
TUTORIAL |
||
______________________________________________________
|
Tutorial 9 |
The normality assumption allows devising two tests meant to figure out if the model is significant. They both bear on the hypothesis b = 0 (no linear relationship between y and x). They will be shown to be equivalent.
TESTING THE SIGNIFICANCE OF SLR
|
Testing b = 0 F test on R² Distribution of the explained variance (SSE) Distribution of the residual variance (SSR) (no demontration) SSE and SSR are independent (no demonstration) The F test The ANOVA table The two tests are equivalent |
||
|
TUTORIAL |
||
_____________________________________________________
|
Tutorial 10 |
Every data point affects the estimates of the slope, the intercept, the predicted values and the estimated error variance to some degree. But some do so more than others. It is important to identify these points.
We address here the question of identifying these "exceptional" data points that have abnormally high contributions to some aspects of the model. This section makes heavy use of the concepts of "standardized residuals" and "studentized residuals", that we go over in some detail, as there is some confusion both in literature and in software about the meaning of these terms.
The demonstrations of most of the results are very technical,
and are therefore not given.
LEVERAGE OBSERVATIONS AND INFLUENTIAL OBSERVATIONS
|
0bservations that affect the predictions of the model The notion of "leverage observation" Leverage Large residuals Standardized residuals "Internally" studentized residuals Leave-one-out residuals Leave-one-out estimated error variance "Externally" studentized residuals DFFITSi Leave-one-out predictions DFFITSi Double outliers Cook's distance Observations that affect the parameters of the model The notion of "influential observation" DFBETASij Covariance Ratio |
||
|
TUTORIAL |
||
_______________________________________________________
|
|
Related readings
|
Want to contribute to this site ? |