Tutorials

Simple Linear Regression

The simplest and most popular regression technique.

 

Simple Linear Regression (SLR) is a special regression model where :

 

As for any predictive technique, RLS has two main goals :

___________________________________

Simple Linear Regression addresses the following question :

 

These measurements lead to the following scatter plot :
 


 

Least Squares Line

As a first step, SLR attempts to represent the fact that the experimental data points are approximately lined up. It does so by identifying the "best straight line" running through the cloud of points. This line is called the "Least Squares Line" (LSL) and will be described by a slope b and an intersect a. These will be the first two parameters of the SLR model.

Coefficient of determination R²

It is then possible to quantify the fit of the LSL to the data. The result is a number between 0 and 1 called the Coefficient of determination, which is denoted R².

Statistical properties of the parameters

The next move is to consider that the reason why the data points are out of perfect alignement is that the measurements are spoiled by random errors (the x values are fixed and accurately known). Without these errors, the data points would all sit on an (unknown) straight line : the Regression Line.

With some rather loose assumptions about the probabilistic distributions of the errors, SLR then calculate some properties of the distributions of the parameters (mean, variance, covariance), and shows that the LSL is a good estimation of the true Regression Line.

Estimation of the variance of the errors

SLR can also estimate the variance of the errors from the residuals of the data set with respect to the LSL. This estimated variance is the third and last parameter of the SLR model.

The normality assumption

So far, little was assumed about the distributions of the errors. We now further assume that the errors are normally distributed.

Distributions and confidence intervals

It is then possible to completely determine the distributions of the parameters, of the predictions and of the residuals, and assign confidence intervals to the calculated values.

Validity of the model

As for any model, it is necessary to ask whether the model just built is significant or not. In the case of SLR, the question is to assess how plausible it is that, assuming that there is no linear relationship between x and y, the points in the data set exhibit the observed degree of alignment.

SLR is an exceptional case where the problem is completely solved by appropriately designed tests.

Outstanding observations

All observations do not contribute equally to the properties of the model. It is important to identify those observations that have a particularly large influence on the model, if only to check that they are not corrupted by large errors.

____________________________________

SLR is definitely the most popular univariate regression technique, for several good reasons :

    1) The "best straight line" (the Least Squares Line") is appealing to intuition and is easily calculated.

    2) The parameters of the model thus obtained have good statistical properties, and can easily be interpreted.

    3) With rather loose additional assumptions about the measurement errors, SLR is completely described by theory. In particular, touchy questions like model assesment and power of generalization are solved exactly without having to resort to cumbersome validation techniques.

 

Yet, some difficult problems are out of reach of SLR. Then, other regression techniques like splines or Neural Networks should certainly be considered.

_______________________________________

More than one one predictor x may be incorporated into a model that is linear both in the parameters and in the predictors {x1, x2 , ..., xn }:

y = a0 +  a1x1 +  a2x2  +  ... + apxp 

We then have a Multiple Linear Regression (MLR) model.

Many of the results in SLR carry out to MLR. But two new problems are now to be considered :

 

These two problems are not completely independent from each other.
These issues are of great practical importance, and justify that MLR receive a separate treatment in this Glossary.

____________________________________________________________

 

Tutorial 1

 

The first tutorial is an overview of the issues that will be developed in subsequent Tutorials. Is is therefore an extended an commented Table of Contents.

 

OVERVIEW OF SIMPLE LINEAR REGRESSION

The Least Squares Line (LSL)

Definition

Uniqueness

Why squares ?

The LSL is not symmetrical in x and y

LSL and First Princpal Component

Determination of the LSL

Coefficient of determination

Geometric interpretation of SLR

Statistical properties of the parameters of the LSL

The Simple Linear Model (SLM)

Estimation of the parameters of the Simple Linear Model

Regression line

Variance of the measurement errors

Predictions, observations and residuals

Variance of the predictions, covariance between predictions and observations

Variance and covariance of the residuals

The normality assumption

Distributions and confidence intervals

Testing the significance of the model

Leverage and influential observations

TUTORIAL

 _______________________________________________________________________

 

Tutorial 2

 

We then calculate the values of the slope and intercept of the Least Squares Line. This is a problem in geometry, with no probabilistic issues.

 

CALCULATING THE PARAMETERS OF THE LEAST SQUARES LINE

The sum of the squared residuals

The normal equations

The Least Squares Line (LSL)

The slope b

The intercept a

The extremum is a minimum

The minimum is unique

Standardized variables

The LSL runs through the barycenter

The LSL and the slope are independent of the positions of the axes

The residuals

The sum of the residuals is 0

The residuals are orthogonal to x

The residuals are orthogonal to the predictions

Special case b = 0

TUTORIAL

____________________________________________________________


Tutorial 3

 

We now quantify the quality of the fit of the LSL to the data with the Coefficient of Determination, and identify its relationship with the Correlation Coefficient.

 

COEFFICIENT OF DETERMINATION R²

The trivial model

The Coefficient of determination R²

Explained variance and Coefficient of determination

Decomposition of the Total Sum of Squares SST

Interpretation of the decomposition of SST

Explained variation

Residual variation

Second definition of R²

Coefficient of determination and Coefficient of correlation

TUTORIAL

 _________________________________________________________


Tutorial 4

 

 We now give of the foregoing results a geometric interpretation that provides a more intuitive understanding of Single Linear Regression.

 

GEOMETRIC INTERPRETATION OF SLR

The representation space, or "space of variables"

SLR as a projection

The sum of the residuals is 0

The residual vector is orthogonal to x

The residual vector is orthogonal to the predictions yi*

The mean of the predictions is equal to the mean of the observations

The LSL runs through the barycenter of the data set

Decomposition of the Total Sum of Squares

TUTORIAL

_______________________________________________________________

 

Tutorial 5

 

The foregoing text dealt with a problem in geometry, of which the LSL appeared to be an intuitively satisfying solution. The notion of randomness appeared nowhere. We now broaden the scope of SLR. The data set {xi, yi} is still considered to be a set of measurements of y for certain fixed values of x, but corrupted by random errors. As a consequence, the slope and intercept are now random variables, whose statistical properties we establish.

 

STATISTICAL PROPERTIES OF THE PARAMETERS OF THE LSL

Random errors

The parameters of the LSL are unbiased estimators

Assumption #1 : errors have 0 mean

The slope b is unbiased

The intercept a is unbiased

Variances of the parameters of the LSL

Additional assumptions about the errors ei

Variance of the slope b

Variance of the intercept a

Covariance of the slope and the intercept

TUTORIAL

______________________________________________________

 

 

Tutorial 6

 

We then address the issue of estimating the variance of the measurement errors.

 

UNBIASED ESTIMATION

OF THE VARIANCE OF THE ERRORS

 Unbiased estimation of the variance s² of the measurement errors ei

TUTORIAL


 

This Table of Contents is very short. The topic is a bit more involved, although not really difficult. The unexpectedly simple final result is important both from theoretical and  practical point of views.

____________________________________________________________

 

 

Tutorial 7

 

Up to now, we dealt with properties of the parameters of the regression line (slope, intercept, variance of errors) and of those of the LSL.

We now address the issue of the properties of the predictions, the observations and the residuals. These quantities are related to the LSL only, not to the true Regression Line.

 

Recall that the only assumptions about the measurement errors are :

 

The most important properties and relationships between these quantities are analyzed in this Tutorial.

 

PREDICTIONS, OBSERVATIONS AND RESIDUALS

Preliminary results

The slope and the errors are uncorrelated

The slope and the mean of the observations are uncorrelated

Predictions

Expectation of a prediction

Covariance of two predictions

Variance of a prediction

Covariance between predictions and observations

The residuals

Expectation of a residual

Covariance of two residuals

Variance of a residual

TUTORIAL

_________________________________________________________

 

 

Tutorial 8

 

So far, we introduced no assumption about the nature of the distribution of the errors. We now formulate the natural assumption that the errors are normally distributed. It is then possible to calculate the distributions of the parameters, of the residuals and of the predictions, as well as calculate confidence intervals for these quantites. The distribution of the estimated error variance may also be calculated, as well as its variance.

 

DISTRIBUTIONS AND CONFIDENCE INTERVALS

UNDER THE NORMALITY ASSUMPTION

Assumed distribution of the errors ei

Slope

Distribution of the slope

Confidence intervals on the slope   (no demonstration)

Intercept

Distribution of the intercept

Confidence intervals on the intercept  (no demonstration)

Predictions

Distribution of the predictions

Confidence intervals on the predictions  (no demonstration)

Distribution of s²* 

Distribution is Chi-square   (no demonstration)

Variance

TUTORIAL

 ______________________________________________________

 

Tutorial 9

 

The normality assumption allows devising two tests meant to figure out if the model is significant. They both bear on the hypothesis b = 0 (no linear relationship between y and x). They will be shown to be equivalent.

 

TESTING THE SIGNIFICANCE OF  SLR

Testing b = 0

F test on R²

Distribution of the explained variance (SSE)

Distribution of the residual variance (SSR)   (no demontration)

SSE  and SSR  are independent   (no demonstration)

The   F   test

The ANOVA table

The two tests are equivalent

TUTORIAL

_____________________________________________________

 

 

Tutorial 10

 

Every data point affects the estimates of the slope, the intercept, the predicted values and the estimated error variance to some degree. But some do so more than others. It is important to identify these points.

We address here the question of identifying these "exceptional" data points that have abnormally high contributions to some aspects of the model. This section makes heavy use of the concepts of "standardized residuals" and "studentized residuals", that we go over in some detail, as there is some confusion both in literature and in software about the meaning of these terms.


The demonstrations of most of the results are very technical, and are therefore not given.

 

LEVERAGE OBSERVATIONS AND INFLUENTIAL OBSERVATIONS

0bservations that affect the predictions of the model

The notion of "leverage observation"

Leverage

Large residuals

Standardized residuals

"Internally" studentized residuals

Leave-one-out residuals

Leave-one-out estimated error variance

"Externally" studentized residuals

DFFITSi

Leave-one-out predictions

DFFITSi

Double outliers

Cook's distance

Observations that affect the parameters of the model

The notion of "influential observation"

DFBETASij

Covariance Ratio

TUTORIAL

 _______________________________________________________

 

 

 

Related readings

Regression

Least Squares

Multiple Linear Regression

Neural Networks

 

Download this Glossary

 

Want to contribute to this site ?