Tutorials

Simple Linear Regression

The simplest and most popular regression technique.

Simple Linear Regression (SLR) is a special regression model where :

• There is only one (numerical) independent variable.
• The model is linear both in the independent variable and, more importantly, in the parameters.

As for any predictive technique, RLS has two main goals :

• Build a model whose parameters can be interpreted in terms of some properties of the population from which the sample was extracted. Of course, it is expected that the model parameters will be good estimates of the corresponding parameters of the population.
• Use the model for making predictions.

___________________________________

Simple Linear Regression addresses the following question :

• A quantity y is measured,
• For a certain number of values of another quantity x.

These measurements lead to the following scatter plot :

# Least Squares Line

As a first step, SLR attempts to represent the fact that the experimental data points are approximately lined up. It does so by identifying the "best straight line" running through the cloud of points. This line is called the "Least Squares Line" (LSL) and will be described by a slope b and an intersect a. These will be the first two parameters of the SLR model.

# Coefficient of determination R²

It is then possible to quantify the fit of the LSL to the data. The result is a number between 0 and 1 called the Coefficient of determination, which is denoted R².

# Statistical properties of the parameters

The next move is to consider that the reason why the data points are out of perfect alignement is that the measurements are spoiled by random errors (the x values are fixed and accurately known). Without these errors, the data points would all sit on an (unknown) straight line : the Regression Line.

With some rather loose assumptions about the probabilistic distributions of the errors, SLR then calculate some properties of the distributions of the parameters (mean, variance, covariance), and shows that the LSL is a good estimation of the true Regression Line.

# Estimation of the variance of the errors

SLR can also estimate the variance of the errors from the residuals of the data set with respect to the LSL. This estimated variance is the third and last parameter of the SLR model.

# The normality assumption

So far, little was assumed about the distributions of the errors. We now further assume that the errors are normally distributed.

## Distributions and confidence intervals

It is then possible to completely determine the distributions of the parameters, of the predictions and of the residuals, and assign confidence intervals to the calculated values.

## Validity of the model

As for any model, it is necessary to ask whether the model just built is significant or not. In the case of SLR, the question is to assess how plausible it is that, assuming that there is no linear relationship between x and y, the points in the data set exhibit the observed degree of alignment.

SLR is an exceptional case where the problem is completely solved by appropriately designed tests.

# Outstanding observations

All observations do not contribute equally to the properties of the model. It is important to identify those observations that have a particularly large influence on the model, if only to check that they are not corrupted by large errors.

____________________________________

SLR is definitely the most popular univariate regression technique, for several good reasons :

1) The "best straight line" (the Least Squares Line") is appealing to intuition and is easily calculated.

2) The parameters of the model thus obtained have good statistical properties, and can easily be interpreted.

3) With rather loose additional assumptions about the measurement errors, SLR is completely described by theory. In particular, touchy questions like model assesment and power of generalization are solved exactly without having to resort to cumbersome validation techniques.

Yet, some difficult problems are out of reach of SLR. Then, other regression techniques like splines or Neural Networks should certainly be considered.

_______________________________________

More than one one predictor x may be incorporated into a model that is linear both in the parameters and in the predictors {x1, x2 , ..., xn }:

y = a0 +  a1x1 +  a2x2  +  ... + apxp

We then have a Multiple Linear Regression (MLR) model.

Many of the results in SLR carry out to MLR. But two new problems are now to be considered :

• The possible (and nefarious) linear coupling between the independent variables collinearity).
• The selection of those of the available independent variables that should be incorporated into the model.

These two problems are not completely independent from each other.
These issues are of great practical importance, and justify that MLR receive a separate treatment in this Glossary.

____________________________________________________________

 Tutorial 1

The first tutorial is an overview of the issues that will be developed in subsequent Tutorials. Is is therefore an extended an commented Table of Contents.

OVERVIEW OF SIMPLE LINEAR REGRESSION

 The Least Squares Line (LSL) Definition Uniqueness Why squares ? The LSL is not symmetrical in x and y LSL and First Princpal Component Determination of the LSL Coefficient of determination Geometric interpretation of SLR Statistical properties of the parameters of the LSL The Simple Linear Model (SLM) Estimation of the parameters of the Simple Linear Model Regression line Variance of the measurement errors Predictions, observations and residuals Variance of the predictions, covariance between predictions and observations Variance and covariance of the residuals The normality assumption Distributions and confidence intervals Testing the significance of the model Leverage and influential observations TUTORIAL

_______________________________________________________________________

 Tutorial 2

We then calculate the values of the slope and intercept of the Least Squares Line. This is a problem in geometry, with no probabilistic issues.

CALCULATING THE PARAMETERS OF THE LEAST SQUARES LINE

 The sum of the squared residuals The normal equations The Least Squares Line (LSL) The slope b The intercept a The extremum is a minimum The minimum is unique Standardized variables The LSL runs through the barycenter The LSL and the slope are independent of the positions of the axes The residuals The sum of the residuals is 0 The residuals are orthogonal to x The residuals are orthogonal to the predictions Special case b = 0 TUTORIAL

____________________________________________________________

 Tutorial 3

We now quantify the quality of the fit of the LSL to the data with the Coefficient of Determination, and identify its relationship with the Correlation Coefficient.

COEFFICIENT OF DETERMINATION R²

 The trivial model The Coefficient of determination R² Explained variance and Coefficient of determination Decomposition of the Total Sum of Squares SST Interpretation of the decomposition of SST Explained variation Residual variation Second definition of R² Coefficient of determination and Coefficient of correlation TUTORIAL

_________________________________________________________

 Tutorial 4

We now give of the foregoing results a geometric interpretation that provides a more intuitive understanding of Single Linear Regression.

GEOMETRIC INTERPRETATION OF SLR

 The representation space, or "space of variables" SLR as a projection The sum of the residuals is 0 The residual vector is orthogonal to x The residual vector is orthogonal to the predictions yi* The mean of the predictions is equal to the mean of the observations The LSL runs through the barycenter of the data set Decomposition of the Total Sum of Squares TUTORIAL

_______________________________________________________________

 Tutorial 5

The foregoing text dealt with a problem in geometry, of which the LSL appeared to be an intuitively satisfying solution. The notion of randomness appeared nowhere. We now broaden the scope of SLR. The data set {xi, yi} is still considered to be a set of measurements of y for certain fixed values of x, but corrupted by random errors. As a consequence, the slope and intercept are now random variables, whose statistical properties we establish.

STATISTICAL PROPERTIES OF THE PARAMETERS OF THE LSL

 Random errors The parameters of the LSL are unbiased estimators Assumption #1 : errors have 0 mean The slope b is unbiased The intercept a is unbiased Variances of the parameters of the LSL Additional assumptions about the errors εi Variance of the slope b Variance of the intercept a Covariance of the slope and the intercept TUTORIAL

______________________________________________________

 Tutorial 6

We then address the issue of estimating the variance of the measurement errors.

UNBIASED ESTIMATION

OF THE VARIANCE OF THE ERRORS

 Unbiased estimation of the variance σ² of the measurement errors εi TUTORIAL

This Table of Contents is very short. The topic is a bit more involved, although not really difficult. The unexpectedly simple final result is important both from theoretical and  practical point of views.

____________________________________________________________

 Tutorial 7

Up to now, we dealt with properties of the parameters of the regression line (slope, intercept, variance of errors) and of those of the LSL.

We now address the issue of the properties of the predictions, the observations and the residuals. These quantities are related to the LSL only, not to the true Regression Line.

Recall that the only assumptions about the measurement errors are :

• All errors have the same variance (homoskedasticity).
• Errors are pairwise uncorrelated.

The most important properties and relationships between these quantities are analyzed in this Tutorial.

PREDICTIONS, OBSERVATIONS AND RESIDUALS

 Preliminary results The slope and the errors are uncorrelated The slope and the mean of the observations are uncorrelated Predictions Expectation of a prediction Covariance of two predictions Variance of a prediction Covariance between predictions and observations The residuals Expectation of a residual Covariance of two residuals Variance of a residual TUTORIAL

_________________________________________________________

 Tutorial 8

So far, we introduced no assumption about the nature of the distribution of the errors. We now formulate the natural assumption that the errors are normally distributed. It is then possible to calculate the distributions of the parameters, of the residuals and of the predictions, as well as calculate confidence intervals for these quantites. The distribution of the estimated error variance may also be calculated, as well as its variance.

DISTRIBUTIONS AND CONFIDENCE INTERVALS

UNDER THE NORMALITY ASSUMPTION

 Assumed distribution of the errors εi Slope Distribution of the slope Confidence intervals on the slope   (no demonstration) Intercept Distribution of the intercept Confidence intervals on the intercept  (no demonstration) Predictions Distribution of the predictions Confidence intervals on the predictions  (no demonstration) Distribution of σ²*  Distribution is Chi-square   (no demonstration) Variance TUTORIAL

______________________________________________________

 Tutorial 9

The normality assumption allows devising two tests meant to figure out if the model is significant. They both bear on the hypothesis β = 0 (no linear relationship between y and x). They will be shown to be equivalent.

TESTING THE SIGNIFICANCE OF  SLR

 Testing β = 0 F test on R² Distribution of the explained variance (SSE) Distribution of the residual variance (SSR)   (no demontration) SSE  and SSR  are independent   (no demonstration) The   F   test The ANOVA table The two tests are equivalent TUTORIAL

_____________________________________________________

 Tutorial 10

Every data point affects the estimates of the slope, the intercept, the predicted values and the estimated error variance to some degree. But some do so more than others. It is important to identify these points.

We address here the question of identifying these "exceptional" data points that have abnormally high contributions to some aspects of the model. This section makes heavy use of the concepts of "standardized residuals" and "studentized residuals", that we go over in some detail, as there is some confusion both in literature and in software about the meaning of these terms.

The demonstrations of most of the results are very technical, and are therefore not given.

LEVERAGE OBSERVATIONS AND INFLUENTIAL OBSERVATIONS

 0bservations that affect the predictions of the model The notion of "leverage observation" Leverage Large residuals Standardized residuals "Internally" studentized residuals Leave-one-out residuals Leave-one-out estimated error variance "Externally" studentized residuals DFFITSi Leave-one-out predictions DFFITSi Double outliers Cook's distance Observations that affect the parameters of the model The notion of "influential observation" DFBETASij Covariance Ratio TUTORIAL

_______________________________________________________

 Regression Least Squares Multiple Linear Regression Neural Networks