Tutorials

Multiple linear regression (MLR)

If you're not familiar with Linear Regression, we suggest that you first read the entry on Simple Linear Regression (SLR).

Principle of Multiple Linear Regression

Simple Linear Regression was trying to "explain" the values of a variable y using the values of another variable x, these two variables being assumed to entertain a linear relation :

y = ax + b + ε(x)

where ε is a random "noise" that depends a priori on x.

The generation of data

Multiple Linear Regression (MLR) addresses just about the same problem except that now, the "response variable" y is supposed to be explained not by just one variable x, but by several variables {xj}. If we slightly change the foregoing notations, we assume that the linear relation between y and the {xj}s is :

y = β0 + β1x1 + β2x2 + ... + βpxp + ε (x)

where :

    * p is the number of the so-called "independent" variables, or "regressors",

    * ε(x) is a random noise (e.g. measurement errors) whose nature will be made more explicit later, but whose properties depend a priori on the point of the data space defined by the values of the xj.

The data is made of n measurements yi, i = 1, 2, ..., n taken for n sets of values {xij} of the independent variables :

yi  = β0 + β1xi1 + β2xi2 + ... + βpxip + ε(x)i 

In this expression :

    * The βi are fixed but unknown numbers.

    * The ε(x)i are n realizations of the ε(x).

The model

Simple Linear Regression was drawing a straight line that best fit the data (in the Least Squares sense) in the (x, y) plane. Multiple Linear Regression will do just the same, but visual representation is now impossible, except, just barely, when there are only two independent variables x1 et x2 : Multiple Linear Regression will then draw a plane that best fits the cloud of data point in the (x1, x2, y) space.

 

 

In this illustration, the Least Squares plane minimizes the sum of the squares of the lengths of the blue segments parallel to the y axis. These (signed) lengths are called the residuals of the model. The Least Squares plane is then the plane that minimizes the sum of the squared residuals.

 

In higher dimensions, we have to be satisified with saying that Multiple Linear Regression will determine the hyperplane of dimension p that minimizes the sum of the squares of the distances (measure parallel to the y axis) between the hyperplane and the data points.

The equation ot this hyperplane is :

y* = β*0 + β1*x1 + β2*x2 + ... + βp*xp

where the βj* are called the estimated parameters of the model.

Adjusted values

The n values taken by y* for the various set of values of the {xj} are called the adjusted values of the model. So, the ith adjusted value yi* is :

y*i  = β0* + β1*xi1 + β2*xi2 + ... + βp*xip

In the above illustration, these are the (signed) heights or the points in the LS plane vertical of the data points.

Predictions

After being built on available data, the model will probably be used for predicting the values taken by the response variable for new data points {x} that are not in the original data set (predictive modeling). The value of y predicted for the new data point  xn+1 = {xn+1,1, ..., xn+1,p} will be :

y*n+1  = β0* + β1*xn+1,1 + β2*xn+1,2 + ... + βp*xn+1,p

and y*n+1 will be called the prediction of the model for  xn+1.

Similarities between Simple and Multiple Linear Regressions

We will see that the similarity between the two problems translates into similarities between the solutions. We will calculate the estimators β*i of the coefficients βi, we will show that these estimators are unbiased, and we will calculate their covariance matrix.

We will then assume that ε is normally distributed, and deduce that the estimators β*i are also normally dsitributed. This result will lead to building tests and confidence intervals on the values of these estimated parameters.

We will then address the issue of the predictions of the model on new data, and calculate confidence intervals on these predictions.

In other words, the path followed by Multiple Linear Regression is quite similar to that of Simple Linear Regression, with similar results.

 

So why a special entry on Multiple Linear Regression ?

Differences between Simple and Multiple Linear Regressions

There are at least three reasons that plead in favor of a special treatment of Multiple Linear Regression.

Matrix calculations, linear algebra

We studied Simple Linear Regression using "ordinary" equations and indeed, the same could be done with Multiple Linear Regression. But because there are now p independent variables instead of just one, calculations become exagerately cumbersome. Fortunately, Linear Algebra is here to give us a helping hand, and provides for extremely compact and elegant calculations. As a matter of fact, Linear Algebra will be our almost exclusive tool throughout the Tutorials, and Multiple Linear Regression is an excellent pedagogic means for a soft introduction to several important aspects of Linear Algebra.

Variable selection

This point is more important for the analyst, and Multiple Linear Regression may very well be the first example he will meet of the bias-variance tradeoff that we now summarize in a few words.

Any model (whether predictive or descriptive) built from a data sample should not incorporate "too many" parameters (here, the β*j ). Beyond a certain limit :

These issues are of great practical importance, and the analyst will therefore have to spend considerable effort to select among the numerous candidate independent variables those that will ultimately be retained in the final model.  

Collinearity of the predictors

Another source of variance of both the parameters and the predictions of a Multiple Linear Regression model is the possible collinearity of the predictors.

Even a thorough selection of the independent variables cannot completely eradicate collinearity. Consequently, Multiple Linear Regression has developped specific techniques, like Ridge Regression, for circumventing this problem.

Generalized Least Squares

A key assumption of the basic Multiple Linear Regression model is that the mesurement errors ε :

    * All have the same variance (homoskedasticity),

    * And are uncorrelated,

which translates into :

Σ = σ²In

where Σ is the covariance matrix of the errors, and In is the order n unit matrix.

This assumption is very strong indeed, and it is rarely the case that it is fully satisfied. It is therefore natural to ask what can be salvaged from MLR when the deterministic part of the data is still linear, but when the covariance matrix of the errors is not proportional to In anymore.

Actually, a great deal can be salvaged as explained in the entry about Generalized Least Squares.

Polynomial regression

In the expression "linear regression", the term "linear" does not make reference to the fact that the model is linear in each of the regressors, but to the fact that it is linear in the parameters. The model

y = β0 + β1.log(x1) + β2.exp(x2

which is definitely not linear in the variables, is linear in the parameters, and all the results pertaining to MLR are applicable to the model.

Introducing non linear functions of the variables in a MLR model is motivated by the desire to make the model cling to the true regression function more closely when it is anticipated that this function is not just a hyperplane, but is curved. It can therefore be said that a "linear model" can accommodate "non linear" regression functions.

When no information about the nature of the non linearity of the true regression function is available, it is customary to simply add monoms in the regressors to the basic model. For example, one might consider the model

y = β0 + β1x1 + β2x2 + β3 x1x2 + β4 x12 + β5 x22 

which is certainly more flexible than the basic model linear in the regressors, as the latter is just a special case of the former with the last three parameters forced to take the value 0.

Such a model is called a polynomial model for obvious reasons.

One must refrain from adding terms of higher and higher degrees to the model because of the bias-variance tradeoff, that we already mentioned within the context of variable selection. As a matter of fact, building a polynomial model must go along with a thorough campaign of variable selection, as the regressors are highly correlated by very nature.

 

________________________________________________________

 

 

 

Tutorial 1

 

In this Tutorial, we first establsh the definitive notations we'll use throughout the sequel, and enunciate the assumption of non collinearity of the data.

We then give a geometric description of the problem in the space of variables that we will use many times later on. Most of the calculations that we will develop will be initiated by preliminary considerations about this geometric representation.

We then calculate the estimated parameters β*j of the model by the Least Squares method. Because this result is so important, we give several demonstrations and get accustomed in the process with the geometric representation of Multiple Linear Regression as well as with some basic notions of Linear Algebra.

This part is of a purely geometric nature, and contains no reference to Statistics.

 

 

 

 

FITTING THE MODEL BY THE LEAST SQUARES METHOD

The data matrix

Intercept and final notations

The data matrix

Assumptions about the data matrix

Minimizing the quadratic error

The space of variables

The space of  solutions

The space of residuals

Least Squares and orthogonal projection

Invariance property of the adjusted values

Calculating the Least Squares estimator

Matrix calculus

Position of the extremum

The extremum is a minimum

Geometric methods

By the properties of a projection operator

By decomposition of a vector on two orthogonal spaces

By orthogonality of the solutions and residuals spaces

Hat matrix

Definition of the hat matrix

A first word on leverages

Invariance property of the hat matrix

Orthogonal projections and components of the vector of adjusted values

 

TUTORIAL

_______________________________________________________________________

 

 

Tutorial 2

 

The first Tutorial was purely geometric. We now introduce some elements of Statistics by considering that the measurement errors ε(x)i are random variables. The measured values yi are therefore also random, and so are the estimated parameters βj*.

We first formulate some assumptions about the statistical properties of the errors, and then calculate the basic statistical properties of the vector of estimated parameters :

    * Its mean,

    * And its covariance matrix.

These results are established using Linear Algebra exclusively, and enunciated in matrix form. But result about individual parameters (e.g. the correlation coefficient between two estimated parameters) can easily be derived from the matrix form.

-----

The Least Squares estimator β* of the vector of parameters β is the best of all linear unbiased estimators of β. For this reason, it is sometimes refered to as the BLUE (Best Linear Unbiased Estimator) of β. This result is known as the "Gauss-Markov theorem", and is demonstrated here.

 

 

 

STATISTICAL PROPERTIES OF THE ESTIMATED PARAMETERS

Errors are random variables

The vector of estimated parameters is random

Centering the errors

Homoskedasticity

Uncorrelated errors

Covariance matrix of the error vector

Normality assumption ?

The vector of estimated parameters is unbiased

 Covariance matrix of the estimated parameters

The general case

Special case : orthogonal variables

TUTORIAL

_______________________________________________________________

 

 

Tutorial 3

 

The preceding Tutorial was addressing the statistical properties of β*, the vector of the estimated parameters.

We now move on to establishing the statistical properties of :

    * The residuals,

    * The adjusted values,

    * and the predictions

of the model.

The vector of residuals will play a particularly important role, and we first describe some of its geometric properties before going over its statistical properties.

-----

We will conclude with a question of prime importance for the analyst : how can the variance of the errors ε be estimated ? We will discover an unbiased estimator of this variance.

Using this result, we will finally find  estimators for the variances (and therefore standard deviations) of the parameters βj*, a result of great importance for a further evaluation of the quality of the model.

 

 

 

RESIDUALS, ADJUSTED VALUES AND PREDICTION ERRORS

ESTIMATED VARIANCE OF THE ERRORS AND OF THE PARAMETERS

Residuals

Definition of the residuals

Properties of the vector of residuals

Projection of the vector of measurements

Invariance property of the residuals

Projection of the vector of errors

Orthogonality of the residuals and the adjusted values

Expectation of the vector of residuals

Covariance matrix of the residuals

Adjusted values

Expectations of the adjusted values

Covariance matrix of the adjusted values

Covariance of the residuals and the adjusted values

Properties of the prediction errors

Expectations of the prediction errors

Variances of the prediction errors

First form

Second form

Unbiased estimation of the variance of the errors

Estimation of the variances of the parameters

TUTORIAL

 _______________________________________________________

 

 

Tutorial 4

 

The linear model has now been fitted to the data by the Ordinary Least Squares (OLS) method. The result is the best possible model in the sense of Least Squares. Yet, "best possible" does not mean "good", and the need is felt for a measure of the quality of the fit.

The most popular criterion of fit for a linear regression model is the coefficient of determination R² as described in this Tutorial. It exists in several forms whose properties need to be understood for correctly interpreting the result of a Linear Regression.

 

 

COEFFICIENT OF DETERMINATION

NON CENTERED R², CENTERED R², ADJUSTED R²

Non centered coefficient of determination

Non centered R ²

Interpretation of the non centered R²

Drawback of the non centered R ²

Centered coefficient of determination

Model with a constant

Definition of the centered R ²

Multiple correlation coefficient

Geometric interpretation

Model without a constant

Adjusted coefficient of determination

The biais-variance tradeoff

Adjusted R²

TUTORIAL

____________________________________________

 

 

Tutorial 5

 

We have so far formulated the following assumptions on data :

    * There is a linear relation between the independent variables {xj} and the response variable y.

    * The measurement errors εi have 0 mean and uniform variance σ² (homoskedasticity),

We did not assume any specific probability distribution for the measurement errors and yet, the Least Squares method proved powerful enough to yield some important results about the statistical properties of the estimators.

 

We now keep the same assumptions, but add a new one : the εi are normal (or "gaussian") random variables. In vector form, this can be summarized by :

ε~N(0, σ²In)

where In is the unit matrix of order n.

This assumption opens new avenues for the study of Multple Linear Regression :

  1. The parameters can now be estimated by a method, namely the method of Maximum Likelihood, that is more general that the method of Least Squares.
  2. The probability distributions of the parameters, estimated variance and model predictions will be fully determined. So we will be able to elaborate confidence intervals on the values of these parameters, as well as on the values predicted by the model.
  3. We will also be able to devise tests on the values of these parameters, and so start studying the roles of individual variables in the behavior and performances of the model, as well as assess the overal credibility of the regression model.

_________

In his Tutorial, we examine the consequences of the normal assumption on the distributions of the various estimators we have already encountered. Confidence intervals and tests will be described in the next Tutorials.

 

 

 

THE NORMALITY ASSUMPTION

Estimation of the parameters by the Method of Maximum Likelihood

The log-likelihood

Estimation of the parameters

Estimation of the error variance

Probability distribution of the estimated parameters (error variance is known)

Distribution of the estimated parameters

Distribution of the estimated error variance

Independence of the estimated parameters and the estimated error variance

Distributions of the estimated parameters (error variance is unknown)

Distributions of the prediction errors

Error variance is known

Error variance is inknown

TUTORIAL

________________________________________________________________________

 

 

Tutorial 6

 

Under the normality assumption, we have now calculated the probability distributions of the estimated parameters, of the estimated variance of the errors, and of the model predictions. It is therefore straightforward to build confidence intervals on the these quantities, and this is what we do in this short Tutorial.

The notion of "Confidence Region" for several parameters considered simultaneously, although conceptually simple, is a bit difficult from a theoretical standpoint, and will just be shortly explained in a note. However as some software incoporate displaying confidence ellipses for pairs of parameters, we describe some simple rules of interpretations.

 

 

CONFIDENCE INTERVALS

Confidence intervals on estimated parameters

Confidence intervals

Note on confidence regions

Confidence intervals on the estimated variance

Confidence intervals on predictions

TUTORIAL

 _______________________________________________

 

A first MLR model has been built. That's the easiest part of the analyst's job : all it takes is click on the "Go" button. Now begins the long, hard and uncertain part : improving the model until it is believed that no additional effort can possibly make the final model better. Then, and only then can the model be delivered to it's end user, together with estimates of its ultimate performances and all the information about the data that was collected during the improvement phase.

We somewhat arbitrarily divide this improvement phase into three parts :

    * Variable selection.

    * Checking the assumptions behind the theory of MLR, and implementing some corrective actions if they are not.

    * Detecting observations that have a particularly strong influence on the model structure and performance (this question is adressed here).

 

The distinction between these three phases is indeed arbitrary, as the parts somewhat overlap and will have to be implemented in turn repetitively (and not necessarily in this order) each time a new model is built until the final model is obtained. A general and imperative rule to be kept in mind is that nothing can be said about a model by just analyzing other models, bigger or smaller. A model has to be fitted before any conclusion can be drawn about the model.

 

 

 

The next three Tutorials are dedicated to the problem of variable selection.

    * Tutorial 7 describes the fundamental test on nested models whose purpose is to determine whether adding variables to a model (or deleting variables from a model) causes a significant change in the performances of the model. Although the test is not by itself a method of variable selection, it is the main tool used by one of the most popular strategies of variable selection.

    * Tutorial 8 describes various quality criteria (adjusted R², Mallows' Cp, AIC, BIC) used for assessing the quality of a MLR model. These criteria will later be used by various strategies of variable selection.

    * Tutorial 9 addresses the question of variable selection per se. It shows how the various tools developed in the preceding two Tutorials are used for selecting a good set of regressors.

 

 

 

Tutorial 7

 

This Tutorial addresses the following question : "Build a first model with a certain set of independent variables. Then remove some variables from this set and build a new, reduced model (the two models are then said to be nested). Are the two models significantly different ?".

In order to answer this question, we must give a precise meaning to the expression "significantly different". This will be achieved by identifying a statistic whose distribution will be known when the two models are actually identical. A test will then be deduced from the distribution of this statistic.

This test on nested models can be built as a Likelihood Ratio Test (LRT). As is often the case with LRTs, calculating the test statistic leads to rather cumbersome calculations that we will not go into. Fortunately, this same statistic can also be obtained by simpler and more intuitive geometric arguments that we'll develop in detail.

Identifying the distribution of the statistic will rely on Cochran's theorem.

-----

A special case of this situation is when just one variable is removed from the complete set of variables. We can then either use the test on nested models, or else use the more familiar Student's test on the coefficient of the removed variable : the two tests will turn out to be equivalent.

Another special case considers removing all the independent variables of the model. The corresponding test, called Fisher's global test, or the "R² test", determines whether the regression model just built has any significance at all.

 

 

TEST ON NESTED MODELS

Comparing models : preliminary considerations

Variable selection and bias-variance tradeoff

The three ways of comparing models

Quality criteria

Simulation

Test on nested models

Test on nested models

Nested models

The test hypothesis

The test is a Likelihood Ratio test

The new projection space

Construction of the test statistic

Distribution of the test statistic

Orthogonality of the numerator and the denominator

Independence of the numerator and the denominator

Distribution of the test statistic

Distribution of the numerator

Distribution of the denominator

Distribution of the statistic

The test

Student's t test on a single variable

The nested modesl test

Student's t test

The two tests are equivalent

Fisher's global test on all variables

TUTORIAL

__________________________________________________________

  

 

Tutorial 8

 

A given set of design data will always lead to constructing several models, that will differ by their :

    * Sets of independent variables,

    * And possibly by the methods used for calculating the model parameters.

Naturally, the question will then be raised to identify the "best" model.

 

This Tutorial is devoted to the description of some of the techniques used for assessing the "quality" of a MLR model.

 

 

 

COMPARING MODELS IN MULTIPLE LINEAR REGRESSION

Three definitions of the "quality" of a model

The ideal solution : data is plentiful

A more realistic situation : data is scarce

Cross-validation

Test between nested models

Adjusted R²

R² always increases when variables are added

Adjusted R²

Mallows'Cp

Theory

Interpretating and using Mallows' Cp

Penalized likelihood

General principle

Akaike's Information Criterion (AIC)

Bayesian Information Criterion (BIC)

Comparing criteria

TUTORIAL

____________________________________________________

 

 

Tutorial 9

 

We refer again to the bias-variance tradeoff to insist that, for a given level of performance on the design data, small models (e.g. with few variables) should be prefered to large ones. Selecting the variables (among the plethora of available variables) that will be incorporated in the final model is a central issue of any modeling action.

In this Tutorial, we show how the tests and quality criteria that we described in the previous Tutorials can be used for selecting a reduced, but effective set of independent variables.

-----

Another role of variable selection is to combat collinearity by removing variables that are strongly correlated with other variables. But variable selection alone may not be enough to circumvent this serious problem. One then has to resort to techniques specifically designed to address the issue of near or complete collinearity, like Ridge Regression.

 

 

 

VARIABLE SELECTION IN MULTIPLE LINEAR REGRESSION

Variable selection as an optimization problem

Using quality criteria

Using tests on nested models

Exhaustive search

Greedy search

A naive idea

Greedy algorithms

Forward selection

Forward selection

Using the test on nested models

Using quality criteria

Backward selection

Stepwise selection

Caveat

Suboptimal solution

Different selection procedures

Different starting points

Stability of the selection

Interpretability

Analyst's jugement

 

TUTORIAL

______________________________________________

 

 

Tutorial 10

 

The validity of a MLR model, including tests on parameters and the various variable selection procedures, rely on a set of rather stringent assumptions :

    1) The process that generated the data is indeed linear.

    2) The design matrix X is full rank.

    3) The errors are uncorrelated and have all the same variance (homoskedasticity).

    4) Additionally, these errors may be assumed to be normally distributed.

 

Although software will always produce estimated parameters and adjusted values (provided that the design matrix be full rank), the statistical properties of these quantities (bias, covariance matrix, probability distribution leading to confidence intervals and tests) can be infered from theory only if the above assumptions are indeed justified.

A fitted model provides means of checking these assumptions to a certain extent. But because a model provides only partial information about the process that generated the data, every new model should be used, among other things, to provide additional circumstantial evidence about this process.

_____

The most widely used quantities for checking the MLR assumptions are the residuals. The residuals are the essential ingredients of many :

    * Diagrams (or "plots"), that allow visual identification of various sorts of departures from the standard assumptions.

    * "Coefficients", of "indicators", whose numerical values may point to a problem with the model.

    * Tests, when the distribution of these indicators are known, at least approximately.


The Tutorial is organized by themes, not by validation techniques. This accounts for the same technique to be sometimes mentioned in several places, but each time with a different problem in mind.

 

VALIDATION OF A MLR MODEL :

CHECKING THE ASSUMPTIONS

Foreword : a vicious circle

Linearity

Plots

Residuals vs adjusted values

Adjusted values vs observed values

Adjusted values vs regressors

"Residual plus component" plot (or "partial residuals plots")

Regressor transformations

Polynomial regression

Tukey's power ladder

Box-Cox transformations

Independence of the errors

Runs test (Wald-Wolfowitz test)

Durbin-Watson test

Normality of the errors

Tests

Caveat

Normality tests

Skewness test

Jarque-Bera test

Histogram of the residuals

Q-Q plot

Another caveat : supernormality of the residuals

Homoskedasticity

Detecting heteroskedasticity

Plots

Residuals vs regressors

Residuals vs adjusted values

Adjusted values vs observed values

Indicators

Tests

Goldfeld-Quandt

Breusch-Pagan

Variance stabilization

Variance stabilization via transformation of the response variable

A priori transformation

Box-Cox transformation

The Likelihood method

The Atkinson method

 

TUTORIAL

 

 ____________________________________________________________

 

Related readings :

Simple Linear Regression

Least Squares estimation

Bias-variance tradeoff

Generalized Least Squares

Download this Glossary