Tutorials

Multiple linear regression (MLR)

If you're not familiar with Linear Regression, we suggest that you first read the entry on Simple Linear Regression (SLR).

# Principle of Multiple Linear Regression

Simple Linear Regression was trying to "explain" the values of a variable y using the values of another variable x, these two variables being assumed to entertain a linear relation :

y = ax + b + ε(x)

where ε is a random "noise" that depends a priori on x.

## The generation of data

Multiple Linear Regression (MLR) addresses just about the same problem except that now, the "response variable" y is supposed to be explained not by just one variable x, but by several variables {xj}. If we slightly change the foregoing notations, we assume that the linear relation between y and the {xj}s is :

y = β0 + β1x1 + β2x2 + ... + βpxp + ε (x)

where :

* p is the number of the so-called "independent" variables, or "regressors",

* ε(x) is a random noise (e.g. measurement errors) whose nature will be made more explicit later, but whose properties depend a priori on the point of the data space defined by the values of the xj.

The data is made of n measurements yi, i = 1, 2, ..., n taken for n sets of values {xij} of the independent variables :

yi  = β0 + β1xi1 + β2xi2 + ... + βpxip + ε(x)i

In this expression :

* The βi are fixed but unknown numbers.

* The ε(x)i are n realizations of the ε(x).

## The model

Simple Linear Regression was drawing a straight line that best fit the data (in the Least Squares sense) in the (x, y) plane. Multiple Linear Regression will do just the same, but visual representation is now impossible, except, just barely, when there are only two independent variables x1 et x2 : Multiple Linear Regression will then draw a plane that best fits the cloud of data point in the (x1, x2, y) space.

In this illustration, the Least Squares plane minimizes the sum of the squares of the lengths of the blue segments parallel to the y axis. These (signed) lengths are called the residuals of the model. The Least Squares plane is then the plane that minimizes the sum of the squared residuals.

In higher dimensions, we have to be satisified with saying that Multiple Linear Regression will determine the hyperplane of dimension p that minimizes the sum of the squares of the distances (measure parallel to the y axis) between the hyperplane and the data points.

The equation ot this hyperplane is :

y* = β*0 + β1*x1 + β2*x2 + ... + βp*xp

where the βj* are called the estimated parameters of the model.

The n values taken by y* for the various set of values of the {xj} are called the adjusted values of the model. So, the ith adjusted value yi* is :

y*i  = β0* + β1*xi1 + β2*xi2 + ... + βp*xip

In the above illustration, these are the (signed) heights or the points in the LS plane vertical of the data points.

## Predictions

After being built on available data, the model will probably be used for predicting the values taken by the response variable for new data points {x} that are not in the original data set (predictive modeling). The value of y predicted for the new data point  xn+1 = {xn+1,1, ..., xn+1,p} will be :

y*n+1  = β0* + β1*xn+1,1 + β2*xn+1,2 + ... + βp*xn+1,p

and y*n+1 will be called the prediction of the model for  xn+1.

# Similarities between Simple and Multiple Linear Regressions

We will see that the similarity between the two problems translates into similarities between the solutions. We will calculate the estimators β*i of the coefficients βi, we will show that these estimators are unbiased, and we will calculate their covariance matrix.

We will then assume that ε is normally distributed, and deduce that the estimators β*i are also normally dsitributed. This result will lead to building tests and confidence intervals on the values of these estimated parameters.

We will then address the issue of the predictions of the model on new data, and calculate confidence intervals on these predictions.

In other words, the path followed by Multiple Linear Regression is quite similar to that of Simple Linear Regression, with similar results.

So why a special entry on Multiple Linear Regression ?

# Differences between Simple and Multiple Linear Regressions

There are at least three reasons that plead in favor of a special treatment of Multiple Linear Regression.

## Matrix calculations, linear algebra

We studied Simple Linear Regression using "ordinary" equations and indeed, the same could be done with Multiple Linear Regression. But because there are now p independent variables instead of just one, calculations become exagerately cumbersome. Fortunately, Linear Algebra is here to give us a helping hand, and provides for extremely compact and elegant calculations. As a matter of fact, Linear Algebra will be our almost exclusive tool throughout the Tutorials, and Multiple Linear Regression is an excellent pedagogic means for a soft introduction to several important aspects of Linear Algebra.

## Variable selection

This point is more important for the analyst, and Multiple Linear Regression may very well be the first example he will meet of the bias-variance tradeoff that we now summarize in a few words.

Any model (whether predictive or descriptive) built from a data sample should not incorporate "too many" parameters (here, the β*j ). Beyond a certain limit :

• Its performances will keep improving if measured on the available data set, but will degrade if measured on new data, because of an increased variance of its predictions.
• The values of its parameters will become unstable (very sensitive to small changes in data), and therefore meaningless, again because of too large a variance ot the estimated parameters.

These issues are of great practical importance, and the analyst will therefore have to spend considerable effort to select among the numerous candidate independent variables those that will ultimately be retained in the final model.

## Collinearity of the predictors

Another source of variance of both the parameters and the predictions of a Multiple Linear Regression model is the possible collinearity of the predictors.

Even a thorough selection of the independent variables cannot completely eradicate collinearity. Consequently, Multiple Linear Regression has developped specific techniques, like Ridge Regression, for circumventing this problem.

# Generalized Least Squares

A key assumption of the basic Multiple Linear Regression model is that the mesurement errors ε :

* All have the same variance (homoskedasticity),

* And are uncorrelated,

which translates into :

Σ = σ²In

where Σ is the covariance matrix of the errors, and In is the order n unit matrix.

This assumption is very strong indeed, and it is rarely the case that it is fully satisfied. It is therefore natural to ask what can be salvaged from MLR when the deterministic part of the data is still linear, but when the covariance matrix of the errors is not proportional to In anymore.

Actually, a great deal can be salvaged as explained in the entry about Generalized Least Squares.

# Polynomial regression

In the expression "linear regression", the term "linear" does not make reference to the fact that the model is linear in each of the regressors, but to the fact that it is linear in the parameters. The model

y = β0 + β1.log(x1) + β2.exp(x2

which is definitely not linear in the variables, is linear in the parameters, and all the results pertaining to MLR are applicable to the model.

Introducing non linear functions of the variables in a MLR model is motivated by the desire to make the model cling to the true regression function more closely when it is anticipated that this function is not just a hyperplane, but is curved. It can therefore be said that a "linear model" can accommodate "non linear" regression functions.

When no information about the nature of the non linearity of the true regression function is available, it is customary to simply add monoms in the regressors to the basic model. For example, one might consider the model

y = β0 + β1x1 + β2x2 + β3 x1x2 + β4 x12 + β5 x22

which is certainly more flexible than the basic model linear in the regressors, as the latter is just a special case of the former with the last three parameters forced to take the value 0.

Such a model is called a polynomial model for obvious reasons.

One must refrain from adding terms of higher and higher degrees to the model because of the bias-variance tradeoff, that we already mentioned within the context of variable selection. As a matter of fact, building a polynomial model must go along with a thorough campaign of variable selection, as the regressors are highly correlated by very nature.

________________________________________________________

 Tutorial 1

In this Tutorial, we first establsh the definitive notations we'll use throughout the sequel, and enunciate the assumption of non collinearity of the data.

We then give a geometric description of the problem in the space of variables that we will use many times later on. Most of the calculations that we will develop will be initiated by preliminary considerations about this geometric representation.

We then calculate the estimated parameters β*j of the model by the Least Squares method. Because this result is so important, we give several demonstrations and get accustomed in the process with the geometric representation of Multiple Linear Regression as well as with some basic notions of Linear Algebra.

This part is of a purely geometric nature, and contains no reference to Statistics.

FITTING THE MODEL BY THE LEAST SQUARES METHOD

 The data matrix Intercept and final notations The data matrix Assumptions about the data matrix Minimizing the quadratic error The space of variables The space of  solutions The space of residuals Least Squares and orthogonal projection Invariance property of the adjusted values Calculating the Least Squares estimator Matrix calculus Position of the extremum The extremum is a minimum Geometric methods By the properties of a projection operator By decomposition of a vector on two orthogonal spaces By orthogonality of the solutions and residuals spaces Hat matrix Definition of the hat matrix A first word on leverages Invariance property of the hat matrix Orthogonal projections and components of the vector of adjusted values TUTORIAL

_______________________________________________________________________

 Tutorial 2

The first Tutorial was purely geometric. We now introduce some elements of Statistics by considering that the measurement errors ε(x)i are random variables. The measured values yi are therefore also random, and so are the estimated parameters βj*.

We first formulate some assumptions about the statistical properties of the errors, and then calculate the basic statistical properties of the vector of estimated parameters :

* Its mean,

* And its covariance matrix.

These results are established using Linear Algebra exclusively, and enunciated in matrix form. But result about individual parameters (e.g. the correlation coefficient between two estimated parameters) can easily be derived from the matrix form.

-----

The Least Squares estimator β* of the vector of parameters β is the best of all linear unbiased estimators of β. For this reason, it is sometimes refered to as the BLUE (Best Linear Unbiased Estimator) of β. This result is known as the "Gauss-Markov theorem", and is demonstrated here.

STATISTICAL PROPERTIES OF THE ESTIMATED PARAMETERS

 Errors are random variables The vector of estimated parameters is random Centering the errors Homoskedasticity Uncorrelated errors Covariance matrix of the error vector Normality assumption ? The vector of estimated parameters is unbiased  Covariance matrix of the estimated parameters The general case Special case : orthogonal variables TUTORIAL

_______________________________________________________________

 Tutorial 3

The preceding Tutorial was addressing the statistical properties of β*, the vector of the estimated parameters.

We now move on to establishing the statistical properties of :

* The residuals,

* and the predictions

of the model.

The vector of residuals will play a particularly important role, and we first describe some of its geometric properties before going over its statistical properties.

-----

We will conclude with a question of prime importance for the analyst : how can the variance of the errors ε be estimated ? We will discover an unbiased estimator of this variance.

Using this result, we will finally find  estimators for the variances (and therefore standard deviations) of the parameters βj*, a result of great importance for a further evaluation of the quality of the model.

RESIDUALS, ADJUSTED VALUES AND PREDICTION ERRORS

ESTIMATED VARIANCE OF THE ERRORS AND OF THE PARAMETERS

 Residuals Definition of the residuals Properties of the vector of residuals Projection of the vector of measurements Invariance property of the residuals Projection of the vector of errors Orthogonality of the residuals and the adjusted values Expectation of the vector of residuals Covariance matrix of the residuals Adjusted values Expectations of the adjusted values Covariance matrix of the adjusted values Covariance of the residuals and the adjusted values Properties of the prediction errors Expectations of the prediction errors Variances of the prediction errors First form Second form Unbiased estimation of the variance of the errors Estimation of the variances of the parameters TUTORIAL

_______________________________________________________

 Tutorial 4

The linear model has now been fitted to the data by the Ordinary Least Squares (OLS) method. The result is the best possible model in the sense of Least Squares. Yet, "best possible" does not mean "good", and the need is felt for a measure of the quality of the fit.

The most popular criterion of fit for a linear regression model is the coefficient of determination R² as described in this Tutorial. It exists in several forms whose properties need to be understood for correctly interpreting the result of a Linear Regression.

COEFFICIENT OF DETERMINATION

NON CENTERED R², CENTERED R², ADJUSTED R²

 Non centered coefficient of determination Non centered R ² Interpretation of the non centered R² Drawback of the non centered R ² Centered coefficient of determination Model with a constant Definition of the centered R ² Multiple correlation coefficient Geometric interpretation Model without a constant Adjusted coefficient of determination The biais-variance tradeoff Adjusted R² TUTORIAL

____________________________________________

 Tutorial 5

We have so far formulated the following assumptions on data :

* There is a linear relation between the independent variables {xj} and the response variable y.

* The measurement errors εi have 0 mean and uniform variance σ² (homoskedasticity),

We did not assume any specific probability distribution for the measurement errors and yet, the Least Squares method proved powerful enough to yield some important results about the statistical properties of the estimators.

We now keep the same assumptions, but add a new one : the εi are normal (or "gaussian") random variables. In vector form, this can be summarized by :

ε~N(0, σ²In)

where In is the unit matrix of order n.

This assumption opens new avenues for the study of Multple Linear Regression :

1. The parameters can now be estimated by a method, namely the method of Maximum Likelihood, that is more general that the method of Least Squares.
2. The probability distributions of the parameters, estimated variance and model predictions will be fully determined. So we will be able to elaborate confidence intervals on the values of these parameters, as well as on the values predicted by the model.
3. We will also be able to devise tests on the values of these parameters, and so start studying the roles of individual variables in the behavior and performances of the model, as well as assess the overal credibility of the regression model.

_________

In his Tutorial, we examine the consequences of the normal assumption on the distributions of the various estimators we have already encountered. Confidence intervals and tests will be described in the next Tutorials.

THE NORMALITY ASSUMPTION

 Estimation of the parameters by the Method of Maximum Likelihood The log-likelihood Estimation of the parameters Estimation of the error variance Probability distribution of the estimated parameters (error variance is known) Distribution of the estimated parameters Distribution of the estimated error variance Independence of the estimated parameters and the estimated error variance Distributions of the estimated parameters (error variance is unknown) Distributions of the prediction errors Error variance is known Error variance is inknown TUTORIAL

________________________________________________________________________

 Tutorial 6

Under the normality assumption, we have now calculated the probability distributions of the estimated parameters, of the estimated variance of the errors, and of the model predictions. It is therefore straightforward to build confidence intervals on the these quantities, and this is what we do in this short Tutorial.

The notion of "Confidence Region" for several parameters considered simultaneously, although conceptually simple, is a bit difficult from a theoretical standpoint, and will just be shortly explained in a note. However as some software incoporate displaying confidence ellipses for pairs of parameters, we describe some simple rules of interpretations.

CONFIDENCE INTERVALS

 Confidence intervals on estimated parameters Confidence intervals Note on confidence regions Confidence intervals on the estimated variance Confidence intervals on predictions TUTORIAL

_______________________________________________

 A first MLR model has been built. That's the easiest part of the analyst's job : all it takes is click on the "Go" button. Now begins the long, hard and uncertain part : improving the model until it is believed that no additional effort can possibly make the final model better. Then, and only then can the model be delivered to it's end user, together with estimates of its ultimate performances and all the information about the data that was collected during the improvement phase. We somewhat arbitrarily divide this improvement phase into three parts :     * Variable selection.     * Checking the assumptions behind the theory of MLR, and implementing some corrective actions if they are not.     * Detecting observations that have a particularly strong influence on the model structure and performance (this question is adressed here).   The distinction between these three phases is indeed arbitrary, as the parts somewhat overlap and will have to be implemented in turn repetitively (and not necessarily in this order) each time a new model is built until the final model is obtained. A general and imperative rule to be kept in mind is that nothing can be said about a model by just analyzing other models, bigger or smaller. A model has to be fitted before any conclusion can be drawn about the model.

 The next three Tutorials are dedicated to the problem of variable selection.     * Tutorial 7 describes the fundamental test on nested models whose purpose is to determine whether adding variables to a model (or deleting variables from a model) causes a significant change in the performances of the model. Although the test is not by itself a method of variable selection, it is the main tool used by one of the most popular strategies of variable selection.     * Tutorial 8 describes various quality criteria (adjusted R², Mallows' Cp, AIC, BIC) used for assessing the quality of a MLR model. These criteria will later be used by various strategies of variable selection.     * Tutorial 9 addresses the question of variable selection per se. It shows how the various tools developed in the preceding two Tutorials are used for selecting a good set of regressors.

 Tutorial 7

This Tutorial addresses the following question : "Build a first model with a certain set of independent variables. Then remove some variables from this set and build a new, reduced model (the two models are then said to be nested). Are the two models significantly different ?".

In order to answer this question, we must give a precise meaning to the expression "significantly different". This will be achieved by identifying a statistic whose distribution will be known when the two models are actually identical. A test will then be deduced from the distribution of this statistic.

This test on nested models can be built as a Likelihood Ratio Test (LRT). As is often the case with LRTs, calculating the test statistic leads to rather cumbersome calculations that we will not go into. Fortunately, this same statistic can also be obtained by simpler and more intuitive geometric arguments that we'll develop in detail.

Identifying the distribution of the statistic will rely on Cochran's theorem.

-----

A special case of this situation is when just one variable is removed from the complete set of variables. We can then either use the test on nested models, or else use the more familiar Student's test on the coefficient of the removed variable : the two tests will turn out to be equivalent.

Another special case considers removing all the independent variables of the model. The corresponding test, called Fisher's global test, or the "R² test", determines whether the regression model just built has any significance at all.

TEST ON NESTED MODELS

 Comparing models : preliminary considerations Variable selection and bias-variance tradeoff The three ways of comparing models Quality criteria Simulation Test on nested models Test on nested models Nested models The test hypothesis The test is a Likelihood Ratio test The new projection space Construction of the test statistic Distribution of the test statistic Orthogonality of the numerator and the denominator Independence of the numerator and the denominator Distribution of the test statistic Distribution of the numerator Distribution of the denominator Distribution of the statistic The test Student's t test on a single variable The nested modesl test Student's t test The two tests are equivalent Fisher's global test on all variables TUTORIAL

__________________________________________________________

 Tutorial 8

A given set of design data will always lead to constructing several models, that will differ by their :

* Sets of independent variables,

* And possibly by the methods used for calculating the model parameters.

Naturally, the question will then be raised to identify the "best" model.

This Tutorial is devoted to the description of some of the techniques used for assessing the "quality" of a MLR model.

COMPARING MODELS IN MULTIPLE LINEAR REGRESSION

 Three definitions of the "quality" of a model The ideal solution : data is plentiful A more realistic situation : data is scarce Cross-validation Test between nested models Adjusted R² R² always increases when variables are added Adjusted R² Mallows'Cp Theory Interpretating and using Mallows' Cp Penalized likelihood General principle Akaike's Information Criterion (AIC) Bayesian Information Criterion (BIC) Comparing criteria TUTORIAL

____________________________________________________

 Tutorial 9

We refer again to the bias-variance tradeoff to insist that, for a given level of performance on the design data, small models (e.g. with few variables) should be prefered to large ones. Selecting the variables (among the plethora of available variables) that will be incorporated in the final model is a central issue of any modeling action.

In this Tutorial, we show how the tests and quality criteria that we described in the previous Tutorials can be used for selecting a reduced, but effective set of independent variables.

-----

Another role of variable selection is to combat collinearity by removing variables that are strongly correlated with other variables. But variable selection alone may not be enough to circumvent this serious problem. One then has to resort to techniques specifically designed to address the issue of near or complete collinearity, like Ridge Regression.

VARIABLE SELECTION IN MULTIPLE LINEAR REGRESSION

 Variable selection as an optimization problem Using quality criteria Using tests on nested models Exhaustive search Greedy search A naive idea Greedy algorithms Forward selection Forward selection Using the test on nested models Using quality criteria Backward selection Stepwise selection Caveat Suboptimal solution Different selection procedures Different starting points Stability of the selection Interpretability Analyst's jugement TUTORIAL

______________________________________________

 Tutorial 10

The validity of a MLR model, including tests on parameters and the various variable selection procedures, rely on a set of rather stringent assumptions :

1) The process that generated the data is indeed linear.

2) The design matrix X is full rank.

3) The errors are uncorrelated and have all the same variance (homoskedasticity).

4) Additionally, these errors may be assumed to be normally distributed.

Although software will always produce estimated parameters and adjusted values (provided that the design matrix be full rank), the statistical properties of these quantities (bias, covariance matrix, probability distribution leading to confidence intervals and tests) can be infered from theory only if the above assumptions are indeed justified.

A fitted model provides means of checking these assumptions to a certain extent. But because a model provides only partial information about the process that generated the data, every new model should be used, among other things, to provide additional circumstantial evidence about this process.

_____

The most widely used quantities for checking the MLR assumptions are the residuals. The residuals are the essential ingredients of many :

* Diagrams (or "plots"), that allow visual identification of various sorts of departures from the standard assumptions.

* "Coefficients", of "indicators", whose numerical values may point to a problem with the model.

* Tests, when the distribution of these indicators are known, at least approximately.

The Tutorial is organized by themes, not by validation techniques. This accounts for the same technique to be sometimes mentioned in several places, but each time with a different problem in mind.

VALIDATION OF A MLR MODEL :

CHECKING THE ASSUMPTIONS

 Foreword : a vicious circle Linearity Plots Residuals vs adjusted values Adjusted values vs observed values Adjusted values vs regressors "Residual plus component" plot (or "partial residuals plots") Regressor transformations Polynomial regression Tukey's power ladder Box-Cox transformations Independence of the errors Runs test (Wald-Wolfowitz test) Durbin-Watson test Normality of the errors Tests Caveat Normality tests Skewness test Jarque-Bera test Histogram of the residuals Q-Q plot Another caveat : supernormality of the residuals Homoskedasticity Detecting heteroskedasticity Plots Residuals vs regressors Residuals vs adjusted values Adjusted values vs observed values Indicators Tests Goldfeld-Quandt Breusch-Pagan Variance stabilization Variance stabilization via transformation of the response variable A priori transformation Box-Cox transformation The Likelihood method The Atkinson method TUTORIAL

____________________________________________________________

 Simple Linear Regression Least Squares estimation Bias-variance tradeoff Generalized Least Squares