|
Tutorials |
Multiple linear regression (MLR)
If you're not familiar with Linear Regression, we suggest that you first read the entry on Simple Linear Regression (SLR).
Simple Linear Regression was trying to "explain" the values of a variable y using the values of another variable x, these two variables being assumed to entertain a linear relation :
y = ax + b + e(x)
where e is a random "noise" that depends a priori on x.
Multiple Linear Regression (MLR) addresses just about the same problem except that now, the "response variable" y is supposed to be explained not by just one variable x, but by several variables {xj}. If we slightly change the foregoing notations, we assume that the linear relation between y and the {xj}s is :
y = b0 + b1x1 + b2x2 + ... + bpxp + e (x)
where :
* p is the number of the so-called "independent" variables, or "regressors",
* e(x) is a random noise (e.g. measurement errors) whose nature will be made more explicit later, but whose properties depend a priori on the point of the data space defined by the values of the xj.
The data is made of n measurements yi, i = 1, 2, ..., n taken for n sets of values {xij} of the independent variables :
yi = b0 + b1xi1 + b2xi2 + ... + bpxip + e(x)i
In this expression :
* The bi are fixed but unknown numbers.
* The e(x)i are n realizations of the e(x).
Simple Linear Regression was drawing a straight line that best fit the data (in the Least Squares sense) in the (x, y) plane. Multiple Linear Regression will do just the same, but visual representation is now impossible, except, just barely, when there are only two independent variables x1 et x2 : Multiple Linear Regression will then draw a plane that best fits the cloud of data point in the (x1, x2, y) space.

In this illustration, the Least Squares plane minimizes the sum of the squares of the lengths of the blue segments parallel to the y axis. These (signed) lengths are called the residuals of the model. The Least Squares plane is then the plane that minimizes the sum of the squared residuals.
In higher dimensions, we have to be satisified with saying that Multiple Linear Regression will determine the hyperplane of dimension p that minimizes the sum of the squares of the distances (measure parallel to the y axis) between the hyperplane and the data points.
The equation ot this hyperplane is :
y* = b*0 + b1*x1 + b2*x2 + ... + bp*xp
where the bj* are called the estimated parameters of the model.
The n values taken by y* for the various set of values of the {xj} are called the adjusted values of the model. So, the ith adjusted value yi* is :
y*i = b0* + b1*xi1 + b2*xi2 + ... + bp*xip
In the above illustration, these are the (signed) heights or the points in the LS plane vertical of the data points.
After being built on available data, the model will probably be used for predicting the values taken by the response variable for new data points {x} that are not in the original data set (predictive modeling). The value of y predicted for the new data point xn+1 = {xn+1,1, ..., xn+1,p} will be :
y*n+1 = b0* + b1*xn+1,1 + b2*xn+1,2 + ... + bp*xn+1,p
and y*n+1 will be called the prediction of the model for xn+1.
We will see that the similarity between the two problems translates into similarities between the solutions. We will calculate the estimators b*i of the coefficients bi, we will show that these estimators are unbiased, and we will calculate their covariance matrix.
We will then assume that e is normally distributed, and deduce that the estimators b*i are also normally dsitributed. This result will lead to building tests and confidence intervals on the values of these estimated parameters.
We will then address the issue of the predictions of the model on new data, and calculate confidence intervals on these predictions.
In other words, the path followed by Multiple Linear Regression is quite similar to that of Simple Linear Regression, with similar results.
So why a special entry on Multiple Linear Regression ?
There are at least three reasons that plead in favor of a special treatment of Multiple Linear Regression.
We studied Simple Linear Regression using "ordinary" equations and indeed, the same could be done with Multiple Linear Regression. But because there are now p independent variables instead of just one, calculations become exagerately cumbersome. Fortunately, Linear Algebra is here to give us a helping hand, and provides for extremely compact and elegant calculations. As a matter of fact, Linear Algebra will be our almost exclusive tool throughout the Tutorials, and Multiple Linear Regression is an excellent pedagogic means for a soft introduction to several important aspects of Linear Algebra.
This point is more important for the analyst, and Multiple Linear Regression may very well be the first example he will meet of the bias-variance tradeoff that we now summarize in a few words.
Any model (whether predictive or descriptive) built from a data sample should not incorporate "too many" parameters (here, the b*j ). Beyond a certain limit :
These issues are of great practical importance, and
the analyst will therefore have to spend considerable effort to select
among the numerous candidate independent variables those that will ultimately
be retained in the final model. ![]()
Another source of variance of both the parameters and the predictions of a Multiple Linear Regression model is the possible collinearity of the independent variables.
Even a thorough selection of the independent variables cannot completely eradicate collinearity. Consequently, Multiple Linear Regression has developped specific techniques, like Ridge Regression, for circumventing this problem.
___________________________________________
|
Tutorial 1 |
In this Tutorial, we first establsh the definitive notations we'll use throughout the sequel, and enunciate the assumption of non collinearity of the data.
We then give a geometric description of the problem in the space of variables that we will use many times later on. Most of the calculations that we will develop will be initiated by preliminary considerations about this geometric representation.
We then calculate the estimated parameters b*j of the model by the Least Squares method. Because this result is so important, we give several demonstrations and get accustomed in the process with the geometric representation of Multiple Linear Regression as well as with some basic notions of Linear Algebra.
This part is of a purely geometric nature, and contains no reference to Statistics.
-----
It is to be noted that we call on several important results in Linear Algebra that we enunciate without demonstration. The interested reader will find these demonstrations in any of the many excellent specialized textbooks.
FITTING THE MODEL BY THE LEAST SQUARES METHOD
|
The data matrix Intercept and final notations The data matrix Assumptions about the data matrix Minimizing the quadratic error The space of variables The space of solutions The space of residuals Least Squares and orthogonal projection Calculating the Least Squares estimator Matrix calculus Position of the extremum The extremum is a minimum Geometric methods By the properties of a projection operator By decomposition of a vector on two orthogonal spaces By orthogonality of the solutions and residuals spaces "Hat matrix" and leverage Orthogonal projections and components of the vector of adjusted values |
||
|
TUTORIAL |
||
_______________________________________________________________________
|
Tutorial 2 |
The first Tutorial was purely geometric. We now introduce some elements of Statistics by considering that the measurement errors e(x)i are random variables. The measured values yi are therefore also random, and so are the estimated parameters bj*.
We first formulate some assumptions about the statistical properties of the errors, and then calculate the basic statistical properties of the vector of estimated parameters :
* Its mean,
* And its covariance matrix.
These results are established using Liner Algebra exclusively, and enunciated in matrix form. But result about individual parameters (e.g. the correlation coefficient between two estimated parameters) can easily be derived from the matrix form.
-----
We conclude with the Gauss-Markov theorem that states that the Least Squares estimator is the best among the linear unbiased estimators of the model parameters.
STATISTICAL PROPERTIES OF THE ESTIMATED PARAMETERS
|
Errors are random variables The vector of estimated parameters is random Centering the errors Homoskedasticity Uncorrelated errors Covariance matrix of the error vector Normality assumption ? The vector of estimated parameters is unbiased Covariance matrix of the estimated parameters The general case Special case : orthogonal variables The Gauss-Markov theorem |
||
|
TUTORIAL |
||
_______________________________________________________________
|
Tutorial 3 |
The preceeding Tutorial was addressing the statistical properties of b*, the vector of the estimated parameters.
We now move on to establish the statistical properties of :
* The residuals,
* The adjusted values,
* and the predictions
of the model.
The vector of residuals will play a particularly important role, and we first describe some of its geometric properties before going over its statistical properties.
-----
We will conclude with a question of prime importance for the analyst : how can the variance of the errors e be estimated ? We will discover an unbiased estimator of this variance.
Using this result, we will finally find estimators for the variances (and therefore standard deviations) of the parameters bj*, a result of great importance for a further evaluation of the quality of the model.
RESIDUALS, ADJUSTED VALUES AND PREDICTION ERRORS
ESTIMATED VARIANCE OF THE ERRORS AND OF THE PARAMETERS
|
Residuals Definition of the residuals Properties of the vector of residuals Projection of the vector of measurements Projection of the vector of errors Orthogonality of the residuals and the adjusted values Expectation of the vector of residuals Covariance matrix of the residuals Adjusted values Expectations of the adjusted values Covariance matrix of the adjusted values Covariance of the residuals and the adjusted values Properties of the prediction errors Expectations of the prediction errors Variances of the prediction errors First form Second form Unbiased estimation of the variance of the errors Estimation of the variances of the parameters |
||
|
TUTORIAL |
||
_____________________________________________________________
|
Tutorial 4 |
The model is now built, and the question is : "Does the model account for the data satisfactorily ?". Intuitively, this will be the case if the residuals are small. The Coefficient of Determination, denoted by R², brings an answer that can be easily interpreted in geometric terms.
In this short Tutorial, we describe the geometric origin of the R², and we give its improved version, the "adjusted R²".
COEFFICIENT OF DETERMINATION R²
ADJUSTED R²
|
Coefficient of determination R² Origin of the variability of the measurements Geometric interpretation Analyse of variance Quality of fit of the model to the data Adjusted R² |
||
|
TUTORIAL |
||
________________________________________
|
Tutorial 5 |
We have so far formulated the following assumptions on data :
* There is a linear relation between the independent variables {xj} and the response variable y.
* The measurement errors ei have 0 mean and uniform variance s² (homoskedasticity),
We did not assume any specific probability distribution for the measurement errors and yet, the Least Squares method proved powerful enough to yield some important results about the statistical properties of the estimators.
We now keep the same assumptions, but add a new one : the ei are normal (or "gaussian") random variables. In vector form, this can be summarized by :
e~N(0, s²In)
where In is the unit matrix of order n.
This assumption opens new avenues for the study of Multple Linear Regression :
_________
In his Tutorial, we examine the consequences of the normal assumption on the distributions of the various estimators we have already encountered. Confidence intervals and tests will be described in the next Tutorials.
-----
The reader is warned that the fundamental result on the distribution of the estimated variance will be given without demonstration. This result relies on Cochran's Theorem that we enunciate, but whose demonstration is beyond the scope of this Glossary.
THE NORMALITY ASSUMPTION
|
Estimation of the parameters by the Method of Maximum Likelihood The log-likelihood Estimation of the parameters Estimation of the error variance Probability distribution of the estimated parameters (error variance is known) Distribution of the estimated parameters Distribution of the estimated error variance (no demonstration) Independence of the estimated parameters and the estimated error variance Distributions of the estimated parameters (error variance is unknown) Distributions of the prediction errors Error variance is known Error variance is inknown |
||
|
TUTORIAL |
||
_____________________________________________________________________
|
Tutorial 6 |
Under the normality assumption, we have now calculated the probability distributions of the estimated parameters, of the estimated variance of the errors, and of the model predictions. It is therefore straightforward to build confidence intervals on the these quantities, and this is what we do in this short Tutorial.
The notion of "Confidence Region" for several parameters considered simultaneously, although conceptually simple, is a bit difficult from a theoretical standpoint, and will just be shortly explained in a note. However as some software incoporate displaying confidence ellipses for pairs of parameters, we describe some simple rules of interpretations.
CONFIDENCE INTERVALS
|
Confidence intervals on estimated parameters Confidence intervals Note on confidence regions Confidence intervals on the estimated variance Confidence intervals on predictions |
||
|
TUTORIAL |
||
_________________________________________________
|
Tutorial 7 |
Once the model is built, one may wonder about how relevant the independent variables in the model actually are. For every independent variable, one may reasonably wonder whether or not it is necessary to incorporate the variable in the model. For suppose that we build a new model without this variable, and that this new (reduced) model turns out to be just about as good as the complete model on the available data : then the bias-variance tradeoff suggests that we should use the reduced model instead of the complete one.
So this Tutorial addresses the following question : "Build a first model with a certain set of independent variables. Then remove some variables from this set and buid a new, reduce model (the two models are then said to be nested). Are the two models significantly different ?".
In order to answer this question, we must give a precise meaning to the expression "significantly different". This will be achieved by identifying a statistic whose distribution will be known when the two models are actually identical. A test will then be deduced from the distribution of this statistic.
The statistic of the nested models test may be constructed by the method of the "maximum likelihood ratio". This approach is quite technical, and instead we will resort to intuitive geometric considerations. Some rather heuristic but simple arguments will lead us to the definition of the appropriate test statistic. Identifying the distribution of the statistic will unfortunately rely again of Cochran's theorem, that we enunciate again without demonstration.
-----
A special case of this situation is when just one variable is removed from the complete set of variables. We can then either use the test on nested models, or else use the more familiar Student test on the coefficient of the removed variable : the two tests will turn out to be equivalent.
Another special case considers removing all the independent variables of the model. The corresponding test, called Fisher's global test, or the "R² test", determines whether the regression model just built has any significance at all.
TESTS ON NESTED MODELS
|
Irrelevant variables Testing the significance of a variable Why detect irrelevant variables ? Nested models Special cases of nested models Deleting a single variable Deleting all variables : Fishers's global test Test between nested models The new projection space Construction of the test statistic Distribution of the test statistic Orthogonality of the numerator and the denominator Independence of the numerator and the denominator Distribution of the test statistic Distribution of the numerator Distribution of the denominator Distribution of the statistic The test Student's test on a single variable The nested modesl test Student's t test The two tests are equivalent Fisher's global test on all variables |
||
|
TUTORIAL |
||
___________________________________________________________
|
Tutorial 8 |
A given set of design data will always lead to constructing several models, that will differ by their :
* Sets of independent variables,
* And possibly by the methods used for calculating the model parameters.
Naturally, the question will then be raised to identify the "best" model.
This Tutorial is devoted to the description of some of the techniques used for assessing the "quality" of a MLR model.
COMPARING MODELS IN MULTIPLE LINEAR REGRESSION
|
Three definitions of the "quality" of a model The ideal solution : data is plentiful A more realistic situation : data is scarce Cross-validation Test between nested models Adjusted R² R² always increases when variables are added Adjusted R² Mallows'Cp Theory Interpretating and using Mallows' Cp Penalized likelihood General principle Akaike's Information Criterion (AIC) Bayesian Information Criterion (BIC) Comparing criteria |
||
|
TUTORIAL |
||
____________________________________________________
|
Tutorial 9 |
We refer again to the bias-variance tradeoff to insist that, for a given level of performance on the design data, small models (e.g. with few variables) should be prefered to large ones. Selecting the variables (among the plethora of available variables) that will be incorporated in the final model is a central issue of any modeling action.
In this Tutorial, we show how the tests and quality criteria that we described in the previous Tutorials can be used for selecting a reduced, but effective set of independent variables.
-----
Another role of variable selection is to combat collinearity by removing variables that are strongly correlated with other variables. But variable selection alone may not be enough to circumvent this serious problem. One then has to resort to techniques specifically designed to address the issue of near or complete collinearity, like Ridge Regression.
VARIABLE SELECTION IN MULTIPLE LINEAR REGRESSION
|
Variable selection as an optimization problem Using quality criteria Using tests on nested models Exhaustive search Greedy search A naive idea Greedy algorithms Forward selection Forward selection Using the test on nested models Using quality criteria Backward selection Stepwise selection Caveat Suboptimal solution Different selection procedures Different starting points Stability of the selection Interpretability Analyst's jugement
|
||
|
TUTORIAL |
||
_______________________________________________
|
|
Related readings :
|
Want to contribute to this site ? |