Ridge regression

# Multiple Linear Regression and collinearity

Multiple Linear Regression is very sensitive to predictors being in a configuration of near-collinearity : when this happens, the model parameters become unstable (large variances) and can therefore no longer be interpreted. From a mathematical standpoint, near-collinearity makes the X'X matrix  ill-conditioned (with X the data matrix) : the value of its determinant is nearly 0, and attempts to calculate the inverse of the matrix run into numerical snags with uncertain final values.

Exact collinearity occurs when at least one of the predictors is a linear combination of other predictors. X is not a full rank matrix anymore, the determinant of X is exactly 0, and inverting X'X is not just difficult, it is downright impossible because the inverse matrix simply does not exist.

-----

The same happens when there are fewer observations than there are parameters to be estimated, a not uncommon situation. For example, a spectrum may be described by the list of light intensity measured at a few hundred different wavelengths (the predictors), whereas only a few tens of spectra (the observations) describing some phenomenon are available.

-----

For the analyst, quasi-collinearity of some predictors causes a large variance (uncertainty) of the model predictions, therefore making these predictions highly unreliable.

# Ridge Regression is a variant of ordinary Multiple Linear Regression whose goal is to circumvent the problem of predictors collinearity. It gives-up the Least Squares (LS) as a method for estimating the parameters of the model, and focusses instead of the X'X matrix. This matrix will be artificially modified so as to make its determinant appreciably different from 0.

By doing so, it makes the new model parameters somewhat biased (whereas the parameters as calculated by the LS method are unbiased estimators of the true parameters). But the variances of these new parameters are smaller than that of the LS parameters and in fact, so much smaller than their Mean Square Errors (MSE) may also be smaller than that of the parameters of the LS model. This is an illustration of the fact that a biased estimator may outperform an unbiased estimator provided its variance is small enough.

Moreover, the predictions errors of the Ridge Model will also turn out to be more accurate than that of the LS regression model when predictors exhibit near collinearity. Therefore, the idea behind of Ridge Regression is at the heart of the "bias-variance tradeoff" issue.

# Ridge parameter

These improvements do not come free.

• An extra parameter has to be introduced in the model, the "ridge parameter". Its value is assigned by the analyst, and determines how much Ridge Regression departs from LS Regression. If this value is too small, Ridge Regression cannot fight collinearity efficiently. If it is too large, the bias of the parameters become too large, and so do the parameters and predictions MSEs.
There is therefore an optimal value for the ridge parameter, that theory alone cannot calculate accurately from the data only. It has therefore to be estimated by a series of trial and errors, usually resorting to cross-validation.
• The nice results about confidence intervals and tests in LS regression are lost, and have to be replaced by complex and approximate results.

Yet, Ridge Regression is more than a "last resort" attempt to salvage LS linear regression in case of near or full collinearity of the predictors. It is to be considered a major linear regression technique of its own that proves its usefulness when collinearity is a problem, an all-too-common circumstance.

_________________________________________________________________

 Tutorial 1

We first go over the problem of collinearity (or "multicollinearity") of the predictors, which is the curse of Multiple Linear Regression.

We then show how a simple but effective change in the method used for calculating the parameters can circumvent this problem.

We further study the statistical properties of the parameters of the Ridge Regression model, and discover that these parameters outperform the usual Least Squares parameters in a situation of near-collinearity of the predictors.

We then address the difficult problem of choosing an optimal value for the "ridge parameter".

RIDGE REGRESSION

 Collinearity Interpretation of the values of the parameters Geometric interpretation Analytic interpretation Ridge Regression Standardization of the variables Three equivalent definitions of Ridge Regression Reconditioning the X'X matrix Penalizing the Sum of Squared Residuals (SSR) Constraint on the length of the vector of parameters Analytic solution Geometric interpretation Statistical properties of the ridge estimator Relation with the Least Squares estimator Bias Variance Mean Square Error (MSE) Choosing the value of the Ridge parameter Analytical solutions Hoerl's solution Ridge variant of Mallow's Cp Graphical solutions Validation TUTORIAL

_________________________________________________________________

 Tutorial 2

There is an unexpected and quite illuminating link between Ridge Regression and Principal Components Analysis. When the Principal Components (PC) of the data set are used as predictors instead or the original variables, the Ridge Regression model appears to result from a simple modification of the LS model built in the same basis : every parameter of the LS regression is then "shrunk" by a factor that is small for the first PCs and larger for the last PCs. So Ridge Regression allows large variance Principal Components to have a larger influence on the final model than low variance Principal Components.

We finally introduce the important concept of "effective number of parameters" that is a more realistic measure of the true "flexibility" of the model than just the number of parameters.

RIDGE REGRESSION AND PRINCIPAL COMPONENTS ANALYSIS

 Singular Value Decomposition (SVD) Ridge Regression in the singular form Least Squares model Ridge model : shrinkage Ridge Regression and Principal Components MSE of the parameters Effective number of parameters (or degrees of freedom) TUTORIAL

_________________________________________________________________

 Multiple Linear Regression Principal Components Analysis Singular Value Decomposition Bias-variance tradeoff