Multiple Linear Regression is very sensitive to predictors being in a configuration of near-collinearity : when this happens, the model parameters become unstable (large variances) and can therefore no longer be interpreted. From a mathematical standpoint, near-collinearity makes the X'X matrix ill-conditioned (with X the data matrix) : the value of its determinant is nearly 0, and attempts to calculate the inverse of the matrix run into numerical snags with uncertain final values.
Exact collinearity occurs when at least one of the predictors is a linear combination of other predictors. X is not a full rank matrix anymore, the determinant of X is exactly 0, and inverting X'X is not just difficult, it is downright impossible because the inverse matrix simply does not exist.
The same happens when there are fewer observations than there are parameters to be estimated, a not uncommon situation. For example, a spectrum may be described by the list of light intensity measured at a few hundred different wavelengths (the predictors), whereas only a few tens of spectra (the observations) describing some phenomenon are available.
For the analyst, quasi-collinearity of some predictors causes a large variance (uncertainty) of the model predictions, therefore making these predictions highly unreliable.
By doing so, it makes the new model parameters somewhat biased (whereas the parameters as calculated by the LS method are unbiased estimators of the true parameters). But the variances of these new parameters are smaller than that of the LS parameters and in fact, so much smaller than their Mean Square Errors (MSE) may also be smaller than that of the parameters of the LS model. This is an illustration of the fact that a biased estimator may outperform an unbiased estimator provided its variance is small enough.
Moreover, the predictions errors of the Ridge Model will also turn out to be more accurate than that of the LS regression model when predictors exhibit near collinearity. Therefore, the idea behind of Ridge Regression is at the heart of the "bias-variance tradeoff" issue.
These improvements do not come free.
Yet, Ridge Regression is more than a "last resort" attempt to salvage LS linear regression in case of near or full collinearity of the predictors. It is to be considered a major linear regression technique of its own that proves its usefulness when collinearity is a problem, an all-too-common circumstance.
We first go over the problem of collinearity (or "multicollinearity") of the predictors, which is the curse of Multiple Linear Regression.
We then show how a simple but effective change in the method used for calculating the parameters can circumvent this problem.
We further study the statistical properties of the parameters of the Ridge Regression model, and discover that these parameters outperform the usual Least Squares parameters in a situation of near-collinearity of the predictors.
We then address the difficult problem of choosing an optimal value for the "ridge parameter".
Interpretation of the values of the parameters
Standardization of the variables
Three equivalent definitions of Ridge Regression
Reconditioning the X'X matrix
Penalizing the Sum of Squared Residuals (SSR)
Constraint on the length of the vector of parameters
Statistical properties of the ridge estimator
Relation with the Least Squares estimator
Mean Square Error (MSE)
Choosing the value of the Ridge parameter
Ridge variant of Mallow's Cp
There is an unexpected and quite illuminating link between Ridge Regression and Principal Components Analysis. When the Principal Components (PC) of the data set are used as predictors instead or the original variables, the Ridge Regression model appears to result from a simple modification of the LS model built in the same basis : every parameter of the LS regression is then "shrunk" by a factor that is small for the first PCs and larger for the last PCs. So Ridge Regression allows large variance Principal Components to have a larger influence on the final model than low variance Principal Components.
We finally introduce the important concept of "effective number of parameters" that is a more realistic measure of the true "flexibility" of the model than just the number of parameters.
RIDGE REGRESSION AND PRINCIPAL COMPONENTS ANALYSIS
Singular Value Decomposition (SVD)
Ridge Regression in the singular form
Least Squares model
Ridge model : shrinkage
Ridge Regression and Principal Components
MSE of the parameters
Effective number of parameters (or degrees of freedom)