Collinearity

A set of variables is said to exhibit collinearity if some variables are approximate, or worse, exact linear combinations of some other variables. Therefore, collinearity is a telltale of linear redundency in the data. In a set of variables exhibiting strong collinearity, it is always the case that at least some pairs of variables are strongly correlated.

Collinearity occurs as one keeps considering more and more variables to describe one same situation : after all, there is only so much information needed to describe this situation, and adding descriptors without limit ends up adding information already present in the first descriptors.

What's wrong with that ? It certainly cannot hurt to have the same information twice in a data set, right ?

In fact, collinearity is a curse of all predictive techniques that are linear in the variables. Consequences of collinearity are :

* Instability of the numerical values of the model's parameters, that makes their interpretation impossible.

* Instability of the model as a whole, whose predictions become unreliable.

* Numerical errors in the computation of various quantities, including the parameters, usually because of very small denominators.

____________________

Many techniques have been devised to fight collinearity.

* The most basic one is to reduce the number of predictors taken into account for the construction of the model (see for example Tutorial "Multiple Linear Regression").

* But once a set of predictors has been identified, further action to combat collinearity may be undertaken. For example, Ridge Regression fights collinearity in Multiple Linear Regression. All these techniques amount to reducing the effective (as opposed to "real") number of parameters of the model by constraining the range of values that the parameters of the model can take.

* New predictors may be created, that are uncorrelated by construction (e.g., by running a PCA on the original predictors).

_______________________________________________