PLS (regression)
Multiple Linear Regression (MLR) is the simplest linear model connecting:
* Numerical input predictors,
* and a response variable y.
But MLR suffers from several serious shortcomings:
1) It cannot accommodate fields with missing data. As a consequence, many observations may have to be discarded, even though they contain valuable information in the other fields.
2) Sensitivity to collinearity between predictors. Strict collinearity renders MLR impossible, while near collinearity makes it unstable and makes the numerical values of the coefficients meaningless.
3) The MLR model is not determined when the number of predictors is larger than the number of observations.
These three conditions are often met in practice.
Many techniques have been invented to cope with the problem of missing values, but they are either cumbersome, or arbitrary and ineffective, if not dangerous. Collinerarity may be countered either by orthogonalizing predictors (Principal Component Regression, or PCR), or by regularization techniques (Ridge Regression, Lasso), each of these methods having its own shortcomings.
There is no direct method to circumvent the "more predictors than observations" problem.
PLS (Partial Least Squares) regression is a technique that successfully deals with the three above mentionned problems. It may be perceived as a generalization of MLR (but also of PCR and of Canonical Analysis).
PLS regression replaces the initial space of the (many) regressors by a low-dimensionality space spanned by a small number of variables called "factors", or "latent variables". Factors are built iteratively. They will then be the new regressors of classical linear regression model.
Factors are orthogonal (uncorrelated), and are linear combinations of the original regressors. In this respect, they are therefore similar to Princpal Components of PCR. But while PCs are determined by regressors only (with no reference to to the response variable y), identifying the factors of PLS involves taking into consideration each factor's individual usefulness in predicting y by maximizing its correlation with y while maintaining the constraint of being orthogonal to the previously determined factors.
At this point, it may help to refer to the (rather involved) mathematics behind PLS. First note that PLS regression can take into account several reponse variables yi represented by matrix Y. Then if X is the (standardized) data matrix:
* PCs of PCR are the eigenvectors of the X'X matrix,
* while PLS's regression factors are eigenvectors of the Y'XX'Y matrix, where predictors and response variables are considered simultaneously.
______
PLS regression may be used even on categorical variables (predictors or response variables) through dummy binary coding into class indicators.
______
PLS regression is the main predictive modeling technique when there are more predictors than observations, when predictors are strongly correlated, and there is a lot of missing data. It was born in the chemiometry field, but the technique is very general, and is now widely used in many other fields (economy, medecine, psychology etc...).
____________________________________________
Related readings :
|
Want to contribute to this site ? |