|
Tutorials |
Principal Components Analysis
Principal Components Analysis is one of the best known and most used Multivariate Exploratory Analysis technique.
Given a data set described by a set of numerical variables {x1, x2 , ..., xp}, the goal of Principal Components Analysis is to describe this data set with a smaller set of new, synthetic variables. These variables will be linear combinations of the original variables, and are called Principal Components.
Quite generally, reducing the number of variables used to describe data will lead to some loss of information. PCA operates in a way that makes this loss minimal, in a sense that will be given a precise meaning.
Therefore, PCA may be regarded as a dimensionality reduction technique.
Although the ultimate goal is to use only a small number of Principal Components, PCA first identifies p such components, that is, the same number as the number of original variables. Only later will the analyst decide on the number of Components to be retained. "Retaining k Principal Components" means "Replacing the observations by their orthogonal projections in the k-dimensional subspace spanned by the first k Principal Components".
The Principal Components define orthogonal directions in the space of observations. In other words, PCA just makes a change of orthogonal reference frame, the new variables being replaced by the Principal Components.
It will turn out that the Principal Components are pariwise uncorrelated.
The fundamental property of the Principal Components is that they can be ordered by decreasing order of "importance" in the following sense :
So the fundamental property of the PCs is that the best k-dimensional projection subspace is spanned by the first k Principal Components. In other words, the optimal subspaces are nested, a strong, useful and not at all obvious property.
PCA is mostly used for making 2-dimensional plots of the data for visual examination and interpretation. For this purpose, data is projected on factorial planes that are spanned by pairs of Principal Components chosen among the first ones (that is, the most significant ones).
From these plots, one will try to extract informations about the data structure, such as :
All multivariate modelisation techniques are prone to the bias-variance tradeoff, which states that the number of variables entering a model should be severely restricted. Data is often decribed by many more variables than necessary for building the best model. Sometimes, specific techniques exist for selecting a "good" subset of variables (see for instance Mutiple Linear Regression), but dimensionality reduction techniques such as PCA may also be considered for feeding the model with a reduced number of variables. For example, Multiple Linear Regression may be replaced by a model using only a reduced number of Principal Components as regressors (Principal Components Regression).
The table describing the data with coordinates on the first k Principal Components is smaller than the original data table. Therefore, Principal Components Analysis may be used as a (lossy) data compression technique.
Although created from a downright applicative perspective (data visualization), the mathematical machinery of PCA is quite general and is at the heart of other important modelization techniques. Let's mention :
PCA is just a change of orthogonal reference frame. Therefore, it relies on simple Linear Algebra as its main mathematical engine, and is quite easy to interpret geometrically. But this strength is also a weakness, for it might very well be that other synthetic variables, more complex than just linear combinations of the original variables, would lead to a more economical data description.
In fact, PCA can be generalized in many ways, mostly based on non linear transforms of the original variables. This issue is not addressed in this Glossary, but the interested reader may seek information on :
Besides, Kohonen Maps may be regarded as a non linear dimensionality reduction technique.
______________________________________________________________
|
Tutorial 1 |
In this first Tutorial we review the main ideas behind Principal Components Analysis with no mathematics. We describe the three main steps of PCA :
* Identification of the axes on which observations should be projected for obtaining as faithful as possible a representation of the data in a low-dimension space.
* The same approach, but in the space of variables.
* Interpreting the projections. This phase is hard to formalize, and relies mostly on the analyst's know-how and experience.
OVERVIEW OF PRINCIPAL COMPONENTS ANALYSIS
|
What does PCA do ? An academic case A barely more realistic case What is a "faithful" representation ? The "best" projection plane The Principal Components The Principal Plane A dual approach : PCA on variables Interpreting a PCA Other applications of PCA |
||
|
TUTORIAL |
||
___________________________________________
|
Tutorial 2 |
We then explain why all observations are not equally well represented in a low-dimension projection subspace, and identify :
* The observations whose projections can be trusted,
* And the observations that are particularly influential in defining the projection subspaces.
INERTIA AND PROJECTION OF OBSERVATIONS
|
The concept of "inertia" Inertia of a point Inertia of a cloud of points Decomposition of inertia Maximizing the projected inertia Minimizing the spread around the best plane The Principal Components The First Two Principal Components All the Principal Components What have we gained ? Projection of the observations The barycenter Contribution of an observation to a Principal Component Quality of representation, "Squared Cosine" Are "high Contribution" and "high Squared Cosine" equivalent ?
|
||
|
TUTORIAL |
||
_________________________________________________________
|
Tutorial 3 |
The analyst is just as interested in variables as in observations. In particular, it is expected that analyzing data should allow the easy discovery of groups of variables that are strongly pairwise correlated. Such groupings may be detected by a cautious and tedious examination of the data correlation matrix, but Princpal Components Analysis allows the detection of such groupings visually.
For this purpose, the same rearch that was conducted in the space of observations is now conducted in the space of variables, which is sort of dual of the space of observations. Variables will then be represented as points in projection planes, and, provided that the quality of these projection is satisfactory, close "variable points" will represent strongly correlated variables. Anti-correlated and uncorrelated pairs of variables may also be visualized, and therefore easily detected.
PRINCIPAL COMPONENTS ANALYSIS ON VARIABLES
|
The space of variables Why the space of variables ? "Distance" between variables PCA on variables The Principal Components for variables (or "axes") The first two axes The other axes Axes and PCs Coordinates of the variables, loadings Contribution of a variable to an axis Plot of variables The Correlation Circle Quality of the representation of a variable On the projection plane On an axis Correlation of variables Contribution of a variable to an axis Simultaneous projections ?
|
||
|
TUTORIAL |
||
_________________________________________________________
|
Tutorial 4 |
The goal of Exploratory Analysis is to allow the analyst to understand the underlying structure of data just as if he could "see" directly the data in its natural high-dimension space. As this is not possible, PCA will project the data (observations or variables) on factorial planes, each plane being defined by two factors as determined by PCA.
The best projection planes have been identified by PCA : they are spanned by the low order factors.
-----
A good deal of experience and know-how are needed to extract valuable information from numbers and projection diagrams as determined by PCA : this is the interpretation phase.
INTERPRETING PRINCIPAL COMPONENTS ANALYSIS
|
The data Quality of the Principal Components Interpretation of the Principal Components The plot of observations Origin of the plot Moving along the Principal Components Half plots General distribution of the observations Higher order Principal Components Quality of the PCA Eigenvalues Communalities |
||
|
TUTORIAL |
||
___________________________________________________________
|
Tutorial 5 |
We finally touch upon two other applications of Principal Components Analysis :
* Because it reduces the number of variables needed to describe data, PCA can be regarded as a (lossy) data compression technique. Data may by summarized by the coordinates of the observations on some few first factors, and therefore be "compressed". It is possible to reconstuct the complete data from this partial description. The reconstruction is of course not perfect.
* All models are sensitive to the bias-variance tradeoff, which requires the model to incorporate as few variables as possible (for a given amount of information injected into the model). Because PCA reduces the number of variables with minimum loss of information, it can be used as a data pre-processing tool before building another model.
OTHER APPLICATIONS OF PCA
|
Data Compression Why can PCA compress data ? How many Principal Components should be kept ? Data Reconstruction Data Pre-processing Dimensionality reduction PCs are uncorrelated How many Principal Components should be kept ? (revisited) |
||
|
TUTORIAL |
||
__________________________________
Related readings