top

 Tutorials

Principal Components Analysis

Principal Components Analysis is one of the best known and most used Multivariate Exploratory Analysis technique.

# Goal of Principal Components Analysis

Given a data set described by a set of numerical variables {x1, x2 , ..., xp}, the goal of Principal Components Analysis is to describe this data set with a smaller set of new, synthetic variables. These variables will be linear combinations of the original variables, and are called Principal Components.

Quite generally, reducing the number of variables used to describe data will lead to some loss of information. PCA operates in a way that makes this loss minimal, in a sense that will be given a precise meaning.

Therefore, PCA may be regarded as a dimensionality reduction technique.

# Properties of the Principal Components

## Number

Although the ultimate goal is to use only a small number of Principal Components, PCA first identifies p such components, that is, the same number as the number of original variables. Only later will the analyst decide on the number of Components to be retained. "Retaining k Principal Components" means "Replacing the observations by their orthogonal projections in the k-dimensional subspace spanned by the first k Principal Components".

## Orthogonality of the Principal Components

The Principal Components define orthogonal directions in the space of observations. In other words, PCA just makes a change of orthogonal reference frame, the new variables being replaced by the Principal Components.

## Uncorrelatedness of the Principal Components

It will turn out that the Principal Components are pariwise uncorrelated.

## Ordering the Principal Components, optimal projection subspaces

The fundamental property of the Principal Components is that they can be ordered by decreasing order of "importance" in the following sense :

• If the analyst decides to describe the data with only k (k < p) linear combinations of the original variables and yet discard as little information as possible in the process, then these k linear combinations have to be the first k Principal Components.

So the fundamental property of the PCs is that the best k-dimensional projection subspace is spanned by the first k Principal Components. In other words, the optimal subspaces are nested, a strong, useful and not at all obvious property.

# Applications of Principal Components Analysis

## Exploratory data analysis

PCA is mostly used for making 2-dimensional plots of the data for visual examination and interpretation. For this purpose, data is projected on factorial planes that are spanned by pairs of Principal Components chosen among the first ones (that is, the most significant ones).

From these plots, one will try to extract informations about the data structure, such as :

• The detection of outliers (observations that are very different from the bulk of the data).
• The identification of clusters that suggest that several subpopulations might coexist within the data set.
• Interpretation of the Principal Components. Whereas the original variables have a "native" interpretations, Principal Components derive from a mathematical definition. A succesful PCA will allow interpreting the Principal Components in terms of realistic, if not measured, properties of the observations. When this is possible, it is sometimes said that PCA has revealed the existence of "latent variables".

## Data preprocessing, dimensionality reduction

All multivariate modelisation techniques are prone to the bias-variance tradeoff, which states that the number of variables entering a model should be severely restricted. Data is often decribed by many more variables than necessary for building the best model. Sometimes, specific techniques exist for selecting a "good" subset of variables (see for instance Mutiple Linear Regression), but dimensionality reduction techniques such as PCA may also be considered for feeding the model with a reduced number of variables. For example, Multiple Linear Regression may be replaced by a model using only a reduced number of Principal Components as regressors (Principal Components Regression).

## Data compression and data reconstruction

The table describing the data with coordinates on the first k Principal Components is smaller than the original data table. Therefore, Principal Components Analysis may be used as a (lossy) data compression technique.

## PCA incorporated in other techniques

Although created from a downright applicative perspective (data visualization), the mathematical machinery of PCA is quite general and is at the heart of other important modelization techniques. Let's mention :

# Generalizations of Principal Components Analysis

PCA is just a change of orthogonal reference frame. Therefore, it relies on simple Linear Algebra as its main mathematical engine, and is quite easy to interpret geometrically. But this strength is also a weakness, for it might very well be that other synthetic variables, more complex than just linear combinations of the original variables, would lead to a more economical data description.

In fact, PCA can be generalized in many ways, mostly based on non linear transforms of the original variables. This issue is not addressed in this Glossary, but the interested reader may seek information on :

• Independent Components Analysis (ICA), that generates new variables that not just uncorrelated (as are the Principal Components), but genuinely independent.
• Curvilinear Components Analysis, a non linear projection technique whose priority is to respect the distances between observations.
• PCA on latent variables, that describes data by linear combinations of a small latent (unobserved) variables.
• Kernel-based PCA, that sends data into a high-dimensional space by an appropriate non-linear projection, and then performs an ordinary PCA in this high-dimensional space.

Besides, Kohonen Maps may be regarded as a non linear dimensionality reduction technique.

______________________________________________________________

 Tutorial 1

In this first Tutorial we review the main ideas behind Principal Components Analysis with no mathematics. We describe the three main steps of PCA :

* Identification of the axes on which observations should be projected for obtaining as faithful as possible a representation of the data in a low-dimension space.

* The same approach, but in the space of variables.

* Interpreting the projections. This phase is hard to formalize, and relies mostly on the analyst's know-how and experience.

OVERVIEW OF PRINCIPAL COMPONENTS ANALYSIS

 What does PCA do ? An academic case A barely more realistic case What is a "faithful" representation ? The "best" projection plane The Principal Components The Principal Plane A dual approach : PCA on variables Interpreting a PCA Other applications of PCA TUTORIAL

___________________________________________

 Tutorial 2

#### We now describe how the "best" projection subspaces are identified for projecting the cloud of individuals. These subspaces are nested : the best k-dimensional subspace is inside the best subspace of dimension k' for any k' > k.

We then explain why all observations are not equally well represented in a low-dimension projection subspace, and identify :

* The observations whose projections can be trusted,

* And the observations that are particularly influential in defining the projection subspaces.

INERTIA AND PROJECTION OF OBSERVATIONS

 The concept of "inertia" Inertia of a point Inertia of a cloud of points Decomposition of inertia Maximizing the projected inertia Minimizing the spread around the best plane The Principal Components The First Two Principal Components All the Principal Components What have we gained ?  Projection of the observations The barycenter Contribution of an observation to a Principal Component Quality of representation, "Squared Cosine" Are "high Contribution" and "high Squared Cosine"  equivalent ? TUTORIAL

_________________________________________________________

 Tutorial 3

The analyst is just as interested in variables as in observations. In particular, it is expected that analyzing data should allow the easy discovery of groups of variables that are strongly pairwise correlated. Such groupings may be detected by a cautious and tedious examination of the data correlation matrix, but Princpal Components Analysis allows the detection of such groupings visually.

For this purpose, the same rearch that was conducted in the space of observations is now conducted in the space of variables, which is sort of dual of the space of observations. Variables will then be represented as points in projection planes, and, provided that the quality of these projection is satisfactory, close "variable points" will represent strongly correlated variables. Anti-correlated and uncorrelated pairs of variables may also be visualized, and therefore easily detected.

PRINCIPAL COMPONENTS ANALYSIS ON VARIABLES

 The space of variables Why the space of variables ? "Distance" between variables PCA on variables The Principal Components for variables (or "axes") The first two axes The other axes Axes and PCs Coordinates of the variables, loadings Contribution of a variable to an axis Plot of variables The Correlation Circle Quality of the representation of a variable On the projection plane On an axis Correlation of variables Contribution of a variable to an axis Simultaneous projections ? TUTORIAL

_________________________________________________________

 Tutorial 4

The goal of Exploratory Analysis is to allow the analyst to understand the underlying structure of data just as if he could "see" directly the data in its natural high-dimension space. As this is not possible, PCA will project the data (observations or variables) on factorial planes, each plane being defined by two factors as determined by PCA.

The best projection planes have been identified by PCA : they are spanned by the low order factors.

-----

A good deal of experience and know-how are needed to extract valuable information from numbers and projection diagrams as determined by PCA : this is the interpretation phase.

INTERPRETING PRINCIPAL COMPONENTS ANALYSIS

 The data Quality of the Principal Components Interpretation of the Principal Components The plot of observations Origin of the plot Moving along the Principal Components Half plots General distribution of the observations Higher order Principal Components Quality of the PCA Eigenvalues Communalities TUTORIAL

___________________________________________________________

 Tutorial 5

We finally touch upon two other applications of Principal Components Analysis :

* Because it reduces the number of variables needed to describe data, PCA can be regarded as a (lossy) data compression technique. Data may by summarized by the coordinates of the observations on some few first factors, and therefore be "compressed". It is possible to reconstuct the complete data from this partial description. The reconstruction is of course not perfect.

* All models are sensitive to the bias-variance tradeoff, which requires the model to incorporate as few variables as possible (for a given amount of information injected into the model). Because PCA reduces the number of variables with minimum loss of information, it can be used as a data pre-processing tool before building another model.

OTHER APPLICATIONS OF PCA

 Data Compression Why can PCA compress data ? How many Principal Components should be kept ? Data Reconstruction Data Pre-processing Dimensionality reduction PCs are uncorrelated How many Principal Components should be kept ? (revisited) TUTORIAL

_________________________________________________

 Covariance matrix