top

 

 

 

 

Tutorials

Principal Components Analysis

Principal Components Analysis is one of the best known and most used Multivariate Exploratory Analysis technique.

Goal of Principal Components Analysis

Given a data set described by a set of numerical variables {x1, x2 , ..., xp}, the goal of Principal Components Analysis is to describe this data set with a smaller set of new, synthetic variables. These variables will be linear combinations of the original variables, and are called Principal Components.

Quite generally, reducing the number of variables used to describe data will lead to some loss of information. PCA operates in a way that makes this loss minimal, in a sense that will be given a precise meaning.

Therefore, PCA may be regarded as a dimensionality reduction technique.

Properties of the Principal Components

Number

Although the ultimate goal is to use only a small number of Principal Components, PCA first identifies p such components, that is, the same number as the number of original variables. Only later will the analyst decide on the number of Components to be retained. "Retaining k Principal Components" means "Replacing the observations by their orthogonal projections in the k-dimensional subspace spanned by the first k Principal Components".

Orthogonality of the Principal Components

The Principal Components define orthogonal directions in the space of observations. In other words, PCA just makes a change of orthogonal reference frame, the new variables being replaced by the Principal Components.

Uncorrelatedness of the Principal Components

It will turn out that the Principal Components are pariwise uncorrelated.

Ordering the Principal Components, optimal projection subspaces

The fundamental property of the Principal Components is that they can be ordered by decreasing order of "importance" in the following sense :

So the fundamental property of the PCs is that the best k-dimensional projection subspace is spanned by the first k Principal Components. In other words, the optimal subspaces are nested, a strong, useful and not at all obvious property.

Applications of Principal Components Analysis

Exploratory data analysis

PCA is mostly used for making 2-dimensional plots of the data for visual examination and interpretation. For this purpose, data is projected on factorial planes that are spanned by pairs of Principal Components chosen among the first ones (that is, the most significant ones).

From these plots, one will try to extract informations about the data structure, such as :

Data preprocessing, dimensionality reduction

All multivariate modelisation techniques are prone to the bias-variance tradeoff, which states that the number of variables entering a model should be severely restricted. Data is often decribed by many more variables than necessary for building the best model. Sometimes, specific techniques exist for selecting a "good" subset of variables (see for instance Mutiple Linear Regression), but dimensionality reduction techniques such as PCA may also be considered for feeding the model with a reduced number of variables. For example, Multiple Linear Regression may be replaced by a model using only a reduced number of Principal Components as regressors (Principal Components Regression).

Data compression and data reconstruction

The table describing the data with coordinates on the first k Principal Components is smaller than the original data table. Therefore, Principal Components Analysis may be used as a (lossy) data compression technique.

PCA incorporated in other techniques

Although created from a downright applicative perspective (data visualization), the mathematical machinery of PCA is quite general and is at the heart of other important modelization techniques. Let's mention :

Generalizations of Principal Components Analysis

PCA is just a change of orthogonal reference frame. Therefore, it relies on simple Linear Algebra as its main mathematical engine, and is quite easy to interpret geometrically. But this strength is also a weakness, for it might very well be that other synthetic variables, more complex than just linear combinations of the original variables, would lead to a more economical data description.

In fact, PCA can be generalized in many ways, mostly based on non linear transforms of the original variables. This issue is not addressed in this Glossary, but the interested reader may seek information on :

 Besides, Kohonen Maps may be regarded as a non linear dimensionality reduction technique.

______________________________________________________________

 

 

Tutorial 1

 

In this first Tutorial we review the main ideas behind Principal Components Analysis with no mathematics. We describe the three main steps of PCA :

    * Identification of the axes on which observations should be projected for obtaining as faithful as possible a representation of the data in a low-dimension space.

    * The same approach, but in the space of variables.

    * Interpreting the projections. This phase is hard to formalize, and relies mostly on the analyst's know-how and experience.

 

  

OVERVIEW OF PRINCIPAL COMPONENTS ANALYSIS

What does PCA do ?

An academic case

A barely more realistic case

What is a "faithful" representation ?

The "best" projection plane

The Principal Components

The Principal Plane

A dual approach : PCA on variables

Interpreting a PCA

Other applications of PCA

TUTORIAL

___________________________________________

 

Tutorial 2

 

We now describe how the "best" projection subspaces are identified for projecting the cloud of individuals. These subspaces are nested : the best k-dimensional subspace is inside the best subspace of dimension k' for any k' > k.

We then explain why all observations are not equally well represented in a low-dimension projection subspace, and identify :

    * The observations whose projections can be trusted,

    * And the observations that are particularly influential in defining the projection subspaces.

 

 

INERTIA AND PROJECTION OF OBSERVATIONS

The concept of "inertia"

Inertia of a point

Inertia of a cloud of points

Decomposition of inertia

Maximizing the projected inertia

Minimizing the spread around the best plane

The Principal Components

The First Two Principal Components

All the Principal Components

What have we gained ?

 Projection of the observations

The barycenter

Contribution of an observation to a Principal Component

Quality of representation, "Squared Cosine"

Are "high Contribution" and "high Squared Cosine"  equivalent ?

 

TUTORIAL

_________________________________________________________

 

 

Tutorial 3

 

The analyst is just as interested in variables as in observations. In particular, it is expected that analyzing data should allow the easy discovery of groups of variables that are strongly pairwise correlated. Such groupings may be detected by a cautious and tedious examination of the data correlation matrix, but Princpal Components Analysis allows the detection of such groupings visually.

For this purpose, the same rearch that was conducted in the space of observations is now conducted in the space of variables, which is sort of dual of the space of observations. Variables will then be represented as points in projection planes, and, provided that the quality of these projection is satisfactory, close "variable points" will represent strongly correlated variables. Anti-correlated and uncorrelated pairs of variables may also be visualized, and therefore easily detected.

 

 

PRINCIPAL COMPONENTS ANALYSIS ON VARIABLES

The space of variables

Why the space of variables ?

"Distance" between variables

PCA on variables

The Principal Components for variables (or "axes")

The first two axes

The other axes

Axes and PCs

Coordinates of the variables, loadings

Contribution of a variable to an axis

Plot of variables

The Correlation Circle

Quality of the representation of a variable

On the projection plane

On an axis

Correlation of variables

Contribution of a variable to an axis

Simultaneous projections ?

 

TUTORIAL

 _________________________________________________________

 

Tutorial 4

 

The goal of Exploratory Analysis is to allow the analyst to understand the underlying structure of data just as if he could "see" directly the data in its natural high-dimension space. As this is not possible, PCA will project the data (observations or variables) on factorial planes, each plane being defined by two factors as determined by PCA.

The best projection planes have been identified by PCA : they are spanned by the low order factors.

-----

A good deal of experience and know-how are needed to extract valuable information from numbers and projection diagrams as determined by PCA : this is the interpretation phase.

 

 

INTERPRETING PRINCIPAL COMPONENTS ANALYSIS

The data

Quality of the Principal Components

Interpretation of the Principal Components

The plot of observations

Origin of the plot

Moving along the Principal Components

Half plots

General distribution of the observations

Higher order Principal Components

Quality of the PCA

Eigenvalues

Communalities

TUTORIAL

 ___________________________________________________________

 

 

Tutorial 5

 

 We finally touch upon two other applications of Principal Components Analysis :

    * Because it reduces the number of variables needed to describe data, PCA can be regarded as a (lossy) data compression technique. Data may by summarized by the coordinates of the observations on some few first factors, and therefore be "compressed". It is possible to reconstuct the complete data from this partial description. The reconstruction is of course not perfect.

    * All models are sensitive to the bias-variance tradeoff, which requires the model to incorporate as few variables as possible (for a given amount of information injected into the model). Because PCA reduces the number of variables with minimum loss of information, it can be used as a data pre-processing tool before building another model.

 

 

OTHER APPLICATIONS OF PCA

Data Compression

Why can PCA compress data ?

How many Principal Components should be kept ?

Data Reconstruction

Data Pre-processing

Dimensionality reduction

PCs are uncorrelated

How many Principal Components should be kept ? (revisited) 

TUTORIAL

 

_________________________________________________

 

Related readings

Covariance matrix

Download this Glossary