Correspondence Analysis

The goal of Correspondence Analysis

Visualizing the coupling between two numerical variables is easy : a simple scatter plot provides a wealth of information about the interaction between the two variables. Unfortunately, scatter plots do not translate into the world of categorical variables.

Correspondence Analysis is a technique that generates graphical representations of the interactions between modalities (or "categories") of two categorical variables. It allows the visual discovery and interpretation of these interactions, that is, of the departure from independence of the two variables.

The mechanics of Correspondence Analysis

Correspondence Analysis's approach is not unlike that of Principal Components Analysis : "factors" are defined, that allow 2D representations of the modalities as points in "factor planes" with as little distorsion as possible.

In these factor planes, it is expected that pairwise distances are indicative of the tendency of modalities to "attract" or "repel" each other. For example, if the 2 variables are :

    * V1, with 3 modalities M1, M2, M3,

    * and V2, with 2 modalities N1 and N2,

 

then M1 being close to N2 would be an indication that more observations chose the pair (M1, N2) than the hypothesis of independence between V1 and V2 would lead us to expect.

Correspondence Analysis also interprets distances between two modalities of the same variable in terms of similarites of their compositions across the modalities of the other variable.

 

CA's mathematical machinery is a bit cumbersome. Although somewhat similar to that of Principal Components Analysis, it requires some preprocessing of the data (the contingency table), et redefining the notion of "distance" from the usual "euclidian" sense to that of "Chi-square distance".

 

and interpreting a CA diagram is somewhat of a black art, but supported by a well defined methodology. As it is, CA is an irreplaceable tool for a quick and relatively safe interpretation of a large contingency table.

Generalization of Correspondence Analysis

CA generalizes to more than two nominal variables. It is then called "Multiple Correspondance Analysis" (MCA). Another equivalent approach to the visual analysis of the interactions of modalities of several categorical variables is called "Homogeneity Analysis", or HOMALS.

_________________________________________________________

 

 

Tutorial 1

 

This first Tutorial is an overview of Correspondence Analysis. We show how contingency tables may be regarded as a numerical coding of the interaction between two categorical variables through frequencies of pairs of modalities.

A PCA-like transformation then allows the modalities of the variables to be represented as points in factorial planes. Visual analysis of these plots, and in particular of the proximities between modalities, will then give us a visual clue about whether the frequency profile of two modalities across other modalities are similar or not.

 

 

OVERVIEW OF CORRESPONDENCE ANALYSIS

Interaction between categorical variables

Independent categorical variables

Interaction between categorical variables

The mechanism of CA

Contingency tables

PCA on rows and on columns

Simultaneous representation

What is expected from a graphical representation ?

Axes

Distance to the origin

Two modalities belonging to the same variable

Two modalities belonging to different variables

TUTORIAL

______________________________________________________

 

 

Tutorial 2

 

Correspondence Analysis does not work on raw contengency tables. It first normalizes them so that cell counts are replaced by frequencies, and modalities of one variable are decribed by normalized "frequency profiles" across the modalities of the other variable.

We then justify that the traditional euclidian distance in not appropriate in this setting for the purpose of measuring the similarity between modalities, and has to be replaced by the so-called "Chi-square distance". The upcoming PCAs will be performed with this newly defined distance.

 

 

THE MECHANISM OF CORRESPONDENCE ANALYSIS

Reformating data

Contengency tables

Frequencies

Profiles

Ponderation

The Chi-square distance

Definition of the Chi-square distance

Why the Chi-square distance ?

The 2 PCAs

How many dimensions ?

The barycenters

Chi-square and total inertia

TUTORIAL

________________________________________

 

 

Tutorial 3

 

At this stage, we have performed two PCAs :

    1) One on row profiles,

    2) One on column profiles.

 

We are ready to proceed with the interpretation of the results. This interpretation will be inspired by the interpretation procedure of regular PCA, with some changes because of the specifics of Correspondence Analysis : ponderation of the modalities, Chi-square distance and the ensuing changes in interpreting inertias.

We review here the elements that will be needed for interpreting a CA. Later on, we will interpret a simple, but realistic example of CA, and we will need to keep the elements below in mind.

 

 

INTERPRETATION OF CORRESPONDENCE ANALYSIS

Plots

Interpretation of the total inertia

Eigenvalues

Inertia of the modalities

Weights of the modalities

Coordinates, weight and inertia

Barycenters and origin

Contribution of modality to a factor

Quality of representation of the modalities

Inertia of the factors

TUTORIAL

_____________________________________________

 

 

Tutorial 4

 

We now treat a simple but realistic example. Although real life problems are usually quite a bit more complex, the step-by-step interpretation procedure that we demonstrate here would be very much the same. The treatment of this example covers the next three sections.

-----

The first section covers the interpretation of the factors.

 

 

EXAMPLE (Part 1) : INTERPRETATION OF THE FACTORS

 The data

The contingency table

The Chi-square

The inertia

Total inertia

How many factors ?

Interpretation of the factors

The basic principle

Which modalities determine the first factor ?

Interpretation of the first factor

The second factor

Other factors

Summary of the interpretation of the factors

TUTORIAL

________________________________________________ 

 

 

Tutorial 5

 

The role of the plots of modalities is to suggest associations of modalities by pair, belonging :

    * either to the same variable,

    * or to different variables.

 

In this section, we address the issue of interpreting each variable individually. Each one of the two variable is described by a plot, and we address the issue of whether it is justified to overlay the two plots into a single combined plot.

 

 

EXAMPLE (Part 2) : INTERPRETING THE MODALITIES

"Quality" or "Square Cosines"

Distance to the origin

"Near center" modalities

"Remote" modalities

Heavy modalities

Neighboring modalities

TUTORIAL

____________________________________

 

 

Tutorial 6

 

In the previous section, we interpreted each variable individually. We could thus discover some properties of the modalities that could certainly have been dug out of the contengency table, but that the plot of modalities made  lot easy to identify.

We now come to interpreting the combined plot of modalities in order to analyze the interactions between the two variables. For this purpose, we display again the same "combined" plot as we did in the previous section, but this time we will consider both variables simultaneously.

 

 

EXAMPLE (Part 3) : THE COMBINED PLOT

The basic idea

Neighboring modalities

Confirming with the contingency table

Expected populations

An association is not symmetrical

Summary of the analysis

Analysis of the cloud

Interpretation of the factors

Interpretation of individual variables

Interpretation of the combined plot

TUTORIAL

_________________________________________

 

 

Tutorial 7

 

We finally address some additional questions pertaining to the interpretation of the plots :

    * Supplemetary variables, which are variables that were not taken into account for building the model, but that are diplayed on the plots and may facilitate their interpretation.

    * Ordinal variables, which are categorical variables whose modalities are naturally ordered. In particular, we show how non linear interactions between variables may then be detected by a fundamentally linear technique.

 

 

CORRESPONDENCE ANALYSIS : COMPLEMENTS

Supplementary variables

Ordinal variables

Interpreting the factors

The Guttman effect

TUTORIAL

 

____________________________________________

 

Related readings

Contingency table

Principal Components Analysis

Download this Glossary