Correspondence Analysis
Visualizing the coupling between two numerical variables is easy : a simple scatter plot provides a wealth of information about the interaction between the two variables. Unfortunately, scatter plots do not translate into the world of categorical variables.
Correspondence Analysis is a technique that generates graphical representations of the interactions between modalities (or "categories") of two categorical variables. It allows the visual discovery and interpretation of these interactions, that is, of the departure from independence of the two variables.
Correspondence Analysis's approach is not unlike that of Principal Components Analysis : "factors" are defined, that allow 2D representations of the modalities as points in "factor planes" with as little distorsion as possible.
In these factor planes, it is expected that pairwise distances are indicative of the tendency of modalities to "attract" or "repel" each other. For example, if the 2 variables are :
* V1, with 3 modalities M1, M2, M3,
* and V2, with 2 modalities N1 and N2,
then M1 being close to N2 would be an indication that more observations chose the pair (M1, N2) than the hypothesis of independence between V1 and V2 would lead us to expect.
Correspondence Analysis also interprets distances between two modalities of the same variable in terms of similarites of their compositions across the modalities of the other variable.
CA's mathematical machinery is a bit cumbersome. Although somewhat similar to that of Principal Components Analysis, it requires some preprocessing of the data (the contingency table), et redefining the notion of "distance" from the usual "euclidian" sense to that of "Chi-square distance".
and interpreting a CA diagram is somewhat of a black art, but supported by a well defined methodology. As it is, CA is an irreplaceable tool for a quick and relatively safe interpretation of a large contingency table.
CA generalizes to more than two nominal variables. It is then called "Multiple Correspondance Analysis" (MCA). Another equivalent approach to the visual analysis of the interactions of modalities of several categorical variables is called "Homogeneity Analysis", or HOMALS.
_________________________________________________________
|
Tutorial 1 |
This first Tutorial is an overview of Correspondence Analysis. We show how contingency tables may be regarded as a numerical coding of the interaction between two categorical variables through frequencies of pairs of modalities.
A PCA-like transformation then allows the modalities of the variables to be represented as points in factorial planes. Visual analysis of these plots, and in particular of the proximities between modalities, will then give us a visual clue about whether the frequency profile of two modalities across other modalities are similar or not.
OVERVIEW OF CORRESPONDENCE ANALYSIS
|
Interaction between categorical variables Independent categorical variables Interaction between categorical variables The mechanism of CA Contingency tables PCA on rows and on columns Simultaneous representation What is expected from a graphical representation ? Axes Distance to the origin Two modalities belonging to the same variable Two modalities belonging to different variables |
||
|
TUTORIAL |
||
______________________________________________________
|
Tutorial 2 |
Correspondence Analysis does not work on raw contengency tables. It first normalizes them so that cell counts are replaced by frequencies, and modalities of one variable are decribed by normalized "frequency profiles" across the modalities of the other variable.
We then justify that the traditional euclidian distance in not appropriate in this setting for the purpose of measuring the similarity between modalities, and has to be replaced by the so-called "Chi-square distance". The upcoming PCAs will be performed with this newly defined distance.
THE MECHANISM OF CORRESPONDENCE ANALYSIS
|
Reformating data Contengency tables Frequencies Profiles Ponderation The Chi-square distance Definition of the Chi-square distance Why the Chi-square distance ? The 2 PCAs How many dimensions ? The barycenters Chi-square and total inertia |
||
|
TUTORIAL |
||
________________________________________
|
Tutorial 3 |
At this stage, we have performed two PCAs :
1) One on row profiles,
2) One on column profiles.
We are ready to proceed with the interpretation of the results. This interpretation will be inspired by the interpretation procedure of regular PCA, with some changes because of the specifics of Correspondence Analysis : ponderation of the modalities, Chi-square distance and the ensuing changes in interpreting inertias.
We review here the elements that will be needed for interpreting a CA. Later on, we will interpret a simple, but realistic example of CA, and we will need to keep the elements below in mind.
INTERPRETATION OF CORRESPONDENCE ANALYSIS
|
Plots Interpretation of the total inertia Eigenvalues Inertia of the modalities Weights of the modalities Coordinates, weight and inertia Barycenters and origin Contribution of modality to a factor Quality of representation of the modalities Inertia of the factors |
||
|
TUTORIAL |
||
_____________________________________________
|
Tutorial 4 |
We now treat a simple but realistic example. Although real life problems are usually quite a bit more complex, the step-by-step interpretation procedure that we demonstrate here would be very much the same. The treatment of this example covers the next three sections.
-----
The first section covers the interpretation of the factors.
EXAMPLE (Part 1) : INTERPRETATION OF THE FACTORS
|
The data The contingency table The Chi-square The inertia Total inertia How many factors ? Interpretation of the factors The basic principle Which modalities determine the first factor ? Interpretation of the first factor The second factor Other factors Summary of the interpretation of the factors |
||
|
TUTORIAL |
||
________________________________________________
|
Tutorial 5 |
The role of the plots of modalities is to suggest associations of modalities by pair, belonging :
* either to the same variable,
* or to different variables.
In this section, we address the issue of interpreting each variable individually. Each one of the two variable is described by a plot, and we address the issue of whether it is justified to overlay the two plots into a single combined plot.
EXAMPLE (Part 2) : INTERPRETING THE MODALITIES
|
"Quality" or "Square Cosines" Distance to the origin "Near center" modalities "Remote" modalities Heavy modalities Neighboring modalities |
||
|
TUTORIAL |
||
____________________________________
|
Tutorial 6 |
In the previous section, we interpreted each variable individually. We could thus discover some properties of the modalities that could certainly have been dug out of the contengency table, but that the plot of modalities made lot easy to identify.
We now come to interpreting the combined plot of modalities in order to analyze the interactions between the two variables. For this purpose, we display again the same "combined" plot as we did in the previous section, but this time we will consider both variables simultaneously.
EXAMPLE (Part 3) : THE COMBINED PLOT
|
The basic idea Neighboring modalities Confirming with the contingency table Expected populations An association is not symmetrical Summary of the analysis Analysis of the cloud Interpretation of the factors Interpretation of individual variables Interpretation of the combined plot |
||
|
TUTORIAL |
||
_________________________________________
|
Tutorial 7 |
We finally address some additional questions pertaining to the interpretation of the plots :
* Supplemetary variables, which are variables that were not taken into account for building the model, but that are diplayed on the plots and may facilitate their interpretation.
* Ordinal variables, which are categorical variables whose modalities are naturally ordered. In particular, we show how non linear interactions between variables may then be detected by a fundamentally linear technique.
CORRESPONDENCE ANALYSIS : COMPLEMENTS
|
Supplementary variables Ordinal variables Interpreting the factors The Guttman effect |
||
|
TUTORIAL |
||
____________________________________________
Related readings