Contingency table

A company sells 5 products in 4 countries At the end of each quarter, it summarizes its sales figures (for instance in number of thousands of units sold) in a table that looks something likes this :

 P1 P2 P3 P4 P5 C1 28 14 45 33 12 C2 36 21 25 64 23 C3 21 64 38 11 7 C4 79 42 67 9 41

Such a table is called a "contingency table". Not every rectangular table of numbers is a contingency table : the cells' contents have to be interpretable in terms of counts of something, so that adding the contents of all cells in a row, or in a column, makes sense.

Contigency tables have a great practical importance. They are the starting point of the analysis of possible interactions of nominal (or "categorical") variables. In this case, we have two such nominal variables :

* "Country", with 4 modalities (or "categories").

* "Product", with 5 modalities.

Here, "interaction" means "departure from independence". What would the sentence :

"The two variables "Country" and "Product" are independent."

mean ? It  would mean that any two countries sell products in exactly the same proportions along the product line. For example, we see that country C1 sells twice as many P1 (28) as it sells P2 (14). If "Country" and "Product" were independent, we would also expect country C2 to sell twice as many P1 as it sells P2  . Here, C2 sold 36 P1, so we would then expect it to have sold 36/2 = 18 P2. But in reality, it sold 21 P2, so a simple visual inspection already tells us that the two variables "Country" and "Product" are not independent.

Note that exactly the same conclusion would have been drawn by examining columns instead of lines. We would then have said : the two variables are independent if any two products sell in exactly the same proportions in each and every country. The two definitions are equivalent.

The analysis of departures from independence has important practical implications. Here, we saw that C2 sold proportionally more P2 than C1 did.   Why ? What are the characteristics of C2 than makes this country particularly receptive to P2 ? Could this receptivity be extended to other countries through appropriate promotional efforts etc...

Visual examination of contingency tables quickly reaches its limits, and analyzing large tables require specific methods. Here are three classical methods for analyzing contingency tables :

1) The "Chi-Square test"

We noted that C2 sold proportionally more P2 than C1 did. Isn't that a hasty conclusion ? Is "21" all that different from "18" ? Couldn't natural fluctuations of ordinary business life account for this difference ? On a global scale, how sure can we be that the two variables are not independent when the numbers seem to indicate that they are not ?

This kind of question calls for a test. In the standard vocabulary of tests  the null hypothesis H0 is : "The two variables are independent". The Chi-Square test will construct a quantity, named "Chi-square" :

* that is "0" when the numbers point to independence,

* positive otherwise,

* and that gets larger and larger as the contingency table departs more and more from the independence structure.

The test will ultimately deliver a number, the p-value, which is the probability for this quantity to be even larger than that observed, should the two variables be actually independent. A low p-value (say, 0.05) makes it unlikely that the variables are independent.

2) Correspondance Analysis (CA)

3) Loglinear models

Loglinear models attempt to decompose cell counts of a contingency table into several contributions :

* A constant term,

* A part that is due to "Variable 1" only,

* A part that is due to "Variable 2" only,

* and, by difference, "the rest", which is, by definition, is caused by the interaction between the two variables. This part vanishes when the two variables are independent.

Actually, loglinear models are usually applied to generalized contingency tables, where more than two nominal variables are analyzed simultaneously. Besides, Loglinear models can be used to analyze interactions of continuous, numerical variables as well.

______________________________________

 Correspondence analysis