Chi-square tests

The generic name of an important family of non parametric tests.

Although Chi-square tests can take many different forms, they are all different guises of the goodness-of-fit test for the multinomial distribution based on the so-called "Pearson's Chi-square statistic" that is detailed below.

The family gets its name from the fact that the distribution of the Chi-square statistic under H0, although generally unknown, always converges to a χ2

distribution for large samples (asymptotic distribution). The number of degrees of freedom of this limit distribution depends on the particular configuration being tested (see Tutorials below).

Chi-square tests are therefore always approximate tests, and more precisely asymptotic tests.

# The basic goodness-of-fit Chi-square test

## Reminder : the multinomial distribution

Let Mult(n; p1, p2, ..., pk ) be the multinomial distribution defined by :

* The k probabilities p1, p2, ..., pk  with  Σi pi = 1,

* And the number n of observations drawn from the distribution.

We'll denote ni the number of observations that "landed" in cell #i (cell count). We therefore have Σi ni = n.

Mult(n; {pi }) is the probability distribution of the k-dimensional vector {n1, n2, ..., nk }.

## Goodness-of-fit test for the multinomial distribution

Suppose that the pis are unknown but that it is considered a definite possibility that for all i, pi = πi for some set of k known numbers πi. In order to determine how likely this assumption is, a goodness-of-fit test needs to be built. The test will assess the plausibility of the null hypothesis :

* H0 : p1 = π1, p2 = π2, ..., pk = πk

against that of the alternative hypothesis

* H1 : at least one of the above equalities is false.

The test needs a test statistic whose value on the sample may be considered a fair indicator of the plausibility of the null asumption. The most popular statistic for this test is the "Pearson's Chi-square statistic", that we now describe.

## Pearson's Chi-square statistic

Pearson's Chi-square statistic Q is defined by :

Note that the Chi-square statistic is not the only statistic available : in particular, a Likelihood Ratio Test of goodness-of-fit can be built,
the test statistic then being "Wilks' G²".

The Chi-2 statistic is therefore the sum of k similar terms, one for each cell :

* The numerator of each term is natural : it is the squared difference between the observed cell count ni and the expected cell count npi (recall that ni /n  is the Maximum Likelihood Estimator of pi).

* A justification for the somewhat unexpected denominator npi is given here.

The distribution of Q is unknown, but it can be shown that it converges to χ2k - 1 when the sample size grows without limit (Pearson's theorem). This fundamental result is the basis of all Chi-square tests.

-----

If we denote :

* Oi the observed cell counts,

* And Ei the expected cell counts,

then Pearson's Chi-square may be written :

## The Chi-square test

Once the sample is drawn, the value of Q is calculated.

* Large values of Q can be obtained only if at least one of the terms is large, that is only if at least one cell count is very different from its expected value, a circumstance that leads to rejecting the null hypothesis.

* Small values of Q (lower image of the illustration below) can be obtained only when each term of the sum is small, that is when each cell count is close to its expected value. The null hypothesis is then certainly not to be rejected.

Note that if all cell counts ni are multiplied by m, then Q is also multiplied by m. So although all cell counts ratios remain unchanged and the asymptotic distribution of Q also remains unchanged, the null hypothesis is now more difficult to accept. This reflects the fact that larger samples contain more information about the distribution than smaller samples.

# Extension of the basic Chi-2 test to continuous distributions

A variant of the above basic Chi-square test leads to a goodness-of-fit test that can be used on continuous distributions. Let p(x) be a continuous distribution, and x = {x1, x2, ..., xn } a sample drawn from some unknown continuous distribution. Could it be that this distribution is p(x) ?

Suppose that we cut p(x) into k slices (lower image of the above illustration). The probability for an observation to fall into slice #i is :

where ai and bi are the left and right ends of the slice.

The situation is then identical to that of the goodness-of-fit of the multinomial distribution Mult(n; p1, p2, ..., pk ) where k is the number of slices.

-----

The discretization of p(x) into the set of probabilities {pi} causes a loss of information : many different continuous distributions can generate the same set {pi}. The choice of k by the analyst is therefore an important issue :

* If k is "too small" and therefore the slices too wide, there is too much information lost about the detailed structure of p(x) by the discretization process : too many continuous distributions can lead to the same set {pi} and the test cannot discriminate efficiently between p(x) and other distributions. The power of the test is therefore low.

* But if k is too large, the average number of observations in each slice is small, a circumstance that is known to render the Chi-square test inaccurate.

These points are just another example of the bias-variance tradeoff (see here) and are efficiently illustrated by the animation studying the behavior of the Chi-square statistic under various experimental conditions.

The same animation provides some experimental evidence in favor of positioning the boundaries of the slices so that all slices define approximately equal probabilities.

-----

Several goodness-of-fit tests, more particularly designed for continuous distributions, have exactly the same function as the "continuous distribution" version of the Chi-2 goodness-of-fit test, and usually perform better (greater power).

# Estimating parameters

Consider the Chi-square test meant to test the null hypothesis : the sample was drawn from the normal distribution N(µ, σ²), where µ and σ² are known. If the normal distribution is cut into k slices, the asymptotic distribution of the Chi-square statistic will be χ2k - 1.

Consider now another test meant to test this other null hypothesis : the sample was drawn from some normal distribution, where µ and σ² are not specified. It is still possible to build a Chi-square test for testing this null hypothesis, but this will require to first build a reference normal distribution. The null hypothesis will then be : "The sample was drawn from this particular reference normal distribution".

The reference normal distribution is obtained by defining its mean and variance from the sample, generally by Maximum Likelihood Estimation.

It can then be shown that the asymptotic distribution of the Chi-square statistic will still be χ2, but χ2k - 3 instead of just χ2k - 1. In other words, estimating the two parameters of the reference distribution causes a loss of two degrees of freedom of the χ2 asymptotic distribution.

-----

Under some regularity conditions, this result is quite general and extends way beyond the above example. It can be stated as follows : each time one of the parameters encountered while building a Chi-square test is unknown and has therefore to be estimated (i.e. by Maximum Likelihood Estimation) :

* The asymptotic distribution of the Chi-2 statistic remains χ2,

* But its number of degrees of freedom has to be reduced by one unit.

This important result is difficult, and is not demonstrated in this Glossary.

# Tests on contingency tables

The most popular use of Chi-square tests is to be found in conjunction with contingency tables (or "frequency tables").

## Chi-square test of independence

Let X1 and X2 be two discrete random variables. The joint probability distribution of the pair (X1, X2 ) is defined by a set of (unknown) probabilities pij.

A sample is drawn from the joint distribution of {X1, X2}. The outcome is a set of counts {nij}, where i denotes the ith modality of X1, and j denotes the jth modality of X2. These counts are usually displayed as a table called a contingency table :

Is the content of the table an argument in favor of the assumption that X1 and X2 are not independent random variables ?

We'll see that a variant of the basic Chi-square test can test :

* H0 : X1 and X2 are independent

against

* H1 : X1 and X2 are not independent.

____________________

When the test rejects H0, it provides no information about how precisely the two variables depart from independence. A more detailed study of the pair

(X1, X2) is then provided by Correspondence Analysis.

## Chi-square test of symmetry

When the two variables have the same number of modalities, the contingency table is square. Another question then arises naturally : is this table compatible with the assumption that the underlying joint probability distribution is symmetric ? In other words, is the probability of the outcome (X1i, X2j ) the same as the probability of the outcome (X1j, X2i ) ?

We'll see that another variant of the Chi-square test can address this question.

## Chi-square test of identity of the marginals

The fundamental quantity behind a contingency table is the joint probability distribution Pij{X1X1i, X2X2j }. Yet, from this joint distribution can be derived the marginal probability distributions of X1 and of X2, which are just the "ordinary" probability distributions of X1 and of X2.

When the two variables have the same number of modalities, a natural question is : "Do the two variables have identical marginal distributions ?".

Again, another variant of the Chi-square test can address this question.

# Chi-square test of identity (or "of homogeneity")

A somewhat different problem is as follows : we have I independent discrete variables {X1, X2, ..., XI },  that all take their values within a set of J values

{v1, v2, ..., vJ }. Samples of sizes {n1, n2, ..., nI } are drawn from the distributions of these variables. For example, the n1 draws from X1 will lead to the following counts :

n11, n12, ..., n1J     with    Σi n1j  = n1

with similar notations for the other variables.

The question is : do these I variables have identical probability distributions ?

A variant of the Chi-square test will succesfully address this question.

-----

The identity Chi-square test may be adapted to continuous distributions along the same lines as the basic goodness-of-fit Chi-square test. It then becomes an alternative :

* To the Mann-Whitney test when comparing two distributions,

* And to the Kruskal-Wallis test when comparing more than two distributions.

_______________________________________________

 Tutorial 1

This test is concerned about whether two categorical variables are independent. If they are, the contingency table describing the sample should not depart appreciably from a certain canonical structure. The departure of the contingency table from this canonical structure is measured by the value of the "Chi-square statistic" which is approximately Chi-square distributed for large samples. We'll calculate the number of degrees of freedom of this χ2 distribution.

THE CHI-2 TEST OF INDEPENDENCE

 The Chi-square test of independence Contingency table Marginal counts The null and alternative hypothesis Marginal probabilities Independence and joint probability distribution The contingency table is multinomially distributed The Chi-2 test statistic Estimating the probabilities Estimating the marginal probabilities Estimating the joint probabilities Number of degrees of freedom Number of estimated parameters Number of degrees of freedom Functional relationship between two variables Largest values of the Chi-square statistic Functional relationship Contributions to the Chi-square statistic The Chi-square statistic wastes information Contributions to the Chi-square statistic TUTORIAL

_________________________________________________

 Tutorial 2

We now describe three additional classical Chi-square tests :

* The "Symmetry test", that bears on the symmetry of the joint probability distribution behind a square contingency table. When estimating the parameters, we'll have to be a bit careful with the Lagrangian optimization which can be messy if not approached properly.

* The "Marginal test", that questions the identity of the marginal probability distributions of a contingency table.

* The "Identity test", which tries to determine whether several discrete populations are identically distributed.

OTHER CLASSICAL CHI-2 TESTS

 Testing the symmetry of the joint distribution Estimating the probabilities by Lagrangian optimization The test statistic Number of degrees of freedom Testing the identity of the marginal distributions The hypothesis Estimating the probabilities Number of degrees of freedom Symmetry implies identity of the marginals Testing the identity of several probability distributions Identical populations Identifying a test statistic Estimating the probabilities Number of degrees of freedom Comparison with the independence test TUTORIAL

_______________________________________________________

 Tutorial 3

In practical applications, discrete variables are often dichotomous (2 modalities) : "Gender" is "Male/Female", "Smoking" is "Yes/No" etc... Also, when a nominal variable has more than 2 modalities, one modality may be opposed to all the other modalities which are then collapsed into a single new modality, thus making the variable dichotomous.

2x2 tables, crossing two dichotomous variables, have therefore received a special attention.

This Tutorial is dedicated to 2x2 tables. We'll show that :

1) The Chi-2 statistic takes a particularly simple form for 2x2 tables.

2) Whereas Chi-square tests are approximate tests, exact tests can sometimes be developed, which are not based on the Chi-2 statistic.

This is the case of :

* The Fisher-Irwin test meant to decide whether two dichotomous variables have identical distributions. In practical terms, the test considers two populations, both split into two subpopulations, the question being whether the two splits define equal proportions in the two populations.

* Fisher's exact test, which is an independence test for two dichotomous variables. We'll first give an intuitive argument in favor of the hypergeometric nature of the distribution of the test statistic, and then confirm this result by a more rigorous approach.

-----

Although these two tests were developed around 1930, their true usefulness is still debated, a subject that we will not cover. Only the mathematical aspects of the tests (distributions of the test statistics) are developed.

3) The McNemar test adresses the issue of the symmetry of a 2x2 joint probability distribution.

SPECIAL CASE : 2x2 TABLES

 The Chi-2 statistic for 2x2 tables Fisher-Irwin test Fisher's exact test Intuitive argument Calculating the distribution of the statistic Numerator Denominator The conditional distribution of the statistic is hypergeometric One sided Fisher's exact test McNemar test TUTORIAL

______________________________________________________