Chi-square

Pronounced "Ky-square".

 

This term is ubiquitous in Statistics. It covers several similar, but different concepts. We here shortly describe :

    1) The Chi-square probability distribution.

    2) The Chi-square tests.

    3) Chi-square based Decision Trees.

    4) The Chi-square distance.

_______________________________

The Chi-square distribution

First, "Chi-square" is the name of a fundamental family of probability distributions. A typical example of a random variable following a Chi-square distribution is n times the variance of a n point sample drawn from a standard normal distribution N(0, 1). Many variables encountered in Data Modeling follow some Chi-square distribution.

 

For more on the Chi-square distribution, please see here.

Chi-square tests

Some quantities related to categorical variables are approximately Chi-square distributed for large samples. As a consequence, it is possible to devise some important tests whose generic name is "Chi-square tests". The most important ones are :

The basic Chi-square test  ("Goodness-of-fit test")

This test is concerned about whether a given probability distribution is a plausible candidate to explain the distribution of the observations in a given sample. It is therefore a "goodness-of-fit" test. It has the same goal as another important goodness-of-fit test, the Kolmogorov test.

This test is described in the Tutorial below.

The identity Chi-square test

This test is concerned about whether two samples were drawn from identical distributions (without specifying these distributions). It is therefore an identity test.

It has the same goal as two other popular identity tests, the Mann-Whitney test and the Kolmogorov-Smirnov test.

This test is described in the Tutorial below.

The Chi-square test of independence

This test is concerned about whether two categorical variables are independent or not.

This test is described in the Tutorial below.

CHAID Decision Trees

The acronym "CHAID" stands for "CHi-square Automatic Interaction Detection".

 

Decision Trees are predictive models that have to make recurrent decisions as to which independent categorical variable Vi is most closely coupled to the dependent variable Y for a certain subset of the sample. There are several ways to measure the strength of this coupling. When the dependent variable is categorical (classification problems), one of them is to conduct an independence Chi-square test on all pairs (Vi, Y), and select the Vj,  variable with the lowest p-value resulting from the test.

Decision Trees relying on this kind of choice are called "CHAID" Trees.

Chi-square distance

Many types of models take into consideration the "distance" between points in a certain space (e.g., Nearest Neighbors,  Hierachical Clustering, Principal Components Analysis ...). Without much thinking, one will usually use the ordinary Euclidian distance to measure the distance between points. Yet, there are circumstances when other definitions of the "distance" are more appropriate. For example, the "Mahalanobis distance" has a quite natural and useful interpretation in Discriminant Analysis.

    * Correspondence Analysis constructs a space in which the natural distance between points is not Euclidian, but rather the so-called "Chi-square distance". The name comes from the fact that the matematical expression defining this distance is identical to that encountered in the elaboration of the "Goodness-of-fit Chi-square test".

    * One may also define the "distance" between two multinomial distributions with a Chi-square type of mathematical expression. Two such distributions with a "0" Chi-square distance are identical, and as they grow more an more different from each other, their Chi-square distance also grows.

_______________________________________________________________________________________

THE CHI-SQUARE DISTRIBUTION

PLEASE SEE HERE...........

_________________________________

 

The following Tutorials describe three fundamental tests where the test statistic is approximately distributed as a Chi-square variable, hence the names "Chi-square tests".

In the three tests, the statistic is essentially the same, it is just used three times for describing different settings.

The demonstration of the approximate Chi-square distribution of the test statistic is quite difficult, and this result is stated without proof.

_______________________________________________

 

Tutorial 1

 

A common endeavor in Statistics is testing a hypothesis about the nature of the probability distribution that generated the sample at hand. This hypothesis is usually formulated not from statistical considerations, but rather from expertise. For example, one may wonder how likely it is that the sample was generated by a given candidate normal distribution, whose mean and variance were calculated by some theory in physics.

In other words, the question is to assess the quality of the fit between the candidate distribution and the sample (hence the expression "goodness-of-fit").

-----

One of the most important goodness-of-fit tests is the Chi-square test, that we now describe.

 

 

THE GOODNESS-OF-FIT CHI-SQUARE TEST

What are we testing ?

An example

General formulation

The binomial case

The binomial distribution

Approximate Chi-square

A step towards the general case

The general multinomial case

Each of the modalities follows a binomial distribution

Generalization of the binomial case

The test for the multinomial case

Influence of sample size

Unknown parameters in the reference distribution

An academic example

Estimating the parameters

Degrees of freedom

More realistic examples

Testing a continuous distribution

Adequation of a distribution to a sample. Likelihood.

Blocks and multinomial distribution

How many blocks ?

Estimating parameters

Influence of sample size

TUTORIAL

_______________________________________________________

 

Tutorial 2

 

The problem is now to decide whether two independent samples were drawn from identical distributions (without specifying the nature of these distributions). This assumption is considered likely if the two samples :

-----

We now describe this identity Chi-square test.

 

THE CHI-SQUARE TEST OF IDENTITY

The problem

The identity Chi-square test

Adding the Z  statistics

Estimating the probabilities

Generalization to p variables

Adding the Z  statistics

Estimating the probabilities

TUTORIAL

______________________________________

 

 

Tutorial 3

 

This third test is concerned about whether two categorical variables are independent. If they are, the contingency table describing the sample should not depart appreciably from a certain canonical structure. The departure of the contengency table from this canonical structure is measured by a statistic that is approximately Chi-square distributed for large samples.

 

 

THE CHI-SQUARE TEST OF INDEPENDENCE

The problem

The concept of independence

Contingency tables

Expected values

The test

The H0 hypothesis

The general idea

Estimating the probabilities

The Z ² statistic

Phi-square

Number of estimated parameters

The distribution of Z ²

Largest value

An upper bound for Z ²

When is the upper bound reached ?

Alternate "coefficients"

Contributions to Z ²

The special case of 2x2 tables

TUTORIAL

_________________________________________________________________

Download this Glossary

 

Want to contribute to this site ?