Chi-square
Pronounced "Ky-square".
This term is ubiquitous in Statistics. It covers several similar, but different concepts. We here shortly describe :
1) The Chi-square probability distribution.
2) The Chi-square tests.
3) Chi-square based Decision Trees.
4) The Chi-square distance.
_______________________________
First, "Chi-square" is the name of a fundamental family of probability distributions. A typical example of a random variable following a Chi-square distribution is n times the variance of a n point sample drawn from a standard normal distribution N(0, 1). Many variables encountered in Data Modeling follow some Chi-square distribution.
For more on the Chi-square distribution, please
see here.
Some
quantities related to categorical variables are approximately Chi-square
distributed for large samples. As a consequence, it is possible to devise some important
tests whose generic name is "Chi-square tests". The most important
ones are :
This test is concerned about whether a given probability
distribution is a plausible candidate to explain the distribution of the observations
in a given sample. It is therefore a "goodness-of-fit" test. It has the same goal as another
important goodness-of-fit test, the Kolmogorov
test.
This test is described in the Tutorial below.
This test is concerned about whether two samples were drawn from identical distributions (without specifying these distributions). It is therefore an identity test.
It has the same goal as two other popular identity tests, the Mann-Whitney test and the Kolmogorov-Smirnov test.
This test is described in the Tutorial below.
This test is concerned about whether two categorical variables are independent or not.
This test is described in the Tutorial below.
The acronym "CHAID" stands for "CHi-square Automatic Interaction Detection".
Decision Trees are predictive models that have to make recurrent decisions as to which independent categorical variable Vi is most closely coupled to the dependent variable Y for a certain subset of the sample. There are several ways to measure the strength of this coupling. When the dependent variable is categorical (classification problems), one of them is to conduct an independence Chi-square test on all pairs (Vi, Y), and select the Vj, variable with the lowest p-value resulting from the test.
Many types of models take into consideration the "distance" between points in a certain space (e.g., Nearest Neighbors, Hierachical Clustering, Principal Components Analysis ...). Without much thinking, one will usually use the ordinary Euclidian distance to measure the distance between points. Yet, there are circumstances when other definitions of the "distance" are more appropriate. For example, the "Mahalanobis distance" has a quite natural and useful interpretation in Discriminant Analysis.
* Correspondence Analysis constructs a space in which the natural distance between points is not Euclidian, but rather the so-called "Chi-square distance". The name comes from the fact that the matematical expression defining this distance is identical to that encountered in the elaboration of the "Goodness-of-fit Chi-square test".
* One may also define the "distance" between two multinomial distributions with a Chi-square type of mathematical expression. Two such distributions with a "0" Chi-square distance are identical, and as they grow more an more different from each other, their Chi-square distance also grows.
_______________________________________________________________________________________
THE CHI-SQUARE DISTRIBUTION
_________________________________
The following Tutorials describe three fundamental tests where the test statistic is approximately distributed as a Chi-square variable, hence the names "Chi-square tests".
In the three tests, the statistic is essentially the same, it is just used three times for describing different settings.
The demonstration of the approximate Chi-square distribution of the test statistic is quite difficult, and this result is stated without proof.
_______________________________________________
|
Tutorial 1 |
A common endeavor in Statistics is testing a hypothesis about the nature of the probability distribution that generated the sample at hand. This hypothesis is usually formulated not from statistical considerations, but rather from expertise. For example, one may wonder how likely it is that the sample was generated by a given candidate normal distribution, whose mean and variance were calculated by some theory in physics.
In other words, the question is to assess the quality of the fit between the candidate distribution and the sample (hence the expression "goodness-of-fit").
-----
One of the most important goodness-of-fit tests is the Chi-square test, that we now describe.
THE GOODNESS-OF-FIT CHI-SQUARE TEST
|
What are we testing ? An example General formulation The binomial case The binomial distribution Approximate Chi-square A step towards the general case The general multinomial case Each of the modalities follows a binomial distribution Generalization of the binomial case The test for the multinomial case Influence of sample size Unknown parameters in the reference distribution An academic example Estimating the parameters Degrees of freedom More realistic examples Testing a continuous distribution Adequation of a distribution to a sample. Likelihood. Blocks and multinomial distribution How many blocks ? Estimating parameters Influence of sample size |
||
|
TUTORIAL |
||
_______________________________________________________
|
Tutorial 2 |
The problem is now to decide whether two independent samples were drawn from identical distributions (without specifying the nature of these distributions). This assumption is considered likely if the two samples :
-----
We now describe this identity Chi-square test.
THE CHI-SQUARE TEST OF IDENTITY
|
The problem The identity Chi-square test Adding the Z statistics Estimating the probabilities Generalization to p variables Adding the Z statistics Estimating the probabilities |
||
|
TUTORIAL |
||
______________________________________
|
Tutorial 3 |
This third test is concerned about whether two categorical variables are independent. If they are, the contingency table describing the sample should not depart appreciably from a certain canonical structure. The departure of the contengency table from this canonical structure is measured by a statistic that is approximately Chi-square distributed for large samples.
THE CHI-SQUARE TEST OF INDEPENDENCE
|
The problem The concept of independence Contingency tables Expected values The test The H0 hypothesis The general idea Estimating the probabilities The Z ² statistic Phi-square Number of estimated parameters The distribution of Z ² Largest value An upper bound for Z ² When is the upper bound reached ? Alternate "coefficients" Contributions to Z ² The special case of 2x2 tables |
||
|
TUTORIAL |
||
_________________________________________________________________
|
Want to contribute to this site ? |