Goodness of fit tests

This illustration represents :

    * A fully determined probability distribution p0(x),

    * And a sample x = {x1, x2, ..., xn} drawn from some unknown distribution.

 

 

At first sight, it does not seem very likely that the sample was drawn from p0(x) : there are too few observations in regions of high density (where observations have a high probability to appear), and too many in regions of low density. One says that the fit between the distribution and the sample is poor.

But the lower image of this same illustration proposes another probability distribution p1(x) with a much better fit to x.

How can this intuitive jugement be made quantitative ?

 

This question is very important in practice, and is known as the goodness-of-fit problem. It is at the heart of Statistics, whose ultimate (and inaccessible) goal is to deduce from a mere finite sample the probability distribution that generated it.

There exits two main approaches to this problem.

    1) The first one is to choose within a given family of probability distributions the distribution that fits the sample best according to a certain criterion. This is the job of estimation techniques :

        * Either parametric estimation, like Maximum Likelihood Estimation,

        * Or non parametric, like most probability density estimation techniques.

    2) The second approach consists in considering a single candidate distribution p(x), and ask how reasonable it is to assume that p(x) generated x. This is the domain of goodness-of-fit tests.


We'll see further down that the distinction between these two approaches is not that clear cut : goodness-of-fit tests may be applied to distributions defined only up to the values of some parameters, which then need to be estimated, generally by MLE.

Goodness-of-fit tests

A goodness-of-fit test tests :

    * H0 : the probability distribution that generated the sample is the given p(x).

against

    * H1 :  the probability distribution that generated the sample is not p(x).

 

According to the general idea underlying the concept of test, one must :

    * First invent a statistic (a function of the sample) whose value intuitively provides a fair indication as to whether H0 is true or not.

    * Then calculate the probability distribution of this test statistic under H0.

 

Literature describes dozens of goodness-of-fit tests, but only a handful are used in practice. Some of the most important ones are described below.

Chi-square test

Certainly the most well known goodness-of-fit test. It is intrinsically a test on discrete distributions, but continuous distributions may be discretized and the Chi-2 test may therefore be used on discrete and continuous distributions as well.

Owing to its practical importance, a special entry is dedicated to the Chi-square test.

Tests based on the Empirical Distribution Function (EDF-based tests)

An important class of goodness-of-fit tests relies on the following idea : unless we're very unlucky, observations appear preferentially in regions where the pdf (for continuous distributions) takes large values, but are few and far between in regions where the pdf takes small (positive) values. This, in turn, tells us that the Empirical Distribution Function (EDF) Fn(x) should, most of the time, be a good approximation of the true cumulative distribution function (cdf) F(x). This intuition is in fact a theorem sometimes refered to as the Fundamental Theorem of Statistics.

 

 

EDF-based goodness-of-fit tests test :

    * H0 :  F = F0

against

    * H1 :  F  F0

where F denotes the cdf of the distribution that generated the sample, and F0 denotes the cdf of the reference distribution.

So all we have to do is identify a statistic which is a reasonable measure of the departure of the EDF from the reference cdf, and calculate its distribution under H0. There exist many such statistics.

Kolmogorov-Smirnov test

The Kolmogorov-Smirnov test is the best known EDF-based test.

 

               The Kolmogorov statistic

The test statistic Dn is the largest absolute value of the difference between the Empirical Distribution Function and the cdf of the candidate distribution :

 

Dn = sup|Fn(x) - F(x)|

 

 

     * A low value of Dn, as in the above illustration, means that the EDF clings tightly to the reference cdf and is therefore a good approximation of this cdf, an argument in favor of H0.

    * The lower image of this illustration depicts a situation with a large value of Dn, certainly an argument against the null hypothesis F = F0.

               Distribution of the Kolmogorov statistic

The distribution of Dn under H0, is unknown, but is the same whatever F(x), and this distribution has been copiously simulated for small values of n.

The asymptotic distribution function (large samples) of Dn is known, and, of course, it does not depend on F(x). More precisely:

 

 

 

This result is extremely difficult, and is not demontrated in this Glossary.

Cramér-von Mises test

The surprising thing about the Kolmogorov-Smirnov test is that a reasonably powerful test can be built by considering a single observation in the sample, and a single point of the candidate distribution function.

It would seem more reasonable to quantify the difference between two cdfs by comparing these two functions over their entire range (i.e. from -∞  and +∞).

A whole family of tests can be defined by integrating the squared difference (Fn(x) - F(x) )² between the empirical distribution function and the reference distribution function (L2 norm). The simplest of these statistics is

which is just the area sandwiched between the EDF and the reference cdf.

 

 

This statistic is not used because it is computationally inconvenient.

It is therefore customary to introduce some arbitrary ponderating factor dK(x) :

which may render the calculation of the statistic tractable.

The most classical choice for K(x) is F(x) itself. The resulting statistic W² is called the Cramér-von Mises statistic, which is defined as :

 

 

 

The main advantage of this choice for K(x) is that this inconvenient integral can now be calculated as a sum involving only the values of F(x) for the observations.

Denote x(i) the ith order statistic of the sample. We'll show that :

 

 

 

The small sample distribution of W ² is unknown, but it is the same for all F(x) and has been thoroughly simulated and tabulated for small values of n.

The asymptotic distribution function of W ² is known, but it is very complex and extremely difficult to establish.

Anderson-Darling test

A problem with the Cramér-von Mises statistic is that the difference between the empirical distribution function and the reference cdf tends to 0 when x → -∞ or x → +∞. Consequently, the value of W ² is rather insensitive to the the precise positions of the remote observations in the tails of the distribution.

This is unfortunate, for the analyst is often most interested in whether the distribution of rare events (represented by these remote observations) conform to a preconceived idea about the nature of the distribution (think of the difference between a standard normal distribution and Student's t distribution).

A modification of the basic L2 statistic then consists in granting more importance to these remote observations by introducing a ponderation function into the definition of the statistic so as to make these remote observations more influential in the outcome of the test. The most widely used ponderation function is

[F(x).(1 - F(x))] -1

which is minimal around the median of the distribution, and tends to infinity when x tends to either -∞ or +∞. In fact, it is easily shown that F(x).(1 - F(x)) is the variance of the EDF in x, so that the test statistic is now the integral of the squared standardized difference between the EDF and the reference cdf.

The test statistic A² is then

 

 

and it is called the Anderson-Darling statistic.

Again, this inconvenient integral can be replaced by a computationally convenient sum, as we'll show that

 

 

with the same notations as before.

 

General remarks on EDF tests

Goodness-of-fit tests based on the empirical distribution function share some common features.

 

            The distribution of the test statistic is "distribution free"

Consider for example the Kolmogorov statistic. Its distribution depends on n, the sample size, but it does not depend on the fully determined continuous distribution under test. The test is then said to be "distribution free".

The same can be said of the Cramér-von Mises and of the Anderson-Darling statistics.

This is at first surprising, but can be easily explained by the fact that elaborating an EDF-based test always begins by transforming the sample by the Probability Integral Transformation (PIT). Observations are then transformed into random variables uniformly distributed in [0, 1], and statements about test statistics in the original space transform into statements about sampling properties of the unique uniform distribution U[0, 1].

In case this argument would be deemed too informal and not convincing enough, we'll make it more explicit, using the Kolmogorov statistic as an example. The argument can easily be adapted to any EDF-based statistic.

-----

This is not true anymore if the tested distribution is not fully determined, that is, if some of its parameters have to be estimated from the sample, as we now explain.

 

            Estimated parameters

One nice thing about the Chi-Square test is that the influence of estimating unknown parameters on the distribution of the test statistic is well understood. For EDF-based tests, things are more messy. In particular, it is no longer true that the distribution of the test statistic is distribution free, and the statistical community has had therefore to tabulate critical values (by Monte-Carlo simulation) for many different distributions (normal, exponential, Gamma, Weibull, logistic etc...) and sample sizes. The test first estimates the vector θ of unknown parameters, and then uses the corresponding F(x; θ) as the reference distribution function for calculating the value of the test statistic.


It is often possible to add correction terms (that depend only on the sample size) to the standard statistic so that the corrected value of the statistic can be used in conjunction with the unique table of critical values of the asymptotic distribution of the statistic.

It can be shown that if the estimated parameter is a location parameter or a scale parameter, then the distribution of the test statistic does not depend on the actual value of the parameter (although it will depend on the nature of the distribution).

Although a rigorous demonstration of this result is difficult, we can give a heuristic argument in its favor.

 

 

We mentioned that the Probability Integral Transformation transforms the original observations into r.v. uniformly distributed in [0, 1]. The above illustration (upper and lower images) clearly shows that shifting the reference distribution function and the sample by a given quantity leaves the transformed sample unchanged, as well as the distribution of any statistic built on this sample.

A similar argument applies when stretching (or contracting) both the distribution function and the sample by the same factor.

So although the reference distribution depends on the sample (because its parameters have to be estimated), it can be hoped that the distribution of the test statistic does not depend on the values of location and/or scale parameters. Such is indeed the case.

 

            Two samples

All EDF-based tests can be transformed into tests meant to determine whether two independent samples were drawn from identical distributions.

 

            Discrete distributions

EDF-based tests can be laboriously adapted to discrete data, although they do not seem to have received wide acceptance in this type of applications.

 

            Censored data

EDF-based tests are also applicable to censored data.

 

            Power

Comparing the powers of these various tests is of course a complex endeavor, as the power of a test depends heavily on the nature of the alternative hypothesis. Yet, three general conclusions may be drawn with considerable caution :

    1) Overall, members of the Cramér-von Mises family are more powerful than the Kolmogorov-Smirnov test. This should not come as a surprise, as the K-S test detects only a marked difference between the empirical distribution function and the reference distribution function at only one point, whereas members of the Cramér-von Mises family probe the fit between the EDF and the reference distribution all along the range of values of x.

    2) Authors seem to agree that quite generally, the Anderson-Darling test is to be prefered to other tests when no parameter has to be estimated.

    3) Yet, parameter estimation seems to reduce the difference in power between these tests, although the Kolmogorov-Smirnov test still trails somewhat behind.

_____________________________________________________

 

 

Tutorial

 

1) We first show that the distribution of the Kolmogorov statistic does not depend on the continuous distribution being tested, provided that its cdf be continuous and strictly increasing. The demonstration can easily be adapted to any EDF-based statistic.

2) The Cramér-von Mises statistic is defined as an integral that would be hard to calculate numerically. We'll show that this integral can in fact be written as a sum involving only the values of the candidate distribution function for the (sorted) observations.

3) Expressing the Anderson-Darling statistic as a sum follows a similar but more complex path. The calculation is only outlined, but the reader should encounter no difficulties filling-in the gaps.

 

 

MISCELLANEOUS ON EDF-BASED TESTS

The distribution of the Kolmogorov statistic is distribution free

The Cramér-von Mises statistic expressed as a sum

The Anderson-Darling statistic expressed as a sum

First integral

Second integral

Final result

TUTORIAL

 

_____________________________________________________

 

Related readings :

Chi-square test

Probability Integral Transformation

Download this Glossary