|
Interactive animation |
Chi-square (distribution)
The distribution of the sample mean of a N(0,1) distribution is N(0,1/n), where n is the sample size. What is the distribution of the sample variance ?
The concept of variance was defined so as to adequately describe the dispersion of the observations in a sample around the mean of the distribution. It turns out that the average of the squared distances of the observations in the sample to the distribution mean has "good" mathematical properties that justify a posteriori the definition of the variance.
So one is quite naturally lead, for the special case of the N(0,1) distribution, to study the distribution of the sum of the squared differences of the observations in the sample to the distribution mean 0, that is, in fact, to the sum of the squared values of these observations.
The reason why we focus on the sum of these squares rather than on their
average value is explained in the Tutorial
below, see "Additivity".
So, by definition, the Chi-square distribution
is that of the sum of the squared values of the observations drawn from the
N(0,1) distribution. It is denoted by
.
More precisely, and more formally :
|
(X1² + X2²
+...+ Xn²) ~ |
So there is not one
distribution,
but a family of distributions, indexed by n. This parameter is
called the "number of degrees of freedom" of the distribution
(this same expression is found in several other distribution families, like
Student's
t or Fisher's
F).
The "Chi-square distribution with n degrees of freedom" is therefore that of the sum of n independent squared r.v. all ~N(0,1).
Let X be any normally distribution r.v. :
X~N(µ,s²)
Recall that the change of variable :
X ' = (X - µ)/s
turns any normal r.v. X into a standard normal variable X '~N(0,1).
So if X~N(µ,s²),
the sum of the squared standardized observations of a n-sample
is distributed as
n.
In practice, one is more interested by the distribution of the average value of the squared observations rather than by their sum. Let :
S² = 1/n.Si(xi - µ)²
The simple change of variable :
Mean = Sum / n
shows that :
nS²/s² ~
n
So far, we assumed that the distribution mean µ was known. In practice, this is rarely the case. So one is lead to replace the true mean µ by its estimated value
= 1/n.Sixi
in the above expression.
But
is
a random variable, and there is now no reason to believe that the modified nS²/s²
is
distributed
anymore.
We finally get to the question of interest to the analyst : "What is the distribution of the sample variance of the normal distribution ?".
Fundamental result
Let :
s² = 1/(n - 1).S(xi -
)²
which is the sample variance, an unbiased estimator of the distribution variance.
We will show that :
|
(n - 1)s²/s²
~ |
So it appears that replacing the distribution mean
by the sample mean does not change the nature of the distribution of the sample
variance (it remains )
: it simply reduces by one unit the number of degrees of freedom of the distibution.![]()
This result is fundamental.
Note that replacing the distribution variance by the sample
variance has a deeper effect on the distribution of the standardized sample
mean : this distribution is then no longer normal, but is a t distribution
instead (see here).
The transition from "n" to "n
- 1" is called "losing a degree of freedom".
This phenomenon is quite general, and will be encountered in other circumstances
involving
distributions,
t distributions or F distributions. It is a consequence of having
to replace an unknown parameter by its estimate.
You'll find here an interactive animation illustrating the Chi-square distribution. It compares the distributions of the sample variance depending on whether the distribution mean is assumed to be known, or else estimated.
In the course of demonstrating the above result, we'll incidentally demonstrate another important result pertaining to the normal distribution :
|
The sample mean |
This is a characteristic property of the normal distribution : a distribution such that the sample mean and the sample variance are independent r.v. is necessarily normal (difficult).
Now that the distribution of the sample variance is known, it is possible to test the hypothesis H0 : s² = s0² about the true value of a normal distribution.
-----
But the importance of the
distribution
extends beyond the issue of the variance of a normal distribution. It happens
that several important statistics follow approximately
distributions
for large samples,
and that it is therefore possible to design tests about the corresponding quantities.
A common problem in Statistics is to assess the plausibility of the assertion : "This sample was generated by this candidate distribution". It is possible to test this hypothesis through a statistic that follows approximately a Chi-square distribution.
In the same spirit, it is possible to test the assertion : "These two samples were drawn from identical distributions" with a statistic that follows approximately a Chi-square distribution.
Given two discrete variables X and Y over finite ranges, it is possible to test the hypothesis "X and Y are independent" with a statistic that follows approximately a Chi-square distribution.
___________________________________
|
Tutorial 1 |
In this Tutorial, we establish the basic properties of the Chi-square distribution. We have here an excellent example of the efficacy of the moment generating function, without which calculating the probability density function of the Chi-square distribution would be difficult.
We will also use the fact that the Chi-2 distribution will be recognized as a special case of the Gamma distribution.
-----
Knowing the explicit (and complicated) analytical form of the Chi-square distribution is not as useless as it may first seem. For example, it will come in handy for identifying a sufficient statistic for the variance of the normal distribution (see here).
BASIC PROPERTIES OF THE CHI-SQUARE DISTRIBUTION
|
The Cumulative distribution function of Probability density function of Moment generating function of Moment generating function of
|
Probability density function of Moments, mode Mean Variance Mode Special cases n = 2 : exponential distribution n = 1 : vertical asymptote Additivity |
||
|
|
TUTORIAL |
||
___________________________________________________________________
|
Tutorial 2 |
We now demonstrate the fundamental result :
(n - 1)s²/s²
~
n-1
that expresses the fact that replacing the true distribution mean by the sample mean :
* Preserves the "
"
nature of the distribution of the sample variance,
* But causes a loss of one degree of freedom of this distribution.
We first go over the 2-observation sample case as it can be represented graphically, as well as the demonstration and the final result.
We then move on to the demonstration for samples of any size. We'll use an elementary demonstration that does not call on Linear Algebra.
-----
This demonstration will incidentally establish another very important result :
The sample mean
and
the sample variance s² of the normal distribution are independent
random variables.
DISTRIBUTION OF THE SAMPLE VARIANCE
OF THE NORMAL DISTRIBUTION
|
Case n = 2 The general case Another expression for the sample variance Changing the reference frame Distribution of the sample variance Independence of the sample mean and the sample variance |
||
|
TUTORIAL |
||
|
|
|
|
* Shapes of the distribution (number of
degrees of freedom is adjustable). |
|
________________________________________________________________
Related readings:
|
Normal distribution |
|
|
Distribution of the empirical variance of a normal distribution |
|
|
Distribution of the empirical standard deviation of a normal distribution |
|
|
|
|
|
Chi-square tests |
|
|
Gamma distribution |