|
Interactive animation |
Distribution function (Cumulative)
Let X be a numerical random variable. It is completely described by the probability for a realization of the variable to be less than x for any x. This probability is denoted by F(x) :
F(x) = P{X < x}
F(x) is called the (cumulative) distribution function (or c.d.f.) of the variable X. It can be regarded as the proportion of the population whose value is less than x.
The c.d.f. of a random variable is clearly a monotonously increasing (or more precisely, non decreasing) function from 0 to 1.
-----
The two events :
* X < x and
* X
x
are mutually exclusive. Therefore :
P{X < x} + P{X
x}
= 1
and
P{X
x}
= 1 - F(x)
More generally, for any two numbers a and b
with a
b,
we have :
|
P{a |
The c.d.f. is by no means constrained to be continuous. For example, the c.d.f. of a r.v. that can take only a finite number of values is a "staircase" function :

We then have :
F(xi ) = Sj P{X = xj } j = 1, 2, ..., I
-----
The same applies to discrete r.v. that can take an infinite number of values, like a Poisson variable. The "staircase" then has an infinite number of steps.
Recall
that a r.v. X is said to have a probability density function p(x)
if, for any two numbers a and b with a
b,
we have :

The cumulative distribution function F(x) is then continuous, and moreover :
* It can be differentiated,
* and its derivative F '(x) is just p(x).
We then have :
|
|
The relation between the cumulative distribution function and the probability density function is illustrated by the upper and lower images of this illustration :
You'll also find here an interactive animation illustrating this relation.
The following illustration represents :
* A probability density,
* And a sample of size n drawn from this probability density.
The observations in the sample are labeled by increasing order of values.
The empirical distribution fuction Fn(x) is defined as follows (lower image of above illustration). It is a staircase function :
* That is equal to 0 for x < x1,
* That is equal to 1 for x
xn,
* Which is constant on semi-open intervals [xi , xi + 1[
* And such that the height of each "step" is 1/n.
There are n steps. The function is monotonously increasing from 0 to 1.
This function is not to be confused with the c.d.f. of
a discrete probability distribution as in the foregoing
paragraph.
The ultimate goal of Statistics is to derive the probability distribution that generated a sample from the sample itself. This goal is of course inaccessible, but Statistics major achievement is to provide practitioners with some partial and probabilistic versions of this goal (mainly, estimation and tests).
This achievement is made possible by the fact that the sample is an incomplete, but hopefully rather faithful image of the true probability distribution (that we now assume to be continuous) :
* Observations are more densely packed in regions of high probability density,
* But are few and far between in regions of low probability density,
this image being somewhat distorted in an unpredictable way by the random nature of population sampling.
The empirical distribution function is an excellent tool for measuring how faithful the sample is to the probability distribution : where observations are densely packed, this function grows rapidly, which is exactly what is expected from the true distribution function, for where the distribution function grows rapidly, the probability density (its derivative) is large, which is propitious to a high concentration of observations.
-----
These intuitive remarks are justified by the Fundamental Theorem of Statistics, which states that :
The empirical distribution function Fn(x) converges to the true distribution function F(x) as the sample size grows without limit.
This convergence is of course to be understood in the sense of the convergence of random variables.
* We demonstrate here that Fn(x) converges to F(x) in probability for every x (we'll use a generalization of the Weak Law of Large Numbers).
* In fact, the convergence is stronger than that, as for every x, the convergence is almost sure (difficult).
* In fact, the convergence is even stronger than almost sure convergence for every x : it can be shown (Glivenko-Cantelli theorem) that if we denote :
Xn = sup|Fn(x) - F(x)|
the r.v. defined as the largest absolute difference between Fn(x) and F(x) for all x (for a given sample), then Xn converges almost surely to 0.
Note that Xn is the statistic of the Kolmogorov test.
This property makes the empirical distribution function very useful in many circumstances :
______________________________________________________________
Related readings :
|
|
Want to contribute to this site ? |