Hypergeometric distribution
An urn contains N balls, of which :
with W + R = N.
n balls are drawn from the urn, without replacement.
That is, n balls (the sample) are randomly selected and taken out of the urn. The
number of white balls in the sample is a random variable X, whose distribution
is known as the hypergeometric distribution.
It depends on the three
parameters N, W and n and will be denoted HG(N, W, n).
This animation illustrates the hypergeometric distribution.
|
The urn Positioning of the balls First, note that in probability theory issues involving the famous "urn", it is implicitely assumed that the "balls" are randomly positioned in the urn. This assumption is unnecessary so long as the sample is drawn by randomly selecting balls in the urn. So, for the sake of clarity, all the white (rectangular)
balls are positioned to the left-hand side of the urn, while the red balls
are positioned to the right-hand side of the urn. Controlling the numbers of balls * Total population of the urn The total (white + red) number of balls can be read in the "N" display. This number is not directly adjustable, but is indirectly adjusted by adjusting the numbers of white and red balls. It is constrained to be more than 2 but not to exceed 30. * Number of white balls The number of white balls is adjusted with the "W" controls. The smallest number of white balls is 1, the largest is such that the total number of balls (white and red) in the urn does not exceed 30. * Number of red balls Same as for white balls, but with the "R" controls.
The sample The sample size is adjusted by the "Sample size" control. It is constrained to be at least 1 and at most N-1. Note that when decreasing the number of white or red balls, the total number of balls in the urn might become smaller than the displayed sample size. Consequently, the sample size is then automatically decreased so as to remain 1 less than the total population size.
The sample is materialized by arrows pointing to the balls that have been randomly selected. The number of white arrows is the r.v. whose dsitribution is being illustrated by this animation.
The distribution The theoretical distribution of white balls in the sample is materialized by blue cells in the lower frame. The number of cells is the smaller of :
as there cannot be more white balls in the sample than there are white balls in the urn to start with. So as you increase the sample size (see above), the number of cells first increases, and then stops increasing when the number of cells is equal to the number of white balls. Yet, the shape of the distribution keeps changing as the sample size keeps going up.
Note that if the sample size n is larger than the number of red balls (n > R), then there are certainly at least n - R white balls in the sample. Therefore, the first (n - R) positions of the distribution graph are then empty (hollow rectangles).
Convergence to a binomial distribution The hypergeometric distribution is a discrete, unimodal, limited range distribution just as the binomial distribution. As a matter of fact, it can be shown that under certain conditions, the hypergeometric distribution converges to a binomial distribution as the number of balls in the urn becomes very large. The grey cells represent the theoretical limit binomial distribution for the chosen values of the parameters. The fit is not very good, which is not surprising as the convergence is only an asymptotic result which says nothing about the quality of the fit for comparatively small numbers of balls. In particular, note that : * The binomial
distribution predicts non-zero probabilities all the way down to
zero, even when the sample size is larger than the number of red
balls in the urn (see previous paragraph).
The animation Click on "Go" and observe the progressive build up of the histogram of the hypergeometric distribution for the selected values of the parameters.
|
Constraints on the composition of the sample
We'll first show that the number w of white ball is the sample is constrained by the following inequalities
|
max(0, n - R) ≤ w ≤ min(W, n) |
as is illustrated by the animation.
Probability mass function
Denote P{X = w} the probability for the sample to contain exactly w white balls (with w constrained as above).
* A first method will show that
|
|
where r = n - w is the number of red balls in the sample.
* A second method will show that
|
|
Fortunately, we'll be able to show that these two seemingly different expressions for the value of P{X = w} are in fact identical.
We'll show that the mean µ of the hypergeometric distribution is
|
|
If we denote p the initial proportion of white balls in the urn, this expression can be written as
µ = np
In other words, the mean of the number of white balls in the sample is equal to the product of the number of balls in the sample and the probability for the first ball drawn form the urn to be a white ball.
This result would be obvious if the balls were drawn with replacement : the probability to draw a white ball would then always be W/N, and the distribution of w would be the binomial B(N, W/N) distribution.
But we're now considering draws without replacement, so that the probability for any draw to produce a white ball depends on the color of the balls already drawn. But despite this fundamental difference, the two set-ups "with replacement" and "without replacement" are conducive to the same result.
Because the draws are without replacement, calculating the variance σ² is definitely more complicated than in the "with replacement" case. We'll show that :
|
|
If we denote p = W/N the initial proportion of white balls in the urn, this expression becomes
σ² = np(1 - p)[1 - (n - 1)/(N - 1)]
Note that for a given p, σ² tends to np(1 - p), the variance of the binomial distribution B(n, p) as the initial number of balls N tends to infinity. This result receives a natural interpretation in the light of the convergence of the hypergeometric distribution to the binomial distribution, as we explain now.
There is clearly a relationship beween the binomial and hypergeometric distributions : the hypergeometric distribution may be regarded as a modified binomial distribution based on sampling without replacement from a finite population.
This relationship is made clear by the asymptotic behaviors (when N tends to infinity) of the hypergeometric distribution.
Suppose that the number N of balls in the urn tend to infinity in such a way that the proportion W/N of white balls tends to a limit p, while the sample size n is kept constant.
We'll show that under these conditions, the hypergeometric distribution HG(N, W, n) converges to the binomial distribution B(n, p).
This result is fairly intuitive since as N gets larger and larger, the difference between "with replacement" and "without replacement" becomes progressively negligeable.
We'll also establish the (less intuitive) following result :
* Suppose that N tends to infinity, but with the number W of white balls kept constant. The number R of red balls therefore also tends to infinity, and the proportion of white balls tends to 0.
* In order to compensate for the relative scarcity of white balls, the sample size n is made to grow so that the proportion n/N of balls drawn from the urn tends to a limit p. Clearly, the proportion of white balls in the sample will tend to 0 as N tends to infinity.
But what we are interested in is the distribution of the number of whites balls in the sample (not that of their proportion), which, under these conditions, converges to the binomial distribution B(W, p).
We just mentioned that if W/N tends to a limit p when N grows without limit (with the sample size n kept constant), the hypergeometric distribution
HG(N, W, n) converges to the binomial distribution B(n, p).
Since we know that the binomial distribution B(n, p) tends to a normal distribution when n grows without limit, it would seem like an easy business to show that the hypergeometric distribution tends to a normal distribution when N and n both tend to infinity, with W/N tending to a limit p.
Actually, things are not that simple. For imagine the following extreme case : as N tends to infinity, we keep n = N at all times. Then, however large N, the sample always contains all of the white balls, and the "distribution" of w is stuck at w = W, hardly a normal distribution.
So we intuitively feel that :
* We must let n grow with N for B(n, p) to have a chance to tend to a normal distribution,
* But n must not grow too fast in order to avoid a lack of diversity in the composition of the sample if n is too large.
-----
We'll identify a sufficient condition on the growth of n with N which ensures that the hypergeometric distribution tends to a normal distribution.
__________________________________________________________________
|
Tutorial 1 |
In this Tutorial, we establish the basic properties of the hypergeometric distribution :
* Probability mass function (two methods)
* Mean and variance.
-----
Although these results are few, obtaining them will take us enough efforts to justify a separate Tutorial.
THE HYPERGEOMETRIC DISTRIBUTION : BASIC PROPERTIES
|
Constraints on the sample composition Probability mass function of the hypergeometric distribution First method Second method Equivalence of the two results Mean of the hypergeometric distribution Sample composition as a sum of Bernoulli variables Calculating the mean Variance of the hypergeometric distribution Variances of the auxiliary Bernoulli variables Covariances of the auxiliary Bernoulli variables Variance of the hypergeometric distribution |
||
|
TUTORIAL |
||
____________________________________________________________
|
Tutorial 2 |
We now show that under the conditions described above, the hypergeometric distribution converges to a binomial distribution.
HYPERGEOMETRIC DISTRIBUTION
AND BINOMIAL DISTRIBUTION
|
First convergence to a binomial distribution Expansion of the probability mass function Three asymptotic equivalences Limit of the hypergeometric distribution Second convergence to a binomial distribution The long solution The short solution |
||
|
TUTORIAL |
||
____________________________________________________
|
Tutorial 3 |
As mentioned above, the issue of the convergence of the hypergeometric distribution to a normal distribution is not as simple as it first looks. A closer scrutiny of the problem reveals that the sample size should certainly not grow too fast with N for this convergence to have a chance to happen.
In this Tutorial :
1) Using only rather simple means, we first show that the hypergeometric distribution tends to a normal distribution if n2/N tends to 0 as N and W tend to infinity with W/N tending to a limit p.
2) Encouraged by this success, we then further refine the method used for establishing this result and discover that the condition is in fact too restrictive, and that it is sufficient for n3/N 2 to tend to 0 (with the same condition on W/N) for the hypergeometric distribution to converge to a normal distribution.
3) Both of the above conditions imply that n/N tends to 0 when N tends to infinity (although this was not assumed in the first place, and comes as a consequence of the proofs), but neither one concludes that n/N must tend to 0 for the hypergeometric to tend to a normal distribution.
So we make a conceptual leap and explore the consequences of imposing outfront that n/N tends to a limit t (which may be different from 0) as N tends to infinity with W/N tending to a limit p. Using somewhat more sophisticated tools than before, we'll show that this last (and rather weak) condition is sufficient for the hypergeometric distribution to converge to a normal distribution. If t = 0, we'll obtain again the previous results as special cases.
Some parts of the proof involve straightforward but long and bulky calculations that will be omitted.
-----
From a logical standpoint, only the third condition is needed as it is the weakest of the three : if any of the first two conditions is satisfied, so is the third one (with t = 0). So this Tutorial may be regarded as an illustration of the improvements brought to some partial solutions of a given problem by using more and more sophisticated methods. Yet, note that all three conditions are only sufficient : we are not aware of any "necessary and sufficient" (or even just "necessary") condition on the growth of n for the hypergeometric distribution to tend to a normal distribution.
HYPERGEOMETRIC DISTRIBUTION
AND NORMAL DISTRIBUTION
|
The remainder The problem Expanding the remainder A sufficient condition for the remainder to tend to 1 Upper bound on the remainder Lower bound on the remainder The normal limit A less constraining sufficient condition An even more permissive sufficient condition |
||
|
TUTORIAL |
||
_______________________________________________________
|
Tutorial 4 |
We then address an issue where the hypergeometric distribution turns up a bit unexpectedly :
* Let X and Y be two independent binomial random variables, with the same p but different sizes m and n. Choose an integer k, and consider the distribution of X under the condition X + Y = k. This distribution is hypergeometric, and does not depend on p.
This important property of the binomial distribution is illustrated by a quite instructive interactive animation.
It is the starting point of the Fisher-Irwin
test whose purpose is to test the H0 hypothesis according
to which two Bernoulli populations have the same value of the paramter p.
DISTRIBUTION OF TWO INDEPENDENT BINOMIAL VARIABLES
CONDITIONALLY TO THEIR SUM
|
Distribution of two independent binomial variables conditionally to their sum _______________________________________ |
||
|
TUTORIAL |
||
_______________________________________________________
Related readings: