Hypergeometric distribution

An urn contains N balls, of which :

with   W + R = N.

 

n balls are drawn from the urn, without replacement. That is, n balls (the sample) are randomly selected and taken out of the urn. The number of white balls in the sample is a random variable X, whose distribution is known as the hypergeometric distribution.
It depends on the three parameters N, W and n and will be denoted HG(N, W, n).

Animation

 This animation illustrates the hypergeometric distribution.

 

 

The "Book of Animations" on your computer

 

 

 

 

 

The urn

    Positioning of the balls

First, note that in probability theory issues involving the famous "urn", it is implicitely assumed that the "balls" are randomly positioned in the urn. This assumption is unnecessary so long as the sample is drawn by randomly selecting balls in the urn.

So, for the sake of clarity, all the white (rectangular) balls are positioned to the left-hand side of the urn, while the red balls are positioned to the right-hand side of the urn.
 

    Controlling the numbers of balls

        * Total population of the urn

                The total (white + red) number of balls can be read in the "N" display. This number is not directly adjustable, but is indirectly adjusted by adjusting the numbers of white and red balls.

It is constrained to be more than 2 but not to exceed 30.

        * Number of white balls

                The number of white balls is adjusted with the "W" controls. The smallest number of white balls is 1, the largest is such that the total number of balls (white and red) in the urn does not exceed 30.

        * Number of red balls

            Same as for white balls, but with the "R" controls.

 

The sample

    The sample size is adjusted by the "Sample size" control. It is constrained to be at least 1 and at most N-1.

    Note that when decreasing the number of white or red balls, the total number of balls in the urn might become smaller than the displayed sample size. Consequently, the sample size is then automatically decreased so as to remain 1 less than the total population size.

 

The sample is materialized by arrows pointing to the balls that have been randomly selected. The number of white arrows is the r.v. whose dsitribution is being illustrated by this animation.

 

The distribution

The theoretical distribution of white balls in the sample is materialized by blue cells in the lower frame.

    The number of cells is the smaller of :

  • The sample size,
  • and the number of white balls in the urn,

as there cannot be more white balls in the sample than there are white balls in the urn to start with.

So as you increase the sample size (see above), the number of cells first increases, and then stops increasing when the number of cells is equal to the number of white balls. Yet, the shape of the distribution keeps changing as the sample size keeps going up.

 

Note that if the sample size n is larger than the number of red balls (n > R), then there are certainly at least n - R white balls in the sample. Therefore, the first (n - R)  positions of the distribution graph are then empty (hollow rectangles).


The vertical scale is automatically adjusted so that the mode has always the same height. The true value of the mode is displayed to the left of the frame.
The mean is marked by a small vertical blue tick at the bottom of the display.

 

Convergence to a binomial distribution

The hypergeometric distribution is a discrete, unimodal, limited range distribution just as the binomial distribution. As a matter of fact, it can be shown that under certain conditions, the hypergeometric distribution converges to a binomial distribution as the number of balls in the urn becomes very large.

The grey cells represent the theoretical limit binomial distribution for the chosen values of the parameters. The fit is not very good, which is not surprising as the convergence is only an asymptotic result which says nothing about the quality of the fit for comparatively small numbers of balls. In particular, note that : 

    * The binomial distribution predicts non-zero probabilities all the way down to zero, even when the sample size is larger than the number of red balls in the urn (see previous paragraph).
   * The binomial distribution predicts non-zero probabilities all the way up to the sample size, even when the sample size is larger than the number of white balls in the urn.


To remove the comparison with a binomial distribution, click on the "Binomial' button.

 

The animation

Click on "Go" and observe the progressive build up of the histogram of the hypergeometric distribution for the selected values of the parameters.

 

 

Properties of the hypergeometric distribution

Constraints on the composition of the sample

We'll first show that the number w of white ball is the sample is constrained by the following inequalities

 

max(0, n - R) ≤  w ≤  min(W, n)

 

as is illustrated by the animation.

Probability mass function

Denote P{X = w} the probability for the sample to contain exactly w white balls (with w constrained as above).

    * A first method will show that

 

 

 

where r = n - w is the number of red balls in the sample.

 

    * A second method will show that

 

 

 

Fortunately, we'll be able to show that these two seemingly different expressions for the value of P{X = w} are in fact identical.

Mean

We'll show that the mean µ of the hypergeometric distribution is

 

 

If we denote p the initial proportion of white balls in the urn, this expression can be written as

µ = np

In other words, the mean of the number of white balls in the sample is equal to the product of the number of balls in the sample and the probability for the first ball drawn form the urn to be a white ball.

This result would be obvious if the balls were drawn with replacement : the probability to draw a white ball would then always be W/N, and the distribution of w would be the binomial B(N, W/N) distribution.

But we're now considering draws without replacement, so that the probability for any draw to produce a white ball depends on the color of the balls already drawn. But despite this fundamental difference, the two set-ups "with replacement" and "without replacement" are conducive to the same result.

Variance

Because the draws are without replacement, calculating the variance σ² is definitely more complicated than in the "with replacement" case. We'll show that :

 

 

 

If we denote p = W/N the initial proportion of white balls in the urn, this expression becomes

σ² = np(1 - p)[1 - (n - 1)/(N - 1)]

 

Note that for a given p, σ² tends to np(1 - p), the variance of the binomial distribution B(n, p) as the initial number of balls N tends to infinity. This result receives a natural interpretation in the light of the convergence of the hypergeometric distribution to the binomial distribution, as we explain now.

Convergence of the hypergeometric distribution to a binomial distribution

There is clearly a relationship beween the binomial and hypergeometric distributions : the hypergeometric distribution may be regarded as a modified binomial distribution based on sampling without replacement from a finite population.

This relationship is made clear by the asymptotic behaviors (when N tends to infinity) of the hypergeometric distribution.

Fixed sample size, proportion of white balls tends to a limit

Suppose that the number N of balls in the urn tend to infinity in such a way that the proportion W/N of white balls tends to a limit p, while the sample size n is kept constant.

We'll show that under these conditions, the hypergeometric distribution HG(N, W, n) converges to the binomial distribution B(n, p).

This result is fairly intuitive since as N gets larger and larger, the difference between "with replacement" and "without replacement" becomes progressively negligeable.

Fixed number of white balls

We'll also establish the (less intuitive) following result :

    * Suppose that N tends to infinity, but with the number W of white balls kept constant. The number R of red balls therefore also tends to infinity, and the proportion of white balls tends to 0.

    * In order to compensate for the relative scarcity of white balls, the sample size n is made to grow so that the proportion n/N of balls drawn from the urn tends to a limit p. Clearly, the proportion of white balls in the sample will tend to 0 as N tends to infinity.

But what we are interested in is the distribution of the number of whites balls in the sample (not that of their proportion), which, under these conditions, converges to the binomial distribution B(W, p).

Convergence of the hypergeometric distribution to a normal distribution

We just mentioned that if W/N tends to a limit p when N grows without limit (with the sample size n kept constant), the hypergeometric distribution

HG(N, W, n) converges to the binomial distribution B(n, p).

Since we know that the binomial distribution B(n, p) tends to a normal distribution when n grows without limit, it would seem like an easy business to show that the hypergeometric distribution tends to a normal distribution when N and n both tend to infinity, with W/N tending to a limit p.

Actually, things are not that simple. For imagine the following extreme case : as N tends to infinity, we keep n = N at all times. Then, however large N, the sample always contains all of the white balls, and the "distribution" of w is stuck at w = W, hardly a normal distribution.

So we intuitively feel that :

    * We must let n grow with N for B(n, p) to have a chance to tend to a normal distribution,

    * But n must not grow too fast in order to avoid a lack of diversity in the composition of the sample if n is too large.

-----

We'll identify a sufficient condition on the growth of n with N which ensures that the hypergeometric distribution tends to a normal distribution.

__________________________________________________________________

 

 

 

Tutorial 1

 

In this Tutorial, we establish the basic properties of the hypergeometric distribution :

    * Probability mass function (two methods)

    * Mean and variance.

-----

Although these results are few, obtaining them will take us enough efforts to justify a separate Tutorial.

 

 

THE HYPERGEOMETRIC DISTRIBUTION : BASIC PROPERTIES

Constraints on the sample composition

Probability mass function of the hypergeometric distribution

First method

Second method

Equivalence of the two results

Mean of the hypergeometric distribution

Sample composition as a sum of Bernoulli variables

Calculating the mean

Variance of the hypergeometric distribution

Variances of the auxiliary Bernoulli variables

Covariances of the auxiliary Bernoulli variables

Variance of the hypergeometric distribution

TUTORIAL

 ____________________________________________________________

 

 

Tutorial 2

 

 We now show that under the conditions described above, the hypergeometric distribution converges to a binomial distribution.

 

 

HYPERGEOMETRIC DISTRIBUTION

AND BINOMIAL DISTRIBUTION

First convergence to a binomial distribution

Expansion of the probability mass function

Three asymptotic equivalences

Limit of the hypergeometric distribution

Second convergence to a binomial distribution

The long solution

The short solution

TUTORIAL

____________________________________________________

 

 

 

Tutorial 3

 

As mentioned above, the issue of the convergence of the hypergeometric distribution to a normal distribution is not as simple as it first looks. A closer scrutiny of the problem reveals that the sample size should certainly not grow too fast with N for this convergence to have a chance to happen.

In this Tutorial :

    1) Using only rather simple means, we first show that the hypergeometric distribution tends to a normal distribution if n2/N tends to 0 as N and W tend to infinity with W/N tending to a limit p.

    2) Encouraged by this success, we then further refine the method used for establishing this result and discover that the condition is in fact too restrictive, and that it is sufficient for n3/N 2 to tend to 0 (with the same condition on W/N) for the hypergeometric distribution to converge to a normal distribution.

    3) Both of the above conditions imply that n/N tends to 0 when N tends to infinity (although this was not assumed in the first place, and comes as a consequence of the proofs), but neither one concludes that  n/N must tend to 0  for the hypergeometric to tend to a normal distribution.

So we make a conceptual leap and explore the consequences of imposing outfront that n/N tends to a limit t (which may be different from 0) as N tends to infinity with W/N tending to a limit p. Using somewhat more sophisticated tools than before, we'll show that this last (and rather weak) condition is sufficient for the hypergeometric distribution to converge to a normal distribution. If t = 0, we'll obtain again the previous results as special cases.

Some parts of the proof involve straightforward but long and bulky calculations that will be omitted.

-----

From a logical standpoint, only the third condition is needed as it is the weakest of the three : if any of the first two conditions is satisfied, so is the third one (with t = 0). So this Tutorial may be regarded as an illustration of the improvements brought to some partial solutions of a given problem by using more and more sophisticated methods. Yet, note that all three conditions are only sufficient : we are not aware of any "necessary and sufficient" (or even just "necessary") condition on the growth of n for the hypergeometric distribution to tend to a normal distribution.

 

 

 

 

HYPERGEOMETRIC DISTRIBUTION

AND NORMAL DISTRIBUTION

The remainder

The problem

Expanding the remainder

A sufficient condition for the remainder to tend to 1

Upper bound on the remainder

Lower bound on the remainder

The normal limit

A less constraining sufficient condition

An even more permissive sufficient condition

TUTORIAL

_______________________________________________________

 

 

 

 

Tutorial 4

 

We then address an issue where  the hypergeometric distribution turns up a bit unexpectedly :

    * Let X and Y be two independent binomial random variables, with the same p but different sizes m and n. Choose an integer k, and consider the distribution of X under the condition X + Y = k. This distribution is hypergeometric, and does not depend on p.

This important property of  the binomial distribution is illustrated by a quite instructive interactive animation.

 

It is the starting point of the Fisher-Irwin test whose purpose is to test the H0 hypothesis according to which two Bernoulli populations have the same value of the paramter p.
 

DISTRIBUTION OF TWO INDEPENDENT BINOMIAL VARIABLES

CONDITIONALLY TO THEIR SUM

 Distribution of two independent binomial variables conditionally to their sum

_______________________________________
 

 Interactive animation

* Parameters of the two binomials are adjustable.
* Sum is adjustable.
* Progressive build-up of the histogram.

TUTORIAL

 

_______________________________________________________

 

Related readings:

Sampling without replacement

Binomial distribution

Fisher-Irwin test

Download this Glossary