Multinomial distribution

Definition of the multinomial distribution

The binomial distribution B(n, p) is obtained by considering n tosses of a coin (Heads and Tails), with p being the probability that the coin will land on a Head.

The multinomial distribution is a generalization of the binomial distribution : each "toss" can now produce more than just two outcomes. For example, one may imagine rolling a "die" with k faces, with pi being the probability for the die to land on face Ai.

A multinomial distribution is entirely characterized by :

    * n, the number of times the die is rolled.

    * The set of probabilities {p1, p2 , ..., pk}       with    p1p2 + ...+ pk = 1

-----

After rolling the die n times, we denote ni the number of times the die landed face Ai up. We therefore have n1n2 + ...+  nk = n. Because rolling a die is a random activity, the nis are the realizations of k random variables that we denote Xi (i = 1, 2, ..., k). These variables are not independent, as they are linked by the relation Σi Xi  = n.

The multinomial distribution Mult(n, p1, p2 , ..., pk ) is the joint distribution of the k random variables Xi. It is therefore a multivariate, discrete distribution. Its support is the set of k-uples of non negative integers {n1n2, ...,  nk} such that n1n2 + ...+ nk = n.

The multinomial probability distribution

The distribution Mult(n; p1, p2 , ..., pk ) is determined by the values of the probabilities of each of the possible k-uples. We denote these probabilities by P{X1 = n1, ..., Xk = nk }.

We'll show that :

 

 

for all the k-uples within the support n1 + n2 + ... + nk  = n of the distribution (and 0 otherwise).

-----

The binomial distribution is a special case with k = 2.

Multinomial coefficient

The term n! / (n1!.n2!...nk!)  is called the multinomial coefficient. It is the number of different "words" that can be written with an alphabet containing k letters by using the first letter n1 times, the second letter n2 times etc...

We'll justify this result two different ways.

-----

Note that the multinomial coefficient is equal to the coefficient of the monom xn1xn2 ...xnk in the development of  (x1 + x2 + ...+ xk )n , hence the name of the distribution.

Moment generating function of the multinomial distribution

The moment generating function of the multinomial distribution Mult(n, p1, p2 , ..., pk ) is :

 

Mean

We show below that for any i, Xi follows the binomial distribution B(n, pi).

It follows immediately that

 

E[Xi] = npi

 

Covariance matrix of the multinomial distribution

The multinomial distribution is the distribution of a random vector X = {X1, X2 , ..., Xk}. It is therefore appropriate to calculate its covariance matrix.

Variances

We show below that for any i, Xi follows the binomial distribution B(n, pi ).

Hence

 

Var(Xi) = npi(1 - pi)

 

Covariances

We'll give two demonstrations of the following result :

 

Cov(Xi , Xj ) = -n pi pj

 

All the covariances are negative : for a given number of draws n, any increase of ni will be conducive, on the average, to a reduction of the number of observations in any other category.


We gave here a third demonstration of this result by calling on the Theorem of iterated expectation.

-----

Because of the constraint Σi Xi  = n, the covariance matrix of the multinomial distribution is not full rank, a fact that we'll establish directly.

Correlation coefficients

By reporting to the definition of the correlation coefficient between two variables, we have :

 

 

 

Note that this expression does not contain n.

Merging categories

Suppose that the first r variables of the random vector X = {X1, X2 , ..., Xk} are replaced by the single variable Y defined as the sum of these variables :

Y = X1 + X2 + ... + Xr

The first r categories {A1, ..., Ar } are said to be merged.

 

Because the Xis are linked by X1 + X2 + ... + Xk = n, we also have Y = n - (Xr + 1 + Xr + 2 + ... + Xk ).

We'll show that

 

The vector

{Y, Xr + 1, Xr + 2, ..., Xk} = {n - (Xr + 1 + Xr + 2 + ... + Xk ), Xr + 1, Xr + 2, ..., Xk}

is distributed as Mult(n; p, pr +1, ..., pk ) with p = p1p2 + ..., pr.

 

In other words :

    1) Merge categories,

    2) Then assign to this new category a probability equal to the sum of the probabilities of the merged cargories,

and you have the multinomial distribution of the new category plus the remaining categories.

Marginal distributions of the multinomial distribution

Recall that Xi is the number of observations in the category Ai.

Single component

Consider category Ai. The outcome of each draw :

    * Is Ai with probability pi.

    * And therefore is "Not  Ai" (that is, any other category) with probability (1 - pi).

Hence Xi is binomial B(n, pi).

Several components

Things are a bit more complicated when more than one component are considered.

Given the multinomial distribution Mult(n; p1, p2 , ..., pk ), we'll calculate the joint distribution of {X1, X2, ..., Xr}, the set of the r first components and discover that this distribution is not multinomial.

Because this result is a bit awkward, litterature sometimes defines the marginal distributions of the multinomial distribution as the distribution of the augmented vector {X1, X2, ..., Xr, n - (X1 + X2 + ... + Xr)} which is multinomial (see preceding paragraph).

-----

This result generalizes straightforwardly to any group of components.

Conditional distributions of the multinomial distribution

Let the vector {X1, X2 , ..., Xk} follow the multinomial distribution Mult(n; p1, p2 , ..., pk ).

We consider the distribution of the r first components {X1, X2 , ..., Xr} of the vector conditionally to the values of the last (k - r) components. In other words, we want the distribution of the variable :

{X1, X2 , ..., Xr | Xr + 1 = nr + 1, Xr + 2 = nr + 2, ..., Xk = nk + 2

We'll show that this distribution is multinomial Mult(m; p'1, p'2 , ..., p'r ) with :

    * m = n - (nr + 1 + nr + 2 + ... + nk + 2 )

    * p'i = pi /(p1+ p2 + ... + pr )     

-----

This result generalizes straightforwardly to any group of components. 

Estimation of the parameters pi

We'll show that the Maximum Likelihood Estimator (MLEpi* of the parameter pi is :

 

Goodness-of-fit test

The Chi-square tests (goodness-of-fit, identity, independence, ...) are very important non parametric tests. They are all basically the same test, which is a goodness-of-fit test for the multinomial distribution. The test statistic is the so-called "Pearson's Chi-square", which we demonstrate below to be asymptotically χ2 distributed. This result is fundamental.

-----

Note that the basic Chi-square test is not the only goodness-of-fit test for the multinomial distribution. We build here the Likelihood Ratio Test which serves exactly the same purpose.

Link with the Poisson distribution

There is an intimate link between the multinomial distribution and the Poisson distribution. We show here that if {X1, X2, ..., Xk} are k independent (but not necessarily identically distributed) Poisson variables, then the joint distribution of {X1, X2, ..., Xk} conditionally to their sum is a multinomial distribution.

___________________________________________________
 

 

 

Tutorial 1

 

We calculate the probability mass function of the multinomial distribution. The key part is establishing the expression of the multinomial coefficient, which we do two by different methods.

We then calculate the Maximum Likelihood Estimator (MLEpi* of the parameter pi. This calculation is a simple exercise in constrained optimization by the method of Lagrangian multipliers.

 

 

PROBABILITY MASS FUNCTION OF THE MULTINOMIAL DISTRIBUTION

Probability mass function of the multinomial distribution

The multinomial coefficient

First demonstration

Second demonstration

Maximum Likelihood Estimation of the parameters pi

TUTORIAL

________________________________________________________

 

 

 

Tutorial 2

 

We now use two different methods for calculating Cov(Xi, Xj ), the covariance of the numbers of observations in two categories of the multinomial distribution.

The second method represents these (random) numbers as sums of auxiliary Bournoulli variables. This approach is often quite effective with discrete probability distributions problems (see for example the calculation of the mean of the hypergeometric distribution).

 

 

COVARIANCE MATRIX OF THE MULTINOMIAL DISTRIBUTION

Direct calculation of the covariance

Second demonstration (indicator variables)

Number of observations in a category as a sum of Bernoulli r.v.s

Calculation of the covariance

The covariance matrix is not full rank

TUTORIAL

_________________________________________________________

 

 

Tutorial 3

 

In this Tutorial :

    1) We first establish the distribution of the vector {Y, Xr + 1, Xr + 2, ..., Xk ) where Y is the sum of the r first components of X :

Y = X1X2 + ..., Xr 

We'll show that this distribution is Mult(n; p, pr +1, ..., pk ) with p = p1p2 + ..., pr.

    2) We then calculate the joint distribution of {X1, X2, ..., Xr }, the genuine marginal distribution of the multinomial distribution. This distribution will turn out not to be multinomial, so the distribution of the "augmented" vector {X1, X2, ..., Xr, n - (X1X2 + ..., Xr)}, whose distribution is multinomial (see preceding paragraph) is sometimes presented as the marginal distribution of X.

    3) We finally calculate the joint distribution of a group of categories conditionally to the values of the other categories, and show that this distribution is multinomial (see here).

 

MERGED CATEGORIES

MARGINAL AND CONDITIONAL DISTRIBUTIONS

Merging categories

Marginal distributions

Conditional distirbutions

TUTORIAL

_________________________________

 

 

Tutorial 4

 

In this Tutorial, we demonstrate the fundamental Pearson's theorem which is the base on which all Chi-square tests are built. The theorem states that the so-called "Pearson's Chi-square statistic", which is the statistic common to all Chi-square tests, is asymptotically χ2 distributed.

Although Linear Algebra can wrap-up the issue in a few compact lines, we'll develop a longer but elementary method which permits following the proof step by step.


An alternative to the Chi-square statistic is Wilk's G² statistic, whose behavior is compared here to that of the Chi-square statistic.

 

 

PEARSON'S THEOREM

Pearson's theorem

The goodness-of-fit problem for the multinomial distribution

The Chi-square statistic

Origin of the Chi-square statistic

The Chi-square statistic is "unnatural"

The two categories case

The Z vector

Definition of the Z  vector

Covariance matrix of the Z  vector

The distribution of Z  is degenerate

The covariance matrix is not full rank

Subspace of the distribution of Z

Another vector with the same covariance matrix as Z

Projection on a plane perpendicular to a unit vector

Projection of a set of standard normal variables

Covariance matrix of the set of  projected variables

Off-diagonal terms

Diagonal terms

Rotation of the coordinate system

The asymptotic distribution of the Chi-square statistic is χ2

 

TUTORIAL

 

______________________________________________________

 

Related readings :

Binomial distribution

Poisson distribution

Chi-square tests

Download this Glossary