Multinomial distribution

# Definition of the multinomial distribution

The binomial distribution B(n, p) is obtained by considering n tosses of a coin (Heads and Tails), with p being the probability that the coin will land on a Head.

The multinomial distribution is a generalization of the binomial distribution : each "toss" can now produce more than just two outcomes. For example, one may imagine rolling a "die" with k faces, with pi being the probability for the die to land on face Ai.

A multinomial distribution is entirely characterized by :

* n, the number of times the die is rolled.

* The set of probabilities {p1, p2 , ..., pk}       with    p1p2 + ...+ pk = 1

-----

After rolling the die n times, we denote ni the number of times the die landed face Ai up. We therefore have n1n2 + ...+  nk = n. Because rolling a die is a random activity, the nis are the realizations of k random variables that we denote Xi (i = 1, 2, ..., k). These variables are not independent, as they are linked by the relation Σi Xi  = n.

The multinomial distribution Mult(n, p1, p2 , ..., pk ) is the joint distribution of the k random variables Xi. It is therefore a multivariate, discrete distribution. Its support is the set of k-uples of non negative integers {n1n2, ...,  nk} such that n1n2 + ...+ nk = n.

# The multinomial probability distribution

The distribution Mult(n; p1, p2 , ..., pk ) is determined by the values of the probabilities of each of the possible k-uples. We denote these probabilities by P{X1 = n1, ..., Xk = nk }.

We'll show that :

for all the k-uples within the support n1 + n2 + ... + nk  = n of the distribution (and 0 otherwise).

-----

The binomial distribution is a special case with k = 2.

# Multinomial coefficient

The term n! / (n1!.n2!...nk!)  is called the multinomial coefficient. It is the number of different "words" that can be written with an alphabet containing k letters by using the first letter n1 times, the second letter n2 times etc...

We'll justify this result two different ways.

-----

Note that the multinomial coefficient is equal to the coefficient of the monom xn1xn2 ...xnk in the development of  (x1 + x2 + ...+ xk )n , hence the name of the distribution.

# Moment generating function of the multinomial distribution

The moment generating function of the multinomial distribution Mult(n, p1, p2 , ..., pk ) is :

# Mean

We show below that for any i, Xi follows the binomial distribution B(n, pi).

It follows immediately that

 E[Xi] = npi

# Covariance matrix of the multinomial distribution

The multinomial distribution is the distribution of a random vector X = {X1, X2 , ..., Xk}. It is therefore appropriate to calculate its covariance matrix.

## Variances

We show below that for any i, Xi follows the binomial distribution B(n, pi ).

Hence

 Var(Xi) = npi(1 - pi)

## Covariances

We'll give two demonstrations of the following result :

 Cov(Xi , Xj ) = -n pi pj

All the covariances are negative : for a given number of draws n, any increase of ni will be conducive, on the average, to a reduction of the number of observations in any other category.

We gave here a third demonstration of this result by calling on the Theorem of iterated expectation.

-----

Because of the constraint Σi Xi  = n, the covariance matrix of the multinomial distribution is not full rank, a fact that we'll establish directly.

## Correlation coefficients

By reporting to the definition of the correlation coefficient between two variables, we have :

Note that this expression does not contain n.

# Merging categories

Suppose that the first r variables of the random vector X = {X1, X2 , ..., Xk} are replaced by the single variable Y defined as the sum of these variables :

Y = X1 + X2 + ... + Xr

The first r categories {A1, ..., Ar } are said to be merged.

Because the Xis are linked by X1 + X2 + ... + Xk = n, we also have Y = n - (Xr + 1 + Xr + 2 + ... + Xk ).

We'll show that

 The vector {Y, Xr + 1, Xr + 2, ..., Xk} = {n - (Xr + 1 + Xr + 2 + ... + Xk ), Xr + 1, Xr + 2, ..., Xk} is distributed as Mult(n; p, pr +1, ..., pk ) with p = p1 + p2 + ..., pr.

In other words :

1) Merge categories,

2) Then assign to this new category a probability equal to the sum of the probabilities of the merged cargories,

and you have the multinomial distribution of the new category plus the remaining categories.

Marginal distributions of the multinomial distribution

Recall that Xi is the number of observations in the category Ai.

## Single component

Consider category Ai. The outcome of each draw :

* Is Ai with probability pi.

* And therefore is "Not  Ai" (that is, any other category) with probability (1 - pi).

Hence Xi is binomial B(n, pi).

## Several components

Things are a bit more complicated when more than one component are considered.

Given the multinomial distribution Mult(n; p1, p2 , ..., pk ), we'll calculate the joint distribution of {X1, X2, ..., Xr}, the set of the r first components and discover that this distribution is not multinomial.

Because this result is a bit awkward, litterature sometimes defines the marginal distributions of the multinomial distribution as the distribution of the augmented vector {X1, X2, ..., Xr, n - (X1 + X2 + ... + Xr)} which is multinomial (see preceding paragraph).

-----

This result generalizes straightforwardly to any group of components.

# Conditional distributions of the multinomial distribution

Let the vector {X1, X2 , ..., Xk} follow the multinomial distribution Mult(n; p1, p2 , ..., pk ).

We consider the distribution of the r first components {X1, X2 , ..., Xr} of the vector conditionally to the values of the last (k - r) components. In other words, we want the distribution of the variable :

{X1, X2 , ..., Xr | Xr + 1 = nr + 1, Xr + 2 = nr + 2, ..., Xk = nk + 2

We'll show that this distribution is multinomial Mult(m; p'1, p'2 , ..., p'r ) with :

* m = n - (nr + 1 + nr + 2 + ... + nk + 2 )

* p'i = pi /(p1+ p2 + ... + pr )

-----

This result generalizes straightforwardly to any group of components.

# Estimation of the parameters pi

We'll show that the Maximum Likelihood Estimator (MLEpi* of the parameter pi is :

Goodness-of-fit test

The Chi-square tests (goodness-of-fit, identity, independence, ...) are very important non parametric tests. They are all basically the same test, which is a goodness-of-fit test for the multinomial distribution. The test statistic is the so-called "Pearson's Chi-square", which we demonstrate below to be asymptotically χ2 distributed. This result is fundamental.

-----

Note that the basic Chi-square test is not the only goodness-of-fit test for the multinomial distribution. We build here the Likelihood Ratio Test which serves exactly the same purpose.

# Link with the Poisson distribution

There is an intimate link between the multinomial distribution and the Poisson distribution. We show here that if {X1, X2, ..., Xk} are k independent (but not necessarily identically distributed) Poisson variables, then the joint distribution of {X1, X2, ..., Xk} conditionally to their sum is a multinomial distribution.

___________________________________________________

 Tutorial 1

We calculate the probability mass function of the multinomial distribution. The key part is establishing the expression of the multinomial coefficient, which we do two by different methods.

We then calculate the Maximum Likelihood Estimator (MLEpi* of the parameter pi. This calculation is a simple exercise in constrained optimization by the method of Lagrangian multipliers.

PROBABILITY MASS FUNCTION OF THE MULTINOMIAL DISTRIBUTION

 Probability mass function of the multinomial distribution The multinomial coefficient First demonstration Second demonstration Maximum Likelihood Estimation of the parameters pi TUTORIAL

________________________________________________________

 Tutorial 2

We now use two different methods for calculating Cov(Xi, Xj ), the covariance of the numbers of observations in two categories of the multinomial distribution.

The second method represents these (random) numbers as sums of auxiliary Bournoulli variables. This approach is often quite effective with discrete probability distributions problems (see for example the calculation of the mean of the hypergeometric distribution).

COVARIANCE MATRIX OF THE MULTINOMIAL DISTRIBUTION

 Direct calculation of the covariance Second demonstration (indicator variables) Number of observations in a category as a sum of Bernoulli r.v.s Calculation of the covariance The covariance matrix is not full rank TUTORIAL

_________________________________________________________

 Tutorial 3

In this Tutorial :

1) We first establish the distribution of the vector {Y, Xr + 1, Xr + 2, ..., Xk ) where Y is the sum of the r first components of X :

Y = X1X2 + ..., Xr

We'll show that this distribution is Mult(n; p, pr +1, ..., pk ) with p = p1p2 + ..., pr.

2) We then calculate the joint distribution of {X1, X2, ..., Xr }, the genuine marginal distribution of the multinomial distribution. This distribution will turn out not to be multinomial, so the distribution of the "augmented" vector {X1, X2, ..., Xr, n - (X1X2 + ..., Xr)}, whose distribution is multinomial (see preceding paragraph) is sometimes presented as the marginal distribution of X.

3) We finally calculate the joint distribution of a group of categories conditionally to the values of the other categories, and show that this distribution is multinomial (see here).

MERGED CATEGORIES

MARGINAL AND CONDITIONAL DISTRIBUTIONS

 Merging categories Marginal distributions Conditional distirbutions TUTORIAL

_________________________________

 Tutorial 4

In this Tutorial, we demonstrate the fundamental Pearson's theorem which is the base on which all Chi-square tests are built. The theorem states that the so-called "Pearson's Chi-square statistic", which is the statistic common to all Chi-square tests, is asymptotically χ2 distributed.

Although Linear Algebra can wrap-up the issue in a few compact lines, we'll develop a longer but elementary method which permits following the proof step by step.

An alternative to the Chi-square statistic is Wilk's G² statistic, whose behavior is compared here to that of the Chi-square statistic.

PEARSON'S THEOREM

 Pearson's theorem The goodness-of-fit problem for the multinomial distribution The Chi-square statistic Origin of the Chi-square statistic The Chi-square statistic is "unnatural" The two categories case The Z vector Definition of the Z  vector Covariance matrix of the Z  vector The distribution of Z  is degenerate The covariance matrix is not full rank Subspace of the distribution of Z Another vector with the same covariance matrix as Z Projection on a plane perpendicular to a unit vector Projection of a set of standard normal variables Covariance matrix of the set of  projected variables Off-diagonal terms Diagonal terms Rotation of the coordinate system The asymptotic distribution of the Chi-square statistic is χ2 TUTORIAL

______________________________________________________

 Binomial distribution Poisson distribution Chi-square tests