Sufficient statistic

Let p(x, q) be a probability distribution known up to the value of the parameter q.

Let also X = {x1, x2, ..., xn } be a n-sample drawn from this distribution. All that will ever be known about the value of q is contained in this set of n numbers.

An estimator wastes information

To estimate the value of q, we devise a statistic (a function of the observations that does not depend on q) with some nice properties, like for example its expectation being equal to q (we then have devised an unbiased estimator of q).

Yet we may reasonably be concerned about having now a single number (the value of the estimator as calculated from the sample) when we started with a set of n numbers (the values of the observations in the sample). Certainly, something must have been lost in the process. One says that some "information" has been thrown away when collapsing the n-sample into a single number (the estimate).

"Useful" and "Useless" information

An  informal yet pervasive line of thinking in Estimation Theory considers that the sample contains both :

    * "Information" that is useful for the purpose of estimating q,

    * And "information" that is useless for the purpose of estimating q (although it might be useful for some other purpose).

 

It may therefore be feared that some of the useful information (as well as some of the useless one) be thrown away when collapsing the sample into a single estimate. This is indeed generally true.

Yet it is remarkable that under certain circumstances, creating a statistic will throw away only (some of) the useless information, while completely retaining the useful information about q. When this is the case, the statistic is called a sufficient statistic for q.

 

 

On the next page, we pursue this line of reasoning by describing a thought experiment that may help the reader gain a better intuitive understanding of the formal definition of a sufficient statistic as given in the next paragraph.

 

Definition of a sufficient statistic

Let p(x, q) be a probability distribution.

Sample distribution

 All of Estimation Theory rests on the knowledge of the probability distribution of the sample X = {x1, x2, ..., xn }, that we denote Lq(X). From this distribution is derived (whenever possible) the probability distribution of any statistic T(X).

Sample distribution conditionally to the value of a statistic

We denote Lq(X |T = t0) the distribution of the sample conditionally to a given value t0 of the statistic T. In loose and absolutely improper terms, keep drawing n-samples from p(x, q), and retain only those samples for which T = t0 : these samples are distributed as Lq(X |T = t0).

Definition of a sufficient statistic

We now come to the formal definition of a sufficient statistic :

 

A statistic T is said to be sufficient for the parameter q 

if the distribution of the sample conditionally to the value of the statistic T does not depend on q.

 

In other words, we can drop the index q in Lq(X |T = t0) and just write L(X |T = t0).

 

The thought experiment described on the next page explains how this rather abstract definition provides an instrumental content to the intuitive idea of a "statistic that retains all the information pertaining to q".

-----

We also give here a geometric interpretation of sufficiency that may help visualize the concept more accurately.

 

Factorization Theorem

As the examples is the first Tutorial show, identifying a sufficient statistic from the definition may be a bit difficult because calculations involving conditional probabilities are usually cumbersome. Fortunately, one can show that if a distribution pq(x) admits a sufficient statistic T for the parameter q, then the joint probability distribution Lq (X) of a n-sample X can be written as :
 

Lq(X) = g(T(X), q).h(X)


where h(X) is a function of the sample only (a statistic).

 

What this expression means is that Lq (X) can be factored into two terms :

 

We will show (in the discrete case only) that the function g(t, q) is in fact the probability distribution function of the sufficient statistic.

-----

The converse is also true : if a distribution pq(x) is such that the joint distribution Lq(x1, x2, ..., xn) of a n sample X = {x1, x2, ..., xn} can be factored as above, then the statistic T is a sufficient statistic for the parameter q.

This important result is known as the Factorization Theorem, that we demonstrate here.

 

The Factorization Theorem is the most practical way to identify sufficient statistics : given pq(x), one attempts to write the analytical expression of Lq(X)  in the factored form. If it is possible, then T(X) is a sufficient statistic.

Exponential family

The first two characterizations of a sufficient statistic refer to the sample distribution, not to the distribution p(x, q) itself. Yet, it would certainly be convenient to be able to decide whether a distribution p(x, q) admits a sufficient statistic for q just by looking at its mathematical expression.

 

We'll show here that p(x, q) admits a sufficient statistic for q if and only if it can be written as :

 

p(x, q) = exp[A(x)B(q) + C(x) + D(q)]

 

 

This expression defines a class of distributions known as the exponential family.

 

In addition, we'll identify a particular sufficient statistic for q.

Sufficient statistic and Efficient estimator

We show here that if a function g(q) of the parameter q admits an efficient estimator, then this estimator is a sufficient statistic for q.

The converse is of course not true : there is no reason why a sufficient statistic should be an efficient estimator, or even an unbiased estimator.

Sufficient statistics and functions

One-to-one functions

We'll then demonstrate two consequences of the Factorization Theorem :

    * A one-to-one transform of a statistic that is sufficient for q is also sufficient for q.


This result shows that being sufficient has no relevance as to whether the statistic is a good estimator or not. For suppose that the sufficient statistic T turns out to be an unbiased estimator of q. Then however large the number a,  T + a is also sufficient, but is of course a poor estimator of q.

   * A statistic that is sufficient for q is also sufficient for one-to-one transforms of q.

General case

In general, it is not true that a function f(T ) of a sufficient statistic T is sufficient.

But if T is a sufficient statistic, and if T = f(S) where :

    * S is another statistic,  and

    * f is not necessarily one-to-one,

then S is also sufficient.

In other words, sufficiency does not necessarily flow "downstream", but it always flows "upstream".

 

We demonstrate this result here.

-----

This result may be viewed in the light of the "useful information" paradigm. A function f(.) that is not one-to-one always causes a loss of information : knowing the output is not enough to know the input unambiguously. When applied to statistics, this remark shows that a function f that is not one-to-one may inadvertantly throw away some of the useful information contained in T, thus making  S = f(T) not sufficient.

 

 

On the other hand, a function f never creates information. So if T is sufficient, and T = f(S), then S must already contain all the useful information, and therefore be sufficient (lower image of the above illustration).

Minimal sufficient statistic

Given two sufficient statistics T and S such that T = f(S) for some f (not one-to-one), T may be regarded as S after shedding some extra and useless weight, but having retained all of its qualities as far as estimating q is concerned.

One may then wonder if a sufficient statistic T may be "lighter" than any other sufficient statistic. When this is the case, T is said to be a minimal sufficient statistic.

So, by definition :

 

A minimal sufficient statistic is a sufficient statistic that is a function of any other sufficient statistic.

 

Sufficiency and Maximum Likelihood

There is a close connection between the two concepts of "Sufficient statistic" and "Maximum Likelihood estimator".

We state without proof two important results :

    * If the Maximum Likelihood estimator of q is unique, then it is a function of a sufficient statistic.

    * If the Maximum Likelihood estimator of q is unique and is a sufficient statistic, then it is minimal sufficient statistic.

Rao-Blackwell theorem

The Rao-Blackwell theorem shows how to improve an unbiased estimator of a parameter q  (i.e. reduce its variance). The procedure goes through one step of conditioning the estimator on a statistic that needs to be sufficient for q for the theorem to be valid.

Neyman-Pearson lemma

The Neyman-Pearson lemma identifies the Best Critical Region for a certain category of tests involving a parameter of a distribution. When this parameter admits a sufficient statistic, the lemma takes a particularly simple form because of the Factorization Theorem, and becomes a powerful tool for an easy identification of Best Critical Regions.

_________________________________________________________

 

 

Tutorial 1

 

In this Tutorial, we give five explicit examples of sufficient statistics that rely only on the definition of a sufficient statistic :

    * Bernoulli b(p) : the number of Heads in n tosses is a sufficient statistic for p. Not as obvious as it sounds.

    * Binomial B(n, p) : the identification of a sufficient statistic for p is a bit complex, and requires the demonstration of an important intermediary result.

    * We show that for the Poisson distribution P(l), the sum of the observations is a sufficient statistic for the parameter l.

    * Uniform U[0, q] : we show that "rightmost observation" (the order statistic or rank n) is a sufficient statistic for the parameter q.

    * Truncated exponential exp(q - x).

 

We show that the leftmost observation is sufficient for q.

 

 

FIRST EXAMPLES OF SUFFICIENT STATISTICS

Bernoulli distribution

Binomial distribution

What is a "sample" from a binomial distribution ?

Conditional distribution of a binomial distribution

Conditional distribution of the sample

The statistic is sufficient

Poisson distribution

Uniform distribution

Truncated exponential

TUTORIAL

_______________________________________________

 

 

Tutorial 2

 

We demonstrate here the Factorization Theorem when p(x,q) is discrete. The demonstration in the continuous case is substantially more difficult, and will not be presented in this Glossary.

We then demonstrate the properties of functions of sufficient statistics enounced above.

-----

We finally review some examples of identification of a sufficient statistic for which using the Factorization Theorem makes this identification easier than using the direct method.
 

 

 

THE FACTORIZATION THEOREM

The Factorization Theorem

Factorization is necessary

Factorization is sufficient

Distribution function of a sufficient statistic

Sufficient statistics and functions

One-to-one function of a sufficient statistic

One-to-one function of the parameter

Sufficient statistic function of a statistic

Examples of applications of the Factorization Theorem

Bernoulli distribution

Uniform distribution

Poisson distribution

Normal distribution (mean)

First method

Second method

Normal distribution (variance)

The statistic

Distribution of the statistic

Distribution of the sample

The statistic is sufficient

Gamma distribution

Exponential distribution

TUTORIAL

 

___________________________________________________

 

Related readings :

Exponential family

Maximum Likelihood estimation

Neyman-Pearson lemma

Rao-Blackwell theorem

Download this Glossary

 

Firefox and Greek characters