Sufficient statistic

Let p(x, θ) be a probability distribution known up to the value of the parameter θ.

Let also X = {x1, x2, ..., xn } be a n-sample drawn from this distribution. All that will ever be known about the value of θ is contained in this set of n numbers.

An estimator wastes information

To estimate the value of θ, we devise a statistic (a function of the observations that does not depend on θ) with some nice properties, like for example its expectation being equal to θ (we then have devised an unbiased estimator of θ).

Yet we may reasonably be concerned about having now a single number (the value of the estimator as calculated from the sample) when we started with a set of n numbers (the values of the observations in the sample). Certainly, something must have been lost in the process. One says that some "information" has been thrown away when collapsing the n-sample into a single number (the estimate).

"Useful" and "Useless" information

An  informal yet pervasive line of thinking in Estimation Theory considers that the sample contains both :

    * "Information" that is useful for the purpose of estimating θ,

    * And "information" that is useless for the purpose of estimating θ (although it might be useful for some other purpose).

 

It may therefore be feared that some of the useful information (as well as some of the useless one) be thrown away when collapsing the sample into a single estimate. This is indeed generally true.

Yet it is remarkable that under certain circumstances, creating a statistic will throw away only (some of) the useless information, while completely retaining the useful information about θ. When this is the case, the statistic is called a sufficient statistic for θ.

 

 

 

On the next page, we pursue this line of reasoning by describing a thought experiment that may help the reader gain a better intuitive understanding of the formal definition of a sufficient statistic as given in the next paragraph.

 

Definition of a sufficient statistic

Let p(x, θ) be a probability distribution.

Sample distribution

 All of Estimation Theory rests on the knowledge of the probability distribution of the sample X = {x1, x2, ..., xn }, that we denote Lθ (X). From this distribution is derived (whenever possible) the probability distribution of any statistic T(X).

Sample distribution conditionally to the value of a statistic

We denote Lθ (X |T = t ) the distribution of the sample conditionally to a given value t of the statistic T. In loose and absolutely improper terms, keep drawing n-samples from p(x, θ), and retain only those samples for which T = t : these samples are distributed as Lθ (X |T = t ).

Definition of a sufficient statistic

We now come to the formal definition of a sufficient statistic :

 

A statistic T is said to be sufficient for the parameter θ 

if the distribution of the sample conditionally to the value of the statistic T does not depend on θ.

 

In other words, we can drop the index θ in Lθ (X |T = t ) and just write L(X |T = t ).

 

The thought experiment described on the next page explains how this rather abstract definition provides an instrumental content to the intuitive idea of a "statistic that retains all the information pertaining to θ".

-----

We also give here a geometric interpretation of sufficiency that may help visualize the concept more accurately.

 


If T is sufficient, the distribution of the sample X conditionally to the value of T does not depend on θ. It is therefore also the case of any function
f(X) of the sample, that is, of any statistic. So, if T is sufficient, the distribution of any statistic conditionally to the value of T does not depend on θ. This remark is very useful in practice.

Multidimensional sufficient statistic

The parameter θ may be a vector parameter, that is as set θ = {θ1, ..., θk } of scalar parameters. For example, on may need to estimate both the mean and the variance of a normal distribution N(µ, σ²) when the values of these two quantites are unknown. We then have θ = (µ, σ²).

In general, the individual components θi of the parameter θ have no sufficient statistic.

Yet, it is sometimes possible to identify a multidimensional sufficient statistic T = {T1, ...,  Tk} for the vector parameter θ = {θ1, ..., θk }. The sample distribution conditionally to the set of k values of {T1, ...,  Tk} then does not depend on θ.

We'll illustrate the concept of multidimensional sufficient statistic by identifying a bidimensional sufficient statistic for the pair (µ, σ²) of a normal distribution whose mean and variance are both unknown.

Factorization Theorem

As the examples is the first Tutorial show, identifying a sufficient statistic from the definition may be a bit difficult because calculations involving conditional probabilities are usually cumbersome. Fortunately, one can show that if a distribution pθ (x) admits a sufficient statistic T for the parameter θ, then the joint probability distribution Lθ (X) of a n-sample X can be written as :
 

Lθ (X) = g(T(X), θ).h(X)

 

What this expression means is that Lθ (X) can be factored into two terms :

    1) h(X), a non-negative function that depends only on the sample, but not on the parameter.

    2) g(T(X), θ), a non-negative function that depends :

            * On the parameter θ,

            * and on the observations, but only through the value of the statistic T(X).

 

We will show that the function g(t, θ) is in fact the probability distribution function of the sufficient statistic.

-----

The converse is also true : if a distribution pθ (x) is such that the joint distribution Lθ (x1, x2, ..., xn) of a n sample X = {x1, x2, ..., xn} can be factored as above, then the statistic T is a sufficient statistic for the parameter θ.

This important result is known as the Factorization Theorem.

 

As we'll see, the Factorization Theorem is the most practical way to identify sufficient statistics : given pθ (x), one attempts to write the analytical expression of Lθ (X)  in the factored form. If it is possible, then T(X) is a sufficient statistic.


The Factorization Theorem applies to multidimensional sufficient statistics as well (see above).

Exponential family

The first two characterizations of a sufficient statistic refer to the sample distribution, not to the distribution p(x, θ) itself. Yet, it would certainly be convenient to be able to decide whether a distribution p(x, θ) admits a sufficient statistic for θ just by looking at its mathematical expression.

 

We'll show here that p(x, θ) admits a sufficient statistic for θ if and only if it can be written as :

 

p(x, θ) = exp[A(x)B(θ) + C(x) + D(θ)]

 

 

This expression defines a class of distributions known as the exponential family.

 

In addition, we'll identify a particular sufficient statistic for θ when p(x, θ) can be written as above.

Sufficient statistic and Efficient estimator

We show here that if a function g(θ) of the parameter θ admits an efficient estimator, then this estimator is a sufficient statistic for θ.

The converse is of course not true : there is no reason why a sufficient statistic should be an efficient estimator, or even an unbiased estimator.

Sufficient statistics and functions

One-to-one functions

We'll then demonstrate two consequences of the Factorization Theorem :

    * A one-to-one transform of a statistic that is sufficient for θ is also sufficient for θ.


This result shows that being sufficient has no relevance as to whether the statistic is a good estimator or not. For suppose that the sufficient statistic T turns out to be an unbiased estimator of θ. Then however large the number a,  T + a is also sufficient, but is of course a poor estimator of θ.

   * A statistic that is sufficient for θ is also sufficient for one-to-one transforms of θ.

General case

In general, it is not true that a function f(T ) of a sufficient statistic T is sufficient.

But if T is a sufficient statistic, and if T = f(S) where :

    * S is another statistic,  and

    * f is not necessarily one-to-one,

then S is also sufficient.

In other words, sufficiency does not necessarily flow "downstream", but it always flows "upstream".

 

We demonstrate this result here.

-----

This result may be viewed in the light of the "useful information" paradigm. A function f(.) that is not one-to-one always causes a loss of information : knowing the output is not enough to know the input unambiguously. When applied to statistics, this remark shows that a function f that is not one-to-one may inadvertantly throw away some of the useful information contained in T, thus making  S = f(T) not sufficient.

 

 

On the other hand, a function f never creates information. So if T is sufficient, and T = f(S), then S must already contain all the useful information, and therefore be sufficient (lower image of the above illustration).

Minimal sufficient statistic

Given two sufficient statistics T and S such that T = f(S) for some f (not one-to-one), T may be regarded as S after shedding some extra and useless weight, but having retained all of its qualities as far as estimating θ is concerned.

One may then wonder if a sufficient statistic T may be "lighter" than any other sufficient statistic. When this is the case, T is said to be a minimal sufficient statistic.

So, by definition :

 

A minimal sufficient statistic is a sufficient statistic that is a function of any other sufficient statistic.

 

 

The important question of minimal sufficient statistics is further developed here.

Sufficiency and Maximum Likelihood

There is a close connection between the two concepts of "Sufficient statistic" and "Maximum Likelihood estimator".

We state without proof two important results :

    * If the Maximum Likelihood estimator of θ is unique, then it is a function of a sufficient statistic.

    * If the Maximum Likelihood estimator of θ is unique and is a sufficient statistic, then it is minimal sufficient statistic.

Rao-Blackwell theorem

The Rao-Blackwell theorem shows how to improve an unbiased estimator of a parameter θ  (i.e. reduce its variance). The procedure goes through one step of conditioning the estimator on a statistic that needs to be sufficient for θ for the theorem to be valid.

Neyman-Pearson lemma

The Neyman-Pearson lemma identifies the Best Critical Region for a certain category of tests involving a parameter of a distribution. When this parameter admits a sufficient statistic, the lemma takes a particularly simple form because of the Factorization Theorem, and becomes a powerful tool for an easy identification of Best Critical Regions.

_________________________________________________________

 

 

Tutorial 1

 

In this Tutorial, we give five explicit examples of sufficient statistics that rely only on the definition of a sufficient statistic :

    * Bernoulli b(p) : the number of Heads in n tosses is a sufficient statistic for p. Not as obvious as it sounds.

    * Binomial B(n, p) : the identification of a sufficient statistic for p is a bit complex, and requires the demonstration of an important intermediary result.

    * We show that for the Poisson distribution P(λ), the sum of the observations is a sufficient statistic for the parameter λ.

    * Uniform U[0, θ] : we show that "rightmost observation" (the order statistic or rank n) is a sufficient statistic for the parameter θ.

    * Truncated exponential exp(θ - x).

 

We show that the leftmost observation is sufficient for θ.

 

 

FIRST EXAMPLES OF SUFFICIENT STATISTICS

Bernoulli distribution

Binomial distribution

What is a "sample" from a binomial distribution ?

Conditional distribution of a binomial distribution

Conditional distribution of the sample

The statistic is sufficient

Poisson distribution

Uniform distribution

Truncated exponential

TUTORIAL

_______________________________________________

 

 

Tutorial 2

 

We demonstrate here the Factorization Theorem :

    1) First in the case of a discrete distribution.

    2) Then in the case of a distribution with a density. The general demonstration is difficult and beyond the bounds of this Glossary, so we'll have to make some simplifying assumptions that fortunately cover most of the situations encountered in practice.

-----

We then demonstrate two consequences of the Factorization Theorem :

    * A one-to-one transform of a statistic that is sufficient for θ is also sufficient for θ.
   * A statistic that is sufficient for θ is also sufficient for one-to-one transforms of θ.

We conclude by showing that if a sufficient statistic is a function of another statistic, then this other statistic is also sufficient.
 

 

 

THE FACTORIZATION THEOREM

The Factorization Theorem (discrete case)

Factorization is necessary

Factorization is sufficient

Distribution function of a sufficient statistic

The Factorization Theorem (densities)

Restrictive conditions

Factorization is necessary

Factorization is sufficient

Sufficient statistics and functions

One-to-one function of a sufficient statistic

One-to-one function of the parameter

Sufficient statistic function of a statistic

 

TUTORIAL

______________________________________________

 

 

Tutorial 3

 

We now use the Factorization Theorem for identifying sufficient statistics for some classical distributions. We'll discover that, more often than not, using the Factorization Theorem makes this identification easier than using the mere definition of a sufficient statistic.

We'll encounter examples of bidimensional statistics that are sufficient for a pair of scalar parameters, even though these parameters have individually no sufficient statistic.

 

 

FACTORIZATION THEOREM :

EXAMPLES OF APPLICATIONS

Bernoulli distribution

Uniform distribution [0, θ]

Uniform distribution [θ, θ + 1]

Poisson distribution

Normal distribution

Mean

First method

Second method

Variance

The statistic

Distribution of the statistic

Distribution of the sample

The statistic is sufficient

Mean and variance

Exponential distribution

Gamma distribution

Shape parameter

Dispersion parameter

Shape and dispersion parameters

Beta distribution

TUTORIAL

 

_______________________________________________________

 

Related readings :

Exponential family

Maximum Likelihood estimation

Neyman-Pearson lemma

Rao-Blackwell theorem

Minimal sufficient statistic

Complete statistic

Download this Glossary