Sufficient statistic

Let p(x; θ) be a probability distribution known up to the value of the parameter θ.

Let also x = {x1, x2, ..., xn } be a n-sample drawn from this distribution. All that will ever be known about the value of θ is contained in this set of n numbers.

# An estimator wastes information

To estimate the value of θ, we devise a statistic (a function of the observations that does not depend on θ) with some nice properties, like for example its expectation being equal to θ (we then have devised an unbiased estimator of θ).

Yet we may reasonably be concerned about having now a single number (the value of the estimator as calculated from the sample) when we started with a set of n numbers (the values of the observations in the sample). Certainly, something must have been lost in the process. One says that some "information" has been thrown away when collapsing the n-sample into a single number (the estimate).

# "Useful" and "Useless" information

An  informal yet pervasive line of thinking in Estimation Theory considers that the sample contains both :

* "Information" that is useful for the purpose of estimating θ,

* And "information" that is useless for the purpose of estimating θ (although it might be useful for some other purpose).

It may therefore be feared that some of the useful information (as well as some of the useless one) be thrown away when collapsing the sample into a single estimate. This is indeed generally true.

Yet it is remarkable that under certain circumstances, creating a statistic will throw away only (some of) the useless information, while completely retaining the useful information about θ. When this is the case, the statistic is called a sufficient statistic for θ.

On the next page, we pursue this line of reasoning by describing a thought experiment that may help the reader gain a better intuitive understanding of the formal definition of a sufficient statistic as given in the next paragraph.

# Definition of a sufficient statistic

Let p(x; θ) be a probability distribution.

## Sample distribution

All of Estimation Theory rests on the knowledge of the probability distribution of the sample X = {X1, X2, ..., Xn }, that we denote L(X; θ). From this distribution is derived (whenever possible) the probability distribution of any statistic T(X).

## Sample distribution conditionally to the value of a statistic

We denote Lθ (X |T = t ) the distribution of the sample conditionally to a given value t of the statistic T. In loose and absolutely improper terms, keep drawing n-samples from p(x, θ), and retain only those samples for which T = t : these samples are distributed as Lθ (X |T = t ).

## Definition of a sufficient statistic

We now come to the formal definition of a sufficient statistic :

 A statistic T is said to be sufficient for the parameter θ  if the distribution of the sample conditionally to the value of the statistic T does not depend on θ.

In other words, we can drop the index θ in Lθ (X |T = t ) and just write L(X |T = t ).

The thought experiment described on the next page explains how this rather abstract definition provides an instrumental content to the intuitive idea of a "statistic that retains all the information pertaining to θ".

-----

We also give here a geometric interpretation of sufficiency that may help visualize the concept more accurately.

1) If T is sufficient, the distribution of the sample X conditionally to the value of T does not depend on θ. It is therefore also the case of any function
f(X) of the sample, that is, of any statistic. So, if T is sufficient, the distribution of any statistic conditionally to the value of T does not depend on θ. This remark is very useful in practice.
2) We'll show that if a statistic T is sufficient for θ, then it is also sufficient for one-to-one functions of θ.

# A necessary and sufficient condition for sufficiency

The definition of a sufficient statistic uses the conditional distribution of the sample.

There exists a necessary and sufficient condition for sufficiency of a statistic T that uses :

* The (unconditional) distribution of the sample,

* And the distribution of the statistic T.

-----

Consider a probability distribution, and the joint probability distribution L(X; θ) of the sample.

Let also T be a statistic whose probability distribution function is q(x; θ).

We'll show that T is sufficient for θ if and only if the ratio :

L (X; θ) / q(T(X); θ)

which seems to depend on θ, does not, in fact, depend on θ.

# Sufficient statistic and unbiased estimation

Sufficient statistics are at the heart of the quest for good unbiased estimators (low variance) of the parameter θ, up to the discovery of the best of these estimators : the unique Uniformly Minimum Variance Unbiased Estimator (UMVUE), when it exists. This quest follows a long path that can be summarized as follows :

## Rao-Blackwell Theorem

A sufficient statistic is not an estimator, but it can improve an already available unbiased estimator (e.g. reduce its variance).

Let :

* θ* be an unbiased estimator of θ.

* T a sufficient statistic for θ.

The Rao-Blackwell Theorem states that

E[θ*T]

is also an unbiased estimator of θ, and at least as good as θ* (variance smaller or equal to that of θ*).

Creating an "improved" unbiased estimator by applying the Rao-Blackwell Theorem is called "blackwellizing" the original estimator by a sufficient statistic.

## Functions of a sufficient statistic

The image of a sufficient statistic by a one-to-one function is still sufficient. But in general, the image of a sufficient statistic by a not one-to-one function is not sufficient.

Yet it may happen that the image S = f(T) of a sufficient statistic T by a not one-to-one function f(.) still is sufficient (lower image of the above illustration).

In fact, we'll show that if a sufficient statistic S is a function of another statistic T, then T is certainly sufficient. In other words, a sufficient statistic can only be a function of another sufficient statistic.

We'll show that S = f(T) is then better than T in the following sense :

* Let θ* be an unbiased estimator of θ.

* Then blackwellizing θ* by S  produces an unbiased estimator of θ whose variance is smaller than the variance of the unbiased estimator obtained by blackwellizing the same θ* by T.

So the function f(.) operates as a "filter" for T  :

* The filter lets all of the "useful" information for estimating θ flow through,

* But it blocks some of the "useless" information, thus making a subsequent blackwellization procedure more effective.

## Minimal sufficient statistic

It may happen that a sufficient statistic T cannot be improved : no image of T by a not one-to-one function is still sufficient. T is then said to be minimal sufficient for θ.

## Complete statistic

A minimal sufficient statistic contains as little useless information as possible and still be sufficient. The amount of residual useless information needs not be 0, though (see here). Fortunately, in many instances, a minimal sufficient statistic contains only useful information, and no useless information. When this is the case, the statistic is said to be complete.

Blackwellizing any unbiased estimator by a complete statistic produces the unique Uniformly Minimum Variance Unbiased Estimator (UMVUE), and our quest for the best possible unbiased estimator has come to a favorable end. This result in known as the Lehmann-Scheffé Theorem.

# Building a sufficient statistic

The definition of a sufficient statistic, as well as the n&s condition described above, allow us to check (rather laboriously, see Tutorial) whether a given statistic is sufficient or not. But they cannot be used for building a sufficient statistic from a given parametric distribution.

Fortunately, there exist two major methods for building a sufficient statistic.

## Factorization Theorem

Identifying a sufficient statistic resorts most often to the Factorization Theorem which states that if a distribution p(x; θ) is such that the joint distribution of the sample L(X; θ) can be factored as

 L(X; θ) = g(T(X); θ).h(X)

where :

* g(T(X); θ) is a non negative function that depends on the parameter θ and on the observations only through the statistic T(X),

* h(X) is a non negative function that depends on the observations, but not on the parameter θ,

then T(X) is a sufficient statistic for θ.

The converse is true : if a statistic T is sufficient, then the sample distribution can be factored as above.

-----

The Factorization Theorem is a powerful method for building sufficient statistics :

1) From p(x; θ), one calculates the joint distribution of the sample L(X; θ).

2) One then attempts to write L(X; θ) in the above factored form.

3) If the attempt is succesful, then the sufficient statistic T(X) is extracted from g(T(X); θ).

Note, though, that failing to factor L(X; θ) may only reflect the lack of perspicacity of the analyst. Yet, in some cases, it is clear that L(X; θ) cannot be factored as expected, and one can then safely assert that θ has no sufficient statistic.

The Factorization Theorem is demonstrated below.

-----

The factored form of L(X; θ) shows the separation between :

* g(T(X); θ), the part of the sample distribution useful for the estimation of θ, and that depends on the sample only through the value of the sufficient statistic T,

* And h(X), a function of the observations which does not depend on θ and plays no role in the estimation.

## Exponential family

The Factorization Theorem refers to the sample distribution, not to the distribution p(x; θ) itself. Yet, it would certainly be convenient to be able to decide whether a distribution p(x; θ) admits a sufficient statistic for θ just by looking at its mathematical expression.

We'll show here that p(x; θ) admits a sufficient statistic for θ if and only if it belongs to the exponential family, and can therefore be written :

 p(x; θ) = exp[A(x)B(θ) + C(x) + D(θ)]

In addition, we'll also show that T = Σi A(xi) is sufficient for θ.

# Multidimensional sufficient statistic

The parameter θ may be a vector parameter, that is as set θ = {θ1, ..., θk } of scalar parameters. For example, on may need to estimate both the mean and the variance of a normal distribution N(µ, σ²) when the values of these two quantites are unknown. We then have θ = (µ, σ²).

In general, the individual components θi of the parameter θ have no sufficient statistic.

Yet, it is sometimes possible to identify a multidimensional sufficient statistic T = {T1, ...,  Tk} for the vector parameter θ = {θ1, ..., θk }. The sample distribution conditionally to the set of k values of {T1, ...,  Tk} then does not depend on θ.

We'll illustrate the concept of multidimensional sufficient statistic by identifying a bidimensional sufficient statistic for the pair (µ, σ²) of a normal distribution whose mean and variance are both unknown.

# Sufficient statistic and Efficient estimator

We show here that if a function g(θ) of the parameter θ admits an efficient estimator, then this estimator is a sufficient statistic for θ.

The converse is of course not true : there is no reason why a sufficient statistic should be an efficient estimator, or even an unbiased estimator.

# Sufficiency and Maximum Likelihood

The definition of a sufficient statistic relies on a property of the sample distribution, that is, of its likelihood. Also, the Factorization Theorem identifies a particular analytical form of the sample likelihood when a sufficient statistic exists. It should therefore be anticipated that some connection exists between the concepts of "Sufficiency" and "Maximum Likelihood".

* There is no reason why a Maximum Likelihood Estimator should be a sufficient statistic. But we'll show that if a MLE is unique, then it is a function of any sufficient statistic.

* In addition, if the MLE is itself a sufficient statistic, then it is minimal sufficient.

# Neyman-Pearson lemma

The Neyman-Pearson lemma identifies the Best Critical Region for a certain category of tests involving a parameter of a distribution. When this parameter admits a sufficient statistic, the lemma takes a particularly simple form because of the Factorization Theorem, and becomes a powerful tool for an easy identification of Best Critical Regions.

_________________________________________________________

 Tutorial 1

This Tutorial has two independent sections.

-----

In the first section, we justify the definition of a sufficient statistic (discrete case only, but the result is true generally).

* Let p(x; θ) be a probability distribution, and L(X; θ) be the joint distibution of a n-sample X.

* Let also T be a sufficient statistic for θ that takes the value t on the sample x = {x1, x2 , ..., xn}.

Let x = {x1, x2 , ..., xn} be a sample drawn from p(x; θ), and P{X = x} = p(x; θ) be the probability of this sample.

Denote t = T(x) the value of T on the sample x = {x1, x2 , ..., xn}.

We then draw samples Y = {Y1, Y2 , ..., Yn} from the conditional distribution

L(X |T = t )

which does not depend on θ by definition of a sufficient statistic.

We'll show that the probability for Y to be identical to x is equal to P{X = x}.

In other words, X and Y have identical unconditional probability distributions.

-----

In the second section, we establish a necessary and sufficient condition for a statistic to be sufficient. The demonstration is given only for the discrete case, but the result is true in general.

* Let L(X; θ) be the joint probability distribution of a sample X.

* Let also T be a statistic wit probability distribution q(x; θ).

We'll show that T is sufficient for θ if and only if the ratio

L(X; θ) / q(T(X); θ)

which seems to depend on θ, does not, in fact, depend on θ.

We then give two examples of application of this theorem.

JUSTIFICATION OF THE DEFINITION OF A SUFFICIENT STATISTIC

A N&S SUFFICIENT CONDITION FOR A STATISTIC TO BE SUFFICIENT

 Justification of the definition of a sufficient statistic ____________________________ A necessary and sufficient condition for a statistic to be sufficient First example : Bernoulli distribution Distribution of the sample Distribution of the statistic The statistic is sufficient Second example : mean of the normal distribution Distribution of the sample Distribution of the statistic The statistic is sufficient TUTORIAL

_________________________________________________________________

 Tutorial 2

In this Tutorial, we give six explicit examples of sufficient statistics that rely only on the definition of a sufficient statistic :

* Bernoulli b(p) : the number of Heads in n tosses is a sufficient statistic for p. Not as obvious as it sounds.

* Binomial B(n, p) : the identification of a sufficient statistic for p is a bit complex, and requires the demonstration of an important intermediary result.

* We show that for the Poisson distribution P(λ), the sum of the observations is a sufficient statistic for the parameter λ.

* Uniform U[0, θ] : we show that "rightmost observation" (the order statistic or rank n) is a sufficient statistic for the parameter θ.

* Normal distribution N(µ, σ²) : the sample mean is sufficient for the distribution mean µ.

* Shifted exponential (or "location exponential") exp(θ - x).

We show that the leftmost observation is sufficient for θ.

FIRST EXAMPLES OF SUFFICIENT STATISTICS

 Bernoulli distribution Binomial distribution What is a "sample" from a binomial distribution ? Conditional distribution of a binomial distribution Conditional distribution of the sample The statistic is sufficient Poisson distribution Uniform distribution Normal distribution Distribution of the sample conditionally to its mean Expectation of the conditional distribution Covariance matrix of the conditional distribution Variance terms Covariance terms The sample mean is sufficient Shifted exponential distribution TUTORIAL

The unsual length of this Tutorial testifies of the difficulties encountered when attempting to show that a statistic is sufficient by refering only to the definition of a sufficient statistic. For example, the case of the mean of the normal distribution calls on advanced results about the multivariate normal distribution (see here), and requires calculating not so easy conditional expectations, conditional variances and conditional covariances.

Upon reaching the end of the Tutorial, the reader will feel the need for a more expedient way of identifying sufficient statistics, such as the Factorization Theorem, that we now address.

_______________________________________________

 Tutorial 3

We demonstrate here the Factorization Theorem :

1) First in the case of a discrete distribution.

2) Then in the case of a distribution with a density. The general demonstration is difficult and beyond the bounds of this Glossary, so we'll have to make some simplifying assumptions that fortunately cover most of the situations encountered in practice.

-----

We then demonstrate two consequences of the Factorization Theorem :

* A one-to-one transform of a statistic that is sufficient for θ is also sufficient for θ.
* A statistic that is sufficient for θ is also sufficient for one-to-one transforms of θ.

-----

We conclude by showing that if a sufficient statistic S is a function of another statistic T, then this other statistic is also sufficient. We pursue by outlining a proof of the fact that blackwellizing an unbiased estimator θ* by S yields a new unbiased estimator whose variance is smaller than that the estimator obtained by blackwellizing θ* by T. Unfortunately, although the demonstration is formally simple, it calls on difficult properties of conditional expectations that will be stated without proof.

THE FACTORIZATION THEOREM

 The Factorization Theorem (discrete case) Factorization is necessary Factorization is sufficient Distribution function of a sufficient statistic The Factorization Theorem (densities) Restrictive conditions Factorization is necessary Factorization is sufficient Sufficient statistics and functions One-to-one function of a sufficient statistic One-to-one function of the parameter Sufficient statistic S function of a statistic T T is sufficient S is better than T TUTORIAL

______________________________________________

 Tutorial 4

We now use the Factorization Theorem for identifying sufficient statistics for some classical distributions. We'll discover that, more often than not, using the Factorization Theorem makes this identification easier than using the mere definition of a sufficient statistic.

We'll encounter examples of bidimensional statistics that are sufficient for a pair of scalar parameters, even though these parameters have individually no sufficient statistic.

FACTORIZATION THEOREM :

EXAMPLES OF APPLICATIONS

 Bernoulli distribution Uniform distribution [0, θ] Uniform distribution [θ, θ + 1] Poisson distribution Normal distribution Mean First method Second method Variance The statistic Distribution of the statistic Distribution of the sample The statistic is sufficient Mean and variance Exponential distribution Gamma distribution Shape parameter Dispersion parameter Shape and dispersion parameters Beta distribution TUTORIAL

____________________________________

 Tutorial 5

In this short Tutorial, we show that :

* If a Maximum Likelihood Estimator is unique, then it is a function of any sufficient statistic.

* If a Maximum Likelihood Estimator is unique and is a sufficient statistic, then it is a minimal sufficient statistic.

SUFFICIENCY AND MAXIMUM LIKELIHOOD

 Maximum Likelihood Estimator and Sufficiency If a MLE is unique and sufficient, then it is minimal sufficient TUTORIAL

_______________________________________________________

Related readings :

 Exponential family Maximum Likelihood estimation Neyman-Pearson lemma Rao-Blackwell theorem Minimal sufficient statistic Complete statistic Lehmann-Scheffé Theorem
 Download this Glossary