Sufficient statistic
Let p(x, θ) be a probability distribution known up to the value of the parameter θ.
Let also X = {x1, x2, ..., xn } be a n-sample drawn from this distribution. All that will ever be known about the value of θ is contained in this set of n numbers.
To estimate the value of θ, we devise a statistic (a function of the observations that does not depend on θ) with some nice properties, like for example its expectation being equal to θ (we then have devised an unbiased estimator of θ).
Yet we may reasonably be concerned about having now a single number (the value of the estimator as calculated from the sample) when we started with a set of n numbers (the values of the observations in the sample). Certainly, something must have been lost in the process. One says that some "information" has been thrown away when collapsing the n-sample into a single number (the estimate).
An informal yet pervasive line of thinking in Estimation Theory considers that the sample contains both :
* "Information" that is useful for the purpose of estimating θ,
* And "information" that is useless for the purpose of estimating θ (although it might be useful for some other purpose).
It may therefore be feared that some of the useful information (as well as some of the useless one) be thrown away when collapsing the sample into a single estimate. This is indeed generally true.
Yet it is remarkable that under certain circumstances, creating a statistic will throw away only (some of) the useless information, while completely retaining the useful information about θ. When this is the case, the statistic is called a sufficient statistic for θ.

On the next page, we pursue this line of reasoning by describing a thought experiment that may help the reader gain a better intuitive understanding of the formal definition of a sufficient statistic as given in the next paragraph.
Let p(x, θ) be a probability distribution.
All of Estimation Theory rests on the knowledge of the probability distribution of the sample X = {x1, x2, ..., xn }, that we denote Lθ (X). From this distribution is derived (whenever possible) the probability distribution of any statistic T(X).
We denote Lθ (X |T = t ) the distribution of the sample conditionally to a given value t of the statistic T. In loose and absolutely improper terms, keep drawing n-samples from p(x, θ), and retain only those samples for which T = t : these samples are distributed as Lθ (X |T = t ).
We now come to the formal definition of a sufficient statistic :
|
A statistic T is said to be sufficient for the parameter θ if the distribution of the sample conditionally to the value of the statistic T does not depend on θ. |
In other words, we can drop the index θ in Lθ (X |T = t ) and just write L(X |T = t ).
The thought experiment described on the next page explains how this rather abstract definition provides an instrumental content to the intuitive idea of a "statistic that retains all the information pertaining to θ".
-----
We also give here a geometric interpretation of sufficiency that may help visualize the concept more accurately.
If T is sufficient, the distribution of the sample X
conditionally to the value of T does not depend on θ. It is therefore
also the case of any function
f(X) of the sample, that is,
of any statistic. So, if T is sufficient, the distribution of any statistic
conditionally to the value of T does not depend on θ. This remark
is very useful in practice.
The parameter θ may be a vector parameter, that is as set θ = {θ1, ..., θk } of scalar parameters. For example, on may need to estimate both the mean and the variance of a normal distribution N(µ, σ²) when the values of these two quantites are unknown. We then have θ = (µ, σ²).
In general, the individual components θi of the parameter θ have no sufficient statistic.
Yet, it is sometimes possible to identify a multidimensional sufficient statistic T = {T1, ..., Tk} for the vector parameter θ = {θ1, ..., θk }. The sample distribution conditionally to the set of k values of {T1, ..., Tk} then does not depend on θ.
We'll illustrate the concept of multidimensional sufficient statistic by identifying a bidimensional sufficient statistic for the pair (µ, σ²) of a normal distribution whose mean and variance are both unknown.
As the examples is the first Tutorial
show, identifying a sufficient statistic from the definition may be a bit difficult because
calculations involving conditional probabilities are usually cumbersome. Fortunately,
one can show that if a distribution pθ (x)
admits a sufficient statistic T for the parameter θ,
then the joint probability distribution Lθ (X) of a n-sample X can be written
as :
|
Lθ (X) = g(T(X), θ).h(X) |
What this expression means is that Lθ (X) can be factored into two terms :
1) h(X), a non-negative function that depends only on the sample, but not on the parameter.
2) g(T(X), θ), a non-negative function that depends :
* On the parameter θ,
* and on the observations, but only through the value of the statistic T(X).
We will show that the function g(t, θ) is in fact the probability distribution function of the sufficient statistic.
-----
The converse is also true : if a distribution pθ (x) is such that the joint distribution Lθ (x1, x2, ..., xn) of a n sample X = {x1, x2, ..., xn} can be factored as above, then the statistic T is a sufficient statistic for the parameter θ.
This important result is known as the Factorization Theorem.
As we'll see, the Factorization Theorem is the most practical way to identify sufficient statistics : given pθ (x), one attempts to write the analytical expression of Lθ (X) in the factored form. If it is possible, then T(X) is a sufficient statistic.
The Factorization Theorem applies to multidimensional sufficient
statistics as well (see above).
The first two characterizations of a sufficient statistic refer to the sample distribution, not to the distribution p(x, θ) itself. Yet, it would certainly be convenient to be able to decide whether a distribution p(x, θ) admits a sufficient statistic for θ just by looking at its mathematical expression.
We'll show here that p(x, θ) admits a sufficient statistic for θ if and only if it can be written as :
|
p(x, θ) = exp[A(x)B(θ) + C(x) + D(θ)] |
This expression defines a class of distributions known as the exponential family.
In addition, we'll identify a particular sufficient statistic for θ when p(x, θ) can be written as above.
We show here that if a function g(θ) of the parameter θ admits an efficient estimator, then this estimator is a sufficient statistic for θ.
The converse is of course not true : there is no reason why a sufficient statistic should be an efficient estimator, or even an unbiased estimator.
We'll then demonstrate two consequences of the Factorization Theorem :
* A one-to-one transform of a statistic that is sufficient for θ is also sufficient for θ.
This result shows that being sufficient has no relevance
as to whether the statistic is a good estimator or not. For suppose that
the sufficient statistic T turns out to be an unbiased estimator
of θ. Then however large
the number a, T + a is also sufficient, but is of course
a poor estimator of θ.
* A statistic that is sufficient for θ is also sufficient for one-to-one transforms of θ.
In general, it is not true that a function f(T ) of a sufficient statistic T is sufficient.
But if T is a sufficient statistic, and if T = f(S) where :
* S is another statistic, and
* f is not necessarily one-to-one,
then S is also sufficient.
In other words, sufficiency does not necessarily flow "downstream", but it always flows "upstream".
We demonstrate this result here.
-----
This result may be viewed in the light of the "useful information" paradigm. A function f(.) that is not one-to-one always causes a loss of information : knowing the output is not enough to know the input unambiguously. When applied to statistics, this remark shows that a function f that is not one-to-one may inadvertantly throw away some of the useful information contained in T, thus making S = f(T) not sufficient.
On the other hand, a function f never creates information. So if T is sufficient, and T = f(S), then S must already contain all the useful information, and therefore be sufficient (lower image of the above illustration).
Given two sufficient statistics T and S such that T = f(S) for some f (not one-to-one), T may be regarded as S after shedding some extra and useless weight, but having retained all of its qualities as far as estimating θ is concerned.
One may then wonder if a sufficient statistic T may be "lighter" than any other sufficient statistic. When this is the case, T is said to be a minimal sufficient statistic.
So, by definition :
|
A minimal sufficient statistic is a sufficient statistic that is a function of any other sufficient statistic. |
The important question of minimal sufficient statistics is further developed here.
There is a close connection between the two concepts of "Sufficient statistic" and "Maximum Likelihood estimator".
We state without proof two important results :
* If the Maximum Likelihood estimator of θ is unique, then it is a function of a sufficient statistic.
* If the Maximum Likelihood estimator of θ is unique and is a sufficient statistic, then it is minimal sufficient statistic.
The Rao-Blackwell theorem shows how to improve an unbiased estimator of a parameter θ (i.e. reduce its variance). The procedure goes through one step of conditioning the estimator on a statistic that needs to be sufficient for θ for the theorem to be valid.
The Neyman-Pearson lemma identifies the Best Critical Region for a certain category of tests involving a parameter of a distribution. When this parameter admits a sufficient statistic, the lemma takes a particularly simple form because of the Factorization Theorem, and becomes a powerful tool for an easy identification of Best Critical Regions.
_________________________________________________________
|
Tutorial 1 |
In this Tutorial, we give five explicit examples of sufficient statistics that rely only on the definition of a sufficient statistic :
* Bernoulli b(p) : the number of Heads in n tosses is a sufficient statistic for p. Not as obvious as it sounds.
* Binomial B(n, p) : the identification of a sufficient statistic for p is a bit complex, and requires the demonstration of an important intermediary result.
* We show that for the Poisson distribution P(λ), the sum of the observations is a sufficient statistic for the parameter λ.
* Uniform U[0, θ] : we show that "rightmost observation" (the order statistic or rank n) is a sufficient statistic for the parameter θ.
* Truncated exponential exp(θ - x).

We show that the leftmost observation is sufficient for θ.
FIRST EXAMPLES OF SUFFICIENT STATISTICS
|
Bernoulli distribution Binomial distribution What is a "sample" from a binomial distribution ? Conditional distribution of a binomial distribution Conditional distribution of the sample The statistic is sufficient Poisson distribution Uniform distribution Truncated exponential |
||
|
TUTORIAL |
||
_______________________________________________
|
Tutorial 2 |
We demonstrate here the Factorization Theorem :
1) First in the case of a discrete distribution.
2) Then in the case of a distribution with a density. The general demonstration is difficult and beyond the bounds of this Glossary, so we'll have to make some simplifying assumptions that fortunately cover most of the situations encountered in practice.
-----
We then demonstrate two consequences of the Factorization Theorem :
* A one-to-one transform of a statistic
that is sufficient for θ is also sufficient
for θ.
* A statistic that
is sufficient for θ is also sufficient for one-to-one
transforms of θ.
We conclude by showing that if a sufficient statistic is a
function of another statistic, then this other statistic is also sufficient.
THE FACTORIZATION THEOREM
|
The Factorization Theorem (discrete case) Factorization is necessary Factorization is sufficient Distribution function of a sufficient statistic The Factorization Theorem (densities) Restrictive conditions Factorization is necessary Factorization is sufficient Sufficient statistics and functions One-to-one function of a sufficient statistic One-to-one function of the parameter Sufficient statistic function of a statistic
|
||
|
TUTORIAL |
||
______________________________________________
|
Tutorial 3 |
We now use the Factorization Theorem for identifying sufficient statistics for some classical distributions. We'll discover that, more often than not, using the Factorization Theorem makes this identification easier than using the mere definition of a sufficient statistic.
We'll encounter examples of bidimensional statistics that are sufficient for a pair of scalar parameters, even though these parameters have individually no sufficient statistic.
FACTORIZATION THEOREM :
EXAMPLES OF APPLICATIONS
|
Bernoulli distribution Uniform distribution [0, θ] Uniform distribution [θ, θ + 1] Poisson distribution Normal distribution Mean First method Second method Variance The statistic Distribution of the statistic Distribution of the sample The statistic is sufficient Mean and variance Exponential distribution Gamma distribution Shape parameter Dispersion parameter Shape and dispersion parameters Beta distribution |
||
|
TUTORIAL |
||
_______________________________________________________
Related readings :
|