JUSTIFICATION OF THE DEFINITION OF A "SUFFICIENT STATISTIC"

The definition of a sufficient statistic is often perceived as abstract and telling little to intuition. We describe here a "thought experiment" that is intended to make this important concept easier to grasp.

# The experimental setting

Let p(x; θ) be a probability distribution that is known except for the value of one its parameters θ.

Let also x = {x1, x2, ..., xn} be a n-sample drawn from this distribution.

Finally, let T be a statistic, and t the value of this statistic on the sample. We denote gθ(t) the distribution of T. Whether gθ(t) is known or not is of no importance for the remainder of this section.

-----

We have two statisticians, call them S1 and S2. Both know p(x; θ), except for the value of θ.

• The complete sample x is known to statistician S1.
• But statistician S2 receives only the limited information : "The value of T on the sample (which is unknown to you) is t".

# The "rich" statistician and the "poor" statistician

Statistician S1 thinks :

"I'm in a good position because I have the detailed description of x, which contains all the information one will ever get about the value of θ."

Statistician S2 thinks :

"I'm at a disadvantage because the information I received is only a small part of the information contained is the sample. In particular, I do not know the values of the observations, so I will never know the value of any statistic other than T (or a function of T) on the sample. So, whatever the question will be asked about θ, there is nothing I can do except hope that T = t has something do with this question.

My colleague S1 is, of course, in a much better postion because he can do all sorts of calculations on the sample and therefore will very likely be able to provide a better answer to the question than I can".

-----

In general, all of the above is true, but there is a special and very important circumstance where statistician S2 is in just as good a position as his apparently luckier colleague S1.

S2 knows the analytical form of p(x; θ), just as S1 does (he just doesn't know the sample x). So he can calculate the theoretical distribution of the sample conditionally to the value of the statistic T :

Lθ(X T = t )

Usually, this distribution will depend on θ (this is why we indexed L with θ), which is unknown, so S2 cannot do anything practical with this theoretical distribution.

# The breakthrough

But suppose that S2 discovers, to his surprise, that θ does not show up in the mathematical expression of Lθ(X T = t), that is, that  Lθ(X T = t does not, in fact, depend on θ, and can therefore just be written L(X T), with the "θ" index removed.

L(X T = t ) is then a completely defined distribution for any value t

and S2 can use it as he wishes.

# The thought experiment

Then we can imagine the following thought experiment :

1. S1 draws a sample x = {x1, x2, ..., xn} from p(x; θ).
2. He calculates t, the value of T on this sample.
3. He communicates t to S2.
4. S2 draws a sample y = {y1, y2, ..., yn} from L(X T = t).

x is a realization of a random vector X, and y is a realization of another random vector Y.

We'll show that :

 X and Y have identical probability distributions.

Actually, this result is true whether the statistic T is sufficient or not for θ. The sufficiency of T plays no role in the demonstration : it is there only to ensure that statistician S2 can use the conditional distribution L(X  | T = t ).

Now let θ* be an estimator of θ. The two values :

* θ*(x)  and

* θ*(y)

are different, but they follow the same probability distribution and are therefore, on the average, equally good (or poor!) estimates of θ.

-----

If we apply this scheme to our thought experiment :

* S1 draws sample x directly from  L(X; θ) by drawing n observations from p(x; θ).

* S2 :

- First "draws" a realization of T from gθ(t). In fact, he doesn't actually do it : he just uses the value t given to him by S1. Note that S1 does not have to know gθ(t) because drawing a sample x and then calculating t is just the same as drawing a realization of T from gθ(t).

- Then S2 draws a sample y from  L(X T = t).

In summary, just because we have been able to identify a statistic T such that the distribution of the sample conditionally to the value t of this statistic does not depend on the value of the parameter θ, we can dispense with knowing the details of the sample, and do as good a job as a statistician using only the value t of the statistic T.

Such a statistic, when it exists, is called a sufficient statistic for θ. The term "sufficient" here means : "Don't bother to give me the complete description of the sample, it is sufficient for me to know the value t of T on the sample to do as good a job as if I knew the sample itself".

________________________________________________________

 GEOMETRIC INTERPRETATION OF A SUFFICIENT STATISTIC

The concept of "sufficient statistic" has a geometric interpretation, that we illustrate with the normal distribution N(µ, σ²). We assume that σ² is fixed. The parameter θ of the foregoing paragraph is therefore the mean µ.

We consider 2-samples, whose distribution in the (x1, x2) plane is binormal with circular symmetry Lµ(x1, x2). We index L with µ as a reminder that this distribution depends on the value of µ.

The above illustration represents Lµ(x1, x2) for µ = 0.

It can be shown that the sample mean m = (x1+ x2) /2 is a sufficient statistic for µ. For a given number t, the samples satisfying m = t are on a line at 45° of the axes (blue line). The conditional distribution Lµ=0(x1, x2 | m = t ) is the distribution encountered while moving along this blue line. Its values are the values of  Lµ=0(x1, x2) on the line, normalized so that the integral of these values is 1.

This conditional distribution is normal.

Now consider another value for µ. The distribution Lµ(x1, x2) is different (lower image of the above illustration). The expression m = t is materialized by another blue line just above the first one. Because m is a sufficient statistic, the distributions on these two blue lines are identical (but of course, this classical result can also be demontrated directly).