Test
Tests are, together with Estimation, one of the main activites of ordinary Statistics.
We first go over a simple test in some detail as a means to introduce some basic ideas. We'll then develop and formalize these ideas.
A certain coin is reputed "fair". That is, when tossed, the probability for the coin to land on "Heads" is claimed to be p = .5 (and therefore the probability for the coin to land on "Tails" is claimed to be q = (1 - p) = .5).
We are going to describe a procedure that will allow us to assess the plausibility of this claim. This procedure will be called a test. This test will bear on the hypothesis according to which the probability for the coin to land on Heads is indeed .5.
We now do the only thing we can do in the face of an unknown probability distribution : we draw a sample from this distribution. Here, the distribution is the Bernoulli b(p) distribution, with p as the sole parameter.
So we toss the coin, say, 10 times. The result of this series of 10 tosses is as follows :
|
0 |
1 |
0 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
where :
* "1" stands for Heads
* "0" stands for Tails.
We now have a 10-sample drawn from the above mentionned distribution. In this sample is all the information we'll ever get about the distribution.
We could imagine calculating the probability to have drawn this particular sample if it is true that p = .5, and use this value as an indication of the plausibility of the hypothesis of fairness of the coin.
A very low probability could then be considered as a clue that the hypothesis is wrong, and that the coin is not fair.
As it turns out, if p = .5 indeed, then all samples have exactly the same probability (which is equal to (.5)10, see here), and so the probability of the sample under the hypothesis that p = .5 is of no use to help us build an opinion about the fairness of the coin.
But we can extract form the sample some information that will guide our reflexion : the number of Heads in the sample. This number is 3, and we can calculate the probability that tossing a fair coin 10 times will generate exactly 3 Heads (see binomial distribution). This probability is about .117.
The binomial distribution tells us that :
* The most probable outcomes are that for which the numbers of Heads and Tails are approximately equal (see illustration below).
* Whereas the most unlikely outcomes are that for which there are very few Heads, or very few Tails. (lower image of the illustration).
So we see that the number of Heads is a good indicator of the credibility of the hypothesis p = .5.
This number is a function of the observations in the sample only : it is a statistic. The important fact about this statistic is that we know its probability distribution under the hypothesis p = .5 : it is the binomial distribution B(10, .5). This will allow us to use the value of the statistic to quantify the credibility of the hypothesis under test.
From now on, we'll reason on this statistic, which is promoted to the rank of statistic of the test.
Rejecting the hypothesis p = .5 on the basis of an improbable number of Heads involves a risk, for even a perfectly honest coin can produce 10 Heads in a row : this event is unlikely, but not impossible. So we know that rejecting the hypothesis p = .5 is a decision prone to errors.
The risk of committing such an error can be quantified. Suppose we decide beforehand that we'll reject the hypothesis p = .5 if we draw 1, 2, 9, or 10 Heads on the basis of the fact that these outcomes are just too unlikely for a fair coin. The probability for this to happen is :
P{Nb Heads = 0} + P{Nb Heads = 1} + P{Nb Heads = 9} + P{Nb Heads = 10}
which is about .02.
This probability is called the significance level of the test. It is the probability that we will wrongfully reject the hypothesis p = .5 on the basis of our criterion.
We decided to reject the hypothesis p = .5 if we observe :
* 1 Head or less,
* 9 Heads, or more.
The values "1" and "9" are called the critical values of the test statistic. If the observed value of the test statistic is beyond these critical values, the hypothesis will be rejected. The region beyond the critical values is called the critical region (or "region of rejection").
The critical region depends on the chosen significance level. Had we decided on a significance level of, say, .15 (instead of .02), then a simple calculation shows that the critical values would have been 2 and 8, instead of 1 and 9, and we would have had to reject the hypothesis had we observed 0, 1, 2, 8, 9 or 10 Heads.
The risk of wrongfully rejecting the hypothesis p = .5 would then have been higher than before.
________________________________________
We'll stop here the description of our little test that allowed us to introduce some of the concepts that are to be found in all tests.
We now go over these concepts in more detail and formalize them so that they can be used in many different contexts. This formalization is necessary for introducing the important notion of power of a test.
A test always starts with a hypothesis about the probability distribution that generated a given sample. This hypothesis bears the generic name of null hypothesis, and is denoted H0. Tens of classical null hypothesis can be formulated (see some of them on the next page). In the above test, the null hypothesis is that the coin is fair. In the standard terminology of tests, we will write :
H0 : p = .5
The null hypothesis can bear on more than just one distribution.
For example, it can state that two distributions are identical.
The only source of information about the distribution (and therefore about the plausibility of the null hypothesis) is the sample x = {x1, x2, ..., xn} drawn from the distribution. We'll assume that the observations are drawn independently from one another.
The art of devising a test lies in identifying what makes us believe that the sample is contradicting the null hypothesis. In the foregoing example, the clue was the number of Heads.
More generally, devising a test will require to identify a test statistic, that is, a quantity that depends on the sample only, and whose value is deemed to be a good indicator of the credibility of the null hypothesis.
-----
Devising a test statistic is no easy business. The statistic of the t test may be quite intuitive, but most classical tests rely on statistics that took considerable efforts and imagination to be invented and proved useful.
Besides, more than one candidate statistic may be identified, and choosing the "best" one is a difficult problem.
The foregoing example showed us that the test relies on knowing the distribution of the test statistic when the null hypothesis is true. Identifying this distribution is usually difficult too.
For example, the t distribution of the T statistic is a bit difficult to establish, and this is even more so for the F statistic of ANOVA.
The distributions of many important test statistics (Chi-2, Kolmogorov-Smirnov, Cramér-von Mises etc...) are unknown. One then has to resort :
* Either to the asymptotic distribution of the statistic (which is usually known) used as an approximation of the exact distribution (Chi-2 statistic).
* Or to tables of critical values obtained by Monte-Carlo simulations.
The same is usually true for statistics elaborated by Likelihood Ratio Tests.
Once the test statistic identified, and its distribution calculated, the test proper may start.
The first step is to choose a significance level. It is a number, between 0 and 1, which is the tolerated probability of wrongfully rejecting the null hypothesis when it is in fact true. This number is denoted α.
The decision to reject the null hypothesis will be made on the basis of the value of the test statistic. We'll go over this point in more detail later.
The significance level of a test is not a statistical quantity. It is an arbitrary number chosen by the analyst. This choice is based on the perception of the seriousness of the consequences of wrongfully rejecting the null hypothesis. The most commonly used significance levels are α = .05 and α = .01. These numbers mean that we tolerate a 5% (resp. 1%) probability of wrongfully rejecting the null hypothesis.
If the analyst wants to increase his protection against the risk of wrongfully rejecting the null hypothesis, he will assign a lower value to the significance level. For instance, if the test bears on a critical issue, like recommending a potentially dangerous surgical operation, the significance level of a test might be decreased to 0.001 or less.
-----
Proper methodology demands that the value of the significance level be chosen before any measurement is made in order to avoid the all-too-natural tendency to adjust a posteriori the significance level to the data so that the decision of rejecting or not the null hypothesis be then made on the basis of some preconceived idea rather than on the data.
Once the distribution of the test statistic is known when H0 is true, it is easy to define a condition for rejecting the null hypothesis with a probability of being wrong equal to α.
In the foregoing example, rejection was decided of the basis of too small, or too large a number of Heads. These two conditions identifies the tails of the distribution of the statistic as the region of rejection of the null hypothesis.
The region of R that leads to rejecting the null hypothesis is called the critical region (or "region of rejection"), and the limits of this region are called the critical values (of the test statistic).
In the t test, the distribution of the test statistic is Student's t distribution, and the critical region looks like this :
For the chosen significance level α, the critical values are cl and cr such that the sum of the areas under the density curve to the left of cl and to the right of cr is equal to α.
* If the value of the T statistic is inside the critical region, the null hypothesis H0 = 0 is rejected as too unlikely.
* But if this value is outside the critical region (lower image of the above illustration), the hypothesis will not be rejected.
"Not rejecting the null hypothesis" does
not mean "Accepting the hypothesis as true". It only means that the
data is not in blatant contradiction with the hypothesis.
The rationale behind this decision is simple. If the probability associated to the value of the test statistic is very small, it is either :
* Because the null hypothesis is true, and this value is then just a very rare event,
* Or because the null hypothesis is false,
and we decide in favor of the second explanation because we don't believe in "rare events".
We just met a first difficulty with the concept of critical region. The above illustration shows that cl = - cr , but we did not explain what motivated this choice. From our definition of the significance level, any pair of values (cl , cr ) defining an area under the density curve equal to α also defines a region such that the probability for the value of the statistic to be in this region is also α. Why choose the displayed region ?
We'll return shortly to this important question.
If the value of the test statistic is beyond the limits defined by the critical values (or more generally, if the value of the test statistic is in the critical region), the null hypothesis will be rejected as too unlikely. Yet, if we get back to our first test about the fairness of a coin, even a fair coin can produce 10 Heads in a row. By rejecting the null hypothesis, we would then make an error that is said to be a Type I error.
So, by definition :
A Type I error is the error made by rejecting the null hypothesis when it is in fact true.
So
P{Type I error} = α
By symmetry, one considers the following situation : the value of the test statistic is not in the critical region, and therefore we do not reject the null hypothesis. Yet, the null hypothesis is false (of course, we don't know it), and we therefore make an error. This other type of error is called a Type II error.
So, by definition :
A Type II error is the error made by failing to reject the null hypothesis when it is in fact false.
By symmetry with the Type I error, one might be tempted to introduce the concept of "Probability of a Type II error". Yet, it turns out that this concept cannot be defined unambiguously at this time. An essential ingredient is missing : the alternative hypothesis.
In practice, a researcher never works on the null hypothesis alone. In fact, he usually hopes the null hypothesis to be false because he favors another hypothesis called the "alternative", or "research" hypothesis, denoted H1.
For example, if the results of a new treatment against high blood pressure are submitted to a test, the null hypothesis is :
H0 : The new treatment has no effect whatever.
The researcher certainly hopes that the data will not just disprove the null hypothesis, but will also suggest a significant reduction in the patients' blood pressure. Consequently, he will consider the alternative hypothesis :
H1 : The new treatment reduces the patients' blood pressure.
rather than the more general but less informative :
H1 : The new treatment has some effect on the patients' blood pressure.
In view of the alternative hypothesis, we will replace our earlier definition of a Type II error :
A Type II error is the error made by failing to reject the null hypothesis when it is in fact false.
by this other one, more restrictive but also more instrumental :
A Type II error is the error made by failing to reject the null hypothesis when the alternative hypothesis is true.
This second definition is less general than the first one because it may very well be that both the null and the alternative hypothesis are false (think of the following situation : H0 : µ = µ0 against H1: µ > µ0 when the reality is µ < µ0 ).
With this new definition of a Type II error, we can return to the issue of the probability of a Type II error.
The question is now :
|
|
If H1, the alternative hypothesis is true (and therefore H0 is false), |
|
|
what is the probability for the value of the test statistic to be outside the critical region ? |
The probability is now unambiguously defined. It is denoted β :
P{Not rejecting H0 when H1 is true} = β
It is more common to reason in terms of (1 - β) rather than β itself. The quantity (1 - β) is called the power of the test. It is the probability for the value of the test statistic to fall inside the critical region when H1 is true.
Power = 1 - β = P{Rejecting H0 when H1 is true}
The interpretation of the power of a test now appears clearly. If the test rejects the null hypothesis, we certainly want it to also hint strongly at the validity of the alternative hypothesis. The above expression shows us that when a powerful test rejects the null hypothesis, we may accept the alternative hypothesis with a low probability of being wrong (Type II error).
Once the sample size, the test statistic and the significance level are fixed, the power of the test will depend only on the choice of a critical region. The key point is that any region such that the probability for the value of the test statistic to be in this region is equal to α when H0 is true is acceptable as a critical region as long as we consider Type I errors only.
For example, suppose that we are testing the mean µ of a normal distribution to be equal to 0 :
* H0 : µ = 0,
* H1 :
µ
0
The test statistic is the standardized sample mean, which is t distributed.
It would be conceivable (although pretty silly) to use the following critical region at the α = 0.05 significance level :

for it is indeed true that the test statistic has a 0.05 probability to land in the shaded area when H0 is true.
But consider the alternative hypothesis
H1 : µ
0. Then the value of the test
statistic falling inside the critical region of H0 could
certainly not be regarded as a clue in favor of H1.
The test would have a very low power.
But if we choose the critical region as in this illustration :

then certainly the value of the test statistic falling inside the critical region of H0 would suggest the mean of the distribution to be substantially smaller (or larger) than 0, and the power of the test would be much higher than before.
-----
So we see that choosing a critical region is dictated, for a given α, by the desire to maximize the power of the test so as to make it efficiently discriminate between the null and the alternative hypothesis.
Facing a pair of mutually exclusive hypothesis (null and alternative), one will then try to identify the critical region conducive to the most powerful test, and therefore to identify the Best Critical Region (BCR).
This theoretical problem is usually difficult but has been solved once and for all for just about every practical situations that the analyst will ever encounter. The rather intuitive results are given in the next section.
The Neyman-Pearson
lemma identifies BCRs for an important class of tests.
Proper methodology demands that the analyst decides on the values α and β before any data is collected. The values assigned to these two probabilities (as well as the choice of a critical region) then impose a certain sample size N, as the power of a test increases with the sample size. In practice, the analyst is usually facing data that was collected without regard for tests, and will therefore have to be satisfied with the value of β that will be calculated from α, the alternative hypothesis (and therefore the critical region) and N.
For many classical tests, BCR often have the canonical shapes described below.
We'll consider only the very common situation in which the test bears on the value of a parameter θ of a probability distribution, and in which the null hypothesis reads :
H0 : θ = θ0
The three most common alternative hypothesis are then :
----------
* H1 : θ
θ0
The question is just "Does data hint at θ = θ0 or not ?".
The critical region is then equally distributed under the two tails of the probability density curve (for a continuous distribution) of the test statistic.

----------
* H1 : θ > θ0
The question is "Does data hint not just at at
θ
θ0 , but more specifically at θ >
θ0 ?".
The critical region is then under the right tail of the probability density curve (for a continuous distribution) of the test statistic : the value of this statistic being in this region is certainly an argument in favor of rejecting the null hypothesis (θ = θ0) and accepting the alternative hypothesis (θ > θ0) instead.

----------
* H1 : θ < θ0
The question is "Does data hint not just at at
θ
θ0 , but more specifically at θ <
θ0 ?".
The critical region is then under the left tail of the probability density curve (for a continuous distribution) of the test statistic.

---------------
For obvious reasons :
* The first test is called a two-sided test,
* Whereas the last two are called one-sided tests.
Best Critical Regions do not always have these nice
and intuitive shapes. We give here
some examples of BCRs with somewhat exotic behaviors.
ANOVA tests the equality of the means of more than two normal distributions (null hypothesis). The only alternative hypothesis is that not all the means are equal.
The F test statistic is a ratio such that :
* The numerator is a non negative number that measures the separation between the group centers,
* The denominator is a non negative number that measures the spread within the groups. It can be regarded as a normalization factor.
The test statistic is always non negative, and a very small value means that the group centers are closely packed, certainly not a convincing argument for rejecting the null hypothesis in favor or the alternative hypothesis. Only large positive values of F can lead to rejecting the null hypothesis.

So ANOVA is inherently one-sided.
-----
The same can be said for other tests like Chi-square tests.
In a one-sided test (say, to the right), the area under the probability density curve to the right of the measured value of the test statistic is the probability for this statistic to be at least as large as the measured value when H0 is true (green area in this illustration).
It is called the p-value (of the statistic).
So :
* The null hypothesis will be rejected if the p-value is lower than the significance level (green area is less than α).
* Otherwise, the null hypothesis will not be rejected (lower image of the above illustration).
1) A similar definition holds for left one-sided tests.
2)
A similar definition can be given for two-sided tests, but is then somewhat
artificial. Two-sided tests usually refer to critical values rather than
p-values.
On the next page is a short list of some of the most frequently encountered tests.
_____________________________________
This page described the classical approach to the concept ot test. It relies on the distribution of a test statistic and on the subsequent identification of a Best Critical Region for the alternative hypothesis.
This approach is not the only possible one. For example, the Neyman-Pearson lemma identifies BCRs for an important class of tests without resorting to a test statistic.
____________________________________________
Related readings :