Comparisons  (Multiple)

Let E1 and E2 be two samples drawn from independent normal distributions with identical variances. A t test will test the hypothesis according to which these two distributions are identical (same means).

What if there are more than two samples ? ANOVA brings a satisfactory answer, as it is a global test of the equality of the means of the distributions. But another approach could be envisioned : use the same t test on all pairs of samples, and reject the hypothesis of equality of the means if at least one the tests rejects the null hypothesis for a pair of samples.

 

This approach is defective for the following reason. Suppose that the normal distributions that generated the samples are indeed identical. Then the samples might as well have been drawn from the same distribution. Although it is expected that all the samples will have their means close to the mean of this unique distribution, it is quite possible that, just by chance, one of the samples has its mean far away from the population mean. As a matter of fact, as more and more samples are drawn from the distribution, the probability for this to happen increases.

 

ANOVA takes this possibility into account automatically, but a series of t tests on pairs of samples does not. As the number of tested pairs increases, the probability for at least one of the pairs to exhibit a significant difference in their means for a t test increases as well. Yet, it cannot be infered that the set of all sample means is significantly heterogenous.

 

More specifically, let α be the significance level chosen for an ANOVA on a set of k samples, and suppose that this ANOVA fails to reject the null hypothesis at this significance level.
Now run a series of t tests an all pairs of samples at the same significance level α. It is now quite possible that at least one of these tests will detect a significant difference between means for a one pair of samples. In other words, the set of t tests behaves as a unique test, and this test will (wrongly) reject the hypothesis that there is no significant difference between means. Everything happens as if an ANOVA had been conducted at a significance level α’ larger than the original significance level α. This "global" significance level is called the familywise error rate (FWE).

 

This problem occurs every time :

In both situations, the series of tests behaves as a single test with an overall significance level larger than the nominal significance level α of the individual tests. In other words, it becomes too easy to reject the null hypothesis.

-----

There are essentially two ways around this problem.

 

    1) The first one is to "mollify" the tests in the series so that it becomes more difficult for any of them to reject the null hypothesis. It is then hoped that the overall significance level of the series of tests will be that originally wanted, should a single, global test have existed. For example, if a significance level  is requested after c pairwise comparisons, each comparison will be done at significance level α’, with :

 

    2) This approach is limited, for it is generally impossible to guarantee a well defined significance level at the end of the c comparisons. It is more appropriate (and more difficult) to devise specific tests tailored for specific multiple comparisons situations. Then, for each type of problem, one has to :

This is typically what is done after an ANOVA has rejected the null hypothesis at a certain significance level α. This rejection is global, with no clue as to which sample was responsible for the rejection. Many a posteriori (or "post hoc") tests have been designed for the purpose of finely analyzing the situation and understanding why the hypothesis was rejected. For example :

________________________________

 

We described the problem caused by multiple comparisons within the context of ANOVA. But the same phenomenon takes place in the world of non-parametric tests as well. The "equivalent" of ANOVA is then the Kruskal-Wallis test. When the Kruskal-Wallis test rejects the null hypothesis, running a series of pairwise Mann-Whitney tests (the non-parametric "equivalent" of the t test) for the purpose of detecting the "guilty" pair of samples is not a good option, and specific tests have been designed to address this issue. There are also specific post hoc tests meant to address the issue of comparing groups to a reference group, just as Dunnett's test is doing for ANOVA.

 

This text is contributed by company   logo_adscience.gif

__________________________________________________

 

Related readings :

ANOVA

Student's t test

Dunnett's test

Kruskal-Wallis test

Mann-Whitney test

Download this Glossary