Rank (of an observation)
We address here the following issues :
_____________________________________________
Let x be a numerical (or quantitative) variable. In a sample of size n, observations can be sorted by increasing order of their values of x. They are then said to be ranked. So :
Below is an example of a 7-observation sample, together with the corresponding ranks of the observations.

The rank of an observation may be perceived as a "downgraded" version of its coordinate x. Ranks only describe the relative positions of the observations, and therefore carry a lot less information than the true values of x, as nothing is retained of the notion of "distance between two observations".
So if genuine x values carry more information about the sample than ranks, why bother with rank ? There are two reasons.
Some situations are not described adequately by numerical coordinates, but can be described quite appropriately by ranks. For example :
Variables that only carry information about the relative ordering of observations are said to be "ordinal".
Using ranks instead of numerical values is not just a matter of convenience. It also has some deep and useful consequences, because ranks are not changed when the scale on which the corresponding numerical variable x (when it exists) is measured is changed. The change of scale may even be non uniform throughout the range of x without altering ranks. Quite generaly, any monotonous transformation of x keeps ranks unchanged.
Many of the classical parameteric tests (e.g. ANOVA, t-tests) rely heavily on the assumption that the considered distributions are normal. When they are not, the tests break down because they are quite sensitive to this normality assumption.
It is often possible to devise tests that serve
the same purpose as parametric tests by using ranks instead of numerical values.
These tests do not rely on any distribution assumption about the data, and are
therefore quite robust (see below
).
The simplest use of ranks is extending the notion of correlation coefficient to ranked observations. Just as the ordinary (Pearson's) correlation coefficient is a measure of the similarity between two numerical variables, rank correlation is a measure of the similarity between two rankings on the same group of observations.
A classical example comes from the issue of "related abilities". Are :
in any way related ?
One way to address this question is to establish rankings of students in these two disciplines, and try to detect any "correlation" between these rankings.
-----
There are two main (and rival) measures of rank correlation :
Both these quantities are equal to :
Other than in these extreme cases, the actual values
of Kendall's t or Spearman's rS carry
little significance. The really important fact about them is that it is possible
to design tests pertaining to the fact that the values of these coefficients
are actually 0. The sample is considered as extracted from an
infinite population, and these tests assess the plausibility of the null hypothesis
:
or
These tests address the question : "Is the data compatible with the hypothesis according to which there is no relationship between these two variables ?".
As previously mentioned, parametric tests are not robust because of their heavy reliance on the normality assumption. Some classical parametric tests have non parametric counterparts based on ranks. Here are some examples :
|
Parametric |
|
Non parametric using ranks |
|
|
Test on Pearson's r = 0 |
|
Test on Kendall's t =
0 |
|
|
t-test on two independent samples |
|
Wilcoxon-Mann-Whitney test |
|
|
1-Way ANOVA |
|
Kruskal-Wallis test |
|
|
2-Way ANOVA |
|
Friedman test |
Recall that 1-Way ANOVA is an extension of the t-test on independent samples to more than two samples.
The last three tests are identity tests. They test the null hypothesis according to which :
were drawn from the same population.
For more on independent or matched samples, please see here.
Note that many non parametric tests are not based on ranks. For example, an alternative to the Kruskal-Wallis test is the Chi-Square test for k independent samples, which is not based on ranks.
A
"native" ordinal variable usually has only a small number of values
(""Hot", "Lukewarm", "Cold"). The extreme
case is that of binary variables, that have only two values ("Male",
"Female"). It is then nearly impossible to avoid two or more observations
to have the same value of the ranking variable. Assigning ranks to observations
then becomes ambiguous, and the observations are said to be tied.
Here is an example illustrating ties : the same task is assigned to every person in a group of n people containing both men (M) and women (F). Once he or she has completed the task, a (numeric) grade is assigned to the result. The question is : "Is there a significant difference between Men and Women as far as accomplishing this task is concerned ?".
The grades are pooled irrespective of the gender of the persons, then ranked as in the following table :
|
Rank of performance |
1 |
2 |
3 |
..... |
n-1 |
n |
|
Gender |
M |
F |
F |
..... |
M |
F |
The answer is to calculate a rank correlation coefficient on the data (for example Kendall's t ), and run a test on the null hypothesis H0 : t = 0.
But we are in a situation with many ties, and handling these ties may usually be done in several (and somewhat arbitrary) ways.
Ties are ubiquitous in rank-based tests. Software usually takes care of that, but the analyst has to be aware of the existence of the problem.
____________________________________________________________
Related readings