Rank  (of an observation)

We address here the following issues :

 Ranking observations

Why ranks ?

Rank correlation

Non parametric tests based on ranks

Ties

_____________________________________________

Ranking observations

        Let x be a numerical (or quantitative) variable. In a sample of size n, observations can be sorted by increasing order of their values of x. They are then said to be ranked. So :

 

Below is an example of a 7-observation sample, together with the corresponding ranks of the observations.

 

 

The rank of an observation may be perceived as a "downgraded" version of its coordinate x. Ranks only describe the relative positions of the observations, and therefore carry a lot less information than the true values of x, as nothing is retained of the notion of "distance between two observations".

Why ranks ?

        So if genuine x values carry more information about the sample than ranks, why bother with rank ? There are two reasons.

Ordinal variables

            Some situations are not described adequately by numerical coordinates, but can be described quite appropriately by ranks. For example :

Robustness

            Using ranks instead of numerical values is not just a matter of convenience. It also has some deep and useful consequences, because ranks are not changed when the scale on which the corresponding numerical variable x (when it exists) is measured is changed. The change of scale may even be non uniform throughout the range of x without altering ranks. Quite generaly, any monotonous transformation of x keeps ranks unchanged.

Many of the classical parameteric tests (e.g. ANOVA, t-tests) rely heavily on the assumption that the considered distributions are normal. When they are not, the tests break down because they are quite sensitive to this normality assumption.

It is often possible to devise tests that serve the same purpose as parametric tests by using ranks instead of numerical values. These tests do not rely on any distribution assumption about the data, and are therefore quite robust (see below ).

Rank correlation

        The simplest use of ranks is extending the notion of correlation coefficient to ranked observations. Just as the ordinary (Pearson's) correlation coefficient is a measure of the similarity between two numerical variables, rank correlation is a measure of the similarity between two rankings on the same group of observations.

A classical example comes from the issue of "related abilities". Are :

in any way related ?

One way to address this question is to establish rankings of students in these two disciplines, and try to detect any "correlation" between these rankings.

-----

There are two main (and rival) measures of rank correlation :

 

Both these quantities are equal to :

 
Other than in these extreme cases, the actual values of Kendall's τ or Spearman's ρS carry little significance. The really important fact about them is that it is possible to design tests pertaining to the fact that the values of these coefficients are actually 0. The sample is considered as extracted from an infinite population, and these tests assess the plausibility of the null hypothesis :

 

These tests address the question : "Is the data compatible with the hypothesis according to which there is no relationship between these two variables ?".

Non parametric tests based on ranks

        As previously mentioned, parametric tests are not robust because of their heavy reliance on the normality assumption. Some classical parametric tests have non parametric counterparts based on ranks. Here are some examples :

 

Parametric

 

Non parametric using ranks

 

Test on Pearson's ρ = 0

cats_next.gif

Test on Kendall's τ = 0
Test on Spearman's ρS = 0  

 

t-test on two independent samples

cats_next.gif

Wilcoxon-Mann-Whitney test

1-Way ANOVA

cats_next.gif

Kruskal-Wallis test

2-Way ANOVA

cats_next.gif

Friedman test

 

Recall that 1-Way ANOVA is an extension of the t-test on independent samples to more than two samples.

The last three tests are identity tests. They test the null hypothesis according to which :

were drawn from the same population.


For more on independent or matched samples, please see here.

Note that many non parametric tests are not based on ranks. For example, an alternative to the Kruskal-Wallis test is the Chi-Square test for k independent samples, which is not based on ranks.

Ties

        A "native" ordinal variable usually has only a small number of values (""Hot", "Lukewarm", "Cold"). The extreme case is that of binary variables, that have only two values ("Male", "Female"). It is then nearly impossible to avoid two or more observations to have the same value of the ranking variable. Assigning ranks to observations then becomes ambiguous, and the observations are said to be tied.
 

Here is an example illustrating ties : the same task is assigned to every person in a group of n people containing both men (M) and women (F). Once he or she has completed the task, a (numeric) grade is assigned to the result. The question is : "Is there a significant difference between Men and Women as far as accomplishing this task is concerned ?".

The grades are pooled irrespective of the gender of the persons, then ranked as in the following table :

 

Rank of performance

1

2

3

.....

n-1

n

Gender

M

F

F

.....

M

F

 

 

The answer is to calculate a rank correlation coefficient on the data (for example Kendall's τ ), and run a test on the null hypothesis H0 :   τ  = 0.

But we are in a situation with many ties, and handling these ties may usually be done in several (and somewhat arbitrary) ways.

 

Ties are ubiquitous in rank-based tests. Software usually takes care of that, but the analyst has to be aware of the existence of the problem.

 

____________________________________________________________

 

Related readings :

Parametric test

Chi-square test

Wilcoxon-Mann-Whitney test

Kruskal-Wallis test

Friedman test

Download this Glossary