Rank  (of an observation)

We address here the following issues :

 Ranking observations Why ranks ? Rank correlation Non parametric tests based on ranks Ties

_____________________________________________

# Ranking observations

Let x be a numerical (or quantitative) variable. In a sample of size n, observations can be sorted by increasing order of their values of x. They are then said to be ranked. So :

• The observation with the smallest value of x is assigned rank 1.
• The observation with the next smallest value of x is assigned rank 2.
• ....
• The observation with the largest value of x is assigned rank n.

Below is an example of a 7-observation sample, together with the corresponding ranks of the observations.

The rank of an observation may be perceived as a "downgraded" version of its coordinate x. Ranks only describe the relative positions of the observations, and therefore carry a lot less information than the true values of x, as nothing is retained of the notion of "distance between two observations".

# Why ranks ?

So if genuine x values carry more information about the sample than ranks, why bother with rank ? There are two reasons.

## Ordinal variables

Some situations are not described adequately by numerical coordinates, but can be described quite appropriately by ranks. For example :

• In a shuffled deck, the relative positions of the cards are described by ranks, not "coordinates".
• The full numerical information about the observations may theoretically be measured, but is missing for some reason. The ranking of the observations may still be possible in some cases. For example, drinks may be sorted by increasing "sweetness" just by tasting, even if the (numerical) characteristics "Sugar content" is not made available.
• Some quantities can be pairwise compared, but not actually measured. For example, gem A will be considered "harder" than gem B if A can scratch B (assuming than B cannot scratch A). Although this particular definition of "hardness" is not ameanable to quantitative measurements, gems mays still be ranked by increasing order of "hardness".

Variables that only carry information about the relative ordering of observations are said to be "ordinal".

## Robustness

Using ranks instead of numerical values is not just a matter of convenience. It also has some deep and useful consequences, because ranks are not changed when the scale on which the corresponding numerical variable x (when it exists) is measured is changed. The change of scale may even be non uniform throughout the range of x without altering ranks. Quite generaly, any monotonous transformation of x keeps ranks unchanged.

Many of the classical parameteric tests (e.g. ANOVA, t-tests) rely heavily on the assumption that the considered distributions are normal. When they are not, the tests break down because they are quite sensitive to this normality assumption.

It is often possible to devise tests that serve the same purpose as parametric tests by using ranks instead of numerical values. These tests do not rely on any distribution assumption about the data, and are therefore quite robust (see below ).

# Rank correlation

The simplest use of ranks is extending the notion of correlation coefficient to ranked observations. Just as the ordinary (Pearson's) correlation coefficient is a measure of the similarity between two numerical variables, rank correlation is a measure of the similarity between two rankings on the same group of observations.

A classical example comes from the issue of "related abilities". Are :

• Ability in music,   and
• Ability in mathematics

in any way related ?

One way to address this question is to establish rankings of students in these two disciplines, and try to detect any "correlation" between these rankings.

-----

There are two main (and rival) measures of rank correlation :

• Kendall's τ and,
• Spearman's ρS .

Both these quantities are equal to :

• +1 when the two rankings are identical,
• -1 when the two rankings are in reversed order.
• 0 when the two rankings show no relationship whatever, either positive or negative.

Other than in these extreme cases, the actual values of Kendall's τ or Spearman's ρS carry little significance. The really important fact about them is that it is possible to design tests pertaining to the fact that the values of these coefficients are actually 0. The sample is considered as extracted from an infinite population, and these tests assess the plausibility of the null hypothesis :

• H0    τ  = 0;

or

• H0    ρS  = 0;

These tests address the question : "Is the data compatible with the hypothesis according to which there is no relationship between these two variables ?".

# Non parametric tests based on ranks

As previously mentioned, parametric tests are not robust because of their heavy reliance on the normality assumption. Some classical parametric tests have non parametric counterparts based on ranks. Here are some examples :

 Parametric Non parametric using ranks Test on Pearson's ρ = 0 Test on Kendall's τ = 0 Test on Spearman's ρS = 0 t-test on two independent samples Wilcoxon-Mann-Whitney test 1-Way ANOVA Kruskal-Wallis test 2-Way ANOVA Friedman test

Recall that 1-Way ANOVA is an extension of the t-test on independent samples to more than two samples.

The last three tests are identity tests. They test the null hypothesis according to which :

• Two independent samples (Wilcoxon-Mann-Whitney),
• Three or more independent samples (Kruskal-Wallis),
• Two or more matched samples (Friedman),

were drawn from the same population.

For more on independent or matched samples, please see here.

Note that many non parametric tests are not based on ranks. For example, an alternative to the Kruskal-Wallis test is the Chi-Square test for k independent samples, which is not based on ranks.

# Ties

A "native" ordinal variable usually has only a small number of values (""Hot", "Lukewarm", "Cold"). The extreme case is that of binary variables, that have only two values ("Male", "Female"). It is then nearly impossible to avoid two or more observations to have the same value of the ranking variable. Assigning ranks to observations then becomes ambiguous, and the observations are said to be tied.

Here is an example illustrating ties : the same task is assigned to every person in a group of n people containing both men (M) and women (F). Once he or she has completed the task, a (numeric) grade is assigned to the result. The question is : "Is there a significant difference between Men and Women as far as accomplishing this task is concerned ?".

The grades are pooled irrespective of the gender of the persons, then ranked as in the following table :

 Rank of performance 1 2 3 ..... n-1 n Gender M F F ..... M F

The answer is to calculate a rank correlation coefficient on the data (for example Kendall's τ ), and run a test on the null hypothesis H0 :   τ  = 0.

But we are in a situation with many ties, and handling these ties may usually be done in several (and somewhat arbitrary) ways.

Ties are ubiquitous in rank-based tests. Software usually takes care of that, but the analyst has to be aware of the existence of the problem.

____________________________________________________________

 Parametric test Chi-square test Wilcoxon-Mann-Whitney test Kruskal-Wallis test Friedman test