Correlation coefficient

# Correlation coefficient and Covariance

One problem with covariance is that it is sensitive to the scales on which the values of the r.v. are measured. Say you're computing the covariance between "Height" and "Weight" in a population. Measure weights in "Kilos" instead of  "Pounds", or heights in "Centimeters" instead of "Inches", and the value of the covariance changes, whereas the strength of the link between "Height" and "Weight" remains the same, of course. So we would like a measure of the strength of the link between "Height" and "Weight" that does not depend on the units used to measure these quantities.

Now suppose that the unit measuring X1 is divided by 2 (so that values of X1 are multiplied by 2). Then the covariance Cov(X1, X2) is also multiplied by 2. But the standard deviation of X1 (square root of the variance) is also multiplied by 2, so the ratio :

Cov(X1, X2) /(Var(X1))1/2

in unchanged. The same argument applies to X2, and, more generally, to any change of units in which X1 or X2  are measured.

So, quite generally, the number :

does not depend on units in which X1 and X2  are expressed anymore  (see animation ) .So, the Correlation Coefficient may be perceived as the standardized version of the Covariance.

This number is known as the Correlation Coefficient of the pair of variables (X1 or X2). It will be noted :

* ρ(X1, X2) when the coefficient is calculated from the distributions of X1 and X2.

* r(X1, X2) when these distributions are known only through a sample (see below).

# Properties of the Correlation Coefficient

* The value of the correlation coefficient is always between -1 and +1 :

 -1  ρ  +1

* If X1 = X2 then Cov(X1, X2 ) = Var(X1) = Var(X2). Therefore, ρ(X, X) = +1.

* The Correlation Coefficient is symmetrical : ρ(X1, X2) = ρ(X2, X1).

* If both variables have unit variances, then their Covariance is the same as their Correlation Coefficient.

* When the two distributions are known only through a sample, the common estimate of the Correlation Coefficient is :

Little is known about its behavior in the general case. When the (X1, X2 ) is binormal, r is known to exhibit a slight negative bias (i.e. r tends to underestimate ρ, the variables are probably less correlated than the sample would like us to believe).

Furthermore, when ρ = 0 (the variables are uncorrelated, and therefore independent because they are normal), the distribution of r is known to a good approximation, and a test for uncorrelatedness (H0 : ρ = 0) can be devised.

# Interpretation of the correlation coefficient

* ρ = +1 or ρ = -1 implies a perfect linear functional relationship between X1 and X2 . Then there are 3 numbers a, b, and c such that :

aX1 + bX2 + c = 0

See top illustration below.

* What if ρ is near 0 ?  If (and only if) the relationship between  X1 and X2  is indeed linear, then it can be said that this relationship is weak.

See bottom  illustration below.

A strong, but non linear, relationship between X1 and X2  may lead to a low value of the correlation coefficient as is illustrated in the top and bottom images of the illustration below :

So, unless the relationship between X1 and X2 is known to be linear, no conclusion can be drawn from a low correlation coefficient. It is sometimes said that the correlation coefficient captures only "the linear part" of the link between X1 and X2.

Two variables with a near 0 correlation coefficient are said to be uncorrelated.

Lack of correlation should not be confused with genuine independence :

* Two independent variables are also uncorrelated,

* But two uncorrelated variables may be far from independence (see the example above, and also the interactive animation ). Only when both variables are normal and have a bivariate normal joint distribution are the concepts of "lack of correlation" and "independence" equivalent : two uncorrelated normal variables in a binormal relationship are independent.

So, in general, independence is a much stronger property than lack of correlation.

# Correlation coefficient and Linear Regression

Simple Linear Regression is intimately linked to the Correlation Coefficient. In particular, if the two variables x1 and x2 of the regression have identical variances (e. g. after they have been standardized), then the slope of the (unique) regression line is equal to the Correlation Coefficient.

# Correlation coefficient may be misleading

Popular and commonplace as it is, the concept of correlation is deceptivingly simple :

1) A high value Correlation Coefficient is often perceived as implying a causal relationship between the two variables. This is totally unjustified. For example, both events, represented by  X1 and X2  might simply have a common cause.

A popular example of this phenomenon is as follows. An insurance company has detected that if :

* X1 is the number of firemen on the site of a conflagration,

* and X2 is the amount of money claimed by the victims of the fire,

then there is a strong positive correlation between X1 and X2 . Should the insurance company deduce that the local fire department is a nuisance, because "the more firemen, the more damage" ?. Of course not. Now, if a third variable X3 , the extent of the conflagration, is taken into account, it become clear that, despite the strong positive value of their correlation coefficient, there is no causal relationship between "number of firemen" and "damage caused by the fire", and that both are in fact cause by the third variable "Extent of the fire".

This very important idea is formalized by the concept of  Partial Correlation, which deserves an entry of its own.

2) A low value of the correlation coefficient is not enough to conclude about the lack of a strong link between the two variables under consideration.

* First, as we already mentioned, because of a possibly strong but non linear relationship between the variables.

* Second, because when more than two variables are necessary to describe a phenomenon, the strength of the link is better captured by their partial correlation coefficient that by their simple (or "total") correlation coefficient, and both may have very different values.

3) Large number of variables

Suppose that you calculate the pairwise correlation coefficients of a large number of variables in a data base with a rather limited number of observations. Then it can be shown (and it is rather intuitive) that it is likely that at least one high value correlation coefficient will show up purely by chance, because the values taken by the variables are indeed random numbers, and that nothing is impossible to randomness.

Therefore, when considering large correlation matrices, it is necessary to resort to countermeasures meant to protect the user against hazardous interpretations of high value Correlation Coefficients.

# Multiple Correlation Coefficient

The Correlation Coefficient generalizes to the situation where one variable Y is pitted against a set of variables {X1, X2 , ..., Xn}. The strength of the linear link between  Y and {X1, X2 , ..., Xn}  is then measured by the so-called Multiple Correlation Coefficient.          .

____________________________________________________________