Correlation coefficient
Please read first the entry on Covariance.
One problem with covariance is that it is sensitive to the scales on which the values of the r.v. are measured. Say you're computing the covariance between "Height" and "Weight" in a population. Measure weights in "Kilos" instead of "Pounds", or heights in "Centimeters" instead of "Inches", and the value of the covariance changes, whereas the strength of the link between "Height" and "Weight" remains the same, of course. So we would like a measure of the strength of the link between "Height" and "Weight" that does not depend on the units used to measure these quantities.
Now suppose that the unit measuring X1 is divided by 2 (so that values of X1 are multiplied by 2). Then the covariance Cov(X1, X2) is also multiplied by 2. But the standard deviation of X1 (square root of the variance) is also multiplied by 2, so the ratio :
Cov(X1, X2) /(Var(X1))1/2
in unchanged. The same argument applies to X2, and, more generally, to any change of units in which X1 or X2 are measured.
So, quite generally, the number :
|
|
does not depend on units in which X1 and X2 are
expressed anymore (see animation
)
.So, the Correlation Coefficient may be perceived as
the standardized version of the Covariance.
This number is known as the Correlation Coefficient of the pair of variables (X1 or X2). It will be noted :
* ρ(X1, X2) when the coefficient is calculated from the distributions of X1 and X2.
* r(X1, X2) when these distributions are known only through a sample (see below).
* The value of the correlation coefficient is always between -1 and +1 :
|
-1 |
* If X1 = X2 then Cov(X1, X2 ) = Var(X1) = Var(X2). Therefore, ρ(X, X) = +1.
* The Correlation Coefficient is symmetrical : ρ(X1, X2) = ρ(X2, X1).
* If both variables have unit variances, then their Covariance is the same as their Correlation Coefficient.
* When the two distributions are known only through a sample, the common estimate of the Correlation Coefficient is :
|
|
Little is known about its behavior in the general case. When the (X1, X2 ) is binormal, r is known to exhibit a slight negative bias (i.e. r tends to underestimate ρ, the variables are probably less correlated than the sample would like us to believe).
Furthermore, when ρ = 0 (the variables are uncorrelated, and therefore independent because they are normal), the distribution of r is known to a good approximation, and a test for uncorrelatedness (H0 : ρ = 0) can be devised.
* ρ = +1 or ρ = -1 implies a perfect linear functional relationship between X1 and X2 . Then there are 3 numbers a, b, and c such that :
aX1 + bX2 + c = 0
See top illustration below.
* What if ρ is near 0 ? If (and only if) the relationship between X1 and X2 is indeed linear, then it can be said that this relationship is weak.
See bottom illustration below.
A strong, but non linear, relationship between X1 and X2 may lead to a low value of the correlation coefficient as is illustrated in the top and bottom images of the illustration below :
So, unless the relationship between X1 and X2 is known to be linear, no conclusion can be drawn from a low correlation coefficient. It is sometimes said that the correlation coefficient captures only "the linear part" of the link between X1 and X2.
Two variables with a near 0 correlation coefficient are said to be uncorrelated.
Lack of correlation should not be confused with genuine independence :
* Two independent variables are also uncorrelated,
* But two uncorrelated variables
may be far from independence (see the example above, and also the interactive
animation
). Only when both variables
are normal and have
a bivariate normal joint
distribution are the concepts of "lack
of correlation" and "independence" equivalent : two uncorrelated
normal variables in a binormal relationship are independent.
So, in general, independence is a much stronger property than lack of correlation.
Simple Linear
Regression is intimately linked to the Correlation Coefficient. In particular,
if the two variables x1
and x2 of the regression have identical variances
(e. g. after they have been standardized), then the
slope of the (unique) regression
line is
equal to the Correlation Coefficient.
Popular and commonplace as it is, the concept of correlation is deceptivingly simple :
1) A high value Correlation Coefficient is often perceived as implying a causal relationship between the two variables. This is totally unjustified. For example, both events, represented by X1 and X2 might simply have a common cause.
A popular example of this phenomenon is as follows. An insurance company has detected that if :
* X1 is the number of firemen on the site of a conflagration,
* and X2 is the amount of money claimed by the victims of the fire,
then there is a strong positive correlation between X1 and X2 . Should the insurance company deduce that the local fire department is a nuisance, because "the more firemen, the more damage" ?. Of course not. Now, if a third variable X3 , the extent of the conflagration, is taken into account, it become clear that, despite the strong positive value of their correlation coefficient, there is no causal relationship between "number of firemen" and "damage caused by the fire", and that both are in fact cause by the third variable "Extent of the fire".
This very important idea is formalized by the concept of Partial Correlation, which deserves an entry of its own.
2) A low value of the correlation coefficient is not enough to conclude about the lack of a strong link between the two variables under consideration.
* First, as we already mentioned, because of a possibly strong but non linear relationship between the variables.
* Second, because when more than two variables are necessary to describe a phenomenon, the strength of the link is better captured by their partial correlation coefficient that by their simple (or "total") correlation coefficient, and both may have very different values.
3) Large number of variables
Suppose that you calculate the pairwise correlation coefficients of a large number of variables in a data base with a rather limited number of observations. Then it can be shown (and it is rather intuitive) that it is likely that at least one high value correlation coefficient will show up purely by chance, because the values taken by the variables are indeed random numbers, and that nothing is impossible to randomness.
Therefore, when considering large correlation matrices, it is necessary to resort to countermeasures meant to protect the user against hazardous interpretations of high value Correlation Coefficients.
The Correlation Coefficient generalizes to the situation where one variable Y is pitted against a set of variables {X1, X2 , ..., Xn}. The strength of the linear link between Y and {X1, X2 , ..., Xn} is then measured by the so-called Multiple Correlation Coefficient. .
____________________________________________________________
Related readings :
|