Correlation (Partial)
Two variables may exhibit a high value correlation coefficient, and yet have only a weak (linear) relationship. For instance, imagine that a bank discovers that in the 25 year old-60 year old bracket , the two variables :
* "Age", and
* "Stock_Portfolio",
have a high value correlation coefficient. Should
it conclude that investment boldness comes with age ?
Now suppose that the bank takes a third variable, namely "Income", into consideration. It will create groups of individuals with similar incomes, and both "Age" and "Stock_Portfolio" will increase on the average as higher and higher incomes are considered.
It is then very likely that within each "Income"
group, the
correlation between "Age" and "Stock_Portfolio" will be
much less pronounced than when calculated on the entire base. The observed correlation
between "Age" and "Portfolio" will then appear as a spurious consequence
of the trend for revenues to increase with age.
The origin of this pernicious phenomenon is called partial correlation,
and it means "correlation when all other variables are kept at fixed
values", or "controlling for other variables". Of course, partial
correlation is a great deal more meaningful than the previously considered total
correlation.
To illustrate the concept of partial correlation,
imagine that you generate a 3D volume by translating a circular disk parallel
to itself as in the picture below. Now fill this volume with a uniform distribution.
|
|
The projection of this distribution onto the "Age" vs. "Stock-Portfolio" plane is both elongated and hugging a straight line rather closely. As a consequence, the correlation coefficient of the pair of variables has a high value.
But now, if you do the same thing only with the individuals having "Income" sitting in a narrow, predefined bracket (rather than with the whole sample), you end up with a uniform distribution in a circular disk, and that means that the correlation coefficient for this subsample is almost zero (drag mouse over image).
This example is not as academic is it looks. In fact,
it is only too common to see high-value correlation coefficients taken at face
value as an indication that two variables have a strong relationship when in
reality they have none, or maybe just a weak one. This high value is just an
artefact caused by other variables (here, "Income").
This example should not make you believe that Partial Correlation is always less than Total Correlation, as the reverse may also happen.
Lets make one step toward professionalism and dispense with pseudo-realistic names for 3 variables x, y, z. This time, we will define a volume as a thin elliptic slice whose orientation in space is such that its projection on the (x, y) is nearly circular. Let's now fill this volume with a uniform distribution.
|
|
By construction, total correlation of x and
y is zero as the distribution uniformely fills up a circular disc.
Now let's consider that part of the distribution with any but fixed value z0 for z. It fills in a narrow, elongated, rectilinear zone in the z = z0 plane, which means that correlation between x and y is high for this subpopulation.
So the situation now is exactly the opposite of what it was before : the two variables x and y have a low (total) correlation, but a high partial correlation. The link between x and y is now masked by the third variable z.
______________
In both of the above examples, we built artificial
distributions for the purpose of graphically depicting the concept of Partial
Correlation and to illustrate how deceptive Total Correlation may be. Just how
artificial these distributions were also transpired in the fact that, in both
cases, the value of the correlation coefficient in the subpopulation was nearly
the same whatever the imposed value for the "control" variable.
Of
course, such a thing is not expected to happen in real life : for
each new value of the control variable, one would expect to observe a new value
of the Correlation Coefficient for the subpopulation.
But Statistical software always deliver a single value for the "Partial Correlation Coefficient" between two variables. Why is that ?
There is a nearly-realistic distribution for which the Partial Correlation between two variables does not change at all when you change the values imposed on all the other variables, and that is the multinormal distribution. Recall that the shape of a multinormal distribution is completely determined by the pairwise values of the (total) correlation coefficients of the variables. This is the reason why it is a universal convention to define the partial correlation coefficient of a distribution as the single and non-ambiguous value of this coefficient for the "equivalent" multinormal distribution.
________________________________________
Voir aussi :
|
Want to contribute to this site ? |