Back To "Correlation coefficient"
INTERACTIVE ANIMATION : COVARIANCE AND CORRELATION COEFFICIENT
This animation illustrates both the concepts
of Covariance and of Correlation Coefficient.
We suggest that you also take a look at the interactive animations :
* about the Covariance Matrix
.
* about the Bivariate Normal
distribution
.
* The animation displays a set of points, together with the respective values of :
To give you a feel for the strong relationship between "Correlation Coefficient" and "Simple Linear Regression", the Least Squares Line (LSL) of y on x is also shown. Note that the line would have been different would we have considered the Regression of x on y : regression is not symmetrical in the two variables, this is why we switched the (x1, x2 ) notation to the (x, y) notation. The Correlation Coefficient is, of course, symmetrical in x and y.
* Move points around with your mouse to have all points sitting on one straight line, and observe the value of the Correlation Coefficient. Notice that this value does not depend on the slope of the line (except for the sign).
* Many different types of configurations can exhibit a near 0 Correlation Coefficient. For example, any blob that is roughly rotationnally symmetrical is in this category (try with a large number of points).
Yet, it is quite possible to build configurations with low to moderate (absolute) values of the Correlation Coefficient, but with the two variables clearly exhibiting a strong, even deterministic (but not linear) relationship. It is quite instructive to spend some time building such configurations.
In particular, it often happens in practice that the data can be partitioned into "blocks", each block exhibiting a strong linear relationship between the variables. But because these relationships are different from block to block, and/or the blocks are not lined up, the resulting global value of the Correlation Coefficient is low. Try to build such "blockwise linear" sets, and observe the deceptively low value of the Correlation Coefficient. Just how close to "0" can you get with two blocks with roughly equal populations ?
Observe also that in such cases, the LSL does not reflect at all the structure of the data set. So, Simple Linear Regression is very sensitive to the internal structure of the data set. Although this kind of pathological situation can be detected visually in SLR, it is not so in Multiple Linear Regression (or in just about any data modeling technique), where the problem is just as severe. This is why it is advisable to spend some time trying to detect homogenous "blocks" in data sets before modeling (clustering). If such clearly identified blocks exist, then it is usually good practice to first break up the data set into these blocks, and then conduct modeling actions on each block separately.
In conclusion,
a high value of the Correlation Coefficient always tells the truth, while a
low to moderate value tells essentially nothing by itself about a possible relationship
between the variables.
* "Vertical" and "Horizontal" sets.
Look again at the mathematical expression for the Correlation Coefficient.
If either of the Standard Deviations is 0, then the denominator is 0, a forbidden situation
... unless
the numerator is 0 too. And indeed, the Covariance is also 0 in such a case.
What of the value of the Correlation Coefficient ? The formal expression
is the 0 / 0, which suggests that the Correlation Coefficient is undefined.
It can be shown that it is indeed the case.
Create a 2-point set, with the
two points sitting on a vertical line. Now slowly move the top point
right and left across a narrow horizontal range : the Correlation Coefficient jumps from
+1 (right position) to -1 (Left position), thus displaying a discontinuous behavior
when the x Standard Deviations reaches the "0" value.
* The Correlation Coefficient is very sensitive to
outliers. First create
a set of points that occupies only a small area near one edge of the
stage. Then take one the points far away from this region (outlier). Observe
the large changes in the value of the Correlation Coefficient as you move the
outlier across the stage.
Also notice that as a side effect, the LSL
tends to "follow" the outlier, and is therefore meaningless.
* The value of the Correlation Coefficient does
not change if you add (or subtract) a same quantity to all x-coordinates,
and/or another quantity to all y-coordinates. The Correlation Coefficient
is "invariant by translation".
Click anywhere in the
"interior" of the set of points (but not on a point), drag the
set, and observe that the value of the Correlation Coefficient does not change.
Note that neither the Covariance nor the Standard Deviations change in the process.
The same result whould have been obtained by
dragging the axes instead of the set of points (not implemented).
* The (absolute) value of the Correlation Coefficient
does not change if you multiply all x-coordinates by one same factor,
and/or the y-coordinates by another factor. The Correlation Coefficient
is invariant (except possibly for the sign) by arbitrary changes of units on the x- and y-axes.
Click anywhere on the scene "outside" the set of points,
and slowly drag your mouse. The deformations of the set of points correspond
to changes of units on the x- and y-axes. Notice that although
the values of the Covariance and Standard Deviations vary during the drag,
the value of the Correlation Coefficient remains constant ... so long as the
mouse does cross an axis. If it does, the sign of the correlation coefficient
is reversed.
Back To "Correlation coefficient"