|
Interactive animation |
Covariance matrix
The variance of a variable is a measure of the dispersion of the values taken by the variable around its mean value.
The covariance matrix generalizes the concept of variance to random vectors, or sets of random variables.
Let x = {X1, X2, ..., Xp} be a random vector with mean vector µ = {µ1, µ2, ..., µp}.
* The dispersion of each Xi around its mean is measured by its variance (which is its own covariance).
* The covariance Cov(Xi, Xj ) of the pair {Xi, Xj }is a measure of the linear coupling between these two variables.
This set of numbers (together with the set of means of the Xis) completely defines the structure of the joint probability distribution of {X1, X2, ..., Xp} up to order 2, just as the mean and variance of a single random variable completely defines its distribution up to order 2.
It is common to group all these numbers into a square table called the Covariance Matrix of the distribution according to the following layout :

The covariance matrix is often denoted
.
*
ij
is the covariance of Xi and Xj
.
*
ii is
the covariance of Xi with
itself, that is its variance
i˛.
So the diagonal elements of the covariance matrix are the variances
of the Xis.
Just as the variance of a single r.v. X is defined by :
Var(X) = E[(X - µ)˛]
the covariance matrix of a random vector is formally defined by :
|
|
which is easily verified to be equivalent to the informal definition given above.
-----
Just as :
Var(X) = E[X˛] - E[X]˛
for a single r.v., it is easily verified that for a random vector x :
|
|
______________
If all variables are standardized, the covariance matrix is identical to the Correlation Matrix.
The covariance matrix of a random vector is not an intrinsic quantity attached to its distribution : it depends on the basis in which it is calculated. We'll see below that some basis allow a covariance matrix to take a particularly simple and useful form.
The multivariate normal distribution (or "multinormal distribution") plays a central role in data modeling as real-life multivariate data is often at least approximately multinormally distributed.
Recall that the multinormal distribution is entirely determined by its mean vector and its covariance matrix. So is is equivalent to say :
* We develop a theory where data distribution is assumed to be multinormal.
* We develop a theory where we make no assumption about data distribution, but the theory is developped only up to the second order.
This is in particular the approach chosen by Discriminant Analysis.
We defined the covariance matrix of a multivariate distribution. But the same definition applies to a sample drawn from this distribution (just as in the univariate case). The terms "variance" and "covariance" just have to be replaced by "sample variance" and "sample covariance". The matrix thus obtained is then called the "sample covariance matrix" (or "empirical covariance matrix").
Let X be the data matrix of a centered sample of size n :

* The first draw from the multivariate distribution delivers the first realization of the random vector, whose coordinates make up the first row of X.
* The second draw from the multivariate distribution delivers the second realization of the random vector, whose coordinates make up the second row of X.
* ...
and there are n draws so we have a sample of size n from the multivariate distribution.
Then it is easily seen that the sample covariance matrix
is 1/n times the product of X by its transpose
:

|
X'X = n |
The above illustration represents the most commonly encountered case where n > p (the number of observations is larger than the number of variables).
The covariance matrix is not just a convenient way
of displaying numbers. As a matrix, it has several important properties
which derive from the fact that a covariance matrix is always positive
semidefinite. The converse is also true : any positive semidefinite matrix
is the covariance matrix of a random vector (in fact, of many).
In particular, the spectral decomposition of the covariance matrix of a random vector x shows that :
* There exists an orthonormal
basis such that the covariance matrix
of
x expressed in this basis is diagonal. The axes of this new basis
are called the Principal Components of
(or of the distribution of x).
* As the off-diagonal elements of this new matrix are 0, the new variables defined by this new basis (the projections of x on the Principal Components) are uncorrelated.
* The diagonal elements of
this new, diagonal covariance matrix are the eigenvalues of
.
So the variances of the projections of x on the Principal Components
ar equal to the corresponding eigenvalues of
.
* If units are changed so that all Principal Components carry now the same variance, the distribution becomes spherically symmetrical. The distribution is then said to be "standardized".
Note that this is not true if a change of units is
made so that the original axes all carry the same variance (for example, if
the original variables are standardized).
The resulting cloud, although it has the same variance on all original
axes, is not standardized (the marginal distributions are correlated). You may experiment with this idea
in the interactive animation below.
__________________________________________________________
This remarks are the starting point of Principal Components Analysis (PCA).
They are illustrated by the following interactive animation, and demonstrated in the Tutorial below.
This animation illustrates the concept of Covariance Matrix.
Upper frame
Many Data Modeling techniques are based on means and
covariance matrices only (the most visible one being
Discriminant Analysis). Textbooks will often state : "The distribution
is assumed to be multinormal". What this sentence actually means is
"The technique we now describe takes into account only the first and second
moments of the distributions, and ignores all higher-order moments".
Lower frame
The green (x', y') axes of the upper frame are rotated so that the x' axis is now in the familiar horizontal direction. The sample and the ellipse have been rotated along with the (x', y') reference frame. The axes of the ellipse are now horizontal and vertical, but it has exactly the same size and shape as the ellipse in the upper frame.
Recall that x' is the direction of the maximum elongation of the sample. So now the sample looks stretched out horizontally (but in fact, its shape is exactly the same as in the upper frame)..
In a similar way, y' is the direction of minimum elongation. The sample looks "sqashed" in the y' direction.
Covariance Matrix
To the right of the upper frame is the sample's Covariance Matrix.
Diagonal elements
They are the variances of the projections of the sample respectively on the (horizontal and vertical) x and y axes.
Off-diagonal elements
They are equal (the matrix is said to be "symmetrical"), and their common value is the covariance Cov(x, y) = Cov(y, x).
Diagonalized Covariance Matrix
To the right of the lower frame is the so-called "Diagonalized Covariance Matrix". It is the Covariance Matrix of the sample as shown in the lower frame.
Diagonal elements
They are the variances of the projections of the sample respectively on the (horizontal and vertical) x' and y' axes.
* The first value is the largest possible variance of the projection of a sample on any axis. Notice that it is larger than either variance as read in the upper Covariance Matrix. In the vocabulary of Linear Algebra (and of PCA), this value is the First (or Largest) Eigenvalue of the original Covariance Matrix.
The half-length of the long axis of the ellipse is
the square root of the first eigenvalue. It is denoted by a horizontal orange
segment.
* The second value is the smallest possible variance of the projection of a sample on any axis. Notice that it is smaller than either variance as read in the upper Covariance Matrix. It is the second eigenvalue of the original Covariance Matrix.
The half-length of the short axis of the ellipse
is the square root of the second eigenvalue. It is denoted by a vertical orange
segment.
* The sum of the two variances in the Covariance Matrix is equal (within the round off errors) to the sum of the variances in the Diagonalized Covariance Matrix. First, this is a theorem of Linear Algebra (the so-called "trace" of a square matrix is invariant under a change of unitary orthogonal frame). Second, in the framework of PCA, this sum receives an interpretation which is independent of any reference frame.
Off-diagonal elements
Both off-diagonal elements are zero (so the matrix is still symmetrical). This reads "x' and y' have zero covariance, they are uncorrelated". This can be demonstrated, but it is quite intuitive : in the (x', y') reference frame, as you move along the x' axis to the right, there is no systematic tendency for y' to go up or down. Discovering such a tendency was the very reason for inventing the concept of Covariance in the first place, so, "no tendency" certainly leads to expect "zero covariance".
Animation
In the upper frame, move red points about with the tip of your mouse, and observe changes of :
In the "general case", the sample has a somewhat elongated shape at an angle with x.
The two lines are close to each other, but are different. In older textbooks, the First Princial Component is sometimes called "the orthogonal regression line".
A word about the green axes. Whereas their directions are
defined unambiguously, their + and - orientation are arbitrary.
This animation chooses orientations such that :
* Increasing
values of x' always go to the rignt.
* Increasing
values of y' always go upward.
This causes abrupt changes in axes
orientations when going through a vertical (x') or horizontal
(y') position, with a corresponding discontinuous change in
the displayed sample in the lower frame.
____________________________________________
Related animations :
_________________________________________________________________________
|
Tutorial |
This Tutorial addresses some of the basic properties of covariance matrices.
* We first show that a covariance matrix is positive semidefinite and that, conversely, any positive semidefinite matrix is the covariance matrix of a random vector (in fact, of infinitely many).
* When a covariance matrix is only positive semidefinite instead of being positive definite, we'll show that the distribution of the random vector is degenerate : it occupies only a subspace (whose dimension we'll calculate) of the complete space.
* The spectral decomposition of a covariance matrix is in fact its diagonalization. This will lead us to demonstrate the following well know facts :
- The eigenvalues of a covariance matrix are equal to the variances of the projections of the random vector on the eigenvectors of this covariance matrix.
- The direction of the largest variance of a projection of x is that defined by the eigenvector associated to the largest eigenvalue of the covariance matrix.
- More generally, we'll show that if the eigenvectors are sorted by decreasing values of the eigenvalues, then the direction orthogonal to {u1, u2, ..., uk}, k < p which maximizes the projected variance of x is uk + 1.
- The projections of x on the eigenvectors of the covariance matrix are uncorrelated random variables.
* We finally introduce the Mahalanobis transformation that leads to the notion of Mahalanobis distance, a random variable whose properties we'll describe, with a special emphasis on the case where x is a multivariate normal vector.
COVARIANCE MATRIX
|
A covariance matrix is semidefinite positive, and conversely A covariance matrix is semidefinite positive A positive semidefinite matrix is a covariance matrix Singular covariance matrices Degenerate distribution Dimension of the subspace of the distribution Diagonalization of a covariance matrix Eigenvalues are projected variances Eigenvectors are directions of largest projected variance The First Principal Component Other Principal Components Projections on eigenvectors are uncorrelated random variables Direct calculation By diagonalization of the covariance matrix Mahalanobis distance Standardization of a random vector Mahalanobis transformation and Mahalanobis distance The general case The multivariate normal case, Chi-square distribution |
||
|
TUTORIAL |
||
___________________________________________________
Related readings :
|