Variance

Let X be a random variable.

# Definition of the variance

## Rationale

The expectation of X, E[X] = µ tells us where the central tendency of the variable distribution is located, but it tells us nothing about the extent of the dispersion, or spread of this distribution around its mean. In the illustration below, the red and the green distributions have the same mean, but very different spreads.

A natural idea for quantifying the spread of the possible values of X is to measure how far, on the average, X is  from its mean. One could therefore consider the expectation of the distance from X to its mean :

E[|X - µ|]

where |...| is the absolute value. But it turns out that the absolute value is mathematically inconvenient.

## Definition of the variance

So instead, we'll consider the expectation of the squared distance from X to its mean µ. The variance Var(X) of a random variable X is defined as :

 Var(X) = E[(X - µ)˛]

This definition has no magic virtue other than its mathematical convenience.

-----

* If the variable is continuous with probability density function p(x) :

where the integration takes place over the range of X.

Note that this is not a definition, but a consequence of the "Law of the Unconscious Statistician" (see here).

* If the variable is discrete :

Var(X) = i(xi - µ)˛.P{X = xi}

## where P{X = xi} is the probability for X to take the value xi.

-----

Just as the expectation of a random variable is usually noted µ, its variance is usually noted ˛.

## Alternative form of the variance

Throughout this Glossary, we'll often use the alternative form of the variance :

 Var(X) = E[X˛] - E[X]˛

that we establish here.

## Caveat

Not all random variables have a variance. For example, a Cauchy variable has no expectation, and a fortiori no variance. The reason is that the above integral then takes an infinite value because the tails of the probability distribution p(x) are too heavy, thus giving too much importance to the (x - µ)˛ terms for values of X that are far from the mean µ.

# Calculating a variance

The above two expressions of the variance are very simple. Yet, they often lead to complicated or even intractable calculations. Recall that the moment generating function is a very powerful mean of calculating the moments (and in particular, the variance) of a probability distribution. We'll use it very often throughout this Glossary, in particular for calculating the variance of the classical probability distributions.

In addition, we'll describe a third expression for the variance known as the "law of conditional variance", and we'll give an example of its usefulness for calculating a variance.

# Standard deviation

A shortcoming of variance is that it is expressed in units that are the units of X squared. Hence, if X  is the height of a human population expressed in centimeters, the variance of X will be expressed in centimeters squared.

In order to have a measure of spread that comes in the same units as the variable itself, one considers the square root of the variance, which is called the standard deviation  :

Standard Deviation(X) = [Var(X)]1/2

The standard deviation is usually denoted .

# Basic properties of the variance

## Linear transformation of the variable

For any pair of numbers a and b we have :

 Var(aX + b) = a˛Var(X)

In particular, note that :

* A translation (a = 1) does not change the variance.

* The variance of a constant " random variable" (a = 0) is 0.

## Variance of the sum of two variables

Let X and Y be two random variables, both having a variance. Then the variance of their sum X + Y is given by :

 Var(X + Y) = Var(X) + Var(Y) + 2.Cov(X, Y)

where Cov(X, Y) is the covariance of X and Y.

We establish in the Tutorial below the slightly more general result about the variance of a linear combination of random variables, of which the above result is a special case.

# Conditional variance

Let X and Y be two random variables. We then consider the variance of X conditionally to Y = y0 :

 Var(X |Y = y0) = E[X - E[(X |Y = y0)]˛ | Y = y0] = E[X˛ | Y = y0] - E[X | Y = y0]˛

It is a number, not a random variable. Note the presence of E[(X |Y = y0)], the conditional expectation of X.

In loose terms, this quantity is the variance of X when only those draws of (X, Y) that yield Y = y0 are retained, all the other draws being ignored. For example, if X is "Height" and Y is "Weight" for a certain population, one would consider the variance of the height of people in the subpopulation with a given weight y0.

-----

This illustration shows the joint probability distribution of two random variables X and Y. For a given value y0 of Y, the horizontal cut through this density defines a curve (lower image of the illustration).

Once properly normalized, this curve is the probability density of X given Y = y0 (conditional probability distribution).

The conditional variance of X given Y = y0 is the variance of this probability density.

-----

By definition, the conditional variance of X with respect to Y is :

Var(X |Y) = E[X - E[(X |Y)]˛ | Y ]

It is a random variable.

We'll prove the important Theorem of Conditional Variance :

 Var(X) = E[Var(X |Y )] + Var(E[X |Y])

It is useful for calculating ("total" or "marginal") variances in some difficult cases. We'll thus easily calculate the variance of the length of the second break in the "broken stick" problem, which would be rather difficult to calculate directly.

It is also an essential ingredient of the Rao-Blackwell theorem, that shows how to reduce the variance of an unbiased estimator.

-----

We'll also give a geometric interpretation of the Theorem of Conditional Variance.

# Estimation of a variance

## The sample variance

Let {x1, x2, ..., xn } be a set of n observations drawn from a probability distribution. Following the same line of reasoning as for a probability distribution, we'll define the sample variance "s˛" as the average squared difference between the observations and the sample mean .

s˛ = 1/n.i(xi - )˛                           i = 1, 2, ..., n

We leave it as an exercise to show that s˛ can also be written :

s˛ = 1/n.(i xi˛) - ˛

-----

For reasons that will appear shortly, n is usually replaced by (n - 1). This change yields the alternative definition of the sample variance S ˛ :

S ˛ = 1/(n - 1).i(xi - )˛                           i = 1, 2, ..., n

Although there isn't much difference between the two when n is large, the difference is appreciable for small values of n. In any case, whenever the expression "sample variance" is encountered, it is a good idea to check which one of the two definitions is used.

## Unbiased estimation of a variance

The change of n to (n - 1) has to do with the question of estimating the variance of an unknown probability distribution, using the sample variance as an estimator. We'll show that :

* The sample variance s˛ is a biased estimator of the distribution variance ˛.

* Whereas the "corrected" sample variance S ˛ is an unbiased estimator of the distribution variance ˛.

 E[1/(n - 1).i(xi - )˛] = ˛

The necessity to replace n by (n - 1) originates from the fact that the mean µ of the distribution is assumed to be unknown, and has therefore to be replaced by its estimate . Would the mean µ be known, then :

s˛ = 1/n.i(xi - µ

would be an unbiased estimator of the variance of the distribution.

Removing the bias of an estimator does not always improve this estimator. For example, we show here that the unbiased estimator of the variance of a normal distribution is a poorer estimator than its biased counterpart.

-----

It is quite remarkable that an unbiased estimator of the variance of a distribution can be calculated without knowing anything about the distribution itself, not even its mean. The only other elementary setting when this occurs is with the distribution mean, which is unbiasedly estimated by the sample mean .

## Convergence

We show here that the sample variance (corrected or not) is a convergent estimator of the distribution variance. In other words, however small ε, the probability for the value of s˛ (or S˛) to be within ε of the value of the variance σ˛ of the distribution tends to 1 as the sample size grows without limit.

## Estimation of the variance of a finite population

The foregoing result applies to estimating the variance of an infinite population, or of a finite population when using  the "sampling with replacement" scheme.

Things are quite different if the population is finite and sampling is done without replacement, a quite common situation. Estimating the variance of this finite population is then a somewhat more complex problem, which is addressed here.

# Sampling distribution of variances

Even when the distribution is completely determined, the distribution of the sample variance is usually intractable, with the notable exception of the normal distribution. The distribution of the sample variance is then closely related to the Chi-square distribution. In fact, the Chi-square distribution originated from attempts to calculate the distribution of the sample variance when the distribution is normal.

# Tests on variances

Consequently, tests on variances are limited to the case where all the distributions involved are normal.

Among the most used tests about variances, let us mention :

* Comparing a sample variance to a theoretical reference value. The test is based on the Chi-square distribution of the sample variance..

* Testing for the equality of variances:

- Comparing two variances : Fisher's F test.

- Comparing more than two variances (testing fo homogeneity of variances) : Levene's test, Bartlett's test, Hartley's test.

These tests are important because many parametric tests (t tests, ANOVA) rely on the assumption that the normal distributions involved have identical variances.

# Generalization of the variance

The variance is defined for univariate distributions. Multivariate distributions also have (multidimensional) spreads around their means, and the straightforward multivariate generalisation of variance is the covariance matrix.

Yet, two scalar generalizations of variance are also used :

* The inertia, that is the sum of the squared eigenvalues of the covariance matrix.

* The so-called "generalized variance", which is the determinant of the covariance matrix (and therefore the product of the eigenvalues of the matrix).

_______________________________

# Variance and Regression

Regression assumes that the data was generated by a probability distribution p(x, y) that is adequately described by a deterministic function f(x) corrupted by a random noise :

p(x, y) = f(x) + (x)

where (x) is a random variable that depends on x. This randomness scatters the values of f(x) along the y axis.

## Estimation of the noise variance

A considerable challenge of regression is to estimate the variance of the noise (x). The only setting where this estimation can be completely carried out by theory is Linear Regression (Simple or Multiple), under some additional assumptions (in particular, that the variance of (x) does not, in fact, depend on x This assumption is called "homoskedasticity").

The result is then that the variance of the noise can, indeed, be unbiasedly estimated by a quantity that is akin to the sample variance as described above, but whose derivation is substantially more difficult than that of the sample variance.

## Estimation of the variance of the model parameters

The regression model is an equation that contains parameters, whose numerical values have been calculated from the data.  Because the model is built on a random sample, its parameters are random variables.

The values of these parameters are hoped not to depend too much on the particular sample at hand, that is, to have low variances. Again, only Linear Regression offers a complete theoretical framework for calculating the variance of the parameters.

## Estimation of the variance of the predictions

A regression model makes predictions about the value of y that would be observed for any arbitrary value of x. Because the model is built on a random sample, its predictions are random variables.

The analyst is particularly interested in the variance of these predictions : assuming that the predictions are unbiased, then only the predictions with a low variance can be considered useful.

Again, only Linear Regression offers a complete theoretical framework for calculating the variance of the model predictions.

One of the most important things that the analyst should keep in mind is that the variance of the model predictions depend on the particular type of model that he chose to build : in very loose terms, "big" models (i.e. with many parameters) exhibit larger prediction variances than smaller models.

But model predictions may also exhibit bias, and this bias decreases as the model gets bigger.

So the analyst is confronted to a dilemma which forces him to accept a certain level of bias in order to reduce the variance of the model predictions, and vice versa.

This important question is called the bias-variance tradeoff.

_______________________________________________

 Tutorial

We here demonstrate some basic results about the variance.

The importance of the Law of conditional variance is illustrated by the broken stick problem. We easily calculate the variance of the second break, which would be more difficult to calculate directly.

We calculated the expected length of this second break by using the Theorem of iterated expectation.

BASIC PROPERTIES OF THE VARIANCE

 Alternative expression of the variance Variance of the linear transform of a r.v. Variance of a linear combination of r.v. General case Special case : variance of the sum of r.v. Estimating the variance of a distribution The "natural" estimator The unbiased estimator Is "unbiased" better ? Convergence of the sample variance The Theorem of Conditional Variance Demonstration Geometric interpretation The "broken stick" problem TUTORIAL

____________________________________________________

 Expectation Conditional expectation Covariance Covariance matrix Inertia

 End of the Glossary