Interactive animation

Mean Squared Error

Let :

• X be a random variable.
• c be a number.

Denote by ε the difference X - c :

ε  = X - c

ε is a random variable.

-----

There are two circumstances where ε may be regarded as an error.

# Mean Square Error of an estimator

c is the unknown value of a parameter of a distribution, and X is an estimator of this parameter. In this context, it is usual to denote the estimator by θ* (instead of X) , and the value of the parameter by θ0  (instead of c). ε is called the "estimation error" of θ*.
A good estimator is close to θ0 on the average. Just how close is usually measured by the mean of the squared estimation error ε. This quantity is called the Mean Square Error (MSE) of the estimator θ* :

MSE = E[(θ* - θ0)²]

In the Tutorial, we show that :

 MSE = Var(θ *) + Bias(θ *)²

It is then clear that there is no reason why an unbiased estimator should always be prefered to an biased one with a smaller variance. For axample, we show here that for normal distributions, the classic unbiased estimator of the variance has a larger MSE than its "uncorrected", and therefore biased counterpart.

In practice, though, unbiased estimators are usually prefered because they are easier to identify, and have convenient mathematical properties.

-----

There are certain circumstances when the analyst will force a bias on an estimator that would otherwise be unbiased, for the purpose of reducing its MSE. A typical example of forced bias is Ridge Regression : the parameters of a Multiple Linear Regression model are natively unbiased estimators of the true values of the parameters, but their variance is very large when the predictors exhibit strong colinearity. Ridge regression forces a certain level of bias on the estimated parameters, with the effect of reducing their MSE. As a side benefit, the MSE of the model predictions are also reduced.

# Guessing the value of a r.v.

The remainder of this page is dedicated to the second circumstance where ε may be regarded as an "error".

## Single random variable

X is a r.v., and the problem is to guess (estimate) the value of a realization of X. This situation is sort of dual of the previous one :

• Instead of estimating a fixed number θ0 by a r.v. θ *,
• We now estimate a r.v. X  with a fixed number c.

Again, we'll measure the accuracy of the guess by the mean of the squared estimation error, a quantity that is also called the Mean Square Error (MSE) of the guess c :

MSE = E[(X - c)²]

The question is then to identify the value of c that will minimize the MSE. We show in the Tutorial (two demonstrations) that :

 c = E[X]

is the choice that minimizes the MSE.

In words : "If you're asked to guess the value of a realization of X, then your best guess (in the MSE sense of the word) is the expectation of X.".

The MSE is then called the Minimum Mean Square Error (MMSE) and is clearly equal to the variance of X.

## Conditioning to a second r.v.

* Single shot

We now consider a slightly more complicated problem. We are still asked to guess the value of a realization of X, but some additional information is now available, namely the value y0 of a realization of a second r.v. Y.

If Y and X are independent, this information is clearly of no help. But if they are not independent, we'll be able to improve on our basic guess c = E[X] with a (probabilistically) more accurate guess c '.

We will show that :

• The best guess is now :

 c' = E[X | Y = y0]

• And that it does indeed improve on our previous guess, that is :

MMSE = MSE(c') MSE(c)

In words : "The best guess c' is the expectation of X conditionally to the observed value y0 of Y."

-----

Note that the "quality" of this estimation clearly depends on the choice of Y :

• As mentioned, if Y is independent of X, then no improvement of the estimation of X over E[X] is to be anticipated.
• Some coupling between X and Y will bring about some improvement in the estimation (blue segments in the illustration below).

Although this sounds like a result of Probability Theory, it is in fact a result in Geometry which states, in very loose terms, that the dispersion around the mean point of a "cut" parallel to the x axis is smaller than the dispersion around the mean point of the projection on the x axis.

• If the coupling between Y and X is strong (lower image in the illustration above), then c' = E[X | Y = y0] will be a much better estimate (lower MSE) than just c = E[X].

* Multiple shot

If we repeat the experiment over and over, then c' is not a number anymore, it becomes a random variable, that we denote by X *. Because each y determines c' uniquely, X * is a function of Y, call it X *(Y). We now want to find X *(Y) such that :

MSE = E[(X - X *(Y))²]

is minimal.

The fairly intuitive result is :

 X *(Y) = E[X | Y ]

We give two demonstrations of this result in the Tutorial.

Note that we are now "estimating" a r.v. X by another r.v. X *.

# Properties of MMSE estimators

What are the properties of the r.v. X * ? In the Tutorial, we show that :

1) X * is an unbiased estimator of E[X]. In other words, X and X * have the same expectation :

E[X *] = E[X]

2) The expectation of the error ε = (X * - X)  is 0 :

E[ε] = 0

3) The estimator X * and the error ε are orthogonal :

E[X *.ε] = 0

This will imply that the estimator X * and the error ε are uncorrelated.

4) Decomposition of the variance of X :

The variance of X is the sum of the variance of the estimator X * and the variance of the error ε :

Var(X) = Var(X *)  + Var(ε)

Note that the variance of the MMSE estimator is always smaller than the variance of the estimated r.v. : the estimator X* is always optimistic.

-----

We encourage the reader to develop a geometric interpretation of these expressions along the lines of the geometric interpretation of Linear Regression. Yet, this interpretation is to be taken with a grain of salt :

• In Linear Regression, the error ε is unknown, but is a primary variable that pre-exists any attempt to build a model.
• While here, ε is a secondary, or derived, variable, which we create in the process of estimating X.

# Regression

Although these probabilistic considerations may seem quite remote from the concerns of the analyst practicing regression, they are in fact at the heart of the problem.

Regression addresses the problem of predicting a response variable Y given the value x of an independent variable X. Although only some realizations of the pair {X, Y} are known (the sample), one considers the joint probability density g(x, y) of the pair {X, Y}. The foregoing results tell us that the best prediction one could possibly hope for (in the MSE sense of the word) would be identify the function :

f(x) = E[Y | X = x]

This expression defines the regression function of Y on X.

Doing regression is attempting to find as close as possible an "empirical regression function" from on the sample (and, usually, some additional assumptions about the joint probability density g(x, y)).

Note that the roles of X and Y are reversed with respect to the previous paragraphs.

The MSE of a regression model is illustrated by an interactive animation that you'll find here .

____________________________________________________________

 Tutorial

* The first section of this Tutorial is a demonstration of the above mentioned formula of the MSE of a parameter estimator.

* The second section gives two demonstrations of the fact that E [X] is the best estimate of X in the MSE sense.

* The third section addresses the same problem when additional information from another r.v. Y is available.

* The fourth section demonstrates the above mentioned properties of Minimum Mean Square Error estimators.

MEAN SQUARE ERROR OF ESTIMATORS

 MSE of a parameter estimator Best estimate of a random variable X By calculus Finding the extremum The extremum is a minimum  By properties of expectation Best estimate of X when a second r.v. Y is available Single shot Multiple shot By calculus By properties of expectation Properties of Minimum Mean Square Error (MMSE) estimators Bias Expectation of the error ε Orthogonality of estimator and error Estimator and error are uncorrelated Variance of the MMSE estimator TUTORIAL

_________________________________________________________

Related readings :

 Estimation Regression Simple Linear Regression Multiple Linear Regression Ridge Regression Bias-variance tradeoff Standard Error of an estimator
 Download this Glossary