Interactive animation

Mean Squared Error

Let :

Denote by ε the difference X - c :

ε  = X - c

ε is a random variable.

-----

There are two circumstances where ε may be regarded as an error.

Mean Square Error of an estimator

c is the unknown value of a parameter of a distribution, and X is an estimator of this parameter. In this context, it is usual to denote the estimator by θ* (instead of X) , and the value of the parameter by θ0  (instead of c). ε is called the "estimation error" of θ*.
A good estimator is close to θ0 on the average. Just how close is usually measured by the mean of the squared estimation error ε. This quantity is called the Mean Square Error (MSE) of the estimator θ* :

MSE = E[(θ* - θ0)²]

In the Tutorial, we show that :
 

MSE = Var(θ *) + Bias(θ *

 

 

It is then clear that there is no reason why an unbiased estimator should always be prefered to an biased one with a smaller variance. For axample, we show here that for normal distributions, the classic unbiased estimator of the variance has a larger MSE than its "uncorrected", and therefore biased counterpart.

 

In practice, though, unbiased estimators are usually prefered because they are easier to identify, and have convenient mathematical properties.

-----

There are certain circumstances when the analyst will force a bias on an estimator that would otherwise be unbiased, for the purpose of reducing its MSE. A typical example of forced bias is Ridge Regression : the parameters of a Multiple Linear Regression model are natively unbiased estimators of the true values of the parameters, but their variance is very large when the predictors exhibit strong colinearity. Ridge regression forces a certain level of bias on the estimated parameters, with the effect of reducing their MSE. As a side benefit, the MSE of the model predictions are also reduced.

Guessing the value of a r.v.

The remainder of this page is dedicated to the second circumstance where ε may be regarded as an "error".

Single random variable

            X is a r.v., and the problem is to guess (estimate) the value of a realization of X. This situation is sort of dual of the previous one :

Again, we'll measure the accuracy of the guess by the mean of the squared estimation error, a quantity that is also called the Mean Square Error (MSE) of the guess c :

MSE = E[(X - c)²]

 

The question is then to identify the value of c that will minimize the MSE. We show in the Tutorial (two demonstrations) that :

 

c = E[X]


 

is the choice that minimizes the MSE.

In words : "If you're asked to guess the value of a realization of X, then your best guess (in the MSE sense of the word) is the expectation of X.".

The MSE is then called the Minimum Mean Square Error (MMSE) and is clearly equal to the variance of X.

Conditioning to a second r.v.

            * Single shot

                We now consider a slightly more complicated problem. We are still asked to guess the value of a realization of X, but some additional information is now available, namely the value y0 of a realization of a second r.v. Y.

If Y and X are independent, this information is clearly of no help. But if they are not independent, we'll be able to improve on our basic guess c = E[X] with a (probabilistically) more accurate guess c '.

We will show that :

 

c' = E[X | Y = y0]

 

MMSE = MSE(c') MSE(c)

 

In words : "The best guess c' is the expectation of X conditionally to the observed value y0 of Y."

-----

Note that the "quality" of this estimation clearly depends on the choice of Y :


Although this sounds like a result of Probability Theory, it is in fact a result in Geometry which states, in very loose terms, that the dispersion around the mean point of a "cut" parallel to the x axis is smaller than the dispersion around the mean point of the projection on the x axis.

 

 

 

            * Multiple shot

                 If we repeat the experiment over and over, then c' is not a number anymore, it becomes a random variable, that we denote by X *. Because each y determines c' uniquely, X * is a function of Y, call it X *(Y). We now want to find X *(Y) such that :

MSE = E[(X - X *(Y))²]

is minimal.

The fairly intuitive result is :

X *(Y) = E[X | Y ]

 

 

We give two demonstrations of this result in the Tutorial.

Note that we are now "estimating" a r.v. X by another r.v. X *.

Properties of MMSE estimators

What are the properties of the r.v. X * ? In the Tutorial, we show that :
 

    1) X * is an unbiased estimator of E[X]. In other words, X and X * have the same expectation :

E[X *] = E[X]

    2) The expectation of the error ε = (X * - X)  is 0 :

E[ε] = 0


    3) The estimator X * and the error ε are orthogonal :

E[X *.ε] = 0

This will imply that the estimator X * and the error ε are uncorrelated.
 

    4) Decomposition of the variance of X :

The variance of X is the sum of the variance of the estimator X * and the variance of the error ε :

Var(X) = Var(X *)  + Var(ε)

Note that the variance of the MMSE estimator is always smaller than the variance of the estimated r.v. : the estimator X* is always optimistic.

-----

We encourage the reader to develop a geometric interpretation of these expressions along the lines of the geometric interpretation of Linear Regression. Yet, this interpretation is to be taken with a grain of salt :

Regression

Although these probabilistic considerations may seem quite remote from the concerns of the analyst practicing regression, they are in fact at the heart of the problem.

Regression addresses the problem of predicting a response variable Y given the value x of an independent variable X. Although only some realizations of the pair {X, Y} are known (the sample), one considers the joint probability density g(x, y) of the pair {X, Y}. The foregoing results tell us that the best prediction one could possibly hope for (in the MSE sense of the word) would be identify the function :

f(x) = E[Y | X = x]

 

This expression defines the regression function of Y on X.

Doing regression is attempting to find as close as possible an "empirical regression function" from on the sample (and, usually, some additional assumptions about the joint probability density g(x, y)).


Note that the roles of X and Y are reversed with respect to the previous paragraphs.

 

The MSE of a regression model is illustrated by an interactive animation that you'll find here .

____________________________________________________________

 

Tutorial

 

    * The first section of this Tutorial is a demonstration of the above mentioned formula of the MSE of a parameter estimator.

    * The second section gives two demonstrations of the fact that E [X] is the best estimate of X in the MSE sense.

    * The third section addresses the same problem when additional information from another r.v. Y is available.

    * The fourth section demonstrates the above mentioned properties of Minimum Mean Square Error estimators.

 

 

MEAN SQUARE ERROR OF ESTIMATORS

MSE of a parameter estimator

Best estimate of a random variable X

By calculus

Finding the extremum

The extremum is a minimum

 By properties of expectation

Best estimate of X when a second r.v. Y is available

Single shot

Multiple shot

By calculus

By properties of expectation

Properties of Minimum Mean Square Error (MMSE) estimators

Bias

Expectation of the error ε

Orthogonality of estimator and error

Estimator and error are uncorrelated

Variance of the MMSE estimator

TUTORIAL

 

 _________________________________________________________

 

Related readings :

Estimation

Regression

Simple Linear Regression

Multiple Linear Regression

Ridge Regression

Bias-variance tradeoff

Standard Error of an estimator

Download this Glossary