Least Squares estimation

The two main parameter estimation techniques are :

  1. Maximizing the Likelihood of the sample,
  2. Minimizing the sum of the squares of the errors of the model predictions on the design set (or training set).

 

We address here the second approach, known as "Least Squares estimation". The term "estimation" refers to the fact that the calculated model parameters will turn out to be (under the proper assumptions) good estimates of the parameters of the mechanism that generated the sample (for more information on estimation, please see here).

 

Examples of Least Squares estimation

Estimating a mean

            Let f(x) be any pdf with a mean µ. An n observation sample {x1, x2 , ..., xn} is drawn from this pdf. The sample mean :

m = 1/n . Si xi

is an unbiased estimator of the population mean µ.

The sample mean  m also has the property of minimizing the sum

S = Si (xi - y

where y is an adjustable parameter. In other words, S is minimal for y = m (the value of S is then just n times the variance of the sample).

So, instead of defining our estimator as "The sample mean", we might just as well have defined it as "The quantity y that makes S minimal.".

 

This remark is the starting point of Least Squares estimation, a very general paradigm that is used for calculating the values of the parameters of models that estimate simultaneously the means of several r.v. linked by a special kind of relationship (see below).

Simple Linear Regression

            A collection of paired observations {xi, y i} is visualized by the following scatter plot :

 

From a purely geometric (not probabilistic) standpoint, it is quite natural to consider that the straight line that makes the sum of the squared residuals minimal is a good embodiement of the fact that the points are closely distributed about a straight line.
 

We now go one step further and assume that :

 

These two conditions describe accurately many realistic situations. For example, the pressure in a gas tank (the yis) might be measured for various temperatures (the xis). Because of measurement errors, each y i should be considered as random. More specifically, we assume that each y i is the sum :

 

The temperatures xi are considered as almost perfectly reproducible from one series of measurement to another, and are therefore not random.

 

Because the yis are random, one might expect that, in the particular sample at hand, some y is are larger than their mean mi, while some others are smaller than their mean. One would also think that :


This line of reasoning leads us to wonder whether the straight line could be used for estimating the mi s, the true values of  y (i.e., with no measurement error) for the set of values {x1, x2 , ..., xn}of x according to the following scheme :
 



The answer to that question is "Under some special conditions, yes". In particular, these conditions demand that all the mean points (xi, mi) be on the same straight line. More specifically, under these conditions (that we will refer to as the Standard Statistical Model, or SSM), each m i* is an unbiased estimate of the mean mi. Moreover, this property extends to new measurements made for a new, yet unused value of x, say xn+1 . The calculated value m* n+1 is an unbiased estimate of the true mean mi+1 of y for x = xn+1.

 

So it appears that Simple Linear Regression is a generalization of our first problem (estimating the mean of a distribution). Here :


Note that each yi is by itself an unbiased estimate of the corresponding  mi. But the cooperation of the Y is  in one single model makes the estimates yi* much more accurate. This point is further developed in the tutorial on Simple Linear Regression.

Multiple Linear Regression

         The Simple Linear Regression model is :

y = a0 + a1x + e 

This model can be generalized to the case where the measurements y depend not just on one variable x, but on p variables xi , i = 1, 2, ..., p. We then obtain the Multiple Linear Regression model :

y = a0 + a1x1 + a2x2 + ... +  apxp + e 

The residuals are defined as in the Simple Linear Regression case. The model is still fit to the data by minimizing the sum of the squared residuals, and its geometric representation is now a p-dimensional hyperplane in a p-dimensional space. The model predictions are still unbiased estimates of the true values of the mean of y for any set of values of the predictors..

 

The values of the parameters can be expressed in a closed form, and their statistical properties are well understood, provided that the data satisfies the SSM.

"Least Squares" classification techniques

            Some classification techniques code the classes not as modalities of a categorical variable, but as numbers (called targets) according to various coding schemes. A (usually) linear  model is then built so as to predict these targets as accurately as possible. The most common way of setting the values of the coefficients of the model is again to require that the targets be predicted with the smallest possible squared error.

Linear Least Squares estimation

What is actually meant by "linear" ?

            In the above examples, the models were linear in the variables, and, indeed, it is often believed that the term "linear" refers to linearity in the variables.

In fact, the term "linear" refers to linearity in the parameters of the model, not in the variables. Thus, a polynomial in x :

y = a0 + a1x + a2x2 + ... +  apxp 

is to be considered a linear model, because it is linear in the parameters a0, a1, ..., ap.

The mechanics of Linear Least Squares estimation

         Least Squares estimation is particularly well suited to linear models because :


Other important quantities can also be expressed in a closed form, and more particularly :

 

and very instructive geometrical interpretations of these quantities can be given.

All these important results require only minimal assumptions about the underlying structure of the data, and make no assumption on the dsitribution of the measurement errors e.

 

If, in addition, the measurement errors e are assumed to be normal, then the distributions of the above quantities (and in particular, the parameters ai) can be derived explicitely. The very important practical consequence is that interval estimation and  tests can be designed around the estimated values of the parameters. Particularly, it becomes possible to test null hypothesis such as :

H0 : ai = 0

that expresses the fact that variable xi has no influence on the response variable y, and therefore can (and should) be removed from the model. This kind of test is both common and important in Multiple Linear Regresion.

Generalizations

Non linear Least Squares

            In the foregoing discussion, the relationship between the response variable y and the parameters was assumed to be linear. We stated that the predictions of the linear model built by the Least Squares method  were unbiased estimates of the mean of y for a given set of values {x}of the predictors. Therefore, the model is then a genuine regression model.
 

Many regression problems are, by nature, not linear in the parameters. For example :

f(t) = A.e-a t + B.e-b t 

f(t) = A.cosw1t + B.cosw2 t

are definitely not linear in their parameters.


Note that it is sometimes possible to transform a non linear function into a linear one with an appropriate transformation, and then use Linear Least Squares for calculating the parameters. But the assumptions of the Standard Linear Model will usually not be met by the transform data. Consequently, the estimated parameters and predictions will be biased, and no test can be used on the estimated parameters.

The least-squares principle may still be applied to calculate the values of the parameters of the model. But the partial derivatives of S (the sum of the squared residuals) with respect to the parameters yield a system of non linear equations that usually cannot be solved in a closed form. Minimizing S then becomes a problem of optimization than can only be solved by iterative numerical procedures.

 

The situation is the same when the analytical form underlying the data is unknown, or perhaps, even non existant. Then one may resort to ad hoc models that may be non linear in the parameters, like Neural Networks. "Training" a Neural Network in a regression application is just iteratively minimizing the sum of squared errors with an optimization algorithm.

Weighted Least Squares

            The standard Linear Least Squares paradigm may be extented to accomodate situations where :

 

A minor modification of the Least Squares approach, called  "Weighted Least Squares" allows solving these problems in a closed form.

______________________________________

 

Related readings

Estimation

Likelihood

Simple Linear Regression

Multiple Linear Regression

Download this Glossary

 

Want to contribute to this site ?