Least Squares estimation

The two main parameter estimation techniques are :

1. Maximizing the Likelihood of the sample,
2. Minimizing the sum of the squares of the errors of the model predictions on the design set (or training set).

We address here the second approach, known as "Least Squares estimation". The term "estimation" refers to the fact that the calculated model parameters will turn out to be (under the proper assumptions) good estimates of the parameters of the mechanism that generated the sample (for more information on estimation, please see here).

• We first give a few examples of Least Squares estimation.
• We then consider the special case of Linear Least Squares, which is extremely important in practical applications ().
• We then touch upon generalizations of the Linear Least Squares paradigm ().

# Examples of Least Squares estimation

## Estimating a mean

Let f(x) be any pdf with a mean µ. An n observation sample {x1, x2 , ..., xn} is drawn from this pdf. The sample mean :

m = 1/n.Σi xi

is an unbiased estimator of the population mean µ.

The sample mean  m also has the property of minimizing the sum

S = Σi (xi - y

where y is an adjustable parameter. In other words, S is minimal for y = m (the value of S is then just n times the variance of the sample).

So, instead of defining our estimator as "The sample mean", we might just as well have defined it as "The quantity y that makes S minimal.".

This remark is the starting point of Least Squares estimation, a very general paradigm that is used for calculating the values of the parameters of models that estimate simultaneously the means of several r.v. linked by a special kind of relationship (see below).

## Simple Linear Regression

A collection of paired observations {xi, y i} is visualized by the following scatter plot :

From a purely geometric (not probabilistic) standpoint, it is quite natural to consider that the straight line that makes the sum of the squared residuals minimal is a good embodiement of the fact that the points are closely distributed about a straight line.

We now go one step further and assume that :

• The xis are fixed, non random numbers,
• But the y i are the realizations of random variables Yi.

These two conditions describe accurately many realistic situations. For example, the pressure in a gas tank (the yis) might be measured for various temperatures (the xis). Because of measurement errors, each y i should be considered as random. More specifically, we assume that each y i is the sum :

• Of the true (but unknown) value mi of  y for x = xi
• and a random noise εi of mean 0, that causes the measurement error.

The temperatures xi are considered as almost perfectly reproducible from one series of measurement to another, and are therefore not random.

Because the yis are random, one might expect that, in the particular sample at hand, some y is are larger than their mean mi, while some others are smaller than their mean. One would also think that :

• If   y i is larger than the corresponding mi, then the point (xi, y i) has a fair chance to be above the line, while
• If   y i is smaller than the corresponding mi, then  the point (xi, y i) has a fair chance to be below the line.

This line of reasoning leads us to wonder whether the straight line could be used for estimating the mi s, the true values of  y (i.e., with no measurement error) for the set of values {x1, x2 , ..., xn}of x according to the following scheme :

The answer to that question is "Under some special conditions, yes". In particular, these conditions demand that all the mean points (xi, mi) be on the same straight line. More specifically, under these conditions (that we will refer to as the Standard Statistical Model, or SSM), each m i* is an unbiased estimate of the mean mi. Moreover, this property extends to new measurements made for a new, yet unused value of x, say xn+1 . The calculated value m* n+1 is an unbiased estimate of the true mean mi+1 of y for x = xn+1.

So it appears that Simple Linear Regression is a generalization of our first problem (estimating the mean of a distribution). Here :

• Several means are estimated simultaneously (the mis). This is possible because the Y is are constrained by the SSM.
• The parameters of the model are calculated by imposing that the sum of the squared residuals be minimal.

Note that each yi is by itself an unbiased estimate of the corresponding  mi. But the cooperation of the Y is  in one single model makes the estimates yi* much more accurate. This point is further developed in the Tutorial on Simple Linear Regression.

## Multiple Linear Regression

The Simple Linear Regression model is :

y = a0 + a1x + ε

This model can be generalized to the case where the measurements y depend not just on one variable x, but on p variables xi , i = 1, 2, ..., p. We then obtain the Multiple Linear Regression model :

y = a0 + a1x1 + a2x2 + ... +  apxp + ε

The residuals are defined as in the Simple Linear Regression case. The model is still fit to the data by minimizing the sum of the squared residuals, and its geometric representation is now a p-dimensional hyperplane in a p-dimensional space. The model predictions are still unbiased estimates of the true values of the mean of y for any set of values of the predictors..

The values of the estimated parameters can be expressed in a closed form, and their statistical properties are well understood, provided that the data satisfies the SSM. In particular, these estimated parameters are unbiased (see here) and their variances are smallest in the family of linear unbiased estimators of the parameters (see Gauss-Markov theorem).

## "Least Squares" classification techniques

Some classification techniques code the classes not as modalities of a categorical variable, but as numbers (called targets) according to various coding schemes. A (usually) linear  model is then built so as to predict these targets as accurately as possible. The most common way of setting the values of the coefficients of the model is again to require that the targets be predicted with the smallest possible squared error.

# Linear Least Squares estimation

## What is actually meant by "linear" ?

In the above examples, the models were linear in the variables, and, indeed, it is often believed that the term "linear" refers to linearity in the variables.

In fact, the term "linear" refers to linearity in the parameters of the model, not in the variables. Thus, a polynomial in x :

y = a0 + a1x + a2x2 + ... +  apxp

is to be considered a linear model, because it is linear in the parameters a0, a1, ..., ap.

## The mechanics of Linear Least Squares estimation

Least Squares estimation is particularly well suited to linear models because :

• the estimated parameters can then be expressed mathematically in a closed form
• and turn out to be unbiased estimates of the true parameters.

Other important quantities can also be expressed in a closed form, and more particularly :

• The standard errors on the estimated parameters can be expressed as functions of the measurement errors ε.
• The calculated parameters are r.v. that turn out not to be independent, but their covariance matrix can be calculated explicitely.
• The residuals are also r.v., whose covariance matrix can also be calculated explicitely.
• The variance of the measurement errors can also be estimated in a closed form and with no bias.

and very instructive geometrical interpretations of these quantities can be given.

All these important results require only minimal assumptions about the underlying structure of the data, and make no assumption on the dsitribution of the measurement errors ε.

If, in addition, the measurement errors ε are assumed to be normal, then the distributions of the above quantities (and in particular, the parameters ai) can be derived explicitely. The very important practical consequence is that interval estimation and  tests can be designed around the estimated values of the parameters. Particularly, it becomes possible to test null hypothesis such as :

H0 : ai = 0

that expresses the fact that variable xi has no influence on the response variable y, and therefore can (and should) be removed from the model. This kind of test is both common and important in Multiple Linear Regresion.

# Generalizations

## Non linear Least Squares

In the foregoing discussion, the relationship between the response variable y and the parameters was assumed to be linear. We stated that the predictions of the linear model built by the Least Squares method  were unbiased estimates of the mean of y for a given set of values {x}of the predictors. Therefore, the model is then a genuine regression model.

Many regression problems are, by nature, not linear in the parameters. For example :

• Decay curves like :

f(t) = A.e-αt + B.e-βt

•  Quasi-periodic curves like :

f(t) = A.cos(ω1t) + B.cos(ω2 t)

are definitely not linear in their parameters.

Note that it is sometimes possible to transform a non linear function into a linear one with an appropriate transformation, and then use Linear Least Squares for calculating the parameters. But the assumptions of the Standard Linear Model will usually not be met by the transform data. Consequently, the estimated parameters and predictions will be biased, and no test can be used on the estimated parameters.

The least-squares principle may still be applied to calculate the values of the parameters of the model. But the partial derivatives of S (the sum of the squared residuals) with respect to the parameters yield a system of non linear equations that usually cannot be solved in a closed form. Minimizing S then becomes a problem of optimization than can only be solved by iterative numerical procedures.

The situation is the same when the analytical form underlying the data is unknown, or perhaps, even non existant. Then one may resort to ad hoc models that may be non linear in the parameters, like Neural Networks. "Training" a Neural Network in a regression application is just iteratively minimizing the sum of squared errors with an optimization algorithm.

## Generalized Least Squares

The standard Linear Least Squares paradigm may be extented to accomodate situations where :

* Different weights are assigned to different observations to account for their relative importance.

* The measurement errors on the y i are correlated.

* The noise variance is not the same throughout the predictor space (heteroskedasticity).

* The mean of a population has to be estimated from several samples with different sizes (see here).

The model parameters are then calculated by the method of Generalized Least Squares (GLS).

__________________________________________________

 Estimation Likelihood Simple Linear Regression Multiple Linear Regression