Least Squares (Weighted and Generalized)

Linear Regression (either Simple or Multiple) assumes that the observations were generated by a process described by the following equation

y = Xβ + ε

where :

    * X is the matrix of predictors,

    * β is the vector of parameters of the regression function,

    * ε is the vector of random errors.

 

The standard assumptions of Linear Regression further assume that the errors :

    * Have identical variances (homoskedasticity),

    * And are uncorrelated.

In other words, they assume that the covariance matrix of the errors is proportional to the identity matrix of order n (with n the number of observations) :

V(ε) = σ²I

where σ² is the (common) variance of the errors.

-----

This strong assumption may very well not be satisfied in practice. Resorting to the usual method of Least Squares estimation (then called "Ordinary Least Squares") may lead to serious mistakes :

    * The estimated parameters are not minimal variance linear estimators anymore.

    * The classical method for estimating the variance of the errors under the assumption of homoskedasticity becomes meaningless if this variance is not constant throughout the range of predictors.

    * Confidence intervals and tests pertaining to the values of the parameters and of the model predictions are not valid anymore, even if the normality assumption is preserved.

    * The coefficient of determination R² becomes meaningless because the classical variance decomposition is not valid anymore.

It is therefore appropriate to ask if the standard Linear Regression paradigm can be modified in a way that accomodates any arbitrary positive definite matrix Ω as a covariance matrix V(ε) of the errors.

Weighted Least Squares

As a first step, one may wish to keep the assumption about the uncorrelatedness of the errors, and just allow the variance of the errors not to be the same across the range of the predictors.

Recall that the Ordinay Least Squares method (OLS) minimizes the Sum of Squared Residuals (SSR) :

SSR = Σi (yi - yi*

(the summation being over the n observations) between the observed values yi and the predicted values yi* (the "adjusted values").

Suppose that the variance of observation yi is σ²i. We'll show that minimizing the quantity

 

SSRw = Σi [(yi - yi*)²/σ²i]

 

 

is equivalent to Ordinary Least Squares after the model has been submitted to a certain transformation.

This method is called Weighted Least Squares (WLS), meaning that every residual is now "weighted" by the inverse of the standard deviation of the corresponding observation. We see that this ponderation grants more influence to observations with a small variance than it does to observations with a large variance on the fitted model.

More specifically, if we introduce the new variables

    * zi = yi /σi  

    * wi = xi /σi  (where xi is the vector of the observed values of the predictors)

the above expression becomes

SSRw = Σi (zi - wi βw*)²

where βw* is the vector of parameters estimated by the WLS method.

So it appears that WLS is equivalent to OLS applied to :

    * The original observations yi divided by their own standard deviation,

    * Regressed on the original predictors also divided by the corresponding standard deviations.


If all the variances are equal, then WLS is clearly equivalent to OLS.

Animation

 The following animation illustrates the concept of Weighted Least Square estimation.

 

 

The "Book of Animations" on your computer

 


 

The animation displays :

    * A grey, fixed "true" regression line that passes through the origin (just for convenience),

    * A set of observations generated from this line with a noise level proportional to x,

    * The blue "Ordinary Least Squares" line (LS),

    * The red "Weighted Least Squares" line (WLS),

    * The "true" value to be predicted for the value of x where the green slider currently stands (thick black tick to the left of the y axis).

    * The average of the already predicted values of the two models (ordinary and weighted) for the value of x where the slider currently stands (thick blue and red ticks to the left of the y axis).

 

After clicking on "Go", each "average prediction" tick is sandwiched between two thin ticks positioned each one standard deviation away from the average of the set of already observed predictions. All those ticks quickly converge toward their final positions.

 

1) In the "Next" mode, compare the LS and WLS lines near the origin, where the noise level is lowest. Observe that the WLS places more emphasis on staying close to the data points in this region than the LS does.
 

2) In the "Run" mode, observe that both the LS and the WLS average ticks ultimately line up with the fixed black tick (y value to be predicted) : both the LS and WLS predictors are unbiased. The means of their distributions ("expected values") are just the (common) value to be predicted.
 

3) Observe that the standard deviation of the WLS prediction distribution becomes ultimately smaller than that of the LS prediction distribution. The magnitude of this improvement is certainly not breathtaking, but it is clearly visible. Its numerical value in posted in the "Results" frame ("Std. Dev. ratio").
   Equivalently, observe the predicted values for both LS and WLS ("Next" mode), and notice that the red horizontal line (WLS prediction) is more often than not sandwiched between the black (true value) and blue (LS prediction) lines, meaning that the WLS prediction is more often than not more accurate than the LS prediction. When these predictions are on either side of the value to be predicted, the WLS prediction is more often than not closer to this value than the LS prediction.

Equivalently, observe that the WLS  more often than not sits within the angle between the regression line and the LS line, meaning that it is "closer" to this line than the LS line. When the LS and the WLS line are on either side of the regression line, the WLS line is, more often than not, closer to the regression line than the LS line.
 

4) Conduct several runs with a different position of the slider each time ("Reset"). Notice that there is one position where the standard deviations of LS and WLS are equal : both models have identical predicting performances for this position.
Conversely, the difference in standard deviations is largest at either end of the range of the independent variable x: resorting to WLS (instead of just LS) becomes more advantageous for predictions to be made near extreme values of x. The improvement is particularly noticeable for small values of x, which is to be expected as WLS is particularly careful to staying close to data points in regions of low noise level.
 

5) For a given number of points and slider position, conduct several runs with a different noise level each time. Observe that, although standard deviations become larger with increased noise level, their ratio remains constant. So the degree of improvement from LS to WLS does not depend on the absolute noise level, only on how this noise varies along the x axis (up to a scale factor).
 

6) Increase the number of points ("Reset" mode), and observe that the ratio of the standard deviations decreases, that is,  the advantage of WLS over LS becomes more perceptible. This is to be expected as :

    *  increasing the number of points may be construed as just extending the data set further to the right with additioinal points, and then change the x scale to confine the data set within the same range,

    * and we previously noted that the the edge of WLS over LS increases as you move further to the right.

 



You'll find here a realistic example of heteroskedasticity.
You'll find here the calculation of the slope and intercept of the Weighted Lesqt Squares Line.

Generalized Least Squares

Weighted Least Squares discards homoskedasticity but preserve the assumption of uncorrelated errors. If this assumption is also discarded, the errors covariance matrix can be any positive definite matrix Ω.

Can Linear Regression still be salvaged under such general and weak assumptions about the process that generated data ?

We'll show that it is still possible to return to the standard linear model by again transforming the predictors and the response variable. This transformation is a bit more complex than that leading to WLS. The main result is as follows : the best estimator β*G of the vector of parameters β of the regression function is given by

 

β*G = (X'Ω -1X)-1X'Ω -1y

 

The method leading to this result is called Generalized Least Squares estimation (GLS), of which WLS is just a special case. The estimator β*G is sometimes called "Aitken's estimator".


Of course, if Ω = I, the above expression reduces to that of β* as obtained by Ordinary Least Squares.

____________________________________________________________________

 

 

 

Tutorial

 

In this Tutorial, we establish the fundamental results of the Generalized Least Squares method of estimation, in particular those results pertaining to the estimation of the vector of parameters of the model. This vector will prove optimal in the sense of the Gauss-Markov theorem.

These results are easily obtained by noting that a Mahalanobis transformation of the vector of errors turns the general problem (no restriction on the errors covariance matrix) into a new problem that satisfies the standard assumptions of Multiple Linear Regression (homoskedasticity and uncorrelated errors).

We'll then insist that this tranformation can be interpreted in terms of a change of metric defining the inner product of two vectors. It will thus appear that there is no fundamental difference between Ordinary Least Squares (OLS) and Generalized Least Squares (GLS), the former being just a special case of the latter (errors covariance matrix proportional to the identity matrix).

-----

Weighed Least Squares will not be treated separately, being just as special case of GLS.

 

 

GENERALIZED LEAST SQUARES

Linear transform of the model

General linear transformation

Back to homoskedasticity and lack of correlation

Transformation is not unique

Generalized Least Squares

Parameters of the model

The new model

Estimation of the parameters of the new model

Statistical properties of the GLS estimator

Expectation

Covariance matrix

Optimality (generalized Gauss-Markov)

Geometric interpretation of GLS

Adjusted values, oblique projection

Changing the metric

Inner product and norm

Residuals

Sum of Squared Residuals

TUTORIAL

 

 _____________________________________________________

 

Related readings :

Multiple Linear Regression

Gauss-Markov theorem

Ordinary Least Squares

Download this Glossary