Least Squares (Weighted and Generalized)
Linear Regression (either Simple or Multiple) assumes that the observations were generated by a process described by the following equation
y = Xβ + ε
* X is the matrix of predictors,
* β is the vector of parameters of the regression function,
* ε is the vector of random errors.
The standard assumptions of Linear Regression further assume that the errors :
* Have identical variances (homoskedasticity),
* And are uncorrelated.
In other words, they assume that the covariance matrix of the errors is proportional to the identity matrix of order n (with n the number of observations) :
V(ε) = σ²I
where σ² is the (common) variance of the errors.
This strong assumption may very well not be satisfied in practice. Resorting to the usual method of Least Squares estimation (then called "Ordinary Least Squares") may lead to serious mistakes :
* The estimated parameters are not minimal variance linear estimators anymore.
* The classical method for estimating the variance of the errors under the assumption of homoskedasticity becomes meaningless if this variance is not constant throughout the range of predictors.
* Confidence intervals and tests pertaining to the values of the parameters and of the model predictions are not valid anymore, even if the normality assumption is preserved.
* The coefficient of determination R² becomes meaningless because the classical variance decomposition is not valid anymore.
It is therefore appropriate to ask if the standard Linear Regression paradigm can be modified in a way that accomodates any arbitrary positive definite matrix Ω as a covariance matrix V(ε) of the errors.
As a first step, one may wish to keep the assumption about the uncorrelatedness of the errors, and just allow the variance of the errors not to be the same across the range of the predictors.
Recall that the Ordinay Least Squares method (OLS) minimizes the Sum of Squared Residuals (SSR) :
SSR = Σi (yi - yi*)²
(the summation being over the n observations) between the observed values yi and the predicted values yi* (the "adjusted values").
Suppose that the variance of observation yi is σ²i. We'll show that minimizing the quantity
SSRw = Σi [(yi - yi*)²/σ²i]
is equivalent to Ordinary Least Squares after the model has been submitted to a certain transformation.
This method is called Weighted Least Squares (WLS), meaning that every residual is now "weighted" by the inverse of the standard deviation of the corresponding observation. We see that this ponderation grants more influence to observations with a small variance than it does to observations with a large variance on the fitted model.
More specifically, if we introduce the new variables
* zi = yi /σi
* wi = xi /σi (where xi is the vector of the observed values of the predictors)
the above expression becomes
SSRw = Σi (zi - wi βw*)²
where βw* is the vector of parameters estimated by the WLS method.
So it appears that WLS is equivalent to OLS applied to :
* The original observations yi divided by their own standard deviation,
* Regressed on the original predictors also divided by the corresponding standard deviations.
If all the variances are equal, then WLS is clearly equivalent to OLS.
The following animation illustrates the concept of Weighted Least Square estimation.
The animation displays :
* A grey, fixed "true" regression line that passes through the origin (just for convenience),
* A set of observations generated from this line with a noise level proportional to x,
* The blue "Ordinary Least Squares" line (LS),
* The red "Weighted Least Squares" line (WLS),
* The "true" value to be predicted for the value of x where the green slider currently stands (thick black tick to the left of the y axis).
* The average of the already predicted values of the two models (ordinary and weighted) for the value of x where the slider currently stands (thick blue and red ticks to the left of the y axis).
After clicking on "Go", each "average prediction" tick is sandwiched between two thin ticks positioned each one standard deviation away from the average of the set of already observed predictions. All those ticks quickly converge toward their final positions.
1) In the "Next" mode, compare the LS and
WLS lines near the origin, where the noise level is lowest. Observe that the
WLS places more emphasis on staying close to the data points in this region
than the LS does.
2) In the "Run" mode, observe that both
the LS and the WLS average ticks ultimately line up with the fixed black tick
(y value to be predicted) : both the LS and WLS predictors are unbiased.
The means of their distributions ("expected values") are just the
(common) value to be predicted.
3) Observe that the standard deviation of the WLS
prediction distribution becomes ultimately smaller than that of the LS
prediction distribution. The magnitude of this improvement is certainly not
breathtaking, but it is clearly visible. Its numerical value in posted in the "Results"
frame ("Std. Dev. ratio").
Equivalently, observe that the WLS more often
than not sits within the angle between the regression line and the LS line, meaning
that it is "closer" to this line than the LS line. When the LS and
the WLS line are on either side of the regression line, the WLS line is, more
often than not, closer to the regression line than the LS line.
4) Conduct several runs with a different position
of the slider each time ("Reset"). Notice that there is one position where the standard
deviations of LS and WLS are equal : both models have identical predicting
performances for this position.
5) For a given number of points and slider position,
conduct several runs with a different noise level each time. Observe that, although
standard deviations become larger with increased noise level, their ratio remains
constant. So the degree of improvement from LS to WLS does not depend on the
absolute noise level, only on how this noise varies along the x
axis (up to a scale factor).
6) Increase the number of points ("Reset" mode), and observe that the ratio of the standard deviations decreases, that is, the advantage of WLS over LS becomes more perceptible. This is to be expected as :
* increasing the number of points may be construed as just extending the data set further to the right with additioinal points, and then change the x scale to confine the data set within the same range,
* and we previously noted
that the the edge of WLS over LS increases as you move further to the right.
You'll find here a realistic example of heteroskedasticity.
You'll find here the calculation of the slope and intercept of the Weighted Lesqt Squares Line.
Weighted Least Squares discards homoskedasticity but preserve the assumption of uncorrelated errors. If this assumption is also discarded, the errors covariance matrix can be any positive definite matrix Ω.
Can Linear Regression still be salvaged under such general and weak assumptions about the process that generated data ?
We'll show that it is still possible to return to the standard linear model by again transforming the predictors and the response variable. This transformation is a bit more complex than that leading to WLS. The main result is as follows : the best estimator β*G of the vector of parameters β of the regression function is given by
β*G = (X'Ω -1X)-1X'Ω -1y
The method leading to this result is called Generalized Least Squares estimation (GLS), of which WLS is just a special case. The estimator β*G is sometimes called "Aitken's estimator".
Of course, if Ω = I, the above expression reduces to that of β* as obtained by Ordinary Least Squares.
In this Tutorial, we establish the fundamental results of the Generalized Least Squares method of estimation, in particular those results pertaining to the estimation of the vector of parameters of the model. This vector will prove optimal in the sense of the Gauss-Markov theorem.
These results are easily obtained by noting that a Mahalanobis transformation of the vector of errors turns the general problem (no restriction on the errors covariance matrix) into a new problem that satisfies the standard assumptions of Multiple Linear Regression (homoskedasticity and uncorrelated errors).
We'll then insist that this tranformation can be interpreted in terms of a change of metric defining the inner product of two vectors. It will thus appear that there is no fundamental difference between Ordinary Least Squares (OLS) and Generalized Least Squares (GLS), the former being just a special case of the latter (errors covariance matrix proportional to the identity matrix).
Weighed Least Squares will not be treated separately, being just as special case of GLS.
GENERALIZED LEAST SQUARES
Linear transform of the model
General linear transformation
Back to homoskedasticity and lack of correlation
Transformation is not unique
Generalized Least Squares
Parameters of the model
The new model
Estimation of the parameters of the new model
Statistical properties of the GLS estimator
Optimality (generalized Gauss-Markov)
Geometric interpretation of GLS
Adjusted values, oblique projection
Changing the metric
Inner product and norm
Sum of Squared Residuals
Related readings :