Because of the natural dispersion of data, a regression model will make erroneous predictions. The best one can hope for is that the model embodies a function close to the true (and forever unknown) regression function, which is the "average distribution" of the data.

The error of a regression model on a particular observation is called the residual of the model for that observation. Usually, the parameters of a regression model are calculated so as to minimize the sum of the squares of the residuals ("Least Squares" method).

For one particular observation to have a high or low value residual tells absolutely nothing about the model being good or bad in the region around that observation.

* A model may be excellent around a point, yet an observation sitting at this point may have a high value residual because of dispersion.

* For similar reasons, an observation with a very low residual may be sitting in a region where the model is poor.

But of course if all residuals in a particular area are low (resp. large), then the model is good (resp. bad) in this area.

-----

A detailed analysis of the residuals can be conducted only within the framework of Linear Regression (Simple or Multiple). The objective of such analysis is to :

• Check whether the basic assumptions of linear modeling (linearity, homoskedasticity, uncorrelated errors) are respected.
• Identify observations whose contributions to some aspects of the model are particularly large, and that can therefore be suspected to throw the model off course and spoil the conclusions drawn from its analysis. This identification is done by such classical indicators of "extremeness" as DFFITS or Cook's distance that all incorporate one of the various tranformations of the residuals (standardized, internally or externally studentized residuals).

 Want to contribute to this site ?