Regression
One of the two branches of Predictive Modeling (the other one is Classification).
We address here the following topics :
|
____________________________________________________
In the illustration below, the upper image is the graphic representation of the (outrageously simplified) historical data base of a car-loan company. This company wants to use these data for predicting the budget that a new customer is willing to spend on a new car.
The data is is distributed as a narrow strip, and it is therefore possible to draw a curve that best "fits" the data (lower image). This curve will be considered as a satisfactory approximation of the true data distribution.
The curve is the embodiement of a function :
Budget = f *(Age)
that is called the regression function of the variable "Budget" on the variable "Age".
The asterisk is there as a reminder that this is just an
estimate of the true regression function (see below).
It will be used as follows :
according to the illustration below :

For the time being, we are going to set aside the fact that the data base is just a sample from a (supposedly infinite) population, and consider that it contains virtually all of the population. Accordingly, we are now going to define the true regression function f(x), and not just the estimate f *(x) built by the analyst from a limited sample.
Even with these causes of uncertainty removed, the prediction cannot be perfect. In the foregoing example, "Age" by itself cannot uniquely determine the behavior of a new customer. For a given Age, the values of y (Budget) spread out according to a certain distribution (see above illustration). As the regression function delivers a unique number f(x) for any value of x, the prediction is almost certainly wrong to a certain extent.
Additional variables may be considered for the purpose of reducing the prediction errors on the predicted value of y. For example, Revenue, Gender, Annual milage, Number of children etc...could be included in the regression function. All the attributes used for the prediction are called "predictors", whereas "Budget" is called the "response variable". So, quite generally, doing regression is looking for the "best" function :
y = f(x1, x2, ..., xp)
for the purpose of predicting the value of the response variable y, knowing the values of the predictors {x1, x2, ..., xp}.
The regression function f(x1, x2, ..., xp) has therefore to be defined so as to make the prediction errors as small as possible. This expression is ambiguous as long as we have not defined what is meant by "minimizing the errors".
In the vocabulary of Statistics, the regression
function id the embodiement of the expectation of y, conditionned by
the values of {x1, x2, ..., xp}.
In technical notations :
f(x1, x2, ..., xp) = E[y , x1, x2, ..., xp]
For the notion of conditional expectation, see here.
Our first visual approach of regression was purely geometric, whereas it now appears that regression is a fundamentally probabilistic concept.
Regression meets two difficulties of its own :
1) The regression function f(.) may have just about any (but unknown) analytic form (or even no analytic form at all). The analyst will therefore have to choose more or less arbitrarily a functional form f *(.) (the regression model) for the purpose of approximating f(.). His choice will be made on the basis of what he knows of the mechanism that generated the data.
2) Regression assumes that the data was generated by a probability density :
y = f(x) + ex
where :
This last point is a nuisance. It does not prevent building of a reasonably accurate regression model, but it makes several important auxiliary techniques ineffective. This is particularly true for tests meant to decide whether or not a particular predictor has an effect of the response variable.
It is often assumed that the variance of ex does not depend on x (homoscedasticity). Further assuming that ex is normally distributed makes the above mentionned tests fully effective if the model is linear (in the parameters, not necessarily in the predictors, see here).
------
In addition to these specific difficulties, regression
must solve, like any modeling process, the very important question
of selecting the appropriate set of predictors for building the model. Because
it is so important, we state again, as in other places of this Glossary, that
:
which makes it easier to account for the available data.
So there is a trade-off that is hard to find (except with the simplest models like Simple Linear Regression). Overlooking this issue is the most frequent cause of "catastrophic model failure".
Once the function form f *(.) of the model has been chosen, the parameters of the model have to be calculated in a way that will make f *(.) as close as possible to the true regression function f(.).
There are two main ways of calculating the parameters of a regression model :
The f *(.) model is built from random and finite data and therefore cannot be more than an estimate of the true regression function f(.), that remains forever unknown. Like for any model, it is necessary to estimate the true performance of the model (that is, on new data that was not used for calculating the parameters).
Linear Regression is a rare situation where, given a few restrictive hypothesis on the data, theory alone can estimate the true performance of the model.
But in the general case, it is necessary :
____________________________________________________________
Related readings:
|