Predictive modeling
Predictive Modeling is one of the two main branches of Data Modeling (the other one being Descriptive Modeling).
Its goal is to :
1) Identify strong links between variables of a data table (columns). Such a link will translate into, for example, and equation between one variable y (the so-called "independent" or "response" variable) and a group of other variables {xi} (the so-called "dependent variables", or "predictors") :
y = f(x1, x2, ..., xn) + Small random noise
The discovery of such a link is an important piece of information by itself, especially if the discovered link turns out to be causal.
2) Then use this equation for predicting the value of y for new individuals whose value of y was not measured (and therefore that was not in the original data table).
Regression and Classification illustrate the general idea behind Predictive Modeling : there is a group of variables (the predictors) that contain all the information necessary to predict the value that another variable (the dependent variable) will take on any individual (within some random noise). Therefore, this dependent variable carries no new information that is not already present in the group of predictors. It is therefore redundant : removing this variable from the table would cause no loss of information about the population.
This redundance is detected by the modeling process, and formalized by an equation y = f(x1, x2, ..., xn) that can then be used for predicting the value of the independent variable for individuals for which this value has not been measured.
Predictive modeling can therefore be understood as :
The equation y = f(x1, x2, ..., xn) is a predictive model. The function f(.) may take many different forms :
Calculating the values of the parameters relies on strong assumptions about the statistical distribution of the data (parametric models). If these assumptions are justified, it is possible to obtain by theory only a wealth of information pertaining to the credibility of the model (confidence intervals, tests, variable selection, prediction errors...).
The two most common parametric predictive models are :
The function f(.) has now no explicit and interpretable mathematical form. Non parametric models behave like "black boxes" : they act (sometimes quite effectively) as "regressors" or "classifiers", but most of the theoretical results that make parametric models so attractive are lost. These results have to be replaced by cumbersome non parametric validation techniques (cross validation, bootstrap).
Among the many difficulties that predictive modeling encounters, the two most important are :
* Choosing the independent variables (predictors). For fundamental reasons (bias-variance tradeoff), it is mandatory to carefully select the predictors to be retained in the final model (for example, see here).
* Choosing the predictive method (i.e. the above mentionned f(.) function), a heavy responsibility for the analyst.
________________________________________________________________________
Related readings :