Predictive modeling

Predictive Modeling is one of the two main branches of Data Modeling (the other one being Descriptive Modeling).

# Predictive modeling

Its goal is to :

1) Identify strong links between variables of a data table (columns). Such a link will translate into, for example, and equation between one variable y (the so-called "independent" or "response" variable) and a group of other variables {xi} (the so-called "dependent variables", or "predictors") :

y = f(x1, x2, ..., xn) + Small random noise

The discovery of such a link is an important piece of information by itself, especially if the discovered link turns out to be causal.

2) Then use this equation for predicting the value of y for new individuals whose value of y was not measured (and therefore that was not in the original data table).

• When the response variable is numerical, predictive modeling is called Regression.
• When the response variable is nominal, predictive modeling is called Classification. The values of the response variable are then modalities, that can be considered as "class labels".

# Predictive modeling and redundance

Regression and Classification illustrate the general idea behind Predictive Modeling : there is a group of variables (the predictors) that contain all the information necessary to predict the value that another variable (the dependent variable) will take on any individual (within some random noise). Therefore, this dependent variable carries no new information that is not already present in the group of predictors. It is therefore redundant : removing this variable from the table would cause no loss of information about the population.

This redundance is detected by the modeling process, and formalized by an equation  y = f(x1, x2, ..., xn) that can then be used for predicting the value of the independent variable for individuals for which this value has not been measured.

Predictive modeling can therefore be understood as :

• The discovery of column redundance in a data table,
• Followed by using this redundance for prediciting the values of the "redundant" response variable y.

# Parametric and non parametric predictive modeling

The equation   y = f(x1, x2, ..., xn) is a predictive model. The function f(.) may take many different forms :

• It may be an explicit mathematical expression containing numerical parameters. The values to be assigned to these parameters are usually calculated by one of two methods :

Calculating the values of the parameters relies on strong assumptions about the statistical distribution of the data (parametric models). If these assumptions are justified, it is possible to obtain by theory only a wealth of information pertaining to the credibility of the model (confidence intervals, tests, variable selection, prediction errors...).

The two most common parametric predictive models are :

• When one cannot formulate any credible and useful assumption about the data distribution, predictive modeling resorts to non parametric models like

The function f(.) has now no explicit and interpretable mathematical form. Non parametric models behave like "black boxes" : they act (sometimes quite effectively) as "regressors" or "classifiers", but most of the theoretical results that make parametric models so attractive are lost. These results have to be replaced by cumbersome non parametric validation techniques (cross validation, bootstrap).

# Predictive modeling is difficult

Among the many difficulties that predictive modeling encounters, the two most important are :

* Choosing the independent variables (predictors). For fundamental reasons (bias-variance tradeoff), it is mandatory to carefully select the predictors to be retained in the final model (for example, see here).

* Choosing the predictive method (i.e. the above mentionned f(.) function), a heavy responsibility for the analyst.

________________________________________________________________________

Related readings :

 Data modeling Descriptive modeling Regression Classification
 Download this Glossary