A model is built from the data available at the time of modelization. Then, new data will come, and will feed the model for the purpose of information extraction. For example, a new candidate for a loan will be "submitted" to a Decision Tree built on historical loan data. The Tree will deliver its verdict as a probability for the candidate to reimburse its loan without problem.
Now assume that the Tree performs satisfactorily on
the data used to build it. Should we believe that its behavior will be equally
satisfactory on new loan candidates ? In other word, should we believe that
the Tree will generalize well ?
This is one one the most important question pertaining
to modelization, and the answer is, unfortunately :
"No, there
is generally no reason to blindly believe that
a model that
performs well on the data used to build it will be equally satisfactory
when
it is most needed, that is when fed with new data".
Estimating the performance of a model on new data
is called validating
the model. Validating a model is important, difficult, and all too
often neglected by occasionnal modelers.
It should be clearly understood that a model with poor generalization capability does not do its job, which is to give an accurate picture of the real world. This is why by-passing the validation phase of a model is so risky. Newcomers to the field of modelization may not be aware of the risk, and honestly believe that their model is "good", just because it performs well on "construction data".
-----
The most frequent cause of poor generalization is model overfitting : the model has been designed with too many parameters. This point is addressed in some detail in the entry about the bias-variance tradeoff.
Goodness-of-fit (test)
Given a sample, one often has to formulate a hypothesis about which distribution generated that sample. One usually has a favorite candidate distribution, and a goodness-of-fit test (or "test of fit") will determine how likely it is that this distribution generated the sample. In this illustration, a sample of numerical values is pitted against two candidate normal distributions. Clearly, the fit with the first one (top illustration) is rather poor, whereas it is much better with the second distribution (bottom illustration).
Parametric goodness-of-fit tests assume a specific mathematical form for the candidate distribution, leaving only the values of a few parameters to be tested against the sample. For example, a t-test will assume that the underlying distribution is normal, and will test the hypothesis according to which the mean of this distribution has a given reference value m0.
Non-parametric goodness-of-fit tests
are more general : they make no restrictive assumption about the mathematical
form of the underlying distribution, as long as it is can be calculated. The
most popular non-parametric goodness-of-fit tests are the Kolmogorov
test, and the
Goodness-of-fit Chi-square
test.