A model is built from the data available at the time of modelization. Then, new data will come, and will feed the model for the purpose of information extraction. For example, a new candidate for a loan will be "submitted" to a Decision Tree built on historical loan data. The Tree will deliver its verdict as a probability for the candidate to reimburse its loan without problem.
Now assume that the Tree performs satisfactorily on
the data used to build it. Should we believe that its behavior will be equally
satisfactory on new loan candidates ? In other word, should we believe that
the Tree will generalize well ?
This is one one the most important question pertaining
to modelization, and the answer is, unfortunately :
is generally no reason to blindly believe that
a model that
performs well on the data used to build it will be equally satisfactory
when it is most needed, that is when fed with new data".
Estimating the performance of a model on new data
is called validating
the model. Validating a model is important, difficult, and all too
often neglected by occasionnal modelers.
It should be clearly understood that a model with poor generalization capability does not do its job, which is to give an accurate picture of the real world. This is why by-passing the validation phase of a model is so risky. Newcomers to the field of modelization may not be aware of the risk, and honestly believe that their model is "good", just because it performs well on "construction data".
The most frequent cause of poor generalization is model overfitting : the model has been designed with too many parameters. This point is addressed in some detail in the entry about the bias-variance tradeoff.