Data
As the name implies, data is the true fuel of Data Modeling. A model will never be better than the data that was used to build it. GIGO (Garabge In, Garbage Out) is just as true in Data Modeling as it is in Computer Programming.
You will learn only too rapidly to identify the virtues that you expected from your data, and that it probably lacks :
1) Completeness (no missing
or erroneous values).
2) Homogenous format across
the various bases that contain your data.
3) Synchronism. Historical
bases contain data that has been collected at different times, and may
therefore exhibit bias.
4) Pertinence. Data used for
modelisation should ideally contain just the right kind of information needed to
solve the problem at hand. But more often than not, you will have to use available
data, whose pertinence is not guaranteed.
5) Volume. Scarce data does
not contain enough information to build a good model. But too much data will
overburden your computer, and you will have to sample it, a somewhat delicate
issue.
6) Bias : data often comes from various sources, but should have been collected in conditions as similar to one another as possible, and conditioin difficult to meet and to check.
Poor data quality is a major cause of failure in Data Modeling, and most occasionnal practitioners have a casual attitude with respect to data quality, with detrimental consequences on their models. It is generally considered that collecting, auditing and conditioning data represents (or should represent) more than half the time devoted to a Data Modeling project.
|
Want to contribute to this site ? |