Data

As the name implies, data is the true fuel of Data Modeling. A model will never be better than the data that was used to build it. GIGO (Garabge In, Garbage Out) is just as true in Data Modeling as it is in Computer Programming.

 

You will learn only too rapidly to identify the virtues that you expected from your data, and that it probably lacks :

    1) Completeness (no missing or erroneous values).
 

    2) Homogenous format across the various bases that contain your data.
 

    3) Synchronism. Historical  bases contain data that has been collected at different times, and may therefore exhibit bias.
 

    4) Pertinence. Data used for modelisation should ideally contain just the right kind of information needed to solve the problem at hand. But more often than not, you will have to use available data, whose pertinence is not guaranteed.
 

    5) Volume. Scarce data does not contain enough information to build a good model. But too much data will overburden your computer, and you will have to sample it, a somewhat delicate issue.
 

    6) Bias : data often comes from various sources, but should have been collected in conditions as similar to one another as possible, and conditioin difficult to meet and to check.

 

Poor data quality is a major cause of failure in Data Modeling, and most occasionnal practitioners have a casual attitude with respect to data quality, with detrimental consequences on their models. It is generally considered that collecting, auditing and conditioning data represents (or should represent) more than half the time devoted to a Data Modeling project.

 

Download this Glossary

 

Want to contribute to this site ?