Modeling

This term may be considered as the entry point of the Glossary. We here briefly describe :

 * What Data Modeling is, * The role that Statistics often plays in Data Modeling, * And the main difficulties of Data modeling.

_____________________________________

DATA MODELING

Data Modeling is the art of extracting useful information from data obtained by measuring properties of "objects", and cast this information into an operational model. All these terms deserve some explanations.

Measurements

This term comes from the technical world, but should be taken in a more general sense. In its straightforward meaning, "to measure" is to code a quantity by a number, as in : "To measure the length of a stick". But the outcome of a measurement may also be a rank ("This voter is very sure who is favorite candidate is") or belong to a group ("This animal is a mammal").

Data

* Usually, several characteristics of the same "object" are measured, so as to gather as much information as possible about the object. Traditionally, measurements pertaining to a same object are displayed on a single line.

* The measured properties are often named "variables", or "attributes", and the objects are usually called "cases" or "individuals".

* Measurements are made on all individuals of the same nature (the population). If this turns out to be impractical (as is very often the case), as many individuals  are "scanned" as time, budget or computing power will allow, to make up a "sample" (see below  ).
* Once completed, measurements are displayed as a rectangular table whose rows are the cases, and whose columns are the variables.

 Case Gender Height Weight Age Pressure 1 F 1,68 49 48 14 2 M 1,79 72 23 13 3 M 1,67 69 65 19 4 F 1,53 95 61 22 5 M 1,82 85 35 15

In this (very, very small) data table, 5 cases are described by 5 variables (Gender, Height, Weight, Aged and Blood Pressure).

Information

All this data is being gathered for the purpose of answering some questions about the sample (or the population) as a whole, or about some specific individuals. There are countless such questions, but they tend to belong to a few very general categories.

At this point, we will only make the distinction between :

1) Questions whose answers describe some global features of the population. Data tables are usually very large, and although the content of any cell is easily understandable, the human brain is not tailored to extract global trends from such a huge amount of data.

These questions range from :

* Very simple ("What is the average age of the individuals in the population ?"),

* to moderately difficult ("Can the population be described with fewer variables than in the original table, while losing as little information as possible in the process ?"),

* to very difficult ("Can the population be partitioned into several homogenous and quite distinct groups ?").

These questions pertain to Descriptive Modeling. (also called "Exploratory Analysis")  .

2) Questions that relate to potential links between the variables in the table.

* Women being, on the average, shorter and lighter than men, one could expect a relation between Gender, Height and Weight.

* One could also possibly ask if the data suggests a relation between Age, Weight and Blood pressure.

If such a relationsh is indeed discovered, it will translate into an equation (or a set of logical rules) between these quantities. For example, analysis of the data could possibly discover a relation like Blood pressure = f(Gender, Age, Weight).
Of course, such a relation is not supposed to be exact, and plugging numbers from the table into the equation will never give a perfect match. But a reasonably close agreement would already be a quite satisfactory achievement.

Once discovered, a link provides two kinds of informations :

* First, it hints at a possible causal relationship between variables, a major discovery if confirmed. Knowing the relation between causes and effects, it then becomes possible to modify the causes to produce a favorable change in the effects.

* Then, the equation representing the link has an operational usefulness. Suppose that a new individual, extracted from the same population, is of known Gender, Age and Weight, but of unknown blood Pressure. Then plugging these numbers into the equation will provide a reasonably good estimate of what his blood Pressure is. So, as a soothsayer, the equation predicts what the value of the blood Pressure of the individual would turn out to be if measured.

For this last reason, this kind of Data Modeling is called Predictive Modeling .

The distinction between Predictive and Descriptive Modeling is, to a certain extent, artificial. In fact, it is mostly based on the goal that the analyst has in mind. For the theoretician, both Modelings rely on detecting regularities in data. These regularities mean that the data is not completely random, but is somewhat structured. These regularities translate into redundances in the data, that Modeling will eliminate.

Model

Once the table has been analyzed, it becomes useless. Answers to questions about the population (or specific individuals) are provided by a set of equations or logical rules that embody the various regularities discovered in the data (see above). This set of equations or rules is the Model, which is therefore a compact and operational mathematical representation of the data.

Of course, a data set can be described by many different models, depending on the problem that the analyst is trying to solve. A model may be designed to capture only certain characteristics of the data, and ignore others.

Why do we have to build models ? The reason is that, more often than not, we do not know the reasons that make the population under study what it is. If these reasons were known, they could (hopefully) be described by equations elaborated from first principles, and no data modeling would then be needed.
Consider for example drug design. If it were possible to describe in their finest details the effects of a molecule on human biochemistry, and if the biochemical description of deseases were complete, then drug design would reduce to a (tremendous) reverse chemical engineering problem, and no experimentation would be necessary. But we are nowhere near such a state of understanding, and drug design is indeed a heavy user of data modeling.

Useful

Predictive Modeling provides a good example of the usefulness of a model (see above) : it becomes possible to predict the value of an unmeasured quantity when a new individual turns up.

But all models, whether Predictive or Descriptive, are also useful in another way. The equations in the model contain parameters (usually numerical) whose values were determined during the course of the modeling process. For example :

Pressure = f(Gender, Age, Weight)

Pressure = 0,03*Gender + 0,18*Age + 0,12*Weight

in a given units system.

Why do the parameters have these particular values ? What information about the population is contained in these numbers ? Providing an answer tho this question is interpreting the model. Interpretation of a model is a crucial phase and requires a close cooperation between the analyst and the specialist of the population.

For example, the analyst will tell the specialist that, among a subpopulation of fixed Gender and Age, every additional kilogram translates into a 0.12 increase in bood Pressure. The specialist will then ponder this information within a medical context. His conclusions might suggest a new series of experiments. Which quantities will then have to be measured and which will not (because providing little additional information), and how many cases will have to be studied will be determined in close cooperation with the analyst.

Not all types of model can be readily interpreted. While Linear Regression or Decision Trees lend themselves to relatively straightforward interpretations, it is not so with Neural Networks that operate as "black boxes". Unfortunately, there usually exists a tradeoff between the accuracy of a model, and how thoroughly it can be interpreted. This fact is to be taken into account when choosing the type of model will be used for studying the population.

________________________

DATA MODELING AND STATISTICS

In all of the above, an important question has been set aside :

* Are the individuals described in table the only ones of interest ? In other words, does the table contain all the population under study ?

* Or, for practical reasons, does the table contain only a fraction of the population (the "sample") ?

This question is of utmost importance.

Population and sample

A sample is usually randomly drawn in the population.

* If the draw is lucky, the distribution of the individuals in the sample is a reasonably faithful copy of the distribution of the individuals in the population as a whole. Any conclusion made about the sample will then apply to the population as well.

* But the draw may be unlucky, and the sample's distribution may be substantially different from that of the population. The sample is then said to be "biased". Conclusions drawn from analyzing the sample will then be simply wrong when when applied to the complete population.

By the very nature of randomness, there is no way of knowing if a sample is a fair and honest representation of the population. Uncertainty about the sample's representativity translates into uncertainty about the fact that the model has captured the interesting properties of the population. Therefore, it is never possible to have an absolute faith in a model. But it is often possible, with some additional hypothesis, to evaluate the credibility of the model (built from the sample) as a token of the properties of the population at large. This is what Statistics is all about.

Statistics has two main branches : Estimation and Tests.

Estimation

What is the average size of the individuals in the population as represented by the sample ? A layman's answer is that the average size of the individuals in the sample is our best guess. The theory of Estimation goes one step further, and states that not only is it a good guess, but is is the best possible guess in a very precise meaning (Point estimation). No other quantity calculated on the sample can be trusted more than the sample average as an estimate of the populations's average height. With some additional hypothesis, it is even possible to quantify the level of trust that this estimate can be granted (Confidence Interval).

Any quantity pertaining to the population can be connected to similar quantities calculated from the sample. What Estimation theory does is :

* Identify the quantities that can be calculated from the sample that are most representative of their counterparts in the population. Results of these calculations are called estimates of the corresponding (and inaccessible) quantities in the population.

* Quantify the level of trust that be placed in these estimates.

Now consider the parameters of a model. They are quantities calculated from the sample, and are therefore estimates of the parameters of the ideal model describing the population. Estimation theory provides means of calculating these estimates, that is, assign numerical values to the parameters of the model.

The same can be said for the predictions of a model : they are estimates of the true (and unknown) values of the quantity under consideration, and it is necessary to be able to assess the credibility of these predictions.

So, model construction is in fact Estimation Theory put to work.

Tests

Data Modeling often brings its contribution to a decision making process. Data is collected and analyzed for the purpose of shedding some light on a debated hypothesis about the population. Depending on whether data corroborates the hypothesis, or else invalidates it, it will be used as a starting point for further decisions, or else discarded as being incompatible with the data.

One of the simplest examples of such situation comes from quality control of mass produced items. Suppose a manufacturer of steel balls claims that the diameter of the balls is, on the average, 10.0mm.
For the customer, measuring the diameter of each and every delivered ball is out of the question. So, in each delivered batch, he draws a small number of balls, accurately measures their diameter, et gets an average of, say, 9.9mm. Now :

* Should he believe the claim of the manufacturer, and accept the delivery, blaming the discrepancy on bad luck in drawing the control sample ?

* Or else consider that the 0.1mm discrepancy is to large to be accounted for by mere bad luck, decide that the data is incompatible with the claim, and then reject the delivered batch ?

Of course, there is no way to ever be certain about the answer. Yet, with the help of some additional assumptions, Tests Theory will say what are the chances of being wrong :

* If rejecting the delivery, (that is, the manufacturer's claim being in fact correct),

* Or by accepting the batch (that is, the manufacturer's  claim being erroneous).

This question is solved by one of  the very classical  "t tests".

A model can be regarded as a hypothesis about the population. This hypothesis is that the data is not completely random, that there is an underlying structure, and that this structure is approximately accounted for by the model.

In the above example involving Gender, Age, Weight and Blood Pressure, the hypothesis is embodied in the equation linking these quantities.

With the help of some additional assumptions, it is often possible to submit the model to a test that will tell how likely it is that the structures discovered by the model are just artefacts caused by the random distribution of the sample (see for example the test on the overal validity of a Multiple Linear Regression model).

-----------

On several occasions, we refered to "additional assumptions" needed to draw conclusions about the credibility of the value of a parameter, or of the model itself. These assumptions are usually about the nature of distribution of the population, or about some parameters of this distribution.
Take for instance the test relative to the diameter of the steel balls. In order to run the "t test", it is necessary to assume that the population of the balls is distributed according to the famous "normal" (or "gaussian") distribution.

Estimation and Tests very often need to make assumptions about the distribution of some quantities calculated from the sample. Sampling Theory, with the proper assumptions about the population, provides the distributions of these quantities.

________________________________________________

THE DIFFICULTIES OF DATA MODELING

Although the principles of Data Modeling are rather straightforward, its practice is fraught with difficulties. The most important ones are :

A clear definition of the problem

Data collecting, cleaning and conditioning, model building and validating, model interpretation are long and costly endeavors. Time spent defining a goal, the criteria that will tell if the goal has been attained or not, identifying the nature and volume of the data needed to build the appropriate model, is always time well spent, not wasted time.

Data quality

More often than not, data is scarce, expensive, poorly conditioned, partially missing, not synchronized, fraught with errors and mistakes, and biased. Yet, data is the fuel of Data Modeling. Enough time and effort should be dedicated to data "cleaning", lest further attempts to build a good model will be ineffective.

Choosing the appropriate technique

The considerable development of Data Modeling provides the analyst with a large number of candidate techniques to reach a given objective. Each technique has its pros and cons, et choosing the technique that is most appropriate for the job is one of the keys to build a successful model.
Unfortunately, except for a few general guidelines, choosing the "best" technique is very much a matter of experience. The most popular criterion for choosing a technique is still : "We are used to it here", though, and it leaves very much to be desired.

Variable selection

For fundamental reasons lying deep in the basic principles of Statistical Data Modeling (but often ignored by practitioners), it is necessary to carefully select those variables that will be used to build the model. It might even be useful to first go through a preliminary Dimensionality Reduction step before building any model at all. If too many variables are used, the model becomes exceedingly sensitive to small variations in the sample, and the final model is therefore not credible (see "bias-variance tradeoff").

To make things worse, different goals, or even different techniques targeting the same goal, will require different optimal sets of variables.

Model selection

Any model contains parameters (even the so-called "non parametric models") whose values will be :

* Either tuned by the "learning" process for minimizing some quantity calculated on the data,

* Or else defined arbitrarily (think of  "Number of bins" in histograms).

In both cases, choosing the number of parameters in the model is pretty much left to the practitioner. Even in Multiple Linear Regression, the number of parameters is dictated by the number of predictors retained in the model.

One quickly discovers that adding parameters to a model improves its flexibility and its ability to account for the design data. One also discovers that past a certain point, this improved performance on the design data translates into a degradation of the model performance on future incoming data, the only meaningful performance. This situation is akin to the variable selection issue previously mentioned and pertains to the same general problem ("bias-variance tradeoff").

Once the model is built, its performance may be assessed on the design data, but this "performance" is valueless, as it gives no clue as to how well (or poorly) the model will perform on new, incoming data.

So great care must be taken to estimate this true performance. This is done by submitting several candidate models with different numbers of parameters to validation procedures. The model that will utimately be retained is the model that will have best passed the validation procedures. It will probably not be the model that performs best on the design data set.

Model selection is time consuming and only too often by-passed by occasional practitioners. This omission is one of the main causes of "catastrophic model failure".

Here, "Variable Selection" and "Model Selection" have been presented separately, because practice often commands that they be done separately. Specific techniques have been developed for each problem, so both cannot be addressed at the same time. Yet, they are the two sides of the same "bias-variance tradeoff" coin, and cannot (theoretically) be treated separately.

_______________________________________________

With all its algorighms, equations and software, Data Modeling is still very much of an art. Using the techniques with no understanding of their limitations unavoidably leads to disaster, and nothing can replace the analyst's knowhow acquired over the years and many failures.

As Data Modeling becomes more mainstream, but also more complex, it becomes necessary for more than just the professional analyst to have a decent understanding of the basic principles behind this all pervasive approach of extracting information from the real world. Data Modeling has become a team endeavor, with constant interactions between the analysts and the specialists of the population under study. We hope that the latter will find some useful clues in this website, and that it will help them in their daily, but not always easy, interactions with statisticians.