Robustness
Data (the sample) contains all the information from which various conclusions will be drawn about the process that generated them. A major concern of Data Modeling is then to assess the credibility of these conclusions. Several reasons may cast a doubt on the validity of these conclusions. The most important are :
These factors may alter the robustness of the analysis and of its conclusions. Let's go over these points in more detail.
Data is rarely irreproachable (see here). The question is then to figure out if the errors in the data prevent any sensible analysis to be carried out. Of course, there is non single answer to this question, and it's all a matter of degree. This point is well illustrated by the concept of outlier in Linear Regression (or, more generally, in any model built by the Least Squares method). One outlier may throw the model predictions considerably off target, even in regions far from the faulty data point.
There are often ways to reduce the sensitivity of a model to outliers. For example, the Least Squares method is very sensitive to outliers, but the minimization of the squares of the errors may be replaced by the minimization of the sum of the absolute values of the errors made by the linear model. Many important theoretical properties are lost by doing so, but the model predictions are then less sensitive to outliers than that of the "standard" Least Squares model.
The sample is random by nature. So any model built from the data depends on the particular sample available at that time. A type of model whose conclusions depend heavily on the particular sample at hand will certainly not be considered robust.
A model with :
accounts for the design data very well, but performs poorly on new data. This overparametrization of the model also makes the model unstable (values of the parameters and of the predictions) in the face of small changes in the design data.
This situation is similar to the above mentioned sensitivity to collinearity, but here, the lack of robustness of the model is caused by a poor appreciation of the situation by the analyst, even though the type of model used may be instrinsically quite robust.
Inferences about the mechanism that generated the data often rely on a priori assumptions formulated by the analyst about some characteristics of this mechanism. In particular, parametric tests assume that the data originated from a normal distribution. If this assumption is unjustified, the p-values of the tests may be far off target.
It is sometimes possible to use a non parametric test as a substitute to a given parametric test (see for instance here). Of course, these non parametric tests are less powerful than their parametric counterparts, but they are also more robust, because they make no assumption on the population distribution.
|
Want to contribute to this site ? |