Ordinal (variable)

A variable is said to be ordinal if :

1) It can take only a small number of values,
2) These values can be naturally ranked by increasing or decreasing values.

Typical examples of ordinal variables are :

• "Size", with values "Small", "Medium", "Large", "Extra large".
• "Temperature", with values "Cold", "Lukewarm", "Hot".
• "Satisfaction" : "Very satisfied", "Rather satisfied", "Moderately dissatisfied", "Very dissatisfied".

Ordinal variables often come as the result of discretization of a numerical variable.

_______________________

Non parametric tests (see for instance Wilcoxon-Mann-Whitney, Kruskal-Wallis, Friedman) often merge numerical observations from various origins into as single sample, and the convert the numerical values into ranks. Ranks are, by very nature, ordinal variables, although of a special kind, as each and every value of the variable is assigned to only one observation.

_______________________

An "outlier" (or "extreme point") is an observation that is very different from the "average" observation in your data set.

An outlier may have two different possible origins :

* It can be an ordinary observation, but of which at least one attribute has been severely corrupted by a mistake, like a copying or a transmission mistake (think of a misplaced decimal point).

* It can also be a bona fide observation, that simply turns out to be very unsual. Think for instance of the file of attributes of the "motor vehicles" sitting in a parking lot at any one time, with nothing but ordinary cars, except for one eighteen-wheeler whoses characteristics are very different from that of any ordinary car.

Outliers are a severe problem in just about every area of data modeling. The reason is that just one outlier may severely distort a model up to the point where conclusions drawn from the model are simply wrong. For example, consider a bank accounts file. Numbers yield a 5000\$ average account balance. Are the bank's customers all well to do ? Or maybe they are all careless about managing their hard-earned money ?

Upon closer scrutiny, one account shows a positive 1 000 000\$ balance : a lucky customer just won at the lottery, and has deposited his gains on his account. Remove this one customer from the file, and all of a sudden, the average balance drops to 500\$, a more reasonable number (bottom illustration. Notice the change of scale from linear to logarithmic).

This very simple example makes explicit the practical attitude with respect to outliers :

1) Detect outliers. Here, it was indeed quite simple (a histogram will do), but the problem of outlier detection is in general quite difficult.

2) Analyze the detected outlier to figure out if it is a "fraudulent" observation, or a bona fide observation.

3) Consider the possibility of building a new model with the outlier removed from the data set.

Consider now the problem of outliers in a somewhat more complex context : that of Simple Linear Regression (SLR). Every observation is now a pair of numbers (xi, yi), and run a SLR of  y on x (top illustration). Suppose now that  yi is grossly wrong  for observation i (we will say that the observation is a "y-outlier"). The new regression line is as shown on the bottom illustration. What was before a downward going regression line has turned into an upward going regression line !

Notice that neither a histogram of x, nor a histogram of y will detect the outlier. Yet, detecting the outlier is still easy : a simple visual examination of the scatter plot will allow pinpointing it. Also, a histogram of the residuals will detect the faulty observation, as the residual for the y-outlier is much larger than for "regular" observations.

A somewhat more annoying example of outlier may also happen in SLR. This time, we assume that a serious mistake occurred on x (the independent variable) for one particular observation. It may very well happen that the outlier's residual is by no means large at all, so even a histogram of residuals will not detect it. To make things worse, the valid points look bad because of their high residuals compared to that of the outlier.
Within the context of regression, such an observation that sits far away from the bulk of the other observations is called a "leverage observation" (or an "x-outlier"). The name implies that it can potentially (but not necessarily) make the regression line "pivot" around the cloud of ordinary observations like a lever around its fulcrum.

Leverage observations may have high or low residuals : residuals are of no help for identifying leverage observations. Actually, it is common that, because of their capacity to make a regression line "pivot", leverage observations have low residuals, and make legitimate observations look bad (bottom illustration).

The situation becomes very bad in Multiple Linear Regression, as it is now impossible to visualize the cloud of observations. Histograms are usually helpless to detect leverage points : in this scatter plot of two of the independent variables, we are lucky to be able to visually detect the leverage point P, but neither x1 nor x2 have a "pathological" histogram. In general, pairwise scatter plots of independent variables will show nothing abnormal.

So detecting leverage points appears as both important and difficult. A battery of subtle techniques have been developped for that purpose. They belong to two main categories :

1) Direct analysis of the statistical properties of the cloud of observations involving robust location and spread estimators of the cloud.

2) Robust regression techniques, that are less sensitive to outliers than the ordinary Least Squares approach. Outliers (whether x- or y-) will then have high residuals, and will therefore be unambiguously identified.

"Professionnal" outlier hunting requires specialized software, and quite a bit of work. Casual outlier hunting will settle for simple tools often found in software, like the "Cook's distance", and the "Mahalanobis distance".

Parameters (of a model)

A model often comes as a numerical function of the input variables.

• The mathematical form of the function is fixed, and imposed by the modelisation technique used (which, in turn, are more or less imposed by assumptions made about data distribution),
• But the function also contains some numerical parameters that need to be adjusted to the data by some type of algorithm.

Take for instance the case of Linear Regression.

• The mathematical function for the model is a first degree polynomial (a straight line)..
• The parameters of the model are the "Slope" of the line, and its "Intercept".

Parametric (model)

Suppose there is a sample that is known to have been generated by a normal distribution. This distribution may easily be estimated : all there is to do is to estimate its mean and variance (using the method of Maximum Likelihood). So, only two numbers are needed to fully describe what is believed to be the distribution that generated the data. These numbers are the values of the parameters of the model.

But now, suppose one has no idea about the nature of the distribution that generated the sample. In order to have at least  an approximate graphical representation of this distribution, a histogram of the sample is drawn (lower image of the above illustration). To capture the global shape of the distribution, the histogram should have at least 10 bins, or more if the sample is large enough. So now, a relatively large number of numerical values (the heights of the bins) are needed to specify the estimated distribution behind the data.

Why such a large difference in the number of numerical values needed for specifying what amounts to the same thing : an estimation of the distribution behind a sample ?

The reason is that :

•  In the first case, a very strong assumption was made about this distribution. This assumption translates into an analytic form that defines a narrow family of distributions (the normal distributions) from which one distribution has to be selected as the most likely generator of the data.
• In the second case, no assumption was made about this distribution. This is the same as saying that the family of distributions considered as candidate generators of the data contains all the distributions you can think of (at least, all the distributions that are not 0 where the observations are).

The first model is said to be parametric. The term "parametric" refers to the fact that, once the analytic form of the model is decided upon (here, the normal distribution), only a small number of parameters have to be estimated. In addition, these parameters can be interpreted in terms of properties of the distribution (here, mean and variance).

The second model (the histogram) belongs to the family of non parametric models. This expression is misleading, as "non parametric" models do incorporate parameters (in fact, often a large number for "local" models such as RBF networks). For histograms, these parameters are :

• The starting point of the bins (not very important),  and
• The bin width, that acts as a "smoothing parameter" (very important).

Its just that these parameters cannot be interpreted in terms of global properties of the distribution.

The model is also said to be "ad hoc", or "black box",  meaning that it does the job, but no knowledge about the distribution can be extracted from the values of its parameters.

___________________________

The two foregoing examples were drawn from descriptive modeling (probability density estimation), but the same distinction between "parametric" or "non parametric" also exists in predictive modeling.

• Classification
• Discriminant Analysis explicitely makes the (very restrictive) assumption that the classes have multinormal distributions. The parameters to be estimated are the elements of their covariance matrices, which are relatively few and readily interpreted.
• But when a Multilayer Perceptron is used in classification, its parameters (or "weights") carry no interpretable information about the shapes of the boundaries between the classes, or about the distributions of data within the classes.
• Decision Trees are non parametric as, again, they make no assumption about the distribution of the data.
• Regression
• Linear Regression is parametric. The assumptions about the data distribution are stringent, and the model parameters (slope, intercept and estimated variance in Simple Linear Regression) have direct interpretations in terms of properties of this distribution.
• But polynomial regression is not parametric, as the coefficients of the polynom cannot be interpreted in terms of properties of the distribution. The same can be said about splines.
• Neural Networks are definitely non parametric regression models : they receive their share of criticism for being very efficient but uninterpretable models.

_________________________

Given a sample and a problem, should this problem be tackled with a parametric or a non parametric technique ?

• If it is possible to formulate reliable assumptions about the distribution behind the data, then the parametric model is the way to go.
• Fitting the model to the data is easy.
• The interpretation of the parameters of the model will be of great value to the analyst.
• Confidence intervals and tests on the parameters of the model are available.
• The amount of information contained in the assumptions about data distribution is so large that only a relatively small number of observations are needed to estimate the values of the parameters.
• But real life data distributions rarely match mathematically convenient distributions. If no assumption can be made about the data distribution, or if the sample fails the tests meant to check hypothesis about this distribution (like a normality test), then non parametric models offer a very useful alternative.
But :
• The values of the parameters will then bring no knowledge about the distribution.
• Confidence intervals and tests on the parameters of the model are lost.
• The large amount of information made available by the assumptions of parametric modeling is now missing. In order to be able to attain the same level of model quality as with a (justified) parametric model, this missing information has to be replaced by information coming from somewhere else, that is, the data. So, attaining the same level of quality with a non parametric model than with a (justified) parametric model will require larger samples. Equivalently, for a given sample size, non parametric models will make less accurate predictions than a justified parametric model.

Parametric (test)

A test is said to be parametric if the tested hypothesis bears on one or several parameters of the assumed underlying distribution(s). Here are two examples :

1) Given a set of numerical values {x1, x2, ..., xn}, and assuming that the underlying distribution is normal, one may wonder whether it is likely that the mean of  this normal distribution is a given value m0. This is one the classical "One sample t-test".

This test belongs to the family of "goodness-of-fit tests", because it bears on the question of assessing whether the sample originated from a given reference distribution.

2) Given two sets of numerical values {x1, x2, ..., xn} and { y1, y2, ..., ym} and assuming that both samples originate from normal distributions, one may wonder whether it is likely that these two normal distributions have identical variances. This is the classical Fisher's F test.

This test belongs to the family of "identity tests", because it bears of the question of assessing whether two parameters of two distributions (or the distributions themselves) are identical.

But many important tests cannot be expressed this way. Here are two examples :

1) Given a sample described by two categorical variables V1 and V2 , one may wonder how likely it is that the two variables are independent. The "Chi-square test of independence" answers this question.

2) Given a set of numerical values x1, x2, ..., xn  and a reference distribution, one may wonder whether it is likely that the sample originated from the reference distribution. Several goodness-of-fit tests (like the Kolmogorov test, or the Goodness-of-fit Chi-square test) answer this question.

Note that these last two examples make no reference to parameters of the actual distributions behind the samples. They are therefore called non parametric tests.

It is often the case that a question can be approached either through both a parametric, or a non parametric test. Which one to choose ? There is no unique answer to this question, but there are some guidelines :

* Non parametric tests have the advantage of not relying on stringent assumptions (e.g. normality) about underlying distributions, that may not be met in practical applications.

* But this tolerance has a price : non parametric tests usually require more extreme situations than a corresponding parametric test to reject the tested hypothesis if using the parametric test is justified.

__________________________________________________________