Ordinal (variable)

A variable is said to be ordinal if:

 

    1) It can take only a small number of values,
    2) These values can be naturally ranked by increasing or decreasing values.

 

Typical examples of ordinal variables are : 

 Ordinal variables often come as the result of discretization of a numerical variable.

_______________________


Non parametric tests (see for instance Wilcoxon-Mann-Whitney, Kruskal-Wallis, Friedman) often merge numerical observations from various origins into as single sample, and the convert the numerical values into ranks. Ranks are, by very nature, ordinal variables, although of a special kind, as each and every value of the variable is assigned to only one observation.

_______________________

 

See also : Categorical, Numerical, Binary. 

 

 

 Outlier

An "outlier" (or "extreme point") is an observation that is very different from the "average" observation in your data set.

 

An outlier may have two different possible origins :

    * It can be an ordinary observation, but of which at least one attribute has been severely corrupted by a mistake, like a copying or a transmission mistake (think of a misplaced decimal point).

    * It can also be a bona fide observation, that simply turns out to be very unsual. Think for instance of the file of attributes of the "motor vehicles" sitting in a parking lot at any one time, with nothing but ordinary cars, except for one eighteen-wheeler whoses characteristics are very different from that of any ordinary car.

 

 

Outliers are a severe problem in just about every area of data modeling. The reason is that just one outlier may severely distort a model up to the point where conclusions drawn from the model are simply wrong. For example, consider a bank accounts file. Numbers yield a 5000$ average account balance. Are the bank's customers all well to do ? Or maybe they are all careless about managing their hard-earned money ?

Upon closer scrutiny, one account shows a positive 1 000 000$ balance : a lucky customer just won at the lottery, and has deposited his gains on his account. Remove this one customer from the file, and all of a sudden, the average balance drops to 500$, a more reasonable number (bottom illustration. Notice the change of scale from linear to logarithmic).

 

This very simple example makes explicit the practical attitude with respect to outliers :

    1) Detect outliers. Here, it was indeed quite simple (a histogram will do), but the problem of outlier detection is in general quite difficult.

    2) Analyze the detected outlier to figure out if it is a "fraudulent" observation, or a bona fide observation.

    3) Consider the possibility of building a new model with the outlier removed from the data set.

 

 

Consider now the problem of outliers in a somewhat more complex context : that of Simple Linear Regression (SLR). Every observation is now a pair of numbers (xi, yi), and run a SLR of  y on x (top illustration). Suppose now that  yi is grossly wrong  for observation i (we will say that the observation is a "y-outlier"). The new regression line is as shown on the bottom illustration. What was before a downward going regression line has turned into an upward going regression line !

 

Notice that neither a histogram of x, nor a histogram of y will detect the outlier. Yet, detecting the outlier is still easy : a simple visual examination of the scatter plot will allow pinpointing it. Also, a histogram of the residuals will detect the faulty observation, as the residual for the y-outlier is much larger than for "regular" observations.

 

 

 

A somewhat more annoying example of outlier may also happen in SLR. This time, we assume that a serious mistake occurred on x (the independent variable) for one particular observation. It may very well happen that the outlier's residual is by no means large at all, so even a histogram of residuals will not detect it. To make things worse, the valid points look bad because of their high residuals compared to that of the outlier.
Within the context of regression, such an observation that sits far away from the bulk of the other observations is called a "leverage observation" (or an "x-outlier"). The name implies that it can potentially (but not necessarily) make the regression line "pivot" around the cloud of ordinary observations like a lever around its fulcrum.

Leverage observations may have high or low residuals : residuals are of no help for identifying leverage observations. Actually, it is common that, because of their capacity to make a regression line "pivot", leverage observations have low residuals, and make legitimate observations look bad (bottom illustration).

 

 

 

 

 

 

The situation becomes very bad in Multiple Linear Regression, as it is now impossible to visualize the cloud of observations. Histograms are usually helpless to detect leverage points : in this scatter plot of two of the independent variables, we are lucky to be able to visually detect the leverage point P, but neither x1 nor x2 have a "pathological" histogram. In general, pairwise scatter plots of independent variables will show nothing abnormal.

 


 

 

 

So detecting leverage points appears as both important and difficult. A battery of subtle techniques have been developped for that purpose. They belong to two main categories :

    1) Direct analysis of the statistical properties of the cloud of observations involving robust location and spread estimators of the cloud.

    2) Robust regression techniques, that are less sensitive to outliers than the ordinary Least Squares approach. Outliers (whether x- or y-) will then have high residuals, and will therefore be unambiguously identified.

 

"Professionnal" outlier hunting requires specialized software, and quite a bit of work. Casual outlier hunting will settle for simple tools often found in software, like the "Cook's distance", and the "Mahalanobis distance".

 

 

Parameters (of a model)

A model often comes as a numerical function of the input variables.

 

Take for instance the case of Linear Regression.

 

 

 

Parametric (model)

Suppose there is a sample that is known to have been generated by a normal distribution. This distribution may easily be estimated : all there is to do is to estimate its mean and variance (using the method of Maximum Likelihood). So, only two numbers are needed to fully describe what is believed to be the distribution that generated the data. These numbers are the values of the parameters of the model.

 

 

But now, suppose one has no idea about the nature of the distribution that generated the sample. In order to have at least  an approximate graphical representation of this distribution, a histogram of the sample is drawn (lower image of the above illustration). To capture the global shape of the distribution, the histogram should have at least 10 bins, or more if the sample is large enough. So now, a relatively large number of numerical values (the heights of the bins) are needed to specify the estimated distribution behind the data.

 

Why such a large difference in the number of numerical values needed for specifying what amounts to the same thing : an estimation of the distribution behind a sample ?

 

The reason is that :

 

The first model is said to be parametric. The term "parametric" refers to the fact that, once the analytic form of the model is decided upon (here, the normal distribution), only a small number of parameters have to be estimated. In addition, these parameters can be interpreted in terms of properties of the distribution (here, mean and variance).

The second model (the histogram) belongs to the family of non parametric models. This expression is misleading, as "non parametric" models do incorporate parameters (in fact, often a large number for "local" models such as RBF networks). For histograms, these parameters are :

Its just that these parameters cannot be interpreted in terms of global properties of the distribution.

The model is also said to be "ad hoc", or "black box",  meaning that it does the job, but no knowledge about the distribution can be extracted from the values of its parameters.

___________________________

The two foregoing examples were drawn from descriptive modeling (probability density estimation), but the same distinction between "parametric" or "non parametric" also exists in predictive modeling.

_________________________

Given a sample and a problem, should this problem be tackled with a parametric or a non parametric technique ?

Parametric (test)

A test is said to be parametric if the tested hypothesis bears on one or several parameters of the assumed underlying distribution(s). Here are two examples :

    1) Given a set of numerical values {x1, x2, ..., xn}, and assuming that the underlying distribution is normal, one may wonder whether it is likely that the mean of  this normal distribution is a given value m0. This is one the classical "One sample t-test".


This test belongs to the family of "goodness-of-fit tests", because it bears on the question of assessing whether the sample originated from a given reference distribution.

    2) Given two sets of numerical values {x1, x2, ..., xn} and { y1, y2, ..., ym} and assuming that both samples originate from normal distributions, one may wonder whether it is likely that these two normal distributions have identical variances. This is the classical Fisher's F test.


This test belongs to the family of "identity tests", because it bears of the question of assessing whether two parameters of two distributions (or the distributions themselves) are identical.

 

But many important tests cannot be expressed this way. Here are two examples :

    1) Given a sample described by two categorical variables V1 and V2 , one may wonder how likely it is that the two variables are independent. The "Chi-square test of independence" answers this question.
 

    2) Given a set of numerical values x1, x2, ..., xn  and a reference distribution, one may wonder whether it is likely that the sample originated from the reference distribution. Several "goodness-of-fit" tests (or "tests of fit"), like the Kolmogorov test, or the "Goodness-of-fit Chi-square test" answer this question.

 

Note that these last two examples make no reference to parameters of the actual distributions behind the samples. They are therefore called non parametric tests.

 

It is often the case that a question can be approached either through both a parametric, or a non parametric test. Which one to choose ? There is no unique answer to this question, but there are some guidelines :

    * Non parametric tests have the advantage of not relying on stringent assumptions (e.g. normality) about underlying distributions, that may not be met in practical applications.

    * But this tolerance has a price : non parametric tests usually require more extreme situations than a corresponding parametric test to reject the tested hypothesis if using the parametric test is justified.

__________________________________________________________
 

Download this Glossary

 

Want to contribute to this site ?