Dimensionality
Let P be a point on a sheet of paper. Its position can be identified by 2 numbers :
* x, its distance from the left hand side of the sheet.
* y, its distance from the lower side of the sheet.
Given the doublet (x, y), P is defined with no ambiguity. The set of points on the sheet of paper is said to be bidimensional.
Now, let's scatter 5 other points on the sheet, P1 to P2, and let's measure the distances from P to each of these 5 points (lower image of the above illustration). You'll easily convince yourself that P is the only point that can lead to the set of distances (d1, ..., d5 ).
Why is P is now determined by 5 numbers when 2 numbers were enough just a few lines ago ? The answer is that the amount of information in the set of 5 numbers (d1, ..., d5) is not greater than the amount of information we had in (x, y). Had we scattered 10 or 1000 reference points (instead of 5), the amount of information would still have been just what is needed to specify one point on the sheet. The five quantities (d1, ..., d5) are said to be redundant, meaning that they carry less information that seems at first sight.
Faced with a large set of quintuplets (d1, ..., d5 ), each relative to a point on the sheet, we could envision replacing this bulky representation by the more compact (x, y) representation without losing any information at all. We would then say that although points were described by 5 attributes, they in fact belong to a set whose genuine dimensionality is 2.
-----
Now, this example may sound rather academic,
and you may object that your job does not consist in scattering dots on sheets.
The fact of the matter is that the phenomenon we just described is ubiquitous,
and is met in all kinds of data files from the very real world (physical measurements,
opinion polls, commercial files...).
The redundance is sometimes visible
"with the naked eye". Consider for example the three variables :
*
"Surface of the house",
* "Power of the car",
* "Income",
extracted from a commercial data base.
These three variables are not equivalent, but it also clear that they are not independent (that is, they are somewhat redundant). In a realistic situation, knowing the values taken by these three variables certainly does not bring about three times as much information about the individual as any one of them alone.
We are now convinced that it is quite likely that the total amount of information in your files is less than what could be expected given the number of variables (fields, or attributes). Another way of saying the same thing is to say that the data in your files could (and, as we'll see, should) be described by fewer variables than in its native form.
Just how many variables would be necessary to describe your data with no or little loss of information ? Answering this question would be "estimating the instrinsic dimensionality of the data". Given how barbarious this expression is, it's reassuring that no software outside research labs addresses the question...now.
But trying to reduce, to a certain extent, the number of variables that describes your data is a mandatory step of Data Modeling. This important question is adressed here.
Dimensionality reduction
The goal of dimensionality reduction is to create a small set of new variables that will describe the individuals in the data base nearly as well as do the original variables, which are usually quite numerous. The new variables will exhibit less redundance than the original variables.
There are many ways to achieve this goal :
1)
The simplest one is to discard some of the original variables if you think that
they are not pertinent for the problem at hand. Conversely, you may extract
form the original set of variables those whom you feel carry a great
deal of information that will be needed to solve the problem.
This is exercise
is far from simple, but usually causes debates about the actual usefulness of
the variables that provide considerable insight into the problem.
2) A set of strongly redundant variables (e.g. strongly correlated) can be replaced by a subset of these variables.
3)
A set of variables can replaced by an appropriate function of
these variables if there is enough expertise to attribute a particularly significant
meaning to this function. For example, the twelve "Monthly income"
variables might be replaced by the single and newly created "Average
annual income".
These methods are all necessary preliminaries. But it is usually not possible to dispense with more sophisticated statistical techniques like factorial techniques (PCA, MCA). Unsupervised Neural networks (Kohonen Maps) also do an excellent job at reducing the number of useful variables.
One can hope that in the near future, software editors will introduce some of the truly remarkable dimensionality reduction techniques that have proven very effective for years (CCA, ICA).
-----
Now you may wonder why so much emphasis is put on dimensionality reduction, especially as it usually goes along with a loss of information ? There are several answers to this question.
1) The most obvious reason is to reduce the amount of data that algorithms will have to crunch, thus reducing calculation times and memory requirements.
2) If you can bring down the number of useful variables to 2, data can be displayed on a "sheet of paper" for visual inspection. You can then use the most fantastic Data Mining tool, the human eye and its extremely efficient ability to detect patterns, clusters, line ups, tendencies, drifts, niches etc...
3) The most important reason has to do with the credibility of the (predictive or descriptive) model you are going to build.
Consider two models that perform equally well on the data used to build them. Imagine also that one of the models incorporates many more variables than the other one. For deep reasons, it can be safely assumed that the parcimonious model will perform much better than the greedy one on new data, data which was not available at the time when the models were built.
This is often considered counterintuitive, as more variables mean more information, and more information should mean better generalization, or so we think (but keep in mind that we considered models that performed equally well to start with).
A practical consequence of this is that one often comes across models deemed "good" by their creators, but that turn out to be rather poor when used on real world new data.
So Dimensionality Reduction is both indispensible and difficult. If you are serious about building models because important decisions will be made on what they have to say, then spend the necessary time on this crucial activity.
____________________________________________
Related readings
|
Want to contribute to this site ? |