|
Interactive animation |
Standardization
Imagine that a bank's data base contains two fields (among many others):
1) Customer's average monthly
balance.
2) Value of the dream
house that he wants to borrow money to buy.
Both fields are expressed in dollars, but chances
are that numbers in the second field will be considerably larger that numbers
in the first field.
It is then generally considered that the model
will strongly take the second field into account, and tend to neglect the the first one even if it has more predicitve power. This will show when
an a posteriori analysis of the importance of individual variables
will be conducted.
In order to bring all (numerical)
variables on the same footing, it is customary to apply a transformation to
each numerical variable before building the model. After transformation, the
new variables will all have a mean value equal to "0", and a
variance equal
to "1". These variables are said to have been standardized
(or sometimes, and improperly, "normalized").
The transformation is linear, and defined by :

for each variable xi.
Note that each variable is standardized independently of the other variables : standardization is not a multivariate transformation. A more complex transformation can make not only every variable standard, but also the covariance matrix of the distribution become the identity matrix In (which ordinary standardization doesn't do). The distribution is then said to have been sphericized.
The illustration below shows the the effect of standardization on a variable distribution : mean value is shifted to "0", and the distribution is squeezed (or expanded) so as to have unit variance.
The distribution that generated the sample is usually unknown. The analyst will therefore have to be statisfied with standardizing the sample, as illustrated below.
__________________________________________________________
The following interactive animation illustrates the standardization of a sample. You need FlashPlayer to view it. If you don't have it, you can download it for free at www.macromedia.com/downloads/ .
The upper sample (red points) is the original, non standardized sample. It's mean is marked by a vertical blue line.
The lower sample (bleu points) is the standardized sample :
* Its mean is always 0,
* Its Standard Deviation is always 1.
Move red points about with your mouse, and observe the corresponding changes of the standardized sample.
No scaled is mentioned for the original sample, as any scale
will conduct to the same standardized sample.
____________________________________________________________
Related readings