Interactive animation

Standardization

Standardization of a random variable

Let X be a rv with mean µ and variance σ². To standardize X is to transform it into another rv X ' by applying a linear transform

X ' = aX + b

such that X ' has mean 0 and unit variance.

By reporting to the properties of elementary transformations of rvs, the reader will easily show that :

 

                                             

 

Why standardize ?

Confidence intervals and tests

The distribution of X ' does not depend on µ and σ anymore. These quantities are generally unknown, but it sometimes happens that replacing them by their estimators creates a new rv X '' whose distribution can be calculated : it is then possible to devise confidence intervals and tests about their sample values (see for example the t test).

Convergence theorems

In loose terms, the Central Limit Theorem (CLT) states that the distribution of a sum of independent and identically distributed random variables becomes more and more "normal like" when the number of variables in the sum grows without limit. But at the same time the (absolute) value of the mean, and the variance of this distribution both tend to infinity, making it awkward to formalize this tendency to normality because the reference distribution (normal) keeps changing as the number of rvs in the sum increases.

Standardizing the sum makes its distribution converge to a fixed distribution (the standard normal distribution), thus making the property of "convergence to a normal distribution" completely unambiguous (see for example the animation about the binomial distribution).

Other results about convergence to a normal distribution are not direct consequences of the CLT, but also need a preliminary standardization step in order to be adequately formulated (see for example the convergence to normal of a Poisson distribution when the parameter λ tends to infinity).

Standardization and data modeling

It is common to standardize the numerical data prior to building a model in order to deal with the following problem.

Suppose that two fields of a bank data base are respectively :

    * The average monthly balance of a customer,

    * And the value of his house,

both expressed in the same currency.

Numbers in the first field will be much lower than numbers in the second field. It is the generally considered that such a large imbalance is detrimental to the quality of the model because the large numbers of the second field will have a larger influence on the model than the numbers in the first field : a posteriori analysis of the importance of the model predictors will underestimate the true importance of the first field (average monthly balance).

After these two variables have been standardized, their true values are not available anymore, and only the general shape of their distributions and the level of their interactions (for example, correlation) will influence the model.

All statistical software incorporate variable standardization.

Animation

The following interactive animation illustrates the standardization of a univariate sample.

 

 

The "Book of Animations" on your computer

 

 

The upper sample (red points) is the original, non standardized sample. It's mean is marked by a vertical blue line.

The lower sample (bleu points) is the standardized sample :

    * Its mean is always 0,

    * Its Standard Deviation is always 1.

 

Move about red points with your mouse, and observe the corresponding changes of the standardized sample.


No scale is mentioned for the original sample, as any scale will produce the same standardized sample.

 

Multivariate standardization, Mahalanobis transformation, "spherization"

Standardization generalizes to the multivariate case, that is, when X is a random vector.

The simplest approach is then to standardize each of the components of X individually. The weakness of this approach is that it does not take the couplings between these components into account. As a result, the components of the transformed vector have indeed 0 mean and unit variance, but they are still correlated, and the covariance matrix of X ' is not the identity matrix.

Yet there exists a linear transformation that not only standardizes the components of X ', but that in addition makes these components uncorrelated (covariance matrix off-diagonal coefficients are 0) : this transformation is called the Mahalanobis transformation.

 

The vector is then said to be "sphericized". Strictly speaking, this expression is correct only when the Mahalanobis is applied to a vector following a multinormal distribution : the resulting distribution is then the spherically symmetric standard multinormal distribution with identity covariance matrix. Spherization of a multinormal distribution usually makes calculations quite a bit simpler, and the results can then be "carried over" back to the original distribution by the inverse Mahalanobis transformation.

 

____________________________________________________________

 

Related readings :

Standard Deviation

Variance

Central Limit Theorem

Covariance matrix

Download this Glossary