Interactive animation

Histogram

Given a sample from an unknown probability distribution p(x), a histogram is a model that is hoped to give a reasonably faithful graphical representation of this probability distribution.

It is built as follows :

• The analyst first decides on a bracket size Δx, and places points on the x axis with a Δx spacing between consecutive points. The origin of this process is not critical and is more or less arbitrary.
• The number of observations in each bracket is measured.
• A vertical rectangle, called a bin, is placed on top of each bracket, and its height made equal to the number of observations in the bracket.

So, a histogram is a series of bins placed side by side on the x axis, and such that the height of a bin is the number of observations in the corresponding bracket.

# The histogram as a representation of p(x)

The sample is, as usual, hoped to be a faithful representation of p(x) : many observations are expected to be found in regions where p(x) is large, and few observations are expected to be found in regions where p(x) is small. Bins are expected to be tall where p(x) is large, and short where p(x) is small. So, the "skyline" of the (appropriately normalized) histogram is expected to be a faithful, discretized, staircase-like representation of  p(x) itself (lower image of the illustration below).

# The histogram as a model

A histogram is :

• A descriptive model.
• A non parametric model : it does not assume any analytical form for the underlying probability distribution.
• A local model : deciding on the height of a bin is done by considering only a small region of the domain of the variable (the bracket).
• A histogram is used mostly for its graphical value, as an illustration of what p(x) is supposed to look like. Yet, in more technical terms, its purpose is probability density estimation. Once normalized so that the sum of the areas of the bin is 1, and given a number x0, the height of the bin that covers  x0 is an estimate of p(x0). We now briefly develop this viewpoint.

# The histogram and the bias-variance tradeoff

Simple as it is, the histogram provides a very good illustration of one of the most fundamental aspects of practical, down-to-earth statistical data modeling : the so-called bias-variance tradeoff. For histograms, the bias-variance tradeoff reads as follows :

• Histograms with too few bins are no good because they average out the details of the distribution.
• Histograms with too many bins are no good because they just tell where the observations are (which we already know), but say nothing about their distribution (lower image in the illustration below).

.

So a good histogram must have just the right number of bins. What is this number ? The answer is disappointing : it cannot be calculated. But it could be roughly estimated by validation techniques.

This question of great practical importance is developed in more details in the next page. It is also illustrated in an interactive animation that you'll find here.

____________________________________________________________