INTERACTIVE ANIMATION: HISTOGRAM

This animation illustrates the concept of Histogram, and more particularly the bias-variance dilemma.

 The "Book of Animations" on your computer

Frame

In the frame are:

1) A horizontal green line that represents a uniform distribution.

2) A sample drawn from this distribution. The sample size is read in the "Nb Points" display and may be adjusted with the "Nb Points" buttons. A new sample is drawn for each new sample size.

3) The histogram (yellow) of the sample. the number of bins is read in the "Nb Bins" display. You may change the number of bins with th "Nb Bins" buttons.

You may change the shape of the density function . Click several times inside the frame, either above or below the green line. For every click, a new sample is drawn, and its histogram is built.

Animation

Build a density function, select a sample size and a number of bins, and click on "Go". A series of samples are drawn from the same density, and for each sample, the corresponding histogram is displayed.

In the "Pause" mode, click on "Step" to draw new samples.

The purpose of this animation is to illustrate the "bias-variance dilemma" issue.

1) Keep the sample size fixed, and change the number of bins (this can be done while the animation is running).

• When the number of bins is increased (and therefore, when they are made narrower), the positions of the observations are identified more and more accurately. The histogram then gives a reasonably faithful image of the sample, to the expense of the faithfulness of the representation of the density itself. In addition, the histogram becomes very unstable: its shape varies considerably from one sample to the next.
If the number of bins is further increased, gaps will appear between bins, even though the density is continuous.
In extreme situations, there will be at most one observation in each bin. The histogram is then nothing more than a useless representation of the sample.

It is said that increasing the number of bins increases the variance of the histogram.

• When there are too few bins (and therefore when their width is too large), substantial variations in the positions of the observations will not change the number of observations in each bin. The histogram then becomes more stable. Unfortunately, this also means that densities with the same global shape but different smaller scale structures will produce histograms with essentially the same shape: the low resolution image of the density is correct, but fine details are lost because the image of the density is smoothed out.

It is said that increasing the width of the bins increases the bias of the histogram.

So there has to be an "optimal" bin width that reaches a reasonable trade-off between:

• A stable, and therefore credible low resolution image of the density,
• And a more detailed, but also less stable, and therefore less credible image of this density.

Unfortunately, there is no clear-cut definition of what "optimal" means. Several definitions may be imagined, for example based on the Kullback-Leibler distance, but none has an absolutely compelling status. At any rate, the practitioner pays little heed to such criteria, as their efficiency depends on the actual shape of the density, which is of course unknown in practice.

2) For a given bin width, change the number of points (this can be done while the animation is running). Note that the stability of the histogram gets better when the number of points gets larger. For a given level of stability (and therefore, of credibility), it then becomes  possible to reduce the bin width, and therefore improve the "sharpness" of the image of the density. The "optimal" bin width gets smaller when samples become larger.