Interactive animation

 

HISTOGRAMS AND THE BIAS-VARIANCE TRADEOFF

 

 

Please make sure that you first read the entry on histogram.


What follows addresses the bias-variance tradeoff issue in the case of histograms. This text has little practical value as histograms are generally used for their graphical interpretation only. But because histograms are simple, we use them as an illustration of the universal bias-variance tradeoff problem.

Normalization of the histogram

If ni observations are in bin i, the area of this bin is ai = nix. The total area of the histogram is :

 AΣi ai. = Σi nix = Δx.Σi ni = Δx.n

where n is the sample size.

But we want to compare the histogram to p(x), whose integral is 1, so we normalize the histogram by dividing all bin heights by A. The area of the normalized histogram is now 1.

The histogram as an estimator

We mentioned that we can use the height of a bin as an estimator of p(x0) for any x0 that is covered by the bin. It is therefore appropriate to ask about :

 

For this purpose, we now go a bit into the details of how a histogram works.

Distribution of the number of observations in a bin

Should another sample be drawn from p(x), but the brackets kept as they were for the first sample, the exact positions of the observations will probably be different, and so will the heights of the bins. The height of a bin is therefore a random variable. What is its distribution ?

Denote by Pi the area under the curve p(x) in the region delimited by bracket i. Any new observation will fall in bracket i with probability Pi by definition of a probability density function (or pdf).

 

 

 

In the above image, Pi is the green shaded area.

If n observations are drawn, the number of observations in bin i will follow the binomial distribution B(n, Pi).

Bias of the histogram

We now calculate the bias of the height of bin i considered as an estimator of p(x0), with x0 in bracket i (see above image). By definition :

Biasi(x0) = Expectation[Estimate - True Value] = E[Heighti - p(x0)] = E[Heighti] - p(x0)

The expectation of B(n, Pi) is nPi, so the average number of observations in bin i is nPi, and the average height of the normalized bin is nPi/A. So, for any x0 covered by bin i, the bias of the estimator Heighti is :

Biasi = nPi/A - p(x0) = Pix - p(x0)

Is this bias large or small ?

 

In summary :

"Histograms with large bins exhibit large biases, but histograms with narrow bins are almost unbiased."

Variance of the histogram

What is the variance of the estimate of p(x0) ? The variance of B(n, Pi)  is  nPi(1 - Pi), and so the variance of the height of the normalized bin i is :

Var(Heighti) = nPi(1 - Pi) / (Δx.n)² = Pi(1 - Pi) / n.(Δx

 Is this variance large or small ?

Var(Heighti) ~ {p(x0).Δx (1 - p(x0).Δx)}/ n.(Δx)² = {p(x0).(1 - p(x0).Δx)}/ nx ~ p(x0) /nx

So the variance is then high, and indeed tends to infinity when Δx tends to 0 (the heights of the normalized non empty bins tend to infinity when Δx tends to 0 to keep the total area of the histogram equal to 1).

 

In summary :

"Histograms with large bins have a low variance, but histograms with narrow bins have a large variance."

The bias-variance tradeoff

So it appears that as the bin width Δx changes, the bias of the estimator Heighti and its variance always go in opposite directions. The analyst can therefore use Δx as a "tuning device" to adjust the tradeoff between bias and variance as he wishes. This is but an example of a ubiquitous and fundamental aspect of data modeling : the bias-variance tradeoff.

 

Now, how should the bin width Δx be chosen ?

Mean Square error of the histogram

If the histogram is built just for the purpose of illustrating an approximation of p(x) known through a sample, trial and errors will show the following :

 

So, even without any theoretical considerations, it is clear that the most faithful image of p(x) is obtained for a certain value of Δx, no smaller, no larger.

-----

Now if we insist on using the histogram as a probability density estimator, what we want is the estimations made by this model be as accurate as possible. The classical measure of quality of an estimator is its Mean Square Error, the mean value of the squared difference between the estimate and the true value :

Mean Square Error = E[(Estimate - True Value)²]

It can easily be shown that :

Mean Square Error = Bias² + Variance

and we could consider applying this general expression to the particular case of the bin height of a histogram.

There is no need to explicit the calculation to reach the following conclusions :

 

We are therefore in a vicious circle, for finding the best estimator of p(x) would require knowing p(x) and if we did, there would be no need to estimate it.

_____________________________

In a conclusion, the histogram, considered as a model, displays the universal phenomenon known as the bias-variance tradeoff. The architecture of a histogram is specified by the number of its bins (within a certain range of the variable x), and a fully constructed histogram is specified by the list of the values of the heights of its bins.


Theory shows that there is an optimal bin width, but that it cannot be calculated. It could be roughly estimated by validation techniques (although nobody would spend any time doing that for the benefit of the humble histogram).

 

The same pattern occurs for any kind of model.

Download this Glossary