|
Interactive animation |
|
HISTOGRAMS AND THE BIAS-VARIANCE TRADEOFF
Please make sure that you first read the entry on histogram.
What follows addresses the bias-variance tradeoff
issue
in the case of histograms. This text has little practical value as histograms
are generally used for their graphical interpretation only. But because histograms
are simple, we use them as an illustration of the universal bias-variance tradeoff
problem.
If ni observations are in bin i, the area of this bin is ai = ni.Δx. The total area of the histogram is :
A = Σi ai. = Σi ni.Δx = Δx.Σi ni = Δx.n
where n is the sample size.
But we want to compare the histogram to p(x), whose integral is 1, so we normalize the histogram by dividing all bin heights by A. The area of the normalized histogram is now 1.
We mentioned that we can use the height of a bin as an estimator of p(x0) for any x0 that is covered by the bin. It is therefore appropriate to ask about :
For this purpose, we now go a bit into the details of how a histogram works.
Should another sample be drawn from p(x), but the brackets kept as they were for the first sample, the exact positions of the observations will probably be different, and so will the heights of the bins. The height of a bin is therefore a random variable. What is its distribution ?
Denote by Pi the area under the curve p(x) in the region delimited by bracket i. Any new observation will fall in bracket i with probability Pi by definition of a probability density function (or pdf).

In the above image, Pi is the green shaded area.
If n observations are drawn, the number of observations in bin i will follow the binomial distribution B(n, Pi).
We now calculate the bias of the height of bin i considered as an estimator of p(x0), with x0 in bracket i (see above image). By definition :
Biasi(x0) = Expectation[Estimate - True Value] = E[Heighti - p(x0)] = E[Heighti] - p(x0)
The expectation of B(n, Pi) is nPi, so the average number of observations in bin i is nPi, and the average height of the normalized bin is nPi/A. So, for any x0 covered by bin i, the bias of the estimator Heighti is :
Biasi = nPi/A - p(x0) = Pi/Δx - p(x0)
Is this bias large or small ?
In summary :
"Histograms with large bins exhibit large biases, but histograms with narrow bins are almost unbiased."
What is the variance of the estimate of p(x0) ? The variance of B(n, Pi) is nPi(1 - Pi), and so the variance of the height of the normalized bin i is :
Var(Heighti) = nPi(1 - Pi) / (Δx.n)² = Pi(1 - Pi) / n.(Δx)²
Is this variance large or small ?
Var(Heighti) ~ {p(x0).Δx (1 - p(x0).Δx)}/ n.(Δx)² = {p(x0).(1 - p(x0).Δx)}/ n.Δx ~ p(x0) /n.Δx
So the variance is then high, and indeed tends to infinity when Δx tends to 0 (the heights of the normalized non empty bins tend to infinity when Δx tends to 0 to keep the total area of the histogram equal to 1).
In summary :
"Histograms with large bins have a low variance, but histograms with narrow bins have a large variance."
So it appears that as the bin width Δx changes, the bias of the estimator Heighti and its variance always go in opposite directions. The analyst can therefore use Δx as a "tuning device" to adjust the tradeoff between bias and variance as he wishes. This is but an example of a ubiquitous and fundamental aspect of data modeling : the bias-variance tradeoff.
Now, how should the bin width Δx be chosen ?
If the histogram is built just for the purpose of illustrating an approximation of p(x) known through a sample, trial and errors will show the following :
So, even without any theoretical considerations, it is clear that the most faithful image of p(x) is obtained for a certain value of Δx, no smaller, no larger.
-----
Now if we insist on using the histogram as a probability density estimator, what we want is the estimations made by this model be as accurate as possible. The classical measure of quality of an estimator is its Mean Square Error, the mean value of the squared difference between the estimate and the true value :
Mean Square Error = E[(Estimate - True Value)²]
It can easily be shown that :
Mean Square Error = Bias² + Variance
and we could consider applying this general expression to the particular case of the bin height of a histogram.
There is no need to explicit the calculation to reach the following conclusions :
We are therefore in a vicious circle, for finding the best estimator of p(x) would require knowing p(x) and if we did, there would be no need to estimate it.
_____________________________
In a conclusion, the histogram, considered as a model, displays the universal phenomenon known as the bias-variance tradeoff. The architecture of a histogram is specified by the number of its bins (within a certain range of the variable x), and a fully constructed histogram is specified by the list of the values of the heights of its bins.
Theory shows that there is an optimal bin
width, but that it cannot be calculated. It could be roughly estimated
by validation techniques (although nobody would
spend any time doing that for the benefit of the humble histogram).
The same pattern occurs for any kind of model.