The Kullback-Leibler distance (K-L) is a measure of the dissimilarity between two completely determined probability distributions.
Let p1(x) and p2(x) be two continuous probability distributions. By definition, the Kullback-Leibler distance D(p1, p2) between p1(x) and p2(x) is :
with a similar expression in the discrete case.
If E denotes the expectation with respect to the distribution p1, this expression may also be written :
It is common to encounter the symmetric version of the K-L distance between p1 and p2 :
Ds(p1, p2) = [D(p1, p2) + D(p2, p1)] / 2
We analyze here the relation between "Kullback-Leibler distance" and "Parameter estimation by the method of Maximum Likelihood".
We establish here the analytical form of the K-L distance (both asymmetric and symmetric) between two normal distributions p1 and p2.
Software occasionally use the K-L distance within the context of the validation of a model. The easiest way to split a sample into a training set and a validation set is simply to draw (without replacement) the validation set at random from the complete sample. The problem is that you may be unlucky, and draw a biased validation set, that is, a set whose distribution is substantially different from the sample's distribution. The estimation of the performance of the model based on this biased validation set is then itself biased, a soft word for "wrong".
Drawing a subset from a sample is fairly fast, and it is sometimes proposed as an option to generate a large number of candidate validation sets and retain as the final validation set the set whose K-L's distance to the remaining training set is smallest.
Because the K-L distance makes senses only for distributions (as opposed to finite samples), one has to assume that both the candidate validation set and the remaining training set are normally distributed. The corresponding normal distributions are then determined by Maximum Likelihood, and their (symmetric) K-L's distance is then computed as in here.
The use of the K-L as a measure of the dissimilarity between samples is illustrated in the following interactive illustration.
The K-L distance as a measure of the dissimilarity between samples uses only the first two centered moments (mean and variance) of the samples because of the normality assumption. Therefore, two samples may have fairly different distributions, and yet exhibit a low value for their K-L distance. All it takes is for the samples to have essentially identical means and standard deviations, but largely different higher order moments (especially skewness and flatness).
You may experiment with this idea in the interactive illustration.
The following figure illustrates the Kullback-Leibler distance.
The illustation has two modes of operation :
1) The "Gaussians" mode (selected by default), displaying the K-L distances (both symmetric and asymmetric) between two normal distributions.
2) Or the "Samples" mode (click on the "Samples" button), displaying the K-L distances (both symmetric and asymmetric) between two arbitrary samples, together with the corresponding Maximum Likelihood normal distributions.
In the "Gaussian mode" :
* Change the means and standard deviations of the gaussians, and observe the variations of the K-L distances (expressed in arbitrary units).
* Make the two gaussians have the same mean. Observe that the gaussian with the smaller Standard Deviation is "closer" to the other one than the reverse. More generally, observe that the narrower gaussian is always "closer" to the other one than the reverse.
* Observe that the two asymmetric distances are equal only when the two Standard Deviations are equal.
* Change the standard deviation of one of the gaussians, while keeping its mean constant. Observe that the symmetric distance has a minimum : there is an optimal value of the standard deviation, all the other parameters being held constant.
* The same is true for the asymmetric distance. What do you notice when the standard deviation reaches this optimal value ?
In the "Samples" mode :
* Drag the points with your mouse, and observe the variations of the K-L distances between the two samples.
* Try to construct two samples
with "0" K-L distance, and yet that have substantially different distributions.
Conclude that this is possible only if at least one of the two samples severely
departs from normality.
Points may occasionally refuse to be dragged. This is
to avoid generating samples with too small a standard deviation, and
guarantees that the corresponding gaussian curves are never taller than
The definition of the Kullback-Leibler distance may seem rather arbitrary. In fact, it is deeply rooted into Information Theory, but we show here that a simple line of reasoning about the likelihood of a distribution p1(x) with respect to a sample drawn from another distribution p2(x) leads quite naturally to this definition.
A K-L distance is always positive (or equal to 0 when the two distributions are identical). This is not obvious from the definition, but is a simple consequence of Jensen's inequality.
We finally confirm the relation between "Kullback-Leibler distance" and "Likelihood" by showing that the method of Maximum Likelihood is equivalent to minimizing the Kullback-Leibler distance between :
* the candidate distribution qθ(x)
* and an approximation p*(x) of the unknown distribution p(x) estimated from the available (large) sample.
OF THE KULLBACK-LEIBLER DISTANCE
Justification of the definition of the Kullback-Leibler distance
Limit of the likelihood for large samples
Entropy and minimum value of this limit
Definition of the Kullback-Leibler distance
The Kullback-Leibler distance is nonnegative
The Kullback-Leibler distance is not symmetric
Distance between two normal distributions
Distance between two uniform distributions
Maximum Likelihood and Kullback-Leibler distance
The Kullback-Leibler distance is usually difficult to calculate explicitely. But the calculation can be carried through when both distributions are normal.
The calculation is a bit long, but quite instructive.
BETWEEN TWO NORMAL DISTRIBUTIONS
Kullback-Leibler distance between two normal distributions
Related readings :