Interactive animation

Kullback-Leibler  (distance)

The Kullback-Leibler distance (K-L) is a measure of the dissimilarity between two completely determined probability distributions.

# Definition of the Kullback-Leibler distance

Let p1(x) and  p2(x) be two continuous probability distributions. By definition, the Kullback-Leibler distance D(p1, p2) between p1(x) and p2(x) is :

with a similar expression in the discrete case.

If E denotes the expectation with respect to the distribution p1, this expression may also be written :

# Basic properties

•  D(p1, p2) is the mean of the quantity log[ p1(x)/p2(x)], with p1(x) being the reference distribution. This definition is justified here.
• The Kullback-Leibler distance is always nonnegative. It is zero only when the two distributions are identical.
• The Kullback-Leibler distance is not symmetric in p1(x) and  p2(x) and why it should be so is explained here. It is then certainly not a "distance" in the mathematical sense of the word.

It is common to encounter the symmetric version of the K-L distance between p1 and p2 :

Ds(p1, p2) = [D(p1, p2) + D(p2, p1)] / 2

# Kullback-Leibler distance and Maximum Likelihood

We analyze here the relation between "Kullback-Leibler distance" and "Parameter estimation by the method of Maximum Likelihood".

# Special case : normal distributions

We establish here the analytical form of the K-L distance (both asymmetric and symmetric) between two normal distributions p1 and p2.

Software occasionally use the K-L distance within the context of the validation of a model. The easiest way to split a sample into a training set and a validation set is simply to draw (without replacement) the validation set at random from the complete sample. The problem is that you may be unlucky, and draw a biased validation set, that is, a set whose distribution is substantially different from the sample's distribution. The estimation of the performance of the model based on this biased validation set is then itself biased, a soft word for "wrong".

Drawing a subset from a sample is fairly fast, and it is sometimes proposed as an option to generate a large number of candidate validation sets and retain as the final validation set the set whose K-L's distance to the remaining training set is smallest.

Because the K-L distance makes senses only for distributions (as opposed to finite samples), one has to assume that both the candidate validation set and the remaining training set are normally distributed. The corresponding normal distributions are then determined by Maximum Likelihood, and their (symmetric) K-L's distance is then computed as in here.

The use of the K-L as a measure of the dissimilarity between samples is illustrated in the following interactive illustration.

Caveat

The K-L distance as a measure of the dissimilarity between samples uses only the first two centered moments (mean and variance) of the samples because of the normality assumption. Therefore, two samples may have fairly different distributions, and yet exhibit a low value for their K-L distance. All it takes is for the samples to have essentially identical means and standard deviations, but largely different higher order moments (especially skewness and flatness).

You may experiment with this idea in the interactive illustration.

__________________________________________________

The following figure illustrates the Kullback-Leibler distance.

 The "Book of Animations" on your computer

 The illustation has two modes of operation :     1) The "Gaussians" mode (selected by default), displaying the K-L distances (both symmetric and asymmetric) between two normal distributions.     2) Or the "Samples" mode (click on the "Samples" button), displaying the K-L distances (both symmetric and asymmetric) between two arbitrary samples, together with the corresponding Maximum Likelihood normal distributions.   In the "Gaussian mode" :     * Change the means and standard deviations of the gaussians, and observe the variations of the K-L distances (expressed in arbitrary units).     * Make the two gaussians have the same mean. Observe that the gaussian with the smaller Standard Deviation is "closer" to the other one than the reverse. More generally, observe that the narrower gaussian is always "closer" to the other one than the reverse.     * Observe that the two asymmetric distances are equal only when the two Standard Deviations are equal.     * Change the standard deviation of one of the gaussians, while keeping its mean constant. Observe that the symmetric distance has a minimum : there is an optimal value of the standard deviation, all the other parameters being held constant.     * The same is true for the asymmetric distance. What do you notice when the standard deviation reaches this optimal value ?    In the "Samples" mode :     * Drag the points with your mouse, and observe the variations of the K-L distances between the two samples.     * Try to construct two samples with "0" K-L distance, and yet that have substantially different distributions. Conclude that this is possible only if at least one of the two samples severely departs from normality.  Points may occasionally refuse to be dragged. This is in order to avoid generating samples with too small a standard deviation, and guarantees that the corresponding gaussian curves are never taller than the frame.

________________________________________________________________________________________________

 Tutorial 1

The definition of the Kullback-Leibler distance may seem rather arbitrary. In fact, it is deeply rooted into Information Theory, but we show here that a simple line of reasoning about the likelihood of a distribution p1(x) with respect to a sample drawn from another distribution p2(x) leads quite naturally to this definition.

A K-L distance is always positive (or equal to 0 when the two distributions are identical). This is not obvious from the definition, but is a simple consequence of Jensen's inequality.

We finally confirm the relation between "Kullback-Leibler distance" and "Likelihood" by showing that the method of Maximum Likelihood is equivalent to minimizing the Kullback-Leibler distance between :

* the candidate distribution qθ(x

* and an approximation p*(x) of the unknown distribution p(x) estimated from the available (large) sample.

BASIC PROPERTIES

OF THE KULLBACK-LEIBLER DISTANCE

 Justification of the definition of the Kullback-Leibler distance Limit of the likelihood for large samples Entropy and minimum value of this limit Definition of the Kullback-Leibler distance The Kullback-Leibler distance is nonnegative The Kullback-Leibler distance is not symmetric Distance between two normal distributions Distance between two uniform distributions Maximum Likelihood and Kullback-Leibler distance TUTORIAL

________________________________________________________

 Tutorial 2

The Kullback-Leibler distance is usually difficult to calculate explicitely. But the calculation can be carried through when both distributions are normal.

The calculation is a bit long, but quite instructive.

KULLBACK-LEIBLER DISTANCE

BETWEEN TWO NORMAL DISTRIBUTIONS

 Kullback-Leibler distance between two normal distributions TUTORIAL

____________________________________________________

 Jensen's inequality Maximum Likelihood