Likelihood Ratio Test  (LRT)

A general and powerful method for building tests.

-----

Likelihood was invented for the purpose of quantifying the fit between a probability distribution and a sample : the larger the likelihood, the better the fit.

On the other side, tests are meant to discriminate between two nonoverlapping groups of distributions that both claim to contain the distribution that generated the sample.

One may therefore anticipate the possibility to use the likelihood of the sample for the various competing distributions to settle the issue of which group is more likely to contain the distribution that generated the sample.

# Vocabulary and notations

Let's first clarify some notations commonly used when dealing with Likelihood Ratio tests.

* First, all the distributions we'll consider belong to the same family, the various distributions in the family differing only though the value of a parameter θ (which may be a vector parameter, i.e. a set of several scalar parameters). For example, we may consider the family of normal distributions N(µ, σ²), of which each member is fully characterized by the values of µ and σ².

* The two groups of distributions are then defined respectively by the null hypothesis and the alternative hypothesis. For example, we might want to test :

- H0 : µ = µ0

against

- H1 : µ  µ0

The hypothesis we'll consider can be indifferently simple or composite (whereas the Neyman-Pearson lemma applies only to two simple hypothesis).

In what follows, it will be convenient to consider that :

* H0 does not just denote the null hypothesis, but the set of values of the parameter θ defined by H0 as well and, by extension, the set of distributions defined by this set of values of the parameter.

* H1 does not just denote the alternative hypothesis, but the set of values of the parameter θ defined by H1 as well and, by extension, the set of distributions defined by this set of values of the parameter.

So the notation H0 U H1 will designate the set of the values of the parameter θ defined by either H0 or H1 and, by extension, the set of distributions defined by this set of values of the parameter.

# Principle of Likelihood Ratio tests

The Likelihood Ratio Test (LRT) approach reasons as follows.

1) Suppose that H0 is true : the distribution that generated the sample belongs indeed to H0. We certainly expect the sample to exhibit a large likelihood for the distribution that generated it, and consequently we expect this likelihood to be close to the largest likelihood encountered when scanning through all the distributions in H0.

Considering the distributions in H1 will probably change nothing : none of these distributions generated the sample, so none of these distributions is expected to display a large likelihood for the sample.

Consequently the largest likelihood in H0 is not anticipated to be substantially smaller than the largest likelihood observed over the complete set of distributions H0 U H1.

This can be formalized as follows. Denote :

- maxH 0 L(x; θ) the largest likelihood observed over H0.

- maxH U H 1 L(x; θ) the largest likelihood observed over H0 U H1.

Then the ratio

λ = maxH 0 L(x; θ) / maxH U H 1 L(x; θ)

which is certainly smaller than 1, is expected to be close to 1.

2) Conversely, suppose that H0 is false (and therefore that H1 is true). The distribution that generated the sample belongs to H1, and not to H0. None of the distributions in H0 is anticipated to exhibit a large likelihood. The largest likelihood of all is anticipated to be found for a distribution in H1 because the distribution that generated the sample is in H1.

Consequently, the ratio

λ = maxH 0 L(x; θ) / maxH U H 1 L(x; θ)

although always positive, is expected to have a small value (close to 0).

# The test

So it appears that the statistic :

 Λ = maxH 0 L(X; θ) / maxH 0 U H 1 L(X; θ)

is a plausible choice for testing H0 against H1 : small values (close to 0) of Λ favor H1, while large values (close to 1) of Λ favor H0.

Suppose that if H0 is true, the probability distribution g(λ) of the random variable Λ is known and that the integral

can be calculated (or at least tabulated) and the result inverted so that for a given α, c can be determined explicitely .

We then have all the ingredients needed for testing H0 against H1 at the α significance level (upper and lower images of this illustration) :

The decision made by the test is then :

* If λ > c then accept H0 and reject H1.

* If λ ≤ c then accept H1 and reject H0.

# Distribution of the test statistic

Unfortunately, the test statistic Λ and its distribution g(λ) are usually very complicated (see for example Bartlett's test).

## Functions of the statistic

It is sometimes possible to identify a function f(.) such that f(Λ) is more tractable. One can then carry out a test equivalent to the original test by using

f(Λ) instead of Λ as a test statistic.

The transformation may be :

* Reasonably intuitive, as in here,

* Or require a good deal of effort to be identified, as in here.

## Asymptotic distribution

It can be shown that, under some regularity conditions, the r.v. -2.logΛ is asymptotically χ2 distributed (i.e. for large samples). The number of degrees of freedom of the distribution is equal to the difference between :

- The number of free parameters in H0 U H1,

- And the number of free parameters in H0.

This deserves a little explanation. Suppose that an LRT is designed for the purpose of testing :

- H0 : µ = µ0

against

- H1 : µ  µ0

for normal distributions where the variance σ² assumed to be known.

* In H0, the likelihood has a certain value determined by the sample, and no parameter is available to make this likelihood change. So H0 has no (i.e. 0) free parameter.

* In H0 U H1 the likelihood will be maximized over the range of µ, which is unconstrained over ]-∞, +∞[. Therefore, H0 U H1 has one free parameter (that is, µ).

The difference between the numbers of free parameters of H0 U H1 and H0 is therefore 1 - 0 = 1.

Similarly, if the same hypothesis are tested in a setting where the variance is not assumed to be known :

- H0 has one free parameter, namely σ²,

- While H0 U H1 has two free parameters, namely µ and σ²,

and the difference between the numbers of free parameters of H0 U H1 and H0 is therefore 2 - 1 = 1.

-----

The term "free" makes reference to the fact that the effective number of parameters (or "number of free parameters") may be smaller than the actual number of parameters. For example, suppose that we consider the sub-family of the family of normal distributions N(µ, σ²) defined by µ² = σ². Although both the mean and the variance of the distributions are free to vary across their entire respective ranges, the sub-family is described by a single "free parameter".

The number of free parameters may be regarded as the intrinsic dimension of the manifold bearing all the allowed values of the parameters linked by constraining relationships.

The "regularity conditions" refered to earlier in this paragraph say that the manifold should have no singularity (like self-crossing) and no boundary (like a half plane). We'll see an example where this second condition is not met, with, as a result, the asymptotic distribution of -2.logΛ being clearly not χ2.

-----

The quality of the fit between the distribution of -2.logΛ for finite samples and the corresponding χ2 limit distribution is of course of great concern to the analyst. A first experimental approach to this important question is given as an interactive animation in one of the Tutorials below.

-----

Note that once it is known that the asymptotic distribution of -2.logΛ is χ2, the asymptotic distribution of Λ can easily be infered (see animation). But because tables of quantiles exist for the χ2 distribution only, it is customary to base a Likelihood Ratio Test on the value of -2.logΛ and not on that of Λ.

___________________

Results concerning the χ2 nature of the asymptotic distribution of -2.logΛ are difficult and are stated without proof. We'll only demonstrate the simplest of them, which is as follows :

* For the pair of hypothesis H0 : θ = θ0 against H1 : θ  θ0

* -2logΛ is asymptotically distributed as χ21,

and that alone will require some effort on our part.

# Likelihood Ratio Test and Maximum Likelihood Estimation

There is a clear connection between LRTs and Maximum Likelihood Estimation (MLE).

* The denominator of the test statistic is

maxH U H 1 L(x; θ)

We are looking for the value of θ that will make the likelihood of the sample largest over the complete domain of θ. This value is, by definition, the Maximum Likelihood Estimate of θ. Denote θ* the Maximum Likelihood Estimator of θ : the denominator now can be written as L(x; θ*).

* Similarly, the numerator of the test statistic is

maxH 0 L(x; θ)

Again, we are looking for the Maximum Likelihood Estimation of θ over the sub-domain of θ defined by H0. Denote θ*0  the corresponding estimator. The numerator is now L(x; θ*0).

-----

The LRT test statistic Λ is therefore given by

 Λ = L(X; θ*) / L(X; θ*0)

# Likelihood Ratio Test and Sufficiency

Since a statistic that is sufficient for θ contains all the information in the sample pertaining to the estimation of θ, it should be considered as a distinct possibility that a LRT can be expressed not just in terms of the sample X, but also in the seemingly more restrictive terms of a sufficient statistic T(X).

It is indeed the case and we'll show that, given any sufficient statistic T(X), the Λ(X) statistic of any LRT can be expressed as a function of T only. In other words, there is always a function Λ*(.) such that

Λ(X) = Λ*(T(X))

for any sample X.

As a consequence, we'll see that when a sufficient statistic for θ is available, LRTs can be built using this sufficient statistic only.

# Likelihood Ratio Test and Neyman-Pearson lemma

We defined a LRT by considering the ratio of the maximum likelihood over H0 to the maximum likelihood over the complete parameter space H0 U H1. We might as well (and perhaps more naturally) have considered the ratio of the maximum likelihood over H0 to the maximum likelihood over H1 only, with a similar line of reasoning and similar results (except, of course, the asymptotic distribution of the test statistic).

This alternative presentation of Likelihood Ratio Tests has the advantage of making obvious the connection between LRTs and the Neyman-Pearson lemma. For suppose that H0 is the simple hypothesis θ = θ0 and that H1 is the simple hypothesis θ = θ1. The "sets" of distributions H0 and H1 then both contain a single distribution, the "maximization" procedures at the numerator and denominator of the test statistic are trivial, and we are now within the framework of the Neyman-Pearson lemma. Recall that the best critical region of a N-P test is defined by

L(x; θ1 ) / L(x; θ0 )  >  kα

which is exactly the expression of the special case of a LRT when both the null and the alternative hypothesis are simple.

So it appears that Likelihood Ratio Tests may be regarded as generalizations of the Neyman-Pearson test to situations where at least one hypothesis is composite.

-----

Note that whereas a Neyman-Pearson test is always a Uniformly Most Powerful (UMP) test, it is in general not the case of Likelihood Ratio Tests.

# Likelihood Ratio Tests and nested models

A model M1 is said to be nested inside another model M2 if the set of parameters of M1 is a subset of the set of parameters of M2 :

* M2 may be regarded as M1 to which some additional parameters have been added for the purpose of attempting to improve its predictive capacity.

* Conversely, M1 may be regarded as M2 from which some parameters have been deleted for the purpose of making it easier to interpret and more stable (see bias-variance tradeoff).

The question is then : "Are the two models significantly different ?". The Likelihood Ratio paradigm is a natural approach for devising a test meant to answer the question.

Denote (β1, β2, ..., βp) the set of parameters of M1, and (β1, β2, ..., βp, βp + 1, ..., βp + q ) the set of parameters of M2. Model M1 may be regarded as model M2 whose last q parameters are constrained to be equal to 0.

Using the same convention as before, we have :

* H0 : βp + 1 = ... = βp + q = 0    (M1 is nested in M2)

* H1 : At least one of these parameters is different from 0 (M1 is not nested in M2).

The test statistic Λ will be the ratio of the likelihood :

* Of model M1,

* And of model M2.

Since M2 will always have a larger likelihood than M1, the value of this ratio is always between 0 and 1. A value of Λ close to 0 means that the likelihood of M2 is substantially larger than than that of M1, and therefore that the models are significantly different.

-----

The simplest example of a Likelihood Ratio Test on nested models is to be found in Multiple Linear Regression within the context of variable selection. But the reader is warned that we balk at the complexity of the calculations, and elaborate the test from simpler (but logically equivalent) geometric considerations instead.

# Caveat

Likelihood Ratio Tests are popular because :

* They are intuitively appealing,

* They often coincide with classical tests that were originally designed by more direct methods (see Tutorial).

* The standard asymptotic probability distribution of the test statistic Λ is a bonus as the probability distribution of a LRT statistic is often intractable..

Yet, none of these reasons is compelling. Further investigations of the properties of LRTs show that a LRT may behave poorly (low power) compared to a "hand-crafted" test specially designed for the problem at hand. This point is well illustrated by the comparative study of :

* The goodness-of-fit Likelihood Ratio Test for the multinomial distribution,

* And the more conventional Chi-square test serving the same purpose.

So, just as Maximum Likelihood Estimation should not systematically be considered as the default method for parameter estimation, Likelihood Ratio Tests are only to be considered as a convenient but not universal method for building a test.

________________________________________________________________________

 Tutorial 1

In this Tutorial, we use the Likelihood Ratio paradigm to :

* First build two tests about the value of the mean of the normal distribution. The first test assumes that the variance is known, while the second test will not assume the variance to be known.

* Then build a test about the value of the variance of the normal distribution with unknown mean.

The test statistics obtained by respecting a strict compliance to the LRT methodology are usually a bit complicated and have messy distributions. But we'll be able to transform these statistics into new random variables whose distributions could more easily be calculated.

But we won't even have to calculate these distributions for some reasoning about these new variables will allow us to establish that these three tests are in fact equivalent respectively to :

* The one sample t-test when the variance is known (also known as "z-test"),

* The one sample t-test when the variance is unknown.

* The F-test when the mean is unknown.

EXAMPLES OF LIKELIHOOD RATIO TESTS

 Mean of normal, variance known The parameter space Maximum Likelihood under H0 Maximum Likelihood under H0U H1 Likelihood ratio Transformation of the Λ test statistic Equivalence with the z - test Mean of normal, variance unknown The parameter space Maximum Likelihood under H0 Maximum Likelihood under H0U H1 Likelihood ratio Transformation of the Λ test statistic Equivalence with the t - test Variance of normal, mean unknown The parameter space Maximum Likelihood under H0 Maximum Likelihood under H0U H1 Likelihood ratio Transformation of the Λ test statistic Equivalence with the F - test TUTORIAL

_________________________________________

 Tutorial 2

The results of the preceding Tutorial are both encouraging and disappointing :

* Using only the basic principles of the Likelihood Ratio Test, we were able to build important tests in a somewhat mechanical way which did not require a deep understanding of the problem at hand.

* But these tests turned out to be equivalent to very classical tests (z, t, F), so that the LRTs seemed in the final analysis to bring nothing new except cumbersome calculations.

So in order to confirm the usefulness of the concept of LRT, it remains to build a LRT that has no classical equivalent. This is what we do now.

In this Tutorial, we again consider the normal distribution N(µ, σ²) with µ and σ² both unknown, and we want to test :

* H0 : µ ≤ µ0

against

* H1 : µ > µ0

No modification of a classical test can handle this pair of hypothesis, and we therefore devise a LRT from the ground up. This endeavor is difficult enough to justify a Tutorial of its own. In particular, Λ will at first seem to be hopelessly beyond tractability, and only a very smart transformation of this statistic will allow us to successfully conclude the building of the test.

-----

We'll also see that we are in a situation where the asymptotic distribution of -2logΛ is definitely not a χ2 distribution, but we'll only touch upon the reasons why it is so.

AN ORIGINAL LIKELIHOOD RATIO TEST

 The hypothesis The test statistic Denominator Numerator The test statistic First case Second case Transformation of the test statistic Summary of the test Asymptotic behavior TUTORIAL

_________________________________________

 Tutorial 3

We now move on to some more LRTs.

* First, we elaborate a test meant to test the null hypothesis θ = θ0 vs the alternative hypothesis θ  θ0 for the uniform distribution U[0, θ].

* We then address a somewhat similar problem pertaining to the value of the parameter θ of the shifted exponential distribution exp[-(x - θ)].

In both problems the likelihood function is discontinuous, and maximization of the likelihood cannot be achieved by differentiation.

________________

We finally develop three versions of the LRT identity test for the exponential distribution Exp(λ) : samples are drawn from several different exponential populations, and the question is whether these populations are identically distributed (same value of the parameter λ).

* We first treat the basic problem as we just stated it. We take this opportunity to explore experimentally the quality of the fit between a

-2.logΛ statistic and its asymptotic χ2 distribution with an interactive animation.

The animation uses 2, 3 or 4 samples (whose sizes are adjustable) so that the three basic shapes of the χ2 distribution are examined in turn :

- χ21 with a vertical asymptote,

- χ22 which is just an exponential distribution,

- χ23 with its well known asymetric bell-shaped curve (see here).

We'll see that in all three cases, the basic geometry of the limit distribution is respected by the empirical distribution of the statistic, and that the quality of the fit visibly improves when more and more observations are drawn from the exponential distribution.

-----

* We then introduce a realistic difficulty : observations smaller than a certain threshold γ cannot be measured. The distributions are therefore "truncated" and the data "censored".

We'll therefore have to first calculate the distribution of the censored data. Only then shall we be able to generalize the preceding identity test to truncated exponential distributions.

-----

* In fact, we'll show that a (weaker) identity test can still be devised if γ is not known provided that it is the same for all the distributions.

MORE EXAMPLES OF LIKELIHOOD RATIO TESTS

 Uniform distribution Denominator of the test statistic Numerator of the test statistic The test statistic Shifted exponential distribution Denominator of the test statistic Numerator of the test statistic The test statistic Identity test for the exponential distribution The null hypothesis The alternative hypothesis The LRT test Numerator Denominator The test statistic Interactive animation (asymptotic distribution) Identity test for the exponential distribution : cut-off point is known Pdf of the truncated exponential distribution The identity test for truncated exponential distributions Identity test for the exponential distribution : cut-off point is unknown TUTORIAL

__________________________________________________

 Tutorial 4

Most Likelihood Ratio Tests are salvaged from intractability only by the fact that -2logΛ is asymptotically χ2 distributed. This life-saver is rather weak, though, as is any asymptotic result used as an approximation of an exact but unknown result for finite samples (see animation in the preceding Tutorial). Besides, establishing this asymptotic behavior is no easy matter.

We'll first give an example where it can be shown directly that -2logΛ is asymptotically χ21 distributed. We'll build the LRT statistic Λ of H0 : λ = λ0 against H1 : λ  λ0 for the Poisson(λ) distribution, then calculate the asymptotic distribution of -2logΛ and show that this distribution is indeed χ21.

We'll then generalize this result and show that for any distribution p(x; θ), if :

* H0 : θ = θ0

* H1 : θ  θ0

then -2logΛ is asymptotically χ21 distributed.

It can be shown that the result extends to any compatible pair of hypothesis (H0, H1), provided that the regularity conditions mentioned earlier are respected.

The proof will be given for a single scalar parameter θ only. It can be shown (difficult and not treated) that a similar result also holds for vector parameters θ. The important and difficult part is to establish that the number of degrees of freedom of the limit χ2 distribution is equal to the difference between :

* The number of free parameters of the denominator of the test statistic Λ,

* And the number of free parameters of the numerator of this statistic.

____________________

We conclude this Tutorial by showing that the statistic Λ of a Likelihood Ratio Test can always be expressed as a function of any sufficient statistic T(X). This allows building a LRT by using only a sufficient statistic T(X) when one is available. The calculations are then usually simpler than that of the standard LRT, as we'll show by revisiting two examples of LRTs developed in the preceding Tutorials.

ASYMPTOTIC RESULTS
LRT AND SUFFICIENCY

 LRT on the parameter of the Poisson distribution The test Asymptotic distribution Asymptotic distribution of -2logΛ Scalar parameter Vector parameter Likelihood Ratio Test and sufficiency Test statistic is a function of any sufficient statistic Mean of the normal distribution Shifted exponential distribution TUTORIAL

__________________________________________________

 Tutorial 5

We conclude with yet another Likelihood Ratio Test : the goodness-of-fit test for a multinomial distribution. This test is not the most complicated or most important we've built along these Tutorials, but it enjoys a somewhat special status.

Recall that the so-called "Chi-square tests" (goodness-of-fit, identity, independence) are very important non parametric tests. They all rely on the "Pearson's Chi-square" statistic whose exact distribution under H0 is usually unknown, but whose asymptotic distribution is χ2, hence the generic name of the tests. The statistic also finds its origin in the basic goodness-of-fit problem for the multinomial distribution, but, contrary to the LRT statistic described in this Tutorial, it was specifically designed for this purpose. The -2logΛ statistic of the multinomial goodness-of-fit LRT is sometimes called "Wilks' G²".

From a theoretical standpoint, the two tests are asymptotically equivalent : the general results about the χ2 asymptotic distribution of the -2logΛ statistic of a LRT will show us that this distribution is the same as the asymptotic distribution of Pearson's Chi-square. In addition, we'll even show directly that

-2logΛ converges in distribution to the Pearson's Chi-square statistic.

But the analyst is more interested in another question : "For finite samples, which one of the two tests is more powerful ?". To our knowledge, this question is intractable, but it can be experimentally adressed by a simulation that we present at the end of this Tutorial. This simulation compares the behaviors and performances of the Chi-2 and G² statistics under various conditions.

MULTINOMIAL GOODNESS-OF-FIT LRT

 The multinomial goodness-of-fit LRT The test statistic Numerator Denominator The statistic Asymptotic distribution Interpretation of the statistic Number of degrees of freedom Asymptotic equivalence with Pearson's Chi-square statistic A little know Taylor expansion Asymptotic equivalence of LRT and Chi-square statistics Interactive animation TUTORIAL

______________________________________________________