|
Interactive animation |
A measure of how likely it is that a given "candidate" probability distribution generated a given sample.
Let :
* S be a set of n numerical observations {x1, x2,..., xn} (the sample).
* X be a numerical random variable with probability density function (p.d.f.) p(x).
The likelihood of p(x) (with respect to S) is a measure of how "likely" it is that the data set S was generated as the outcome of n independent draws of X.
In the top illustration, it is very unlikely (although not strictly impossible) that the sample was generated by the normal distribution D1(x), because all data points sit in regions where the probability density is very low.
On the other hand, the bottom illustration depicts
a situation where it is much more likey that the data set was generated by the
distribution D2(x).
The mathematical expression of the likelihood
L is just the product of the values of the probability density function
(p.d.f.) taken at all data
points :
L(
p(x)) = |
Clearly, L will take substantial values if
and only if the p.d.f. of X takes substantial values for all data points.
Therefore L appears as a somewhat arbitrary, but reasonable
choice for measuring the match between a p.d.f. and a sample.
In fact,
the properties
of the likelihood make it the most important estimator of the match between
a probability distribution and a sample.
_______________________________________
The
likelihood can also be interpreted as follows. Let X1, X2
, ..., Xn be n independent variables, all
with the same p.d.f. p(x) . Then the likelihood
of p(x)
is the joint p.d.f. of the set of variables (X1,
X2 , ..., Xn).
Equivalently, the sample
S may be represented as a point x = {x1,
x2,..., xn} in a n-dimensional
space. The likelihood of p(x) is then the p.d.f. of S considered as a n-dimensional
random variable.
_______________________________________
The
concept of likelihood generalizes to the case where the variable X is categorical
(or "nominal") rather than numerical. Let M1, M2 ,
..., Mk be the modalities of the variable, and
n1, n2 , ..., nk (
ini
= n) be the observed frequencies of the modalities in the sample.
The probability distribution of X is defined by the value pi
of the modality Mi for each i :
pi = Probability of Mi
Then the likelihood of this set of probabilities is defined as :
|
L(p1,
.., pn) = |
_______________________________________
In
many circumstances, it will be computationally more convenient to use the
opposite of the logarithm
of the likelihood, -log(L), rather than the likelihood itself. This quantity
is called the log-likelihood. For example, in the case of a
continuous numerical
variable :
log-likelihood = -log(L) = -log(
i
p(xi)) = -
ilog(p(xi))
Why the "-" sign ?
For an example of the use of the log-likelihood, see "Kullback-Leibler distance".
Likelihood (Method
of maximum)
The concept of likelihood
plays a central role is estimating the parameters of a model (either predictive
or descriptive).
For example, let S be a
data set, and suppose we know with certainty that S was generated by a normal
distribution N(m,
²).
Suppose further that neither the mean m, nor the variance
² of
this normal distribution are known. How can the values of these two parameters
be estimated ? Although several approaches can be envisioned, the
most important one is to decide in favor of those values for m and
² that
will make the likelihood of N(m,
²) largest
(hence the name "Method of Maximum Likelihood").
We therefore consider now the likelihood L
as a function of m and
,
and set out to find the largest value of L when m and
vary. For this purpose, we should set the partial derivatives of L
with respect to m and
to "0" . In this particular case, it is more convenient to work
with the log-likelihood, which reaches its largest
value for the same values of m and
as the likelihood does, because "log" is a monotonically increasing
function. So we set :
log(L)/
m
= 0 and
log(L)/![]()
= 0
1)These equations only determine the position of an extremum of
the likelihood. It is of course appropriate to check that this extremum is indeed
a
maximum, and not a minimum. This is done by checking that the second derivatives
of the likelihood take negative values at the (single) extremum.
2)
The "derivation" approach is not always applicable, as L is
not always a differentiable function of the parameters. Can you identify a very
simple probability density function for which the likelhood is not differentiable
with respect to its parameters ?
The actual calculation, though straightforward, is a bit tedious. The very simple (and intuitive) result is :
* mMax
= ![]()
*
Max
= s
where
is
the sample average, and s is the sample standard deviation.
So, when the candidate distribution is normal, the Maximum Likelihood method of parameter estimation leads to the following conclusion :
* Estimate the mean m by the sample average.
* Estimate the standard deviation by the sample standard deviation.
Warning : this result is not valid for
all types of candidate distributions.
This simple example generalizes (under very general
conditions) to the case of an
arbitrary candidate distribution that is known up to one (or several) parameter(s)
:
p(x,
)
The Maximum Likelihood estimation of
is obtained by solving the equation(s) :
L(x,
) /
![]()
= 0
The solution may not be unique (it is in the case of a normal candidate distribution). It is also necessary to check that the solution corresponds to a maximum of L, and not to a minimum.
________________________________________
In Simple
Linear Regression, parameters are usually
estimated by the Least Squares approach, but it can be shown that
(under the standard assumptions of SLR) the results
thus obtained are the same as those obtained by the Maximum Likelihood
method.
But it is not so in Logistic
Regression : the parameters of the model are estimated by Maximum
Likelihood, and the model does not minimize the sum of the squares of
the residuals (as it does in SLR).
In Neural Networks, parameters are estimated by the Least Squares approach when used as regressors. But when used as classifier, the proper parameter estimation method should be (and sometimes is!) the Maximum Likelihood. This amount to minimizing an error function that is not the sum of squares of the residuals.
________________________________________
The central property of the Maximum Likelihood method of parameter estimation is as follows :
Under very general conditions, a parameter estimator generated by Maximum Likelihood is asymptotically efficient.
In simple terms, this means that as n, the number of observations in the sample, grows without limit, a Maximum Likelihood estimator ultimately becomes "better" (in a precise technical sense) that any other estimator.
For small samples, some estimators may turn out to be better than Maximum Likelihood estimators.
_________________________________________________________________
The following animation illustrates the concept of likelihood, and the Method of Maximum Likelihood.
The likelihood is the product of the heights of all the green connections from the sample points to the gaussian curve.
The posted value is the ratio of the current likelihood to the largest possible likelihood.
To fit the candidate normal distribution to the sample :
* Translate it by translating the top of the curve with your mouse,
* Change its width (standard deviation) by translating either side of the curve with your mouse.
Fine-tune the position and width of the curve by clicking and keeping your mouse button down :
* Above the top of the curve to make it taller (and therefore narrower),
* In the area below the curve to make it shorter (and therefore wider),
* On either side of the curve
to translate it.
____________________________
A little teaser to conclude. We found the values of the parameters of the normal distribution that lead to the largest possible value of the likelihood. Now, what about this value ?
* Without any calculation, convince yourself that the largest value of the likelihood does not change if the sample is translated by an arbitrary quantity.
* Without any calculation, find out how this largest value changes when the x-scale is changed. For example, if every xi is changed into 2.xi, how does the maximum value of the likelihood change ?
* Recall that the probability density function of a normal distribution is :
p(x) = (1/
.(2
)1/2). exp(-(x
- m)2/2
2)
With a very simple calculation, find the largest
value of the likelihood for any sample (x1, x2 ,
..., xn ). Conclude that if the sample has unit
standard deviation, then the largest value of the likelihood depends only
on the number of points n, and not on the actual positions of the points.
Because of the previous remark, conclude that, in the general case, the largest value of the likelihood
depends only on the standard deviation of the sample and the number of
observations.
Warning : this result is not valid for all types of candidate distributions.
____________________________________________
Related readings :
|
Want to contribute to this site ? |