Point estimation
Estimation and Tests are the two main branches of what is known as "inferential Statistics".
-----
Point estimation (or often, just "estimation") is one of the central activities of the statistician. Probability distributions are usually known only through random samples, and estimation is the art of extracting valuable information from a sample about the distribution of probability that generated it.
The term "Estimation" covers several different but closely related viewpoints, that we analyze here.
The above mentioned task may be made simpler by narrowing down the question. This can be done in two ways :
* One may give up the idea of completely identifying the distribution, and be satisfied with a reasonably accurate guess of some particular aspect of the distribution, like its mean, its variance, its mode, or any other quantity defined from it.
* One may also assume that the distribution belongs to a family of distributions described by a mathematical expression containing one or several numerical parameters. Identifying the distribution then amounts to guessing the values of a few parameters.
This is called "estimating the parameters of a distribution", or simply parameter estimation.
The distinction between"probability distribution estimation"
and "parameter estimation"
is pointless when the parameters in the mathematical expression of the family
are properties of the distribution, as it is the case
for example of the normal distribution (mean and variance).
A model, whether predictive or descriptive, may be perceived as the outcome of a particular type of probability distribution estimation. For example, regression assumes that the data was generated by a probability distribution described by the sum of :
* A deterministic part (the true regression line),
* And a random part (the error term).
"Fitting" the regression model then consists in estimating :
* The values of the parameters of the true regression line.
* And the statistical properties of the error terms.
The result (the regression model) may then be regarded as an estimation of the probability distribution that generated the data.
More generally, a model contains parameters, whose values are calculated from the sample. These parameters are therefore random variables that have distributions of their own, and identifying these distributions is an important task in Data Modeling, for analyzing these distributions will tell us how reliable the model is (in a nutshell, models whose parameters are strongly biased (see below) or have broad distributions are unreliable).
This is called estimating the parameters of a model, or "fitting the model to the data".
A model has two main purposes :
* The discovery of valuable information about the probability distribution that generated the data.
* Making predictions about new incoming data. For example, a regression model will be used for predicting the value of the response variable y for any new value of the predictor x. Even descriptive models make predictions : a histogram predicts the value of the density of a distribution where a new observation happens to fall.
Because the model is random, so are its predictions. For the practitioner, it is of utmost importance to be able to figure out how reliable these predictions are. This is achieved by considering a model prediction as a random variable used as an estimator of the quantity to be predicted and then evaluate the quality (e.g. bias and variance) of this estimator. Note that the quantity to be predicted may itself be a random variable : for example, a regression model will deliver an estimate of the value of the response variable for a new predictor x, but in the real world, the mechanism that generated the data delivers a random value of y for any given x. On then tries to estimate the mean of the true distribution of y.
Let θ be a parameter of a distribution, whose true (and unknown) value is θ0.
An estimator is a function of the observations in the sample (a "statistic"), and whose value will be used as a guess of the true but unknown value θ0 of the parameter θ. The value taken by an estimator on a given sample is called an estimate (of θ0 ).
We will denote:
So :
θ* = θ *(sample)
Estimates are expected to be close to the true value θ0. But the sample being random, an estimator is a random variable. So it can never been said with certainty that an estimate is close to the true value of a parameter of the distribution (or of the model).
So it appears that Estimation theory will focus not
on estimates, that are meaningless, but on the properties of estimators considered as
random variables, that is, their probability distributions, or some restricted
aspects thereof (mostly mean, variance and asymptotic properties).
We will see that a given parameter of a distribution may have several different estimators to choose from. So the two central questions of Estimation theory are :
Nothing in the structure of the equation defining a statistic tells whether it is, or is not, an estimator of anything. It can be said that there is no such thing as a definition of an estimator. But a particular statistic will be used to estimate a parameter if it posseses certain desirable properties, that we briefly review here.
A central idea of Statistics is that very large samples are (unless you're terribly unlucky) a reasonably faithful image of the distribution itself. In other words, the empirical distribution function is hoped be pretty close to the real distribution function for large samples. For example, in the case of a continuous distribution, this means that there should be many observations in regions where the probability density is high, and few where the probability density is low (see the Fundamental Theorem of Statistics).
So you would certainly expect from a "good"
estimator θ * of a parameter θ that it produces
estimates θ* that are closer
and closer to the true value θ0 as
you consider larger and larger samples.
Yet, again, because an estimator is a rv, this
convergence of the estimates towards θ0 as
the sample size n grows without limit cannot be expected to happen in a deterministic way.
It can happen only in a probabilistic way, and we will have to be satisfied with the following weaker property :
all you have to do is consider large enough samples for the probability
of the estimate θ* to be within a bracket of width ε centered
on θ0
to be larger than P.
In other words, beyond a certain sample
size, at least P.100 % of the estimates will fall inside this narrow
bracket around θ0.
A statistic with such a property is called a consistent estimator of the parameter θ.
Consistency is the least you can demand from a statistic to qualify as an estimator.
Except for some rare and pathological exceptions, an estimator's distribution (like that of any any other non trivial statistic) becomes narrower an narrower, and more and more normal-like as larger and larger samples are considered. If we take for granted the fact that the variance of the estimator will tend to 0 as the sample size grows without limit, what consistency really means is that the mean of the estimator's distribution tends to θ0 as the sample size grows without limit, as shown in the upper and lower images below :
1) In technical terms, a consistent estimator is a sequence
of random variables indexed by n (the sample size) that converge in probability to θ0.
2)
The Weak Law of Large Numbers is an
example of identification of a consistent estimator.
3) For an exception to the rule of the "narrowing of the
distribution of a statistic for larger and larger samples", see the
Cauchy distribution. We there show that despite the symmetry of the distribution, the
sample mean is not a consistent estimator of the distribution
median (the distribution has no mean). The sample median, though, is a consistent
estimator of the distribution median.
Consistency
is an asymptotic property : defining consistency requires considering arbitrarily large samples. In real life, sample size
will be limited by time or budget constraints. So it is natural to consider
what quality should be expected from an estimator based on samples of a fixed
size n.
Then you would certainly hope the central region of the distribution
of the estimator to be close to the true value θ0 of
the parameter. One way of expressing this idea is to consider estimators
whose distribution mean is equal to θ0 for any value of n,
the true
value of the parameter θ. Such an estimator
is said to be unbiased., and unbiasedness translates into :
E[θ]n = θ0 for any n
This definition is somewhat arbitrary, as the
mean has no magic virtue other than its mathematical convenience. Any other
measure of central tendency, like the median or the mode, would have provided
other adequate embodiments of the idea of "lack of bias", except
for the fact that further calculations might then prove intractable.
For a given n, an estimator whose expectation is not equal to θ0 is said to be biased. For example :
Yet, these two estimators are consistent, as their means tend to the true values (resp. of the variance and the correlation coefficient) as larger and larger samples are considered. So a consistent estimator may be biased for all values of n.
As we will see, bias does not necessarily
make an estimator useless, even for small samples.
A parameter may have several unbiased estimators. For example, given a symmetrical continuous distribution, both :
* The sample mean
and
* The sample median
are unbiased estimators of the distribution mean (when it exists). Which one should we choose ?
Certainly we should choose the estimator that generates
estimates that are closer (in a probabilitic sense) to the true value θ0
than estimates generated by the other one. One way to do that is to select
the estimator with the lower variance.
This leads to the definition of the relative efficiency
of two unbiased estimators. Given two unbiased estimators θ *1 and
θ *2 of the same parameter
θ , one defines the efficiency of θ *2
with respect to θ *1 (for
a given sample size n) as the ratio of their variances :
Relative efficiency (θ *2 with respect to θ *1)n = Var(θ *1)n / Var(θ *2 )n
One might then wonder if, for a given parameter θ, there exists an unbiased estimator which is more efficient than any other unbiased estimator for all sample sizes. The answer is "In general, no.". But it is sometimes possible to identify an unbiased estimator θE with a similar, albeit weaker property :
So θE is indeed the "most efficient estimator", but only in an asymptotic sense. For any sample size n, there might very well be an unbiased estimatorθ * more efficient than θE..
Such an estimator is called an efficient estimator.
The question :
* "What
is the smallest possible variance of an unbiased estimator ?"
or equivalently
:
* "Is there a lower bound to the variance
of an unbiased estimator ?"
is both important and difficult.
It is
addressed in the entry about the Cramér-Rao
inequality.
The practitioner is not particularly keen on unbiasedness. What is really important to him is that, on the average, the estimate θ* be close to the true value θ 0. So he will tend to favor estimators such that the mean-square error :
E[(θ* - θ0 )]²
be as low as possible, whether θ * is biased or not. Such an estimator is called a minimum mean-square-error estimator.
Given two estimators :
θ *2 might prove a better estimator than θ *1 in practice (lower image in the illustration below).
Yet, identifying minimum mean-square-error estimators is not an easy task, and most commonly encountered estimators are simply unbiased estimators.
As noted above, there is no way to tell a priori if a particular statistic will be useful as an estimator of a certain parameter. Given a statistic, only its behavior in terms of :
* Bias with respect to the true value of a parameter,
* Narrowing of its distribution for large samples,
will tell if this statistic is worth being
considered for the purpose of estimating the value of this parameter.
So the question of how to devise a statistic that will have
some of the desirable properties of a good estimator remains open. We now briefly
mention three popular methods used for constructing estimators of a given
parameter :
It is the most natural method. Because a large sample is (hoped to be) a faithful image of the unknown distribution D, having this sample is just as good as having the distribution itself. So we proceed as if the empirical distribution function D* were the true distribution. If the parameter θ is defined in terms of a function of D :
θ = f(D)
then we use :
θ*= f(D*)
as an estimate of θ.
This is the "common sense" attitude that makes us use the sample mean as an estimate of the mean of the population without thinking twice about it.
This "plug in" method is called the method of moments. In early days of Statistics, the methods of moments was the only one available, and was mostly used for estimating the moments of a distribution (mean, variance and higher order moments), hence its name.
An estimator constructed by the method of moments is clearly consistent (although this requires demonstrating), but often suffers from severe bias for small samples.
Given
a sample and a candidate distribution, the Likelihood is
a measure of "how likely" it is that the sample was generated by that
particular distribution.
Given a family of distributions (usually
summarized by a mathematical expression containing a few numerical parameters),
the Method of Maximum Likelihood (ML) selects that particular distribution
in the family that makes the Likelihood largest. The values of the parameters
thus obtained are called the Maximum Likelihood estimates of the
parameters of the (unknown) underlying distribution.
The Method of Maximum Likelihood is established on much more solid theoretical grounds than the Method of Moments. In particular, it can be shown that, under very general conditions, a ML estimator is :
ML estimation is the most widely used method of estimation.
The best estimate of the mean of a random variable is the sample mean m, which also has the property of making minimal the sum :
S = Σi (xi - a)²
where a is an adjustable parameter. S
is minimal for a = m.
The function y = f(x) is the regression function of y on x is, for any x, f(x) is the mean of the values of y for this value of x. Therefore, regression may be perceived as simultaneously estimating the means of an infinity of random variables, one for each value of x.
Least Squares estimation is an extension
of the above mentioned property of the sample mean. The parameters of a regression
model y = f(x) are usually calculated by imposing
that the sum of the squares of the differences between :
be minimal.
Least Squares is the technique used for calculating the models in Simple and Multiple Linear Regression.
For more on Least Squares estimation, please see here
.
What we described in this page is known as "point estimation", the reason being that the action of estimating produces just a number, the estimate. The weakness of point estimation is that this estimate comes with no clue about how credible it is.
It is sometimes possible to bring about some additional information concerning this credibility. This is the goal of a different kind of estimation known as "interval estimation".
In a nutshell, given a sample, interval estimation builds a segment such that it is possible to calculate the probability for this segment (known as "confidence interval") to contain θ0, the true value of the parameter. For a given probability (known as "confidence level"), the shorter the confidence interval, the better the precision with which θ0 has been localized.
You may read more about interval estimation here.
____________________________________________________________
In summary, the goal of Estimation is to extract complete or partial information about a probability distribution from a sample generated by this distribution. This information is necessarily probabilistic, and is embodied in an estimate (of a parameter of the distribution). An estimator is a statistic whose properties as a random variable let us expect that its value for the sample at hand (the estimate) is close to the true value of the parameter that is being estimated.
A few general techniques are available for constructing useful estimators (moments, maximum likelihood, least-squares).
It is often possible to associate to a point estimate a confidence interval and a confidence level for this interval.
_______________________________________________________
Related readings :