Point estimation

Estimation and Tests are the two main branches of what is known as "inferential Statistics".

-----

Point estimation (or often, just "estimation") is one of the central activities of the statistician. Probability distributions are usually known only through random samples, and estimation is the art of extracting valuable information from a sample about the distribution of probability that generated it.

 

The term "Estimation" covers several different but closely related viewpoints, that we analyze here.

What is to be estimated ?

Probability distribution estimation

             The ultimate dream of Statistics is to unambiguously identify the probability distribution behind a random phenomenon. But this distribution is known only through a sample that is both finite and random, and can therefore never be identified with certainty. Yet attempts can be made at "guessing" the nature of the distribution that generated this particular sample. This question is relatively simple for discrete probability distributions, but may prove formidable for continuous distributions.

Guessing a complete probability distribution is called probability distribution estimation., and more specifically, probability density estimation in the case of continuous distributions.

Parameter estimation (of a distribution)

           The above mentioned task may be made simpler by narrowing down the question. This can be done in two ways :

    1. You may give up the idea of completely identifying the distribution, and be satisfied with a reasonably accurate guess of some particular aspect of the distribution, like its mean, its variance, its mode, or any other quantity defined from it.
    2. You may also assume that the distribution belongs to a family of distributions described by a mathematical expression containing one or several numerical parameters. Identifying the distribution then amounts to guessing the values of a few parameters.

 

This is called estimating the parameters of a distribution, or simply parameter estimation.


The distinction between these two types of estimation is moot when the parameters in the mathematical expression of the family are properties of the distribution, as it is the case for example of the normal distribution (mean and variance).

Parameter estimation (of a model)

            Finally, a model, whether predictive or descriptive, may be perceived as a particular type of description of a probability distribution. A parametric model contains parameters, whose values are calculated from the sample. These parameters are therefore random variables that have distributions of their own, and identifying these distributions is an important task in Data Modeling, for analyzing these distributions will tell us how reliable the model is (in a nutshell, models whose parameters have broad distributions are unreliable).

This is called estimating the parameters of a model.

 

We now review some aspects of parameter estimation.

Estimators and estimates

            Let q  be a parameter of a distribution, whose true (and unknown) value is q0.

            An estimator is a function of observations in the sample (a "statistic"), and whose value will be used as a guess of the true value q0 of the parameter q. The value taken by an estimator on a given sample is called an estimate (of q0).

We will denote by :

So :

q* =  q *(sample)

Estimates are expected to be close to the true value q0 .But the sample being random, an estimator is a random variable. So it can never been said with certainty that an estimate is close to the true value of a parameter of the distribution (or of the model).

So it appears that Estimation theory will focus not on estimates, that are meaningless, but on the properties of estimators considered as random variables, that is, their probability distributions, or some restricted aspects thereof (mostly mean and variance).
 

We will see that a given parameter of a distribution may have several different estimators to choose from. So the two central questions of Estimation theory are :

Desirable properties of a "good" estimator

        Nothing in the structure of the equation defining a statistic tells whether it is, or is not, an estimator of something. It can be said that there is no such thing as a definition of an estimator. But a particular statistic will be used to estimate a parameter if it posseses certain desirable properties, that we briefly review here.

Consistency

             A central idea of Statistics is that very large samples are (unless you're terribly unlucky) a reasonably faithful image of the distribution itself. In other words, the empirical distribution function is hoped be pretty close to the real distribution function for large samples. For example, in the case of a continuous distribution, this means that there should be many observations in regions where the probability density is high, and few where the probability density is low.

So you would certainly expect from a "good" estimator q * of a parameter q  that it produces estimates q*  that are closer and closer to the true value q0 as you consider larger and larger samples.
Yet, again, because an estimator is a r.v., this convergence of the estimates towards q0 as the sample size n grows without limit cannot be expected to happen in a deterministic way. It can happen only in a probabilistic way, and we will have to be satisfied with the following weaker property :


all you have to do is consider large enough samples for the probability of the estimate q*  to be within a bracket of width e centered on q0  to be larger than P.
In other words, beyond a certain sample size, at least P.100 % of the estimates will fall inside this narrow bracket around q0.

 

A statistic with such a property is called a consistent estimator of the parameter q.

Consistency is the least you can demand from a statistic to qualify as an estimator.

 

Except for some rare and pathological exceptions, an estimator's distribution (like that of any any other non trivial statistic) becomes narrower an narrower, and more and more normal-like as larger and larger samples are considered. If we take for granted the fact that the variance of the estimator will tend to 0 as the sample size grows without limit, what consistency really means is that the mean of the estimator's distribution tends to q0 as the sample size grows without limit, as shown in the upper and lower images below :


1) In technical terms, a consistent estimator is a series of random variables indexed by n (the sample size) that converge in probability to q0.

2) The Weak Law of Large Numbers is an example of identification of a consistent estimator.

3) For an exception to the rule of the "narrowing of the distribution of a statistic for larger and larger samples", see the Cauchy distribution. We there show that despite the symmetry of the distribution, the sample mean is not a consistent estimator of the distribution median (the distribution has no mean). The sample median, though, is a consistent estimator of the distribution median.

 Unbiasedness

             Consistency is an asymptotic property : defining consistency requires considering arbitrarily large samples. In real life, sample size will be limited by time or budget constraints. So it is natural to consider what quality should be expected from an estimator based on samples of a fixed size n.
Then you would certainly hope the central region of the distribution of the estimator to be close to the true value q0 of the parameter. One way of expressing this idea is to consider estimators whose distribution mean is equal to q0  for any value of n, the true value of the parameter q. Such an estimator is said to be unbiased., and unbiasedness translates into :

E[q]n = q0    for any  n


This definition is somewhat arbitrary, as the mean has no magic virtue other than its mathematical convenience. Any other measure of central tendency, like the median or the mode, would have provided other adequate embodiments of the idea of "lack of bias", except for the fact that further calculations might then prove intractable.

For a given n, an estimator whose expectation is not equal to q0 is said to be biased. For example :

 

Yet, these two estimators are consistent, as their means tend to the true values (resp. of the variance and the correlation coefficient) as larger and larger samples are considered. So a consistent estimator may be biased for all values of n.


As we will see, bias does not necessarily make an estimator useless, even for small samples.

Efficiency

           A parameter may have several unbiased estimators. For example, given a symmetrical continuous distribution, both :

are unbiased estimators of the distribution mean (when it exists). Which one should we choose ?


Certainly we should choose the estimator that generates estimates that are closer (in a probabilitic sense) to the true value q0  than estimates generated by the other one. One way to do that is to select the estimator with the lower variance.


This leads to the definition of the relative efficiency of two unbiased estimators. Given two unbiased estimators q *1 and q *2 of the same parameter q , one defines the efficiency of q *2 with respect to q *1 (for a given sample size n) as the ratio of their variances :

Relative efficiency (q *2 with respect to q *1)n = Var(q *1)n / Var(q *2)n

One might then wonder if, for a given parameter q, there exists an unbiased estimator which is more efficient than any other unbiased estimator for all sample sizes. The answer is "In general, no.". But it is sometimes possible to identify an unbiased estimator qE  with a similar, albeit weaker property :

 

So qE is indeed the "most efficient estimator", but only in an asymptotic sense. For any sample size n, there might very well be an unbiased estimatorq * more efficient than qE..

Such an estimator is called an efficient estimator.


The question : 
   * "What is the smallest possible variance of an unbiased estimator ?"
or equivalently : 
   * "Is there a lower bound to the variance of an unbiased estimator ?"
is both important and difficult.
Should this Glossary live long enough, it will be addressed in due time.

Minimum mean-square-error

            The practitioner is not particularly keen on unbiasedness. What is really important to him is that, on the average, the estimate q* be close to the true value q 0. So he will tend to favor estimators such that the mean-square error :

E[(q* - q 0)]²

be as low as possible, whether q * is biased or not. Such an estimator is called a minimum mean-square-error estimator.

Given two estimators :

 

q *2 might prove a better estimator than q *1  in practice (lower image in the illustration below).

 

 

Yet, identifying minimum mean-square-error estimators is not an easy task, and most commonly encountered estimators are simply unbiased estimators.

Constructing an estimator

            As noted above, there is no way to tell  a priori if a particular statistic will be useful as an estimator of a certain parameter. Given a statistic, only its behavior in terms of :


will tell if this statistic is worth being considered for the purpose of estimating the value of this parameter.
So the question of how to construct a statistic that will have some of the desirable properties of a good estimator remains open. We now briefly mention three popular methods used for constructing estimators of a given parameter :
 

The method of moments

                It is the most natural method. Because a large sample is (hoped to be) a faithful image of the unknown distribution D, having this sample is just as good as having the distribution itself. So we proceed as if the empirical distribution function D* was the true distribution. If the parameter q  is defined in terms of a function of D :

q = f(D)

then we use :

q*= f(D*)

as an estimate of q.

This is the "common sense" attitude that makes us use the sample mean as an estimate of the mean of the population without thinking twice about it.

This "plug in" method is called the method of moments. In early days of Statistics, the methods of moments was the only one available, and was mostly used for estimating the moments of a distribution (mean, variance and higher order moments), hence its name.

 

An estimator constructed by the method of moments is clearly consistent (although this requires demonstrating), but often suffers from severe bias for small samples.

Method of Maximum Likelihood

            Given a sample and a candidate distribution, the Likelihood is a measure of "how likely" it is that the sample was generated by that particular distribution.
Given a family of distributions (usually summarized by a mathematical expression containing a few numerical parameters), the Method of Maximum Likelihood (ML) selects that particular distribution in the family that makes the Likelihood maximum. The values of the parameters thus obtained are called the Maximum Likelihood estimates of the parameters of the (unknown) underlying distribution.

 

The Method of Maximum Likelihood is established on much more solid theoretical grounds than the Method of Moments. In particular, it can be shown that, under very general conditions, a ML estimator is :

 

ML estimation is the most widely used method of estimation.

Least Squares estimation

            The best estimate of the mean of a random variable is the sample mean m, which also has the property of making minimal the sum :

S = Si (xi - a

where a is an adjustable parameter. S is minimal for a = m.
 

The function yf(x) is the regression function of y on x is, for any x, f(x) is the mean of the values of y for this value of x. Therefore, regression may be perceived as simultaneously estimating the means of an infinity of random variables, one for each value of x.


Least Squares estimation is an extension of the above mentioned property of the sample mean. The parameters of a regression model yf(x) are usually calculated by imposing that the sum of the squares of the differences between :

be minimal.

 

Least Squares is the technique used for calculating the models in Simple and Multiple Linear Regression.

 

For more on Least Squares estimation, please see here .

Interval estimation

        What we described in this page is known as "point estimation", the reason being that the action of estimating produces just a number, the estimate. The weakness of point estimation is that this estimate comes with no clue about how credible it is.

It is sometimes possible to bring about some additional information concerning this credibility. This is the goal of a different kind of estimation known as "interval estimation".

In a nutshell, given a sample, interval estimation builds a segment such that it is possible to calculate the probability for this segment (known as "confidence interval") to contain q0, the true value of the parameter. For a given probability (known as "confidence level"), the shorter the confidence interval, the better the precision with which q0 has been localized.

You may read more about interval estimation here.

____________________________________________________________

 

In summary, the goal of Estimation is to extract complete or partial information about a probability distribution from a sample generated by this distribution. This information is necessarily probabilistic, and is embodied in an estimate (of a parameter of the distribution). An estimator is a statistic whose properties as a random variable let us expect that its value for the sample at hand (the estimate) is close to the true value of the parameter that is being estimated.

A few general techniques are available for constructing useful estimators (moments, maximum likelihood, least-squares).

It is often possible to associate to a point estimate a confidence interval and a confidence level for this interval.

 _______________________________________________________

 

Related readings :

Data Modeling

Parameter of a model

Likelihood

Interval estimation

Monte-Carlo simulation

Download this Glossary

 

Want to contribute to this site ?