Point estimation

Estimation and Tests are the two main branches of what is known as "inferential Statistics".

-----

Point estimation (or often, just "estimation") is one of the central activities of the statistician. Probability distributions are usually known only through random samples, and estimation is the art of extracting valuable information from a sample about the distribution of probability that generated it.

The term "Estimation" covers several different but closely related viewpoints, that we analyze here.

# Guessing a complete probability distribution is called probability distribution estimation., and more specifically, probability density estimation in the case of continuous distributions.

## Parameter estimation (of a distribution)

The above mentioned task may be made simpler by narrowing down the question. This can be done in two ways :

* One may give up the idea of completely identifying the distribution, and be satisfied with a reasonably accurate guess of some particular aspect of the distribution, like its mean, its variance, its mode, or any other quantity defined from it.

* One may also assume that the distribution belongs to a family of distributions described by a mathematical expression containing one or several numerical parameters. Identifying the distribution then amounts to guessing the values of a few parameters.

This is called "estimating the parameters of a distribution", or simply parameter estimation.

The distinction between"probability distribution estimation" and "parameter estimation" is pointless when the parameters in the mathematical expression of the family are properties of the distribution, as it is the case for example of the normal distribution (mean and variance).

## Fitting a model

A model, whether predictive or descriptive, may be perceived as the outcome of a particular type of probability distribution estimation. For example, regression assumes that the data was generated by a probability distribution described by the sum of :

* A deterministic part (the true regression line),

* And a random part (the error term).

"Fitting" the regression model then consists in estimating :

* The values of the parameters of the true regression line.

* And the statistical properties of the error terms.

The result (the regression model) may then be regarded as an estimation of the probability distribution that generated the data.

More generally, a model contains parameters, whose values are calculated from the sample. These parameters are therefore random variables that have distributions of their own, and identifying these distributions is an important task in Data Modeling, for analyzing these distributions will tell us how reliable the model is (in a nutshell, models whose parameters are strongly biased (see below) or have broad distributions are unreliable).

This is called estimating the parameters of a model, or "fitting the model to the data".

## Predictions of a model

A model has two main purposes :

* The discovery of valuable information about the probability distribution that generated the data.

* Making predictions about new incoming data. For example, a regression model will be used for predicting the value of the response variable y for any new value of the predictor x. Even descriptive models make predictions : a histogram predicts the value of the density of a distribution where a new observation happens to fall.

Because the model is random, so are its predictions. For the practitioner, it is of utmost importance to be able to figure out how reliable these predictions are. This is achieved by considering a model prediction as a random variable used as an estimator of the quantity to be predicted and then evaluate the quality (e.g. bias and variance) of this estimator. Note that the quantity to be predicted may itself be a random variable : for example, a regression model will deliver an estimate of the value of the response variable for a new predictor x, but in the real world, the mechanism that generated the data delivers a random value of y for any given x. On then tries to estimate the mean of the true distribution of y.

# Estimators and estimates

Let θ  be a parameter of a distribution, whose true (and unknown) value is θ0.

An estimator is a function of the observations in the sample (a "statistic"), and whose value will be used as a guess of the true but unknown value θ0 of the parameter θ. The value taken by an estimator on a given sample is called an estimate (of θ0 ).

We will denote:

• θ * an estimator of θ, and
• θ*  the value generated by θ * from the sample (the estimate).

So :

θ* = θ *(sample)

Estimates are expected to be close to the true value θ0. But the sample being random, an estimator is a random variable. So it can never been said with certainty that an estimate is close to the true value of a parameter of the distribution (or of the model).

So it appears that Estimation theory will focus not on estimates, that are meaningless, but on the properties of estimators considered as random variables, that is, their probability distributions, or some restricted aspects thereof (mostly mean, variance and asymptotic properties).

We will see that a given parameter of a distribution may have several different estimators to choose from. So the two central questions of Estimation theory are :

• What are the properties that make a statistic a "good" estimator of a parameter ?
• How can we design a good estimator for a given parameter ?

# Desirable properties of a "good" estimator

Nothing in the structure of the equation defining a statistic tells whether it is, or is not, an estimator of anything. It can be said that there is no such thing as a definition of an estimator. But a particular statistic will be used to estimate a parameter if it posseses certain desirable properties, that we briefly review here.

## Consistency

A central idea of Statistics is that very large samples are (unless you're terribly unlucky) a reasonably faithful image of the distribution itself. In other words, the empirical distribution function is hoped be pretty close to the real distribution function for large samples. For example, in the case of a continuous distribution, this means that there should be many observations in regions where the probability density is high, and few where the probability density is low (see the Fundamental Theorem of Statistics).

So you would certainly expect from a "good" estimator θ * of a parameter θ  that it produces estimates θ*  that are closer and closer to the true value θ0 as you consider larger and larger samples.
Yet, again, because an estimator is a rv, this convergence of the estimates towards θ0 as the sample size n grows without limit cannot be expected to happen in a deterministic way. It can happen only in a probabilistic way, and we will have to be satisfied with the following weaker property :

• Let θ  be the parameter to be estimated, and let θ0 be its real value. Then :
• however close to 1 the arbitrary probability P, and
• however small the arbitrary positive number ε,

all you have to do is consider large enough samples for the probability of the estimate θ* to be within a bracket of width ε centered on θ0  to be larger than P.
In other words, beyond a certain sample size, at least P.100 % of the estimates will fall inside this narrow bracket around θ0.

A statistic with such a property is called a consistent estimator of the parameter θ.

Consistency is the least you can demand from a statistic to qualify as an estimator.

Except for some rare and pathological exceptions, an estimator's distribution (like that of any any other non trivial statistic) becomes narrower an narrower, and more and more normal-like as larger and larger samples are considered. If we take for granted the fact that the variance of the estimator will tend to 0 as the sample size grows without limit, what consistency really means is that the mean of the estimator's distribution tends to θ0 as the sample size grows without limit, as shown in the upper and lower images below :

1) In technical terms, a consistent estimator is a sequence of random variables indexed by n (the sample size) that converge in probability to θ0.
2) The Weak Law of Large Numbers is an example of identification of a consistent estimator.
3) For an exception to the rule of the "narrowing of the distribution of a statistic for larger and larger samples", see the Cauchy distribution. We there show that despite the symmetry of the distribution, the sample mean is not a consistent estimator of the distribution median (the distribution has no mean). The sample median, though, is a consistent estimator of the distribution median.

## Unbiasedness

Consistency is an asymptotic property : defining consistency requires considering arbitrarily large samples. In real life, sample size will be limited by time or budget constraints. So it is natural to consider what quality should be expected from an estimator based on samples of a fixed size n.
Then you would certainly hope the central region of the distribution of the estimator to be close to the true value θ0 of the parameter. One way of expressing this idea is to consider estimators whose distribution mean is equal to θ0  for any value of n, the true value of the parameter θ. Such an estimator is said to be unbiased., and unbiasedness translates into :

E[θ]n = θ0    for any  n

This definition is somewhat arbitrary, as the mean has no magic virtue other than its mathematical convenience. Any other measure of central tendency, like the median or the mode, would have provided other adequate embodiments of the idea of "lack of bias", except for the fact that further calculations might then prove intractable.

For a given n, an estimator whose expectation is not equal to θ0 is said to be biased. For example :

• The sample variance is a biased estimator of the distribution variance.
• In a binormal distribution, the empirical correlation coefficient is a biased estimator of the coefficient of correlation of the distribution.

Yet, these two estimators are consistent, as their means tend to the true values (resp. of the variance and the correlation coefficient) as larger and larger samples are considered. So a consistent estimator may be biased for all values of n.

As we will see, bias does not necessarily make an estimator useless, even for small samples.

## Efficiency

A parameter may have several unbiased estimators. For example, given a symmetrical continuous distribution, both :

* The sample mean

and

* The sample median

are unbiased estimators of the distribution mean (when it exists). Which one should we choose ?

Certainly we should choose the estimator that generates estimates that are closer (in a probabilitic sense) to the true value θ0  than estimates generated by the other one. One way to do that is to select the estimator with the lower variance.

This leads to the definition of the relative efficiency of two unbiased estimators. Given two unbiased estimators θ *1 and θ *2 of the same parameter θ , one defines the efficiency of θ *2 with respect to θ *1 (for a given sample size n) as the ratio of their variances :

Relative efficiency (θ *2 with respect to θ *1)n = Var(θ *1)n / Var(θ *2 )n

One might then wonder if, for a given parameter θ, there exists an unbiased estimator which is more efficient than any other unbiased estimator for all sample sizes. The answer is "In general, no.". But it is sometimes possible to identify an unbiased estimator θE  with a similar, albeit weaker property :

• For any other unbiased estimator θ * of θ, θE will eventually be more efficient than θ * as larger and larger samples are considered.

So θE is indeed the "most efficient estimator", but only in an asymptotic sense. For any sample size n, there might very well be an unbiased estimatorθ * more efficient than θE..

Such an estimator is called an efficient estimator.

The question :
* "What is the smallest possible variance of an unbiased estimator ?"
or equivalently :
* "Is there a lower bound to the variance of an unbiased estimator ?"
is both important and difficult.
It is addressed in the entry about the Cramér-Rao inequality.

## Minimum mean-square-error

The practitioner is not particularly keen on unbiasedness. What is really important to him is that, on the average, the estimate θ* be close to the true value θ 0. So he will tend to favor estimators such that the mean-square error :

E[(θ* - θ0 )]²

be as low as possible, whether θ * is biased or not. Such an estimator is called a minimum mean-square-error estimator.

Given two estimators :

• θ *1  that is unbiased, but with a large variance,
• θ *2 that is somewhat biased, but with a small variance,

θ *2 might prove a better estimator than θ *1  in practice (lower image in the illustration below).

Yet, identifying minimum mean-square-error estimators is not an easy task, and most commonly encountered estimators are simply unbiased estimators.

# Devising an estimator

As noted above, there is no way to tell  a priori if a particular statistic will be useful as an estimator of a certain parameter. Given a statistic, only its behavior in terms of :

* Bias with respect to the true value of a parameter,

* Narrowing of its distribution for large samples,
will tell if this statistic is worth being considered for the purpose of estimating the value of this parameter.
So the question of how to devise a statistic that will have some of the desirable properties of a good estimator remains open. We now briefly mention three popular methods used for constructing estimators of a given parameter :

## The method of moments

It is the most natural method. Because a large sample is (hoped to be) a faithful image of the unknown distribution D, having this sample is just as good as having the distribution itself. So we proceed as if the empirical distribution function D* were the true distribution. If the parameter θ is defined in terms of a function of D :

θ = f(D)

then we use :

θ*= f(D*)

as an estimate of θ.

This is the "common sense" attitude that makes us use the sample mean as an estimate of the mean of the population without thinking twice about it.

This "plug in" method is called the method of moments. In early days of Statistics, the methods of moments was the only one available, and was mostly used for estimating the moments of a distribution (mean, variance and higher order moments), hence its name.

An estimator constructed by the method of moments is clearly consistent (although this requires demonstrating), but often suffers from severe bias for small samples.

## Method of Maximum Likelihood

Given a sample and a candidate distribution, the Likelihood is a measure of "how likely" it is that the sample was generated by that particular distribution.
Given a family of distributions (usually summarized by a mathematical expression containing a few numerical parameters), the Method of Maximum Likelihood (ML) selects that particular distribution in the family that makes the Likelihood largest. The values of the parameters thus obtained are called the Maximum Likelihood estimates of the parameters of the (unknown) underlying distribution.

The Method of Maximum Likelihood is established on much more solid theoretical grounds than the Method of Moments. In particular, it can be shown that, under very general conditions, a ML estimator is :

• Consistent,
• Asymptotically normal,
• Efficient (see above).

ML estimation is the most widely used method of estimation.

## Least Squares estimation

The best estimate of the mean of a random variable is the sample mean m, which also has the property of making minimal the sum :

S = Σi (xi - a

where a is an adjustable parameter. S is minimal for a = m.

The function yf(x) is the regression function of y on x is, for any x, f(x) is the mean of the values of y for this value of x. Therefore, regression may be perceived as simultaneously estimating the means of an infinity of random variables, one for each value of x.

Least Squares estimation is an extension of the above mentioned property of the sample mean. The parameters of a regression model yf(x) are usually calculated by imposing that the sum of the squares of the differences between :

•  the model predictions f(x),
• and the measured values of the response variable (the y values in the data table),

be minimal.

Least Squares is the technique used for calculating the models in Simple and Multiple Linear Regression.

For more on Least Squares estimation, please see here .

# Interval estimation

What we described in this page is known as "point estimation", the reason being that the action of estimating produces just a number, the estimate. The weakness of point estimation is that this estimate comes with no clue about how credible it is.

It is sometimes possible to bring about some additional information concerning this credibility. This is the goal of a different kind of estimation known as "interval estimation".

In a nutshell, given a sample, interval estimation builds a segment such that it is possible to calculate the probability for this segment (known as "confidence interval") to contain θ0, the true value of the parameter. For a given probability (known as "confidence level"), the shorter the confidence interval, the better the precision with which θ0 has been localized.

You may read more about interval estimation here.

____________________________________________________________

In summary, the goal of Estimation is to extract complete or partial information about a probability distribution from a sample generated by this distribution. This information is necessarily probabilistic, and is embodied in an estimate (of a parameter of the distribution). An estimator is a statistic whose properties as a random variable let us expect that its value for the sample at hand (the estimate) is close to the true value of the parameter that is being estimated.

A few general techniques are available for constructing useful estimators (moments, maximum likelihood, least-squares).

It is often possible to associate to a point estimate a confidence interval and a confidence level for this interval.

_______________________________________________________

Related readings :

 Data Modeling Parameter of a model Likelihood Interval estimation Monte-Carlo simulation
 Download this Glossary