Exponential family

Many classical distributions p(x; θ), whether discrete or continuous, have their probability density functions (or probability mass functions) share a common mathematical structure, described below, and known collectively as the exponential families.

A distribution (or rather, a family of distributions) whose pdf of pmf can be written in the "exponential family" form is guaranteed to have "good" properties :

* There exists a complete (and therefore sufficient minimal) statistic for the parameter θ (whether scalar or vector), and this statistic is easily identified,

* There exists a function of the parameter θ that can be efficiently estimated, and this function as well as the efficient estimator are also easily identified (recall that a probability distribution has at most one function of the parameter θ that can be efficiently estimated).

So the set of distributions belonging to exponential families may be regarded as the "core" of the set of all probability distributions, where the most "regular" distributions are to be found. Most of the usual probability distributions (with the exception of the uniform distribution) belong to some exponential family.

Definition of an exponential family

General (or "standard") form

Let p(x; θ) be a family of distributions, where θ is assumed to be scalar for the time being. This family is said to be an exponential family if it can be written as :

 p(x; θ) = exp[A(x)B(θ) + C(x) + D(θ)]

* The distribution can be continuous or discrete.

* The range of x over which p(x; θ) is not equal to 0 (the support of ther distribution) must not depend on θ. In particular, any family of uniform distributions is excluded from this definition.

Consider for example the Bernoulli distribution b(π). Its probability mass function is

p(x; π) = πx(1 - π)1 - x            x = 0, 1.

which can be written as

p(x; π) = exp[x.logπ + (1 - x).log(1 - π)] = exp[x.logπ + log(1 - π) - x.log(1 - π)]

or

p(x; π) = exp[x.log(π/(1 - π)) + 0 + log(1 - π)]

By identification, we have

* A(x) = x

* B(π) = log(π/(1 - π))

* C(x) = 0

* D(π) = log(1 - π)

so the Bernoulli distributions are an exponential family.

Canonical form

In practice, an exponential family is almost never described by its general form. It is customary (and convenient) to consider that the parameter is not θ, but rather B(θ). The new parameter η = B(θ) is the called the natural parameter (or sometimes "canonical parameter"). We then have

p(x; η) = exp[A(x).η + C(x) + D(η)]

where η is a function of θ. Of course, the function D(.) is not the same as in the general form.

For the Bernoulli distribution, define

η = log(π/(1 - π))

Then π = eη/(1 + eη) et log(1 - π) = - log(1 + eη). The pmf is now written in the canonical form as

p(x; η)= exp[x.η + 0 - log(1 + eη)]

and now D(η) = - log(1 + eη).

-----

Depending on authors and contexts, the canonical form may be written in various equivalent forms, the most common ones being

p(x; η) = g(η).exp[A(x).η].h(x)

and

p(x; η) = exp[A(x).η - B(η) ].h(x)

Origins of the exponential family

Many classical distributions can indeed be written under the general exponential form, but this is not enough to justify the importance of this somewhat artificial expression. In fact, the concept of  "exponential family" has two different but converging origins that we now describe (for a scalar parameter).

Sufficient statistic and exponential family

We previously established characterizations of a sufficient statistic that relied on the properties of the sample likelihood, or on its relationship with the probability distribution p(x; θ), but not on p(x; θ) itself. Yet it seems clear that the existence of a sufficient statistic for θ should somehow constrain the form of p(x; θ).

This is indeed true, and we'll show that a necessary and sufficient condition for the existence of a sufficient statistic for θ is that the family p(x; θ) can be written as above, and is therefore an exponential family.

In fact, we'll do more than that, and identify a particular sufficient statistic for θ. We'll show that

 T = Σi A(xi)

called the canonical statistic of the family :

* Is sufficient for θ in the general form,

* Is sufficient for η in the canonical form.

-----

In addition, it can be shown that T is not just sufficient, but is minimally sufficient (difficult) and complete (very difficult).

Cramér-Rao lower bound and exponential family

Recall that the Cramér-Rao inequality establishes a lower bound on the variance of an unbiased estimator of a function g(θ) of the parameter θ, but says nothing about whether or not there exists an estimator whose variance is indeed equal to this bound (efficient estimator).

We'll show that a necessary and sufficient condition for the existence of a function g(θ) that can be efficiently estimated is that the family p(x; θ) be an

exponential family. For the general form

p(x; θ) = exp[A(x)B(θ) + C(x) + D(θ)]

this unique function is

where " ' " denotes the differentiation with respect to θ.

If g(.) is the identity function, θ admits an efficient estimator; but it can be asserted that θ has no efficient estimator if g(.) is some other function.

-----

In addition, we'll show that the efficient estimator of g(θ) is 1/n.Σi A(xi) = 1/n.T, where T is the canonical statistic that we stated to be sufficient for θ in the preceding paragraph.

This result requires regularity conditions on p(x; θ) that are stronger than that needed for etablishing the Cramér-Rao lower bound. We won't enunciate these difficult regularity conditions and only mention that if these conditions are relaxed, there exist "exotic" distributions that do not belong to the exponential family and yet have efficient estimators for some g(θ).

________________________

So we see that under some regularity conditions that are satisfied by all usual distributions (with the exception of the uniform distribution), we have the following equivalence scheme :

 Sufficient statistic for θ EXPONENTIAL FAMILY Efficient estimation of some g(θ)

We can therefore assert that given a family of distributions p(x; θ), there exists a function g(θ) of the parameter θ that can be efficiently estimated if and only if there exists a sufficient statistic for θ. In addition, we can identify this function as well as its efficient estimator by writing the family in its exponential form.

Vector parameter

Many classical distributions do not depend just on a single scalar parameter, but on several scalar parameters (i.e. a vector parameter). For example, the normal distribution N(µ, σ²) depends on the vector parameter θ = (µ, σ²).

By definition, the canonical form then becomes

 p(x; η) = exp[Σi Ai(x).ηi + C(x) + D(η)]

The (multidimensional) canonical statistic is now

T = {ΣA1(xj), Σj A2(xj), ..., Σj Ak(xj)}

where k is the dimension of the vector parameter η = {η1, η2 , ..., ηk}.

For example, the canonical representation of the N(µ, σ²) family is

whose natural parameter is η = {η1, η2} with η1 = µ/σ² and η2 = -1/2σ².

Natural parameter space

All distributions is a given exponential family can be written under the canonical form

p(x; η) = exp[Σi Ai(x).ηi + C(x) + D(η)]

Yet, not all the values of η = {η1, η2 , ..., ηk} plugged into the above expression define a probability distribution. This will be the case only if the integral over x of p(x; η) (or the sum in the discrete case) is equal to 1 (normalization).

It can be shown that the set of values of η = {η1, η2 , ..., ηk} for which p(x; η) = exp[Σi Ai(x).ηi + C(x) + D(η)] is indeed a probability distribution is a convex set : if si η and η' are two admissible values of the natural parameter, then for all λ with 0 < λ < 1 :

η'' = λη  + (1 - λ)η'

is also an admissible value of the parameter.

Regular family

Non identifiability of the parameters, minimal representation

Because of the linear nature of the expression Σi Ai(x).ηi, two kinds of difficulties may be encountered :

1) The Ai(x) may be linked by one or several linear relations. Then several values of the parameter η define the same distribution p(x; η). The parameter then loses any statistical interpretation, and is said to be unidentifiable. This situation is similar to that met in Multiple Linear Regression when several regressors are linked by a linear relation : the regression coefficients cannot be identified (colinearity).

It is then appropriate to transform the k expressions Ai(x) into k' < k new linearly independent expressions A'i(x). In the process, a new and identifiable natural parameter η' is created, whose dimension is k'.

2) The ηi may be linked by one or several linear relations, and η is then again unidentifiable. The natural parameter space is then a convex set lying in a linear k'-dimensional subspace of the complete space. This situation is similar to that of a degenerate multivariate normal distribution whose covariance matrix is not strictly positive definite, and which consequently lies in a linear subspace of the complete space.

Again, it is then appropriate to calculate k' independent linear combinations of the ηis (which, of course, defines k' new A'i(x)s).

-----

When both the Ai(x)s and the ηis are linearly independent, the representation of the exponential family is said to be minimal.

Curved family

Even when the components of η are linearly independent, there may exist non linear relations between these components. A classical example is the family of normal distributions N(µ, µ²) whose variance increases as the square of the mean. Clearly this exponential family (show that it is exponential !) depends on one parameter only, and the family is then said to be curved. The natural parameter space of this example is the parabola y = x² whose intrinsic dimension is 1 in the 2D plane spanned by (µ, σ²).

This kind of situation can be avoided by imposing that the natural parameter space contain a hypercube whose dimension is equal to (i.e. not less than) that of the parameter η.

Regular family

If an exponential family admits a minimal representation, and if the natural parameter space contains a hypercube whose dimension is equal to that of the parameter η, the family is said to be regular.

It can then be shown that the k-dimensional statistic

T = {ΣA1(xj), Σj A2(xj), ..., Σj Ak(xj)}

is not just minimally sufficient but also complete for η (whereas completeness is not guaranteed for curved families).

This result is difficult and is not demonstrated in this Glossary.

_________________

Consider for example the normal distribution N(µ, σ²). Clearly :

1) There is no linear relation between η1 = µ/σ² and η2 = -1/2σ².

2) There is no linear relation between A1(x) = x and A2(x) = x².

Besides, the natural parameter space is [-∞, +∞]x[-∞, 0] which contains a 2D square.

The N(µ, σ²) family is therefore regular, and we can assert that (Σi Xi, Σi Xi²) is complete for (µ/σ², -1/2σ²), and therefore for (µ, σ²) which entertains a one-to-one correspondence with (µ/σ², -1/2σ²).

Natural exponential family

Definition of the "natural exponential family"

In practice, many classical distributions can be cast into a restrictive form that defines a sub-class of the canonical exponential family, and which is known as the natural exponential family. This form is :

 p(x, θ, Φ) = exp{(xθ - b(θ))/Φ + c(x, Φ)}

* We'll se that the parameter θ of the natural form is not the same as the parameter θ of the canonical form.

* The new term Φ is usually known. We'll see that it contributes to the spread of the distribution. For this reason, it is called the dispersion parameter of the distribution.

* c(x, Φ) is a "catch all" term that does not play any important role. It is only required not to depend on θ.

Mean and variance of the natural exponential family

Because the natural exponential family is simpler than the general exponential family, the mean and the variance of one of its distribution can be expressed by remarkably simple expressions of the components of the natural form.

* Mean :

We'll show that :

 µ = b'(θ)

where " ' " denotes the differentiation with respect to θ.

The definition of the natural exponential family demands that b'(θ) be one-to-one. The parameter θ is then a function of the mean µ :

θ = b'-1(µ) = τ(µ)

and the distribution can then be expressed using µ as the parameter : it is then said to be expressed in the mean value parametrisation.

* Variance :

We'll show that :

 σ² = Φ.b''(θ)

This last expression justifies the name "dispersion parameter" given to Φ.

Variance function

The above expression shows that, within a natural exponential family, the variance is a function of the mean :

 σ² = Φ.µ' = V(µ)

The function V is called the variance function of the distribution.

It can be shown that under some regularity conditions, the variance function fully characterizes the distribution once this distribution is known to belong to the natural exponential family.

__________________________________________________

 Tutorial 1

In this Tutorial, we describe the two origins of the exponential family.

* Sufficient statistic

- A necessary and sufficient condition for p(x; θ) to have a sufficient statistic for θ is that it belongs to the exponential family.

- We then identify a particular sufficient statistic for θ.

* Cramér-Rao lower bound

- With some reservations, a necessary and sufficient condition to the existence of a function g(θ) that can be efficiently estimated is that p(x; θ) belongs to the exponential family.

- We'll identify g(θ) as well as its efficient estimator. If g(θ) is not the identity function, then θ has no efficient estimator.

____________

* Natural exponential family

- We then show that the mean and the variance of a distribution belonging to the natural exponential family can be calculated by indirect but simple and elegant methods.

EXPONENTIAL FAMILY AND SUFFICIENT STATISTIC

EXPONENTIAL FAMILY AND CRAMER-RAO LOWER BOUND

NATURAL EXPONENTIAL FAMILY

 Exponential family and Sufficient statistic The condition is necessary The condition is sufficient A particular sufficient statistic Exponential family and the Cramér-Rao lower bound The condition is necessary The condition is sufficient Score and exponential family Score and efficient estimation The efficient estimator and the estimated quantity The natural exponential family Mean Variance Variance function TUTORIAL

________________________________________________________

 Tutorial 2

We then review some classical distributions and show that they belong to exponential families. For each of these families, we'll identify :

* The general form, from which we'll deduce :

- A complete statistic for the parameter (scalar or vector),

- A function g(θ) that is efficiently estimated by this statistic (scalar case only). This function is often easily interpeted, but we'll see that it is not always the case. For example, consider the Gamma distribution Γ(α, β). When the dispersion parameter β is kept constant, the one parameter family Γ(α, β) is an exponential family whose efficiently estimated function g(α) seems to be out of reach of interpretation. The same can be said of the Beta distribution Beta(α, β) when either one of its two parameters is kept constant.

* We'll duplicate the same scheme for the canonical form, which is the most widely used form in practice.

* We'll see that many of these families can be written in the natural form, but not all of them. For example, the three distributions mentioned above have no natural form.

We'll calculate the mean and the variance for those distributions with a natural form, as well as their variance function, which is a characteristic of the family among the natural families.

* We'll finally discover that although the normal distribution N(µ, σ²) generates a 2-parameter exponential family, it is not so for the sub-family obtained by keeping the mean µ constant, whereas the sub-family obtained by keeping the variance σ² constant is an exponential family.

Each of the two sub-families obtained by keeping constant one of the two natural parameters η1 or η2 is an exponential family.

_______________

Recall that :

* The exponential and Chi-square distributions are special cases of the Gamma distribution.

* The Bernoulli distribution is a special case of the binomial distribution.

* The geometric distribution is a special case of the negative binomial distribution.

EXAMPLES OF EXPONENTIAL FAMILIES

 Binomial distribution Negative binomial distribution Poisson distribution Gamma distribution Beta distribution Normal distribution TUTORIAL

_____________________________________________________

 Sufficient statistic Cramér-Rao lower bound