HOME

TheInfoList



OR:

In
probability Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...
and
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. The terms "distribution" and "family" are often used loosely: specifically, ''an'' exponential family is a ''set'' of distributions, where the specific distribution varies with the parameter; however, a parametric ''family'' of distributions is often referred to as "''a'' distribution" (like "the normal distribution", meaning "the family of normal distributions"), and the set of all exponential families is sometimes loosely referred to as "the" exponential family. They are distinct because they possess a variety of desirable properties, most importantly the existence of a
sufficient statistic In statistics, a statistic is ''sufficient'' with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the pa ...
. The concept of exponential families is credited to
E. J. G. Pitman Edwin James George Pitman (29 October 1897 – 21 July 1993) was an Australian mathematician who made significant contributions to statistics and probability theory. In particular, he is remembered primarily as the originator of the Pitman per ...
, G. Darmois, and B. O. Koopman in 1935–1936. Exponential families of distributions provides a general framework for selecting a possible alternative parameterisation of a
parametric family In mathematics and its applications, a parametric family or a parameterized family is a indexed family, family of objects (a set of related objects) whose differences depend only on the chosen values for a set of parameters. Common examples are p ...
of distributions, in terms of natural parameters, and for defining useful
sample statistic A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hypo ...
s, called the natural sufficient statistics of the family.


Definition

Most of the commonly used distributions form an exponential family or subset of an exponential family, listed in the subsection below. The subsections following it are a sequence of increasingly more general mathematical definitions of an exponential family. A casual reader may wish to restrict attention to the first and simplest definition, which corresponds to a single-parameter family of
discrete Discrete may refer to: *Discrete particle or quantum in physics, for example in quantum theory *Discrete device, an electronic component with just one circuit element, either passive or active, other than an integrated circuit *Discrete group, a g ...
or
continuous Continuity or continuous may refer to: Mathematics * Continuity (mathematics), the opposing concept to discreteness; common examples include ** Continuous probability distribution or random variable in probability and statistics ** Continuous ...
probability distributions.


Examples of exponential family distributions

Exponential families include many of the most common distributions. Among many others, exponential families includes the following: *
normal Normal(s) or The Normal(s) may refer to: Film and television * ''Normal'' (2003 film), starring Jessica Lange and Tom Wilkinson * ''Normal'' (2007 film), starring Carrie-Anne Moss, Kevin Zegers, Callum Keith Rennie, and Andrew Airlie * ''Norma ...
*
exponential Exponential may refer to any of several mathematical topics related to exponentiation, including: *Exponential function, also: **Matrix exponential, the matrix analogue to the above * Exponential decay, decrease at a rate proportional to value *Exp ...
*
gamma Gamma (uppercase , lowercase ; ''gámma'') is the third letter of the Greek alphabet. In the system of Greek numerals it has a value of 3. In Ancient Greek, the letter gamma represented a voiced velar stop . In Modern Greek, this letter re ...
* chi-squared *
beta Beta (, ; uppercase , lowercase , or cursive ; grc, βῆτα, bē̂ta or ell, βήτα, víta) is the second letter of the Greek alphabet. In the system of Greek numerals, it has a value of 2. In Modern Greek, it represents the voiced labiod ...
*
Dirichlet Johann Peter Gustav Lejeune Dirichlet (; 13 February 1805 – 5 May 1859) was a German mathematician who made deep contributions to number theory (including creating the field of analytic number theory), and to the theory of Fourier series and ...
*
Bernoulli Bernoulli can refer to: People *Bernoulli family of 17th and 18th century Swiss mathematicians: ** Daniel Bernoulli (1700–1782), developer of Bernoulli's principle **Jacob Bernoulli (1654–1705), also known as Jacques, after whom Bernoulli numbe ...
* categorical * Poisson * Wishart * inverse Wishart * geometric A number of common distributions are exponential families, but only when certain parameters are fixed and known. For example: *
binomial Binomial may refer to: In mathematics *Binomial (polynomial), a polynomial with two terms * Binomial coefficient, numbers appearing in the expansions of powers of binomials *Binomial QMF, a perfect-reconstruction orthogonal wavelet decomposition ...
(with fixed number of trials) * multinomial (with fixed number of trials) *
negative binomial In probability theory and statistics, the negative binomial distribution is a discrete probability distribution that models the number of failures in a sequence of independent and identically distributed Bernoulli trials before a specified (non-r ...
(with fixed number of failures) Notice that in each case, the parameters which must be fixed determine a limit on the size of observation values. Examples of common distributions that are ''not'' exponential families are Student's ''t'', most
mixture distribution In probability and statistics, a mixture distribution is the probability distribution of a random variable that is derived from a collection of other random variables as follows: first, a random variable is selected by chance from the collectio ...
s, and even the family of uniform distributions when the bounds are not fixed. See the section below on
examples Example may refer to: * '' exempli gratia'' (e.g.), usually read out in English as "for example" * .example, reserved as a domain name that may not be installed as a top-level domain of the Internet ** example.com, example.net, example.org, e ...
for more discussion.


Scalar parameter

A single-parameter exponential family is a set of probability distributions whose
probability density function In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can ...
(or probability mass function, for the case of a
discrete distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon ...
) can be expressed in the form : f_X(x\mid\theta) = h(x)\,\exp\!\bigl ,\eta(\theta) \cdot T(x) - A(\theta)\,\bigr/math> where ''T''(''x''), ''h''(''x''), ''η''(''θ''), and ''A''(''θ'') are known functions. The function ''h''(''x'') must of course be non-negative. An alternative, equivalent form often given is : f_X(x\mid\theta) = h(x)\,g(\theta)\,\exp\!\bigl ,\eta(\theta) \cdot T(x)\,\bigr/math> or equivalently : f_X(x\mid\theta) = \exp\!\bigl ,\eta(\theta) \cdot T(x) - A(\theta) + B(x)\,\bigr/math> The value ''θ'' is called the parameter of the family. In addition, the support of f_X\!\left( x \mid \theta \right) (i.e. the set of all x for which f_X\!\left( x \mid \theta \right) is greater than 0) does not depend on \theta. This can be used to exclude a parametric family distribution from being an exponential family. For example, the Pareto distribution has a pdf which is defined for x \geq x_m (x_m being the scale parameter) and its support, therefore, has a lower limit of x_m . Since the support of f_\!(x) is dependent on the value of the parameter, the family of Pareto distributions does not form an exponential family of distributions (at least when x_m is unknown). Often ''x'' is a vector of measurements, in which case ''T''(''x'') may be a function from the space of possible values of ''x'' to the real numbers. More generally, ''η''(''θ'') and ''T''(''x'') can each be vector-valued such that \eta'(\theta) \cdot T(x) is real-valued. However, see the discussion below on vector parameters, regarding the exponential family. If ''η''(''θ'') = ''θ'', then the exponential family is said to be in ''
canonical form In mathematics and computer science, a canonical, normal, or standard form of a mathematical object is a standard way of presenting that object as a mathematical expression. Often, it is one which provides the simplest representation of an obje ...
''. By defining a transformed parameter ''η'' = ''η''(''θ''), it is always possible to convert an exponential family to canonical form. The canonical form is non-unique, since ''η''(''θ'') can be multiplied by any nonzero constant, provided that ''T''(''x'') is multiplied by that constant's reciprocal, or a constant ''c'' can be added to ''η''(''θ'') and ''h''(''x'') multiplied by \exp\!\bigl c \cdot T(x)\,\bigr to offset it. In the special case that ''η''(''θ'') = ''θ'' and ''T''(''x'') = ''x'' then the family is called a
natural exponential family In probability and statistics, a natural exponential family (NEF) is a class of probability distributions that is a special case of an exponential family (EF). Definition Univariate case The natural exponential families (NEF) are a subset of ...
. Even when ''x'' is a scalar, and there is only a single parameter, the functions ''η''(''θ'') and ''T''(''x'') can still be vectors, as described below. The function ''A''(''θ''), or equivalently ''g''(''θ''), is automatically determined once the other functions have been chosen, since it must assume a form that causes the distribution to be normalized (sum or integrate to one over the entire domain). Furthermore, both of these functions can always be written as functions of ''η'', even when ''η''(''θ'') is not a one-to-one function, i.e. two or more different values of ''θ'' map to the same value of ''η''(''θ''), and hence ''η''(''θ'') cannot be inverted. In such a case, all values of ''θ'' mapping to the same ''η''(''θ'') will also have the same value for ''A''(''θ'') and ''g''(''θ'').


Factorization of the variables involved

What is important to note, and what characterizes all exponential family variants, is that the parameter(s) and the observation variable(s) must
factorize In mathematics, factorization (or factorisation, see English spelling differences) or factoring consists of writing a number or another mathematical object as a product of several ''factors'', usually smaller or simpler objects of the same kind ...
(can be separated into products each of which involves only one type of variable), either directly or within either part (the base or exponent) of an
exponentiation Exponentiation is a mathematical operation, written as , involving two numbers, the '' base'' and the ''exponent'' or ''power'' , and pronounced as " (raised) to the (power of) ". When is a positive integer, exponentiation corresponds to re ...
operation. Generally, this means that all of the factors constituting the density or mass function must be of one of the following forms: : f(x), g(\theta), c^, c^, ^c, ^c, ^, ^, ^, \text ^, where ''f'' and ''h'' are arbitrary functions of ''x''; ''g'' and ''j'' are arbitrary functions of ''θ''; and ''c'' is an arbitrary "constant" expression (i.e. an expression not involving ''x'' or ''θ''). There are further restrictions on how many such factors can occur. For example, the two expressions: : ^, \qquad ^ (\theta), are the same, i.e. a product of two "allowed" factors. However, when rewritten into the factorized form, : ^ = ^ (\theta) = e^, it can be seen that it cannot be expressed in the required form. (However, a form of this sort is a member of a ''curved exponential family'', which allows multiple factorized terms in the exponent.) To see why an expression of the form : ^ qualifies, : ^ = e^ and hence factorizes inside of the exponent. Similarly, : ^ = e^ = e^ and again factorizes inside of the exponent. A factor consisting of a sum where both types of variables are involved (e.g. a factor of the form 1+f(x)g(\theta)) cannot be factorized in this fashion (except in some cases where occurring directly in an exponent); this is why, for example, the
Cauchy distribution The Cauchy distribution, named after Augustin Cauchy, is a continuous probability distribution. It is also known, especially among physicists, as the Lorentz distribution (after Hendrik Lorentz), Cauchy–Lorentz distribution, Lorentz(ian) fun ...
and Student's ''t'' distribution are not exponential families.


Vector parameter

The definition in terms of one ''real-number'' parameter can be extended to one ''real-vector'' parameter : \boldsymbol \theta \equiv \left ,\theta_1,\,\theta_2,\,\ldots,\,\theta_s\,\right\mathsf T~. A family of distributions is said to belong to a vector exponential family if the probability density function (or probability mass function, for discrete distributions) can be written as : f_X(x\mid\boldsymbol \theta) = h(x)\,\exp\left(\sum_^s \eta_i() T_i(x) - A() \right)~, or in a more compact form, : f_X(x\mid\boldsymbol \theta) = h(x)\,\exp\Big(\boldsymbol\eta() \cdot \mathbf(x) - A() \Big) This form writes the sum as a
dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an algebra ...
of vector-valued functions \boldsymbol\eta() and \mathbf(x)\,. An alternative, equivalent form often seen is : f_X(x\mid\boldsymbol \theta) = h(x)\,g(\boldsymbol \theta)\,\exp\Big(\boldsymbol\eta() \cdot \mathbf(x)\Big) As in the scalar valued case, the exponential family is said to be in ''canonical form'' if :\quad \eta_i() = \theta_i \quad \forall i\,. A vector exponential family is said to be ''curved'' if the dimension of : \boldsymbol \theta \equiv \left ,\theta_1,\,\theta_2,\,\ldots,\,\theta_d\,\,\right\mathsf T is less than the dimension of the vector : (\boldsymbol \theta) \equiv \left ,\eta_1(\boldsymbol \theta),\,\eta_2(\boldsymbol \theta),\,\ldots,\,\eta_s(\boldsymbol \theta)\,\right\mathsf T~. That is, if the ''dimension'', , of the parameter vector is less than the ''number of functions'', , of the parameter vector in the above representation of the probability density function. Most common distributions in the exponential family are ''not'' curved, and many algorithms designed to work with any exponential family implicitly or explicitly assume that the distribution is not curved. As in the above case of a scalar-valued parameter, the function A(\boldsymbol \theta) or equivalently g(\boldsymbol \theta) is automatically determined once the other functions have been chosen, so that the entire distribution is normalized. In addition, as above, both of these functions can always be written as functions of \boldsymbol\eta, regardless of the form of the transformation that generates \boldsymbol\eta from \boldsymbol\theta\,. Hence an exponential family in its "natural form" (parametrized by its natural parameter) looks like : f_X(x\mid\boldsymbol \eta) = h(x)\,\exp\Big(\boldsymbol\eta \cdot \mathbf(x) - A()\Big) or equivalently : f_X(x\mid\boldsymbol \eta) = h(x)\,g(\boldsymbol \eta)\,\exp\Big(\boldsymbol\eta \cdot \mathbf(x)\Big) The above forms may sometimes be seen with \boldsymbol\eta^\mathsf T \mathbf(x) in place of \boldsymbol\eta \cdot \mathbf(x)\,. These are exactly equivalent formulations, merely using different notation for the
dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an algebra ...
.


Vector parameter, vector variable

The vector-parameter form over a single scalar-valued random variable can be trivially expanded to cover a joint distribution over a vector of random variables. The resulting distribution is simply the same as the above distribution for a scalar-valued random variable with each occurrence of the scalar replaced by the vector :\mathbf = \left( x_1, x_2, \cdots, x_k \right)^~. The dimensions of the random variable need not match the dimension of the parameter vector, nor (in the case of a curved exponential function) the dimension of the natural parameter \boldsymbol\eta and
sufficient statistic In statistics, a statistic is ''sufficient'' with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the pa ...
 . The distribution in this case is written as : f_X\!\left(\mathbf\mid\boldsymbol \theta\right) = h(\mathbf)\,\exp\!\left(\,\sum_^s \eta_i() T_i(\mathbf) - A()\,\right) Or more compactly as : f_X\!\left(\,\mathbf\mid\boldsymbol \theta\,\right) = h(\mathbf) \, \exp\!\Big(\,\boldsymbol\eta() \cdot \mathbf(\mathbf) - A()\,\Big) Or alternatively as : f_X\!\left(\,\mathbf\mid\boldsymbol \theta\,\right) = g(\boldsymbol \theta) \; h(\mathbf) \, \exp\!\Big(\,\boldsymbol\eta() \cdot \mathbf(\mathbf)\,\Big)


Measure-theoretic formulation

We use
cumulative distribution function In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x. Ev ...
s (CDF) in order to encompass both discrete and continuous distributions. Suppose is a non-decreasing function of a real variable. Then Lebesgue–Stieltjes integrals with respect to H(\mathbf) are integrals with respect to the ''reference measure'' of the exponential family generated by  . Any member of that exponential family has cumulative distribution function : F\left(\,\mathbf\mid\boldsymbol\theta\,\right) = \exp\bigl(\,\boldsymbol\eta(\theta) \cdot \mathbf(\mathbf)\,-\,A(\boldsymbol\theta)\,\bigr) ~ H(\mathbf)~. is a Lebesgue–Stieltjes integrator for the reference measure. When the reference measure is finite, it can be normalized and is actually the
cumulative distribution function In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x. Ev ...
of a probability distribution. If is absolutely continuous with a density f(x) with respect to a reference measure \, x \, (typically Lebesgue measure), one can write \, F(x) = f(x)~ x \,. In this case, is also absolutely continuous and can be written \, H(x) = h(x)\, x \, so the formulas reduce to that of the previous paragraphs. If is discrete, then is a
step function In mathematics, a function on the real numbers is called a step function if it can be written as a finite linear combination of indicator functions of intervals. Informally speaking, a step function is a piecewise constant function having only ...
(with steps on the support of ). Alternatively, we can write the probability measure directly as :P\left(\,\mathbf\mid\boldsymbol\theta\,\right) = \exp\bigl(\,\boldsymbol\eta(\theta) \cdot \mathbf(\mathbf) - A(\boldsymbol\theta)\,\bigr) ~ \mu(\mathbf)~. for some reference measure \mu\,.


Interpretation

In the definitions above, the functions , , and were apparently arbitrarily defined. However, these functions play a significant role in the resulting probability distribution. * is a ''
sufficient statistic In statistics, a statistic is ''sufficient'' with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the pa ...
'' of the distribution. For exponential families, the sufficient statistic is a function of the data that holds all information the data provides with regard to the unknown parameter values. This means that, for any data sets x and y, the likelihood ratio is the same, that is \frac = \frac if . This is true even if and are quite distinct – that is, even if d(x,y) > 0\,. The dimension of equals the number of parameters of and encompasses all of the information regarding the data related to the parameter . The sufficient statistic of a set of
independent identically distributed In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is us ...
data observations is simply the sum of individual sufficient statistics, and encapsulates all the information needed to describe the
posterior distribution The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior p ...
of the parameters, given the data (and hence to derive any desired estimate of the parameters). (This important property is discussed further below.) * is called the ''natural parameter''. The set of values of for which the function f_X(x;\eta) is integrable is called the ''natural parameter space''. It can be shown that the natural parameter space is always
convex Convex or convexity may refer to: Science and technology * Convex lens, in optics Mathematics * Convex set, containing the whole line segment that joins points ** Convex polygon, a polygon which encloses a convex set of points ** Convex polytope ...
. * is called the ''log- partition function'' because it is the
logarithm In mathematics, the logarithm is the inverse function to exponentiation. That means the logarithm of a number  to the base  is the exponent to which must be raised, to produce . For example, since , the ''logarithm base'' 10 o ...
of a
normalization factor The concept of a normalizing constant arises in probability theory and a variety of other areas of mathematics. The normalizing constant is used to reduce any probability function to a probability density function with total probability of one. ...
, without which f_X(x;\theta) would not be a probability distribution: :: A(\eta) = \log\left ( \int_X h(x)\,\exp (\eta(\theta) \cdot T(x)) \, \mathrmx \right ) The function is important in its own right, because the
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
,
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
and other moments of the sufficient statistic can be derived simply by differentiating . For example, because is one of the components of the sufficient statistic of the
gamma distribution In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-square distribution are special cases of the gamma d ...
, \operatorname log x/math> can be easily determined for this distribution using . Technically, this is true because ::K\left( u\mid\eta \right) = A(\eta+u) - A(\eta)\,, is the
cumulant generating function In probability theory and statistics, the cumulants of a probability distribution are a set of quantities that provide an alternative to the '' moments'' of the distribution. Any two probability distributions whose moments are identical will have ...
of the sufficient statistic.


Properties

Exponential families have a large number of properties that make them extremely useful for statistical analysis. In many cases, it can be shown that ''only'' exponential families have these properties. Examples: *Exponential families are the only families with
sufficient statistic In statistics, a statistic is ''sufficient'' with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the pa ...
s that can summarize arbitrary amounts of
independent identically distributed In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is us ...
data using a fixed number of values. (
Pitman Pitman may refer to: * A coal miner, particularly in Northern England * Pitman (surname) * Pitman, New Jersey, United States * Pitman, Pennsylvania, United States * Pitman, Saskatchewan, Canada * Pitman Shorthand, a system of shorthand * Pitman ar ...
Koopman Koopman is a Dutch language, Dutch occupational surname that means "merchant". The spelling Coopman is more common in West Flanders.Coopman
at ...
Darmois theorem) *Exponential families have
conjugate prior In Bayesian probability theory, if the posterior distribution p(\theta \mid x) is in the same probability distribution family as the prior probability distribution p(\theta), the prior and posterior are then called conjugate distributions, and th ...
s, an important property in
Bayesian statistics Bayesian statistics is a theory in the field of statistics based on the Bayesian interpretation of probability where probability expresses a ''degree of belief'' in an event. The degree of belief may be based on prior knowledge about the event, ...
. *The
posterior predictive distribution Posterior may refer to: * Posterior (anatomy), the end of an organism opposite to its head ** Buttocks, as a euphemism * Posterior horn (disambiguation) * Posterior probability The posterior probability is a type of conditional probability that r ...
of an exponential-family random variable with a conjugate prior can always be written in closed form (provided that the normalizing factor of the exponential-family distribution can itself be written in closed form). *In the mean-field approximation in
variational Bayes Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usuall ...
(used for approximating the
posterior distribution The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior p ...
in large
Bayesian network A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bay ...
s), the best approximating posterior distribution of an exponential-family node (a node is a random variable in the context of Bayesian networks) with a conjugate prior is in the same family as the node. Given an exponential family defined by f_X(x\mid\theta) = h(x)\,\exp\!\bigl ,\theta \cdot T(x) - A(\theta)\,\bigr/math>, where \Theta is the parameter space, such that \theta\in\Theta\subset\R^k. Then * If \Theta has nonempty interior in \R^k, then given any IID samples X_1,... , X_n\sim f_X, the statistic T(X_1, ..., X_n):= \sum_^n T(X_i) is a complete statistic for \theta. * T is a minimal statistic for \theta iff for all \theta_1, \theta_2\in \Theta, and x_1, x_2 in the support of X, if (\theta_1 - \theta_2)\cdot (T(x_1) - T(x_2)) = 0, then \theta_1 = \theta_2 or x_1 = x_2.


Examples

It is critical, when considering the examples in this section, to remember the discussion above about what it means to say that a "distribution" is an exponential family, and in particular to keep in mind that the set of parameters that are allowed to vary is critical in determining whether a "distribution" is or is not an exponential family. The
normal Normal(s) or The Normal(s) may refer to: Film and television * ''Normal'' (2003 film), starring Jessica Lange and Tom Wilkinson * ''Normal'' (2007 film), starring Carrie-Anne Moss, Kevin Zegers, Callum Keith Rennie, and Andrew Airlie * ''Norma ...
,
exponential Exponential may refer to any of several mathematical topics related to exponentiation, including: *Exponential function, also: **Matrix exponential, the matrix analogue to the above * Exponential decay, decrease at a rate proportional to value *Exp ...
,
log-normal In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable is log-normally distributed, then has a normal ...
,
gamma Gamma (uppercase , lowercase ; ''gámma'') is the third letter of the Greek alphabet. In the system of Greek numerals it has a value of 3. In Ancient Greek, the letter gamma represented a voiced velar stop . In Modern Greek, this letter re ...
, chi-squared,
beta Beta (, ; uppercase , lowercase , or cursive ; grc, βῆτα, bē̂ta or ell, βήτα, víta) is the second letter of the Greek alphabet. In the system of Greek numerals, it has a value of 2. In Modern Greek, it represents the voiced labiod ...
,
Dirichlet Johann Peter Gustav Lejeune Dirichlet (; 13 February 1805 – 5 May 1859) was a German mathematician who made deep contributions to number theory (including creating the field of analytic number theory), and to the theory of Fourier series and ...
,
Bernoulli Bernoulli can refer to: People *Bernoulli family of 17th and 18th century Swiss mathematicians: ** Daniel Bernoulli (1700–1782), developer of Bernoulli's principle **Jacob Bernoulli (1654–1705), also known as Jacques, after whom Bernoulli numbe ...
, categorical, Poisson, geometric, inverse Gaussian,
von Mises Mises or von Mises may refer to: * Ludwig von Mises, an Austrian-American economist of the Austrian School, older brother of Richard von Mises ** Mises Institute, or the Ludwig von Mises Institute for Austrian Economics, named after Ludwig von ...
and von Mises-Fisher distributions are all exponential families. Some distributions are exponential families only if some of their parameters are held fixed. The family of Pareto distributions with a fixed minimum bound ''x''m form an exponential family. The families of
binomial Binomial may refer to: In mathematics *Binomial (polynomial), a polynomial with two terms * Binomial coefficient, numbers appearing in the expansions of powers of binomials *Binomial QMF, a perfect-reconstruction orthogonal wavelet decomposition ...
and multinomial distributions with fixed number of trials ''n'' but unknown probability parameter(s) are exponential families. The family of negative binomial distributions with fixed number of failures (a.k.a. stopping-time parameter) ''r'' is an exponential family. However, when any of the above-mentioned fixed parameters are allowed to vary, the resulting family is not an exponential family. As mentioned above, as a general rule, the support of an exponential family must remain the same across all parameter settings in the family. This is why the above cases (e.g. binomial with varying number of trials, Pareto with varying minimum bound) are not exponential families — in all of the cases, the parameter in question affects the support (particularly, changing the minimum or maximum possible value). For similar reasons, neither the discrete uniform distribution nor
continuous uniform distribution In probability theory and statistics, the continuous uniform distribution or rectangular distribution is a family of symmetric probability distributions. The distribution describes an experiment where there is an arbitrary outcome that lies betw ...
are exponential families as one or both bounds vary. The
Weibull distribution In probability theory and statistics, the Weibull distribution is a continuous probability distribution. It is named after Swedish mathematician Waloddi Weibull, who described it in detail in 1951, although it was first identified by Maurice Re ...
with fixed shape parameter ''k'' is an exponential family. Unlike in the previous examples, the shape parameter does not affect the support; the fact that allowing it to vary makes the Weibull non-exponential is due rather to the particular form of the Weibull's
probability density function In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can ...
(''k'' appears in the exponent of an exponent). In general, distributions that result from a finite or infinite
mixture In chemistry, a mixture is a material made up of two or more different chemical substances which are not chemically bonded. A mixture is the physical combination of two or more substances in which the identities are retained and are mixed in the ...
of other distributions, e.g.
mixture model In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observatio ...
densities and
compound probability distribution In probability and statistics, a compound probability distribution (also known as a mixture distribution or contagious distribution) is the probability distribution that results from assuming that a random variable is distributed according to som ...
s, are ''not'' exponential families. Examples are typical Gaussian
mixture model In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observatio ...
s as well as many
heavy-tailed distribution In probability theory, heavy-tailed distributions are probability distributions whose tails are not exponentially bounded: that is, they have heavier tails than the exponential distribution. In many applications it is the right tail of the distr ...
s that result from
compounding In the field of pharmacy, compounding (performed in compounding pharmacies) is preparation of a custom formulation of a medication to fit a unique need of a patient that cannot be met with commercially available products. This may be done for me ...
(i.e. infinitely mixing) a distribution with a
prior distribution In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken int ...
over one of its parameters, e.g. the Student's ''t''-distribution (compounding a
normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
over a gamma-distributed precision prior), and the beta-binomial and Dirichlet-multinomial distributions. Other examples of distributions that are not exponential families are the
F-distribution In probability theory and statistics, the ''F''-distribution or F-ratio, also known as Snedecor's ''F'' distribution or the Fisher–Snedecor distribution (after Ronald Fisher and George W. Snedecor) is a continuous probability distribution ...
,
Cauchy distribution The Cauchy distribution, named after Augustin Cauchy, is a continuous probability distribution. It is also known, especially among physicists, as the Lorentz distribution (after Hendrik Lorentz), Cauchy–Lorentz distribution, Lorentz(ian) fun ...
,
hypergeometric distribution In probability theory and statistics, the hypergeometric distribution is a discrete probability distribution that describes the probability of k successes (random draws for which the object drawn has a specified feature) in n draws, ''without'' ...
and
logistic distribution Logistic may refer to: Mathematics * Logistic function, a sigmoid function used in many fields ** Logistic map, a recurrence relation that sometimes exhibits chaos ** Logistic regression, a statistical model using the logistic function ** Logit, ...
. Following are some detailed examples of the representation of some useful distribution as exponential families.


Normal distribution: unknown mean, known variance

As a first example, consider a random variable distributed normally with unknown mean ''μ'' and ''known'' variance ''σ''2. The probability density function is then :f_\sigma(x;\mu) = \frac 1 e^. This is a single-parameter exponential family, as can be seen by setting :\begin h_\sigma(x) &= \frac 1 e^ \\ ptT_\sigma(x) &= \frac x \sigma \\ ptA_\sigma(\mu) &= \frac\\ pt\eta_\sigma(\mu) &= \frac \mu \sigma. \end If ''σ'' = 1 this is in canonical form, as then ''η''(''μ'') = ''μ''.


Normal distribution: unknown mean and unknown variance

Next, consider the case of a normal distribution with unknown mean and unknown variance. The probability density function is then :f(y;\mu,\sigma) = \frac e^. This is an exponential family which can be written in canonical form by defining :\begin \boldsymbol &= \left ,\frac,~-\frac\,\right\\ h(y) &= \frac \\ T(y) &= \left( y, y^2 \right)^ \\ A() &= \frac + \log , \sigma, = -\frac + \frac\log\left, \frac \ \end


Binomial distribution

As an example of a discrete exponential family, consider the binomial distribution with ''known'' number of trials ''n''. The probability mass function for this distribution is :f(x)=p^x (1-p)^, \quad x \in \. This can equivalently be written as :f(x)=\exp\left(x \log\left(\frac\right) + n \log(1-p)\right), which shows that the binomial distribution is an exponential family, whose natural parameter is :\eta = \log\frac. This function of ''p'' is known as
logit In statistics, the logit ( ) function is the quantile function associated with the standard logistic distribution. It has many uses in data analysis and machine learning, especially in data transformations. Mathematically, the logit is the ...
.


Table of distributions

The following table shows how to rewrite a number of common distributions as exponential-family distributions with natural parameters. Refer to the flashcards for main exponential families. For a scalar variable and scalar parameter, the form is as follows: : f_X(x\mid \theta) = h(x) \exp\Big(\eta() T(x) - A()\Big) For a scalar variable and vector parameter: : f_X(x\mid\boldsymbol \theta) = h(x) \exp\Big(\boldsymbol\eta() \cdot \mathbf(x) - A()\Big) : f_X(x\mid\boldsymbol \theta) = h(x) g(\boldsymbol \theta) \exp\Big(\boldsymbol\eta() \cdot \mathbf(x)\Big) For a vector variable and vector parameter: : f_X(\mathbf\mid\boldsymbol \theta) = h(\mathbf) \exp\Big(\boldsymbol\eta() \cdot \mathbf(\mathbf) - A()\Big) The above formulas choose the functional form of the exponential-family with a log-partition function A(). The reason for this is so that the moments of the sufficient statistics can be calculated easily, simply by differentiating this function. Alternative forms involve either parameterizing this function in terms of the normal parameter \boldsymbol\theta instead of the natural parameter, and/or using a factor g(\boldsymbol\eta) outside of the exponential. The relation between the latter and the former is: :A(\boldsymbol\eta) = -\log g(\boldsymbol\eta) :g(\boldsymbol\eta) = e^ To convert between the representations involving the two types of parameter, use the formulas below for writing one type of parameter in terms of the other. {, class="wikitable" ! Distribution ! Parameter(s) \boldsymbol\theta ! Natural parameter(s) \boldsymbol\eta ! Inverse parameter mapping ! Base measure h(x) ! Sufficient statistic T(x) ! Log-partition A(\boldsymbol\eta) ! Log-partition A(\boldsymbol\theta) , - ,
Bernoulli distribution In probability theory and statistics, the Bernoulli distribution, named after Swiss mathematician Jacob Bernoulli,James Victor Uspensky: ''Introduction to Mathematical Probability'', McGraw-Hill, New York 1937, page 45 is the discrete probabil ...
, , p , \log\frac{p}{1-p} *This is the
logit function In statistics, the logit ( ) function is the quantile function associated with the standard logistic distribution. It has many uses in data analysis and machine learning, especially in data transformations. Mathematically, the logit is the in ...
. , \frac{1}{1+e^{-\eta = \frac{e^\eta}{1+e^{\eta *This is the
logistic function A logistic function or logistic curve is a common S-shaped curve (sigmoid curve) with equation f(x) = \frac, where For values of x in the domain of real numbers from -\infty to +\infty, the S-curve shown on the right is obtained, with the ...
. , 1 , x , \log (1+e^{\eta}) , -\log (1-p) , - , binomial distribution
with known number of trials n , , p , \log\frac{p}{1-p} , \frac{1}{1+e^{-\eta = \frac{e^\eta}{1+e^{\eta , {n \choose x} , x , n \log (1+e^{\eta}) , -n \log (1-p) , - ,
Poisson distribution In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known co ...
, , \lambda , \log\lambda , e^\eta , \frac{1}{x!} , x , e^{\eta} , \lambda , - , negative binomial distribution
with known number of failures r , , p , \log p , e^\eta , {x+r-1 \choose x} , x , -r \log (1-e^{\eta}) , -r \log (1-p) , - ,
exponential distribution In probability theory and statistics, the exponential distribution is the probability distribution of the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average ...
, , \lambda , -\lambda , -\eta , 1 , x , -\log(-\eta) , -\log\lambda , - , Pareto distribution
with known minimum value x_m , , \alpha , -\alpha-1 , -1-\eta , 1 , \log x , -\log (-1-\eta) + (1+\eta) \log x_{\mathrm m} , -\log \alpha - \alpha \log x_{\mathrm m} , - ,
Weibull distribution In probability theory and statistics, the Weibull distribution is a continuous probability distribution. It is named after Swedish mathematician Waloddi Weibull, who described it in detail in 1951, although it was first identified by Maurice Re ...

with known shape , , \lambda , -\frac{1}{\lambda^k} , (-\eta)^{-\frac{1}{k , x^{k-1} , x^k , -\log(-\eta) -\log k , k\log\lambda -\log k , - , Laplace distribution
with known mean \mu , , b , -\frac{1}{b} , -\frac{1}{\eta} , 1 , , x-\mu, , \log\left(-\frac{2}{\eta}\right) , \log 2b , - ,
chi-squared distribution In probability theory and statistics, the chi-squared distribution (also chi-square or \chi^2-distribution) with k degrees of freedom is the distribution of a sum of the squares of k independent standard normal random variables. The chi-squar ...
, , \nu , \frac{\nu}{2}-1 , 2(\eta+1) , e^{-\frac{x}{2 , \log x , \log \Gamma(\eta+1)+(\eta+1)\log 2 , \log \Gamma\left(\frac{\nu}{2}\right)+\frac{\nu}{2}\log 2 , - ,
normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...

known variance , , \mu , \frac{\mu}{\sigma} , \sigma\eta , \frac{e^{-\frac{x^2}{2\sigma^2}{\sqrt{2\pi}\sigma} , \frac{x}{\sigma} , \frac{\eta^2}{2} , \frac{\mu^2}{2\sigma^2} , - ,
continuous Bernoulli distribution In probability theory, statistics, and machine learning, the continuous Bernoulli distribution is a family of continuous probability distributions parameterized by a single shape parameter \lambda \in (0, 1), defined on the unit interval x \in ...
, , \lambda , \log\frac{\lambda}{1-\lambda} , \frac{e^\eta}{1+e^{\eta , 1 , x , \log\frac{e^\eta - 1}{\eta} , \log\left( \frac{1 - 2\lambda}{(1-\lambda)\log\left(\frac{1-\lambda}{\lambda}\right)} \right) , - ,
normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
, , \mu,\ \sigma^2 , \begin{bmatrix} \dfrac{\mu}{\sigma^2} \\
0pt PT, Pt, or pt may refer to: Arts and entertainment * ''P.T.'' (video game), acronym for ''Playable Teaser'', a short video game released to promote the cancelled video game ''Silent Hills'' * Porcupine Tree, a British progressive rock group ...
-\dfrac{1}{2\sigma^2} \end{bmatrix} , \begin{bmatrix} -\dfrac{\eta_1}{2\eta_2} \\ 5pt-\dfrac{1}{2\eta_2} \end{bmatrix} , \frac{1}{\sqrt{2\pi , \begin{bmatrix} x \\ x^2 \end{bmatrix} , -\frac{\eta_1^2}{4\eta_2} - \frac12\log(-2\eta_2) , \frac{\mu^2}{2\sigma^2} + \log \sigma , - ,
log-normal distribution In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable is log-normally distributed, then has a norma ...
, , \mu,\ \sigma^2 , \begin{bmatrix} \dfrac{\mu}{\sigma^2} \\
0pt PT, Pt, or pt may refer to: Arts and entertainment * ''P.T.'' (video game), acronym for ''Playable Teaser'', a short video game released to promote the cancelled video game ''Silent Hills'' * Porcupine Tree, a British progressive rock group ...
-\dfrac{1}{2\sigma^2} \end{bmatrix} , \begin{bmatrix} -\dfrac{\eta_1}{2\eta_2} \\ 5pt-\dfrac{1}{2\eta_2} \end{bmatrix} , \frac{1}{\sqrt{2\pi}x} , \begin{bmatrix} \log x \\ (\log x)^2 \end{bmatrix} , -\frac{\eta_1^2}{4\eta_2} - \frac12\log(-2\eta_2) , \frac{\mu^2}{2\sigma^2} + \log \sigma , - ,
inverse Gaussian distribution In probability theory, the inverse Gaussian distribution (also known as the Wald distribution) is a two-parameter family of continuous probability distributions with support on (0,∞). Its probability density function is given by : f(x;\mu, ...
, , \mu,\ \lambda , \begin{bmatrix} -\dfrac{\lambda}{2\mu^2} \\ 5pt-\dfrac{\lambda}{2} \end{bmatrix} , \begin{bmatrix} \sqrt{\dfrac{\eta_2}{\eta_1 \\ 5pt-2\eta_2 \end{bmatrix} , \frac{1}{\sqrt{2\pi}x^{\frac{3}{2} , \begin{bmatrix} x \\ pt\dfrac{1}{x} \end{bmatrix} , -2\sqrt{\eta_1\eta_2} -\frac12\log(-2\eta_2) , -\frac{\lambda}{\mu} -\frac12\log\lambda , - , rowspan=2,
gamma distribution In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-square distribution are special cases of the gamma d ...
, , \alpha,\ \beta , \begin{bmatrix} \alpha-1 \\ -\beta \end{bmatrix} , \begin{bmatrix} \eta_1+1 \\ -\eta_2 \end{bmatrix} , rowspan=2, 1 , rowspan=2, \begin{bmatrix} \log x \\ x \end{bmatrix} , rowspan=2, \log \Gamma(\eta_1+1)-(\eta_1+1)\log(-\eta_2) , \log \Gamma(\alpha)-\alpha\log\beta , - , k,\ \theta , \begin{bmatrix} k-1 \\ pt-\dfrac{1}{\theta} \end{bmatrix} , \begin{bmatrix} \eta_1+1 \\ pt-\dfrac{1}{\eta_2} \end{bmatrix} , \log \Gamma(k)+k\log\theta , - ,
inverse gamma distribution In probability theory and statistics, the inverse gamma distribution is a two-parameter family of continuous probability distributions on the positive real line, which is the distribution of the reciprocal of a variable distributed according ...
, , \alpha,\ \beta , \begin{bmatrix} -\alpha-1 \\ -\beta \end{bmatrix} , \begin{bmatrix} -\eta_1-1 \\ -\eta_2 \end{bmatrix} , 1 , \begin{bmatrix} \log x \\ \frac{1}{x} \end{bmatrix} , \log \Gamma(-\eta_1-1)-(-\eta_1-1)\log(-\eta_2) , \log \Gamma(\alpha)-\alpha\log\beta , - ,
generalized inverse Gaussian distribution In probability theory and statistics, the generalized inverse Gaussian distribution (GIG) is a three-parameter family of continuous probability distributions with probability density function :f(x) = \frac x^ e^,\qquad x>0, where ''Kp'' is a mo ...
, , p,\ a,\ b , \begin{bmatrix} p-1 \\ -a/2 \\ -b/2 \end{bmatrix} , \begin{bmatrix} \eta_1+1 \\ -2\eta_2\\ -2\eta_3 \end{bmatrix} , 1 , \begin{bmatrix} \log x \\ x \\ \frac{1}{x} \end{bmatrix} , \log 2 K_{\eta_1+1}(\sqrt{4\eta_2\eta_3}) - \frac{\eta_1+1}{2}\log\frac{\eta_2}{\eta_3} , \log 2 K_{p}(\sqrt{ab}) - \frac{p}{2}\log\frac{a}{b} , - ,
scaled inverse chi-squared distribution The scaled inverse chi-squared distribution is the distribution for ''x'' = 1/''s''2, where ''s''2 is a sample mean of the squares of ν independent normal random variables that have mean 0 and inverse variance 1/σ2 = τ2. The distribu ...
, , \nu,\ \sigma^2 , \begin{bmatrix} -\dfrac{\nu}{2}-1 \\
0pt PT, Pt, or pt may refer to: Arts and entertainment * ''P.T.'' (video game), acronym for ''Playable Teaser'', a short video game released to promote the cancelled video game ''Silent Hills'' * Porcupine Tree, a British progressive rock group ...
-\dfrac{\nu\sigma^2}{2} \end{bmatrix} , \begin{bmatrix} -2(\eta_1+1) \\
0pt PT, Pt, or pt may refer to: Arts and entertainment * ''P.T.'' (video game), acronym for ''Playable Teaser'', a short video game released to promote the cancelled video game ''Silent Hills'' * Porcupine Tree, a British progressive rock group ...
\dfrac{\eta_2}{\eta_1+1} \end{bmatrix} , 1 , \begin{bmatrix} \log x \\ \frac{1}{x} \end{bmatrix} , \log \Gamma(-\eta_1-1)-(-\eta_1-1)\log(-\eta_2) , \log \Gamma\left(\frac{\nu}{2}\right)-\frac{\nu}{2}\log\frac{\nu\sigma^2}{2} , - ,
beta distribution In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval , 1in terms of two positive parameters, denoted by ''alpha'' (''α'') and ''beta'' (''β''), that appear as ...


(variant 1) , , \alpha,\ \beta , \begin{bmatrix} \alpha \\ \beta \end{bmatrix} , \begin{bmatrix} \eta_1 \\ \eta_2 \end{bmatrix} , \frac{1}{x(1-x)} , \begin{bmatrix} \log x \\ \log (1-x) \end{bmatrix} , \log \Gamma(\eta_1) + \log \Gamma(\eta_2) - \log \Gamma(\eta_1+\eta_2) , \log \Gamma(\alpha) + \log \Gamma(\beta) - \log \Gamma(\alpha+\beta) , - ,
beta distribution In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval , 1in terms of two positive parameters, denoted by ''alpha'' (''α'') and ''beta'' (''β''), that appear as ...


(variant 2) , , \alpha,\ \beta , \begin{bmatrix} \alpha - 1 \\ \beta - 1 \end{bmatrix} , \begin{bmatrix} \eta_1 + 1 \\ \eta_2 + 1 \end{bmatrix} , 1 , \begin{bmatrix} \log x \\ \log (1-x) \end{bmatrix} , \log \Gamma(\eta_1 + 1) + \log \Gamma(\eta_2 + 1) - \log \Gamma(\eta_1 + \eta_2 + 2) , \log \Gamma(\alpha) + \log \Gamma(\beta) - \log \Gamma(\alpha+\beta) , - ,
multivariate normal distribution In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional ( univariate) normal distribution to higher dimensions. One ...
, , \boldsymbol\mu,\ \boldsymbol\Sigma , \begin{bmatrix} \boldsymbol\Sigma^{-1}\boldsymbol\mu \\ pt-\frac12\boldsymbol\Sigma^{-1} \end{bmatrix} , \begin{bmatrix} -\frac12\boldsymbol\eta_2^{-1}\boldsymbol\eta_1 \\ pt-\frac12\boldsymbol\eta_2^{-1} \end{bmatrix} , (2\pi)^{-\frac{k}{2 , \begin{bmatrix} \mathbf{x} \\ pt\mathbf{x}\mathbf{x}^{\mathsf T} \end{bmatrix} , -\frac{1}{4}\boldsymbol\eta_1^{\mathsf T}\boldsymbol\eta_2^{-1}\boldsymbol\eta_1 - \frac12\log\left, -2\boldsymbol\eta_2\ , \frac12\boldsymbol\mu^{\mathsf T}\boldsymbol\Sigma^{-1}\boldsymbol\mu + \frac12 \log , \boldsymbol\Sigma, , - ,
categorical distribution In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution) is a discrete probability distribution that describes the possible results of a random variable that can ...


(variant 1) , , p_1,\ \ldots,\,p_k

where \textstyle\sum_{i=1}^k p_i=1 , \begin{bmatrix} \log p_1 \\ \vdots \\ \log p_k \end{bmatrix} , \begin{bmatrix} e^{\eta_1} \\ \vdots \\ e^{\eta_k} \end{bmatrix}

where \textstyle\sum_{i=1}^k e^{\eta_i}=1 , 1 , \begin{bmatrix} =1\\ \vdots \\ { =k \end{bmatrix} * =i/math> is the Iverson bracket* , 0 , 0 , - ,
categorical distribution In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution) is a discrete probability distribution that describes the possible results of a random variable that can ...


(variant 2) , , p_1,\ \ldots,\,p_k

where \textstyle\sum_{i=1}^k p_i=1 , \begin{bmatrix} \log p_1+C \\ \vdots \\ \log p_k+C \end{bmatrix} , \begin{bmatrix} \dfrac{1}{C}e^{\eta_1} \\ \vdots \\ \dfrac{1}{C}e^{\eta_k} \end{bmatrix} =
\begin{bmatrix} \dfrac{e^{\eta_1{\sum_{i=1}^{k}e^{\eta_i \\
0pt PT, Pt, or pt may refer to: Arts and entertainment * ''P.T.'' (video game), acronym for ''Playable Teaser'', a short video game released to promote the cancelled video game ''Silent Hills'' * Porcupine Tree, a British progressive rock group ...
\vdots \\ pt\dfrac{e^{\eta_k{\sum_{i=1}^{k}e^{\eta_i \end{bmatrix} where \textstyle\sum_{i=1}^k e^{\eta_i}=C , 1 , \begin{bmatrix} =1\\ \vdots \\ { =k \end{bmatrix} * =i/math> is the Iverson bracket* , 0 , 0 , - ,
categorical distribution In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution) is a discrete probability distribution that describes the possible results of a random variable that can ...


(variant 3) , , p_1,\ \ldots,\,p_k

where p_k = 1 - \textstyle\sum_{i=1}^{k-1} p_i , \begin{bmatrix} \log \dfrac{p_1}{p_k} \\
0pt PT, Pt, or pt may refer to: Arts and entertainment * ''P.T.'' (video game), acronym for ''Playable Teaser'', a short video game released to promote the cancelled video game ''Silent Hills'' * Porcupine Tree, a British progressive rock group ...
\vdots \\ pt\log \dfrac{p_{k-1{p_k} \\ 5pt0 \end{bmatrix} =

\begin{bmatrix} \log \dfrac{p_1}{1-\sum_{i=1}^{k-1}p_i} \\
0pt PT, Pt, or pt may refer to: Arts and entertainment * ''P.T.'' (video game), acronym for ''Playable Teaser'', a short video game released to promote the cancelled video game ''Silent Hills'' * Porcupine Tree, a British progressive rock group ...
\vdots \\ pt\log \dfrac{p_{k-1{1-\sum_{i=1}^{k-1}p_i} \\ 5pt0 \end{bmatrix} *This is the inverse
softmax function The softmax function, also known as softargmax or normalized exponential function, converts a vector of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...
, a generalization of the
logit function In statistics, the logit ( ) function is the quantile function associated with the standard logistic distribution. It has many uses in data analysis and machine learning, especially in data transformations. Mathematically, the logit is the in ...
. , \begin{bmatrix} \dfrac{e^{\eta_1{\sum_{i=1}^{k}e^{\eta_i \\
0pt PT, Pt, or pt may refer to: Arts and entertainment * ''P.T.'' (video game), acronym for ''Playable Teaser'', a short video game released to promote the cancelled video game ''Silent Hills'' * Porcupine Tree, a British progressive rock group ...
\vdots \\ pt\dfrac{e^{\eta_k{\sum_{i=1}^{k}e^{\eta_i \end{bmatrix} =

\begin{bmatrix} \dfrac{e^{\eta_1{1+\sum_{i=1}^{k-1}e^{\eta_i \\
0pt PT, Pt, or pt may refer to: Arts and entertainment * ''P.T.'' (video game), acronym for ''Playable Teaser'', a short video game released to promote the cancelled video game ''Silent Hills'' * Porcupine Tree, a British progressive rock group ...
\vdots \\ pt\dfrac{e^{\eta_{k-1}{1+\sum_{i=1}^{k-1}e^{\eta_i \\ 5pt\dfrac{1}{1+\sum_{i=1}^{k-1}e^{\eta_i \end{bmatrix} *This is the
softmax function The softmax function, also known as softargmax or normalized exponential function, converts a vector of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...
, a generalization of the
logistic function A logistic function or logistic curve is a common S-shaped curve (sigmoid curve) with equation f(x) = \frac, where For values of x in the domain of real numbers from -\infty to +\infty, the S-curve shown on the right is obtained, with the ...
. , 1 , \begin{bmatrix} =1\\ \vdots \\ { =k \end{bmatrix} * =i/math> is the Iverson bracket* , \log \left(\sum_{i=1}^{k} e^{\eta_i}\right) = \log \left(1+\sum_{i=1}^{k-1} e^{\eta_i}\right) , -\log p_k = -\log \left(1 - \sum_{i=1}^{k-1} p_i\right) , - ,
multinomial distribution In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a ''k''-sided dice rolled ''n'' times. For ''n'' independent trials each of w ...


(variant 1)
with known number of trials n , , p_1,\ \ldots,\,p_k

where \textstyle\sum_{i=1}^k p_i=1 , \begin{bmatrix} \log p_1 \\ \vdots \\ \log p_k \end{bmatrix} , \begin{bmatrix} e^{\eta_1} \\ \vdots \\ e^{\eta_k} \end{bmatrix}

where \textstyle\sum_{i=1}^k e^{\eta_i}=1 , \frac{n!}{\prod_{i=1}^{k} x_i!} , \begin{bmatrix} x_1 \\ \vdots \\ x_k \end{bmatrix} , 0 , 0 , - ,
multinomial distribution In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a ''k''-sided dice rolled ''n'' times. For ''n'' independent trials each of w ...


(variant 2)
with known number of trials n , , p_1,\ \ldots,\,p_k

where \textstyle\sum_{i=1}^k p_i=1 , \begin{bmatrix} \log p_1+C \\ \vdots \\ \log p_k+C \end{bmatrix} , \begin{bmatrix} \dfrac{1}{C}e^{\eta_1} \\ \vdots \\ \dfrac{1}{C}e^{\eta_k} \end{bmatrix} =
\begin{bmatrix} \dfrac{e^{\eta_1{\sum_{i=1}^{k}e^{\eta_i \\
0pt PT, Pt, or pt may refer to: Arts and entertainment * ''P.T.'' (video game), acronym for ''Playable Teaser'', a short video game released to promote the cancelled video game ''Silent Hills'' * Porcupine Tree, a British progressive rock group ...
\vdots \\ pt\dfrac{e^{\eta_k{\sum_{i=1}^{k}e^{\eta_i \end{bmatrix} where \textstyle\sum_{i=1}^k e^{\eta_i}=C , \frac{n!}{\prod_{i=1}^{k} x_i!} , \begin{bmatrix} x_1 \\ \vdots \\ x_k \end{bmatrix} , 0 , 0 , - ,
multinomial distribution In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a ''k''-sided dice rolled ''n'' times. For ''n'' independent trials each of w ...


(variant 3)
with known number of trials n , p_1,\ \ldots,\,p_k

where p_k = 1 - \textstyle\sum_{i=1}^{k-1} p_i , \begin{bmatrix} \log \dfrac{p_1}{p_k} \\
0pt PT, Pt, or pt may refer to: Arts and entertainment * ''P.T.'' (video game), acronym for ''Playable Teaser'', a short video game released to promote the cancelled video game ''Silent Hills'' * Porcupine Tree, a British progressive rock group ...
\vdots \\ pt\log \dfrac{p_{k-1{p_k} \\ 5pt0 \end{bmatrix} =

\begin{bmatrix} \log \dfrac{p_1}{1-\sum_{i=1}^{k-1}p_i} \\
0pt PT, Pt, or pt may refer to: Arts and entertainment * ''P.T.'' (video game), acronym for ''Playable Teaser'', a short video game released to promote the cancelled video game ''Silent Hills'' * Porcupine Tree, a British progressive rock group ...
\vdots \\ pt\log \dfrac{p_{k-1{1-\sum_{i=1}^{k-1}p_i} \\ 5pt0 \end{bmatrix} , \begin{bmatrix} \dfrac{e^{\eta_1{\sum_{i=1}^{k}e^{\eta_i \\
0pt PT, Pt, or pt may refer to: Arts and entertainment * ''P.T.'' (video game), acronym for ''Playable Teaser'', a short video game released to promote the cancelled video game ''Silent Hills'' * Porcupine Tree, a British progressive rock group ...
\vdots \\ pt\dfrac{e^{\eta_k{\sum_{i=1}^{k}e^{\eta_i \end{bmatrix} =

\begin{bmatrix} \dfrac{e^{\eta_1{1+\sum_{i=1}^{k-1}e^{\eta_i \\
0pt PT, Pt, or pt may refer to: Arts and entertainment * ''P.T.'' (video game), acronym for ''Playable Teaser'', a short video game released to promote the cancelled video game ''Silent Hills'' * Porcupine Tree, a British progressive rock group ...
\vdots \\ pt\dfrac{e^{\eta_{k-1}{1+\sum_{i=1}^{k-1}e^{\eta_i \\ 5pt\dfrac{1}{1+\sum_{i=1}^{k-1}e^{\eta_i \end{bmatrix} , \frac{n!}{\prod_{i=1}^{k} x_i!} , \begin{bmatrix} x_1 \\ \vdots \\ x_k \end{bmatrix} , n\log \left(\sum_{i=1}^{k} e^{\eta_i}\right) = n\log \left(1+\sum_{i=1}^{k-1} e^{\eta_i}\right) , -n\log p_k = -n\log \left(1 - \sum_{i=1}^{k-1} p_i\right) , - ,
Dirichlet distribution In probability and statistics, the Dirichlet distribution (after Peter Gustav Lejeune Dirichlet), often denoted \operatorname(\boldsymbol\alpha), is a family of continuous multivariate probability distributions parameterized by a vector \bold ...


(variant 1) , , \alpha_1,\ \ldots,\,\alpha_k , \begin{bmatrix} \alpha_1 \\ \vdots \\ \alpha_k \end{bmatrix} , \begin{bmatrix} \eta_1 \\ \vdots \\ \eta_k \end{bmatrix} , \frac{1}{\prod_{i=1}^k x_i} , \begin{bmatrix} \log x_1 \\ \vdots \\ \log x_k \end{bmatrix} , \sum_{i=1}^k \log \Gamma(\eta_i) - \log \Gamma\left(\sum_{i=1}^k \eta_i \right) , \sum_{i=1}^k \log \Gamma(\alpha_i) - \log \Gamma\left(\sum_{i=1}^k\alpha_i\right) , - ,
Dirichlet distribution In probability and statistics, the Dirichlet distribution (after Peter Gustav Lejeune Dirichlet), often denoted \operatorname(\boldsymbol\alpha), is a family of continuous multivariate probability distributions parameterized by a vector \bold ...


(variant 2) , , \alpha_1,\ \ldots,\,\alpha_k , \begin{bmatrix} \alpha_1 - 1 \\ \vdots \\ \alpha_k - 1 \end{bmatrix} , \begin{bmatrix} \eta_1 + 1 \\ \vdots \\ \eta_k + 1 \end{bmatrix} , 1 , \begin{bmatrix} \log x_1 \\ \vdots \\ \log x_k \end{bmatrix} , \sum_{i=1}^k \log \Gamma(\eta_i + 1) - \log \Gamma\left(\sum_{i=1}^k (\eta_i + 1) \right) , \sum_{i=1}^k \log \Gamma(\alpha_i) - \log \Gamma\left(\sum_{i=1}^k\alpha_i\right) , - , rowspan=2,
Wishart distribution In statistics, the Wishart distribution is a generalization to multiple dimensions of the gamma distribution. It is named in honor of John Wishart, who first formulated the distribution in 1928. It is a family of probability distributions defi ...
, , \mathbf V,\ n , \begin{bmatrix} -\frac12\mathbf{V}^{-1} \\ pt\dfrac{n-p-1}{2} \end{bmatrix} , \begin{bmatrix} -\frac12{\boldsymbol\eta_1}^{-1} \\ pt2\eta_2+p+1 \end{bmatrix} , 1 , \begin{bmatrix} \mathbf{X} \\ \log, \mathbf{X}, \end{bmatrix} , rowspan=2, -\left(\eta_2+\frac{p+1}{2}\right)\log, -\boldsymbol\eta_1,
      + \log\Gamma_p\left(\eta_2+\frac{p+1}{2}\right) =
-\frac{n}{2}\log, -\boldsymbol\eta_1, + \log\Gamma_p\left(\frac{n}{2}\right) =
\left(\eta_2+\frac{p+1}{2}\right)(p\log 2 + \log, \mathbf{V}, )
      + \log\Gamma_p\left(\eta_2+\frac{p+1}{2}\right) *Three variants with different parameterizations are given, to facilitate computing moments of the sufficient statistics. , rowspan=2, \frac{n}{2}(p\log 2 + \log, \mathbf{V}, ) + \log\Gamma_p\left(\frac{n}{2}\right) , - , colspan=5, Note: Uses the fact that {\rm tr}(\mathbf{A}^{\mathsf T}\mathbf{B}) = \operatorname{vec}(\mathbf{A}) \cdot \operatorname{vec}(\mathbf{B}), i.e. the
trace Trace may refer to: Arts and entertainment Music * ''Trace'' (Son Volt album), 1995 * ''Trace'' (Died Pretty album), 1993 * Trace (band), a Dutch progressive rock band * ''The Trace'' (album) Other uses in arts and entertainment * ''Trace'' ...
of a
matrix product In mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the s ...
is much like a
dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an algebra ...
. The matrix parameters are assumed to be vectorized (laid out in a vector) when inserted into the exponential form. Also, \mathbf{V} and \mathbf{X} are symmetric, so e.g. \mathbf{V}^{\mathsf T} = \mathbf{V}\ . , - ,
inverse Wishart distribution In statistics, the inverse Wishart distribution, also called the inverted Wishart distribution, is a probability distribution defined on real-valued positive-definite matrices. In Bayesian statistics it is used as the conjugate prior for the co ...
, , \mathbf \Psi,\,m , \begin{bmatrix} -\frac12\boldsymbol\Psi \\ pt-\dfrac{m+p+1}{2} \end{bmatrix} , \begin{bmatrix} -2\boldsymbol\eta_1 \\ pt-(2\eta_2+p+1) \end{bmatrix} , 1 , \begin{bmatrix} \mathbf{X}^{-1} \\ \log, \mathbf{X}, \end{bmatrix} , \left(\eta_2 + \frac{p + 1}{2}\right)\log, -\boldsymbol\eta_1,
       + \log\Gamma_p\left(-\Big(\eta_2 + \frac{p + 1}{2}\Big)\right) =
-\frac{m}{2}\log, -\boldsymbol\eta_1, + \log\Gamma_p\left(\frac{m}{2}\right) =
-\left(\eta_2 + \frac{p + 1}{2}\right)(p\log 2 - \log, \boldsymbol\Psi, )
       + \log\Gamma_p\left(-\Big(\eta_2 + \frac{p + 1}{2}\Big)\right) , \frac{m}{2}(p\log 2 - \log, \boldsymbol\Psi, ) + \log\Gamma_p\left(\frac{m}{2}\right) , - ,
normal-gamma distribution In probability theory and statistics, the normal-gamma distribution (or Gaussian-gamma distribution) is a bivariate four-parameter family of continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean ...
, , \alpha,\ \beta,\ \mu,\ \lambda , \begin{bmatrix} \alpha-\frac12 \\ -\beta-\dfrac{\lambda\mu^2}{2} \\ \lambda\mu \\ -\dfrac{\lambda}{2}\end{bmatrix} , \begin{bmatrix} \eta_1+\frac12 \\ -\eta_2 + \dfrac{\eta_3^2}{4\eta_4} \\ -\dfrac{\eta_3}{2\eta_4} \\ -2\eta_4 \end{bmatrix} , \dfrac{1}{\sqrt{2\pi , \begin{bmatrix} \log \tau \\ \tau \\ \tau x \\ \tau x^2 \end{bmatrix} , \log \Gamma\left(\eta_1+\frac12\right) - \frac12\log\left(-2\eta_4\right)
       - \left(\eta_1+\frac12\right)\log\left(-\eta_2 + \dfrac{\eta_3^2}{4\eta_4}\right) , \log \Gamma\left(\alpha\right)-\alpha\log\beta-\frac12\log\lambda :* The Iverson bracket is a generalization of the discrete delta-function: If the bracketed expression is true, the bracket has value 1; if the enclosed statement is false, the Iverson bracket is zero. There are many variant notations, e.g. wavey brackets: is equivalent to the notation used above. The three variants of the
categorical distribution In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution) is a discrete probability distribution that describes the possible results of a random variable that can ...
and
multinomial distribution In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a ''k''-sided dice rolled ''n'' times. For ''n'' independent trials each of w ...
are due to the fact that the parameters p_i are constrained, such that :\sum_{i=1}^{k} p_i = 1~. Thus, there are only k-1 independent parameters. *Variant 1 uses k natural parameters with a simple relation between the standard and natural parameters; however, only k-1 of the natural parameters are independent, and the set of k natural parameters is
nonidentifiable In statistics, identifiability is a property which a model must satisfy for precise inference to be possible. A model is identifiable if it is theoretically possible to learn the true values of this model's underlying parameters after obtaining an ...
. The constraint on the usual parameters translates to a similar constraint on the natural parameters. *Variant 2 demonstrates the fact that the entire set of natural parameters is nonidentifiable: Adding any constant value to the natural parameters has no effect on the resulting distribution. However, by using the constraint on the natural parameters, the formula for the normal parameters in terms of the natural parameters can be written in a way that is independent on the constant that is added. *Variant 3 shows how to make the parameters identifiable in a convenient way by setting C = -\log p_k\ . This effectively "pivots" around p_k and causes the last natural parameter to have the constant value of 0. All the remaining formulas are written in a way that does not access p_k\ , so that effectively the model has only k-1 parameters, both of the usual and natural kind. Variants 1 and 2 are not actually standard exponential families at all. Rather they are ''curved exponential families'', i.e. there are k-1 independent parameters embedded in a k-dimensional parameter space. Many of the standard results for exponential families do not apply to curved exponential families. An example is the log-partition function A(x)\ , which has the value of 0 in the curved cases. In standard exponential families, the derivatives of this function correspond to the moments (more technically, the
cumulant In probability theory and statistics, the cumulants of a probability distribution are a set of quantities that provide an alternative to the '' moments'' of the distribution. Any two probability distributions whose moments are identical will have ...
s) of the sufficient statistics, e.g. the mean and variance. However, a value of 0 suggests that the mean and variance of all the sufficient statistics are uniformly 0, whereas in fact the mean of the ith sufficient statistic should be p_i\ . (This does emerge correctly when using the form of A(x)\ shown in variant 3.)


Moments and cumulants of the sufficient statistic


Normalization of the distribution

We start with the normalization of the probability distribution. In general, any non-negative function ''f''(''x'') that serves as the
kernel Kernel may refer to: Computing * Kernel (operating system), the central component of most operating systems * Kernel (image processing), a matrix used for image convolution * Compute kernel, in GPGPU programming * Kernel method, in machine learn ...
of a probability distribution (the part encoding all dependence on ''x'') can be made into a proper distribution by normalizing: i.e. :p(x) = \frac{1}{Z} f(x) where :Z = \int_x f(x) \,dx. The factor ''Z'' is sometimes termed the ''normalizer'' or '' partition function'', based on an analogy to
statistical physics Statistical physics is a branch of physics that evolved from a foundation of statistical mechanics, which uses methods of probability theory and statistics, and particularly the Mathematics, mathematical tools for dealing with large populations ...
. In the case of an exponential family where :p(x; \boldsymbol\eta) = g(\boldsymbol\eta) h(x) e^{\boldsymbol\eta \cdot \mathbf{T}(x)}, the kernel is :K(x) = h(x) e^{\boldsymbol\eta \cdot \mathbf{T}(x)} and the partition function is :Z = \int_x h(x) e^{\boldsymbol\eta \cdot \mathbf{T}(x)} \,dx. Since the distribution must be normalized, we have :1 = \int_x g(\boldsymbol\eta) h(x) e^{\boldsymbol\eta \cdot \mathbf{T}(x)}\, dx = g(\boldsymbol\eta) \int_x h(x) e^{\boldsymbol\eta \cdot \mathbf{T}(x)} \,dx = g(\boldsymbol\eta) Z. In other words, :g(\boldsymbol\eta) = \frac{1}{Z} or equivalently :A(\boldsymbol\eta) = - \log g(\boldsymbol\eta) = \log Z. This justifies calling ''A'' the ''log-normalizer'' or ''log-partition function''.


Moment-generating function of the sufficient statistic

Now, the
moment-generating function In probability theory and statistics, the moment-generating function of a real-valued random variable is an alternative specification of its probability distribution. Thus, it provides the basis of an alternative route to analytical results compare ...
of ''T''(''x'') is :M_T(u) \equiv E ^{u^\top T(x)}\mid\eta= \int_x h(x) e^{(\eta+u)^\top T(x)-A(\eta)} \,dx = e^{A(\eta + u)-A(\eta)} proving the earlier statement that :K(u\mid\eta) = A(\eta+u) - A(\eta) is the
cumulant generating function In probability theory and statistics, the cumulants of a probability distribution are a set of quantities that provide an alternative to the '' moments'' of the distribution. Any two probability distributions whose moments are identical will have ...
for ''T''. An important subclass of exponential families are the
natural exponential families In probability and statistics, a natural exponential family (NEF) is a class of probability distributions that is a special case of an exponential family (EF). Definition Univariate case The natural exponential families (NEF) are a subset of ...
, which have a similar form for the moment-generating function for the distribution of ''x''.


Differential identities for cumulants

In particular, using the properties of the cumulant generating function, : \operatorname{E}(T_{j}) = \frac{ \partial A(\eta) }{ \partial \eta_{j} } and : \operatorname{cov}\left (T_i,\ T_j \right) = \frac{ \partial^2 A(\eta) }{ \partial \eta_i \, \partial \eta_j }. The first two raw moments and all mixed second moments can be recovered from these two identities. Higher-order moments and cumulants are obtained by higher derivatives. This technique is often useful when ''T'' is a complicated function of the data, whose moments are difficult to calculate by integration. Another way to see this that does not rely on the theory of
cumulant In probability theory and statistics, the cumulants of a probability distribution are a set of quantities that provide an alternative to the '' moments'' of the distribution. Any two probability distributions whose moments are identical will have ...
s is to begin from the fact that the distribution of an exponential family must be normalized, and differentiate. We illustrate using the simple case of a one-dimensional parameter, but an analogous derivation holds more generally. In the one-dimensional case, we have :p(x) = g(\eta) h(x) e^{\eta T(x)} . This must be normalized, so :1 = \int_x p(x) \,dx = \int_x g(\eta) h(x) e^{\eta T(x)} \,dx = g(\eta) \int_x h(x) e^{\eta T(x)} \,dx . Take the
derivative In mathematics, the derivative of a function of a real variable measures the sensitivity to change of the function value (output value) with respect to a change in its argument (input value). Derivatives are a fundamental tool of calculus. F ...
of both sides with respect to ''η'': :\begin{align} 0 &= g(\eta) \frac{d}{d\eta} \int_x h(x) e^{\eta T(x)} \,dx + g'(\eta)\int_x h(x) e^{\eta T(x)} \,dx \\ &= g(\eta) \int_x h(x) \left(\frac{d}{d\eta} e^{\eta T(x)}\right) \,dx + g'(\eta)\int_x h(x) e^{\eta T(x)} \,dx \\ &= g(\eta) \int_x h(x) e^{\eta T(x)} T(x) \,dx + g'(\eta)\int_x h(x) e^{\eta T(x)} \,dx \\ &= \int_x T(x) g(\eta) h(x) e^{\eta T(x)} \,dx + \frac{g'(\eta)}{g(\eta)}\int_x g(\eta) h(x) e^{\eta T(x)} \,dx \\ &= \int_x T(x) p(x) \,dx + \frac{g'(\eta)}{g(\eta)}\int_x p(x) \,dx \\ &= \operatorname{E}
(x) An emoticon (, , rarely , ), short for "emotion icon", also known simply as an emote, is a pictorial representation of a facial expression using characters—usually punctuation marks, numbers, and letters—to express a person's feelings, ...
+ \frac{g'(\eta)}{g(\eta)} \\ &= \operatorname{E}
(x) An emoticon (, , rarely , ), short for "emotion icon", also known simply as an emote, is a pictorial representation of a facial expression using characters—usually punctuation marks, numbers, and letters—to express a person's feelings, ...
+ \frac{d}{d\eta} \log g(\eta) \end{align} Therefore, :\operatorname{E}
(x) An emoticon (, , rarely , ), short for "emotion icon", also known simply as an emote, is a pictorial representation of a facial expression using characters—usually punctuation marks, numbers, and letters—to express a person's feelings, ...
= - \frac{d}{d\eta} \log g(\eta) = \frac{d}{d\eta} A(\eta).


Example 1

As an introductory example, consider the
gamma distribution In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-square distribution are special cases of the gamma d ...
, whose distribution is defined by :p(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1}e^{-\beta x}. Referring to the above table, we can see that the natural parameter is given by :\eta_1 = \alpha-1, :\eta_2 = -\beta, the reverse substitutions are :\alpha = \eta_1+1, :\beta = -\eta_2, the sufficient statistics are (\log x, x), and the log-partition function is :A(\eta_1,\eta_2) = \log \Gamma(\eta_1+1)-(\eta_1+1)\log(-\eta_2). We can find the mean of the sufficient statistics as follows. First, for ''η''1: :\begin{align} \operatorname{E} log x&= \frac{ \partial A(\eta_1,\eta_2) }{ \partial \eta_1 } = \frac{ \partial }{ \partial \eta_1 } \left(\log\Gamma(\eta_1+1) - (\eta_1+1) \log(-\eta_2)\right) \\ &= \psi(\eta_1+1) - \log(-\eta_2) \\ &= \psi(\alpha) - \log \beta, \end{align} Where \psi(x) is the
digamma function In mathematics, the digamma function is defined as the logarithmic derivative of the gamma function: :\psi(x)=\frac\ln\big(\Gamma(x)\big)=\frac\sim\ln-\frac. It is the first of the polygamma functions. It is strictly increasing and strict ...
(derivative of log gamma), and we used the reverse substitutions in the last step. Now, for ''η''2: :\begin{align} \operatorname{E} &= \frac{ \partial A(\eta_1,\eta_2) }{ \partial \eta_2 } = \frac{ \partial }{ \partial \eta_2 } \left(\log \Gamma(\eta_1+1)-(\eta_1+1)\log(-\eta_2)\right) \\ &= -(\eta_1+1)\frac{1}{-\eta_2}(-1) = \frac{\eta_1+1}{-\eta_2} \\ &= \frac{\alpha}{\beta}, \end{align} again making the reverse substitution in the last step. To compute the variance of ''x'', we just differentiate again: :\begin{align} \operatorname{Var}(x) &= \frac{\partial^2 A\left(\eta_1,\eta_2 \right)}{\partial \eta_2^2} = \frac{\partial}{\partial \eta_2} \frac{\eta_1+1}{-\eta_2} \\ &= \frac{\eta_1+1}{\eta_2^2} \\ &= \frac{\alpha}{\beta^2}. \end{align} All of these calculations can be done using integration, making use of various properties of the
gamma function In mathematics, the gamma function (represented by , the capital letter gamma from the Greek alphabet) is one commonly used extension of the factorial function to complex numbers. The gamma function is defined for all complex numbers except ...
, but this requires significantly more work.


Example 2

As another example consider a real valued random variable ''X'' with density : p_\theta (x) = \frac{ \theta e^{-x} }{\left(1 + e^{-x} \right)^{\theta + 1} } indexed by shape parameter \theta \in (0,\infty) (this is called the skew-logistic distribution). The density can be rewritten as : \frac{ e^{-x} } { 1 + e^{-x} } \exp\left( -\theta \log\left(1 + e^{-x} \right) + \log(\theta)\right) Notice this is an exponential family with natural parameter : \eta = -\theta, sufficient statistic : T = \log\left (1 + e^{-x} \right), and log-partition function : A(\eta) = -\log(\theta) = -\log(-\eta) So using the first identity, : \operatorname{E}(\log(1 + e^{-X})) = \operatorname{E}(T) = \frac{ \partial A(\eta) }{ \partial \eta } = \frac{ \partial }{ \partial \eta } \log(-\eta)= \frac{1}{-\eta} = \frac{1}{\theta}, and using the second identity : \operatorname{var}(\log\left(1 + e^{-X} \right)) = \frac{ \partial^2 A(\eta) }{ \partial \eta^2 } = \frac{ \partial }{ \partial \eta } \left frac{1}{-\eta}\right= \frac{1}{(-\eta)^2} = \frac{1}{\theta^2}. This example illustrates a case where using this method is very simple, but the direct calculation would be nearly impossible.


Example 3

The final example is one where integration would be extremely difficult. This is the case of the
Wishart distribution In statistics, the Wishart distribution is a generalization to multiple dimensions of the gamma distribution. It is named in honor of John Wishart, who first formulated the distribution in 1928. It is a family of probability distributions defi ...
, which is defined over matrices. Even taking derivatives is a bit tricky, as it involves
matrix calculus In mathematics, matrix calculus is a specialized notation for doing multivariable calculus, especially over spaces of matrices. It collects the various partial derivatives of a single function with respect to many variables, and/or of a ...
, but the respective identities are listed in that article. From the above table, we can see that the natural parameter is given by : \boldsymbol\eta_1 = -\frac12\mathbf{V}^{-1}, : \eta_2 = \frac{n-p-1}{2}, the reverse substitutions are : \mathbf{V} = -\frac12{\boldsymbol\eta_1}^{-1}, : n = 2\eta_2+p+1, and the sufficient statistics are (\mathbf{X}, \log, \mathbf{X}, ). The log-partition function is written in various forms in the table, to facilitate differentiation and back-substitution. We use the following forms: : A(\boldsymbol\eta_1, n) = -\frac{n}{2}\log, -\boldsymbol\eta_1, + \log\Gamma_p\left(\frac{n}{2}\right), : A(\mathbf{V},\eta_2) = \left(\eta_2+\frac{p+1}{2}\right)(p\log 2 + \log, \mathbf{V}, ) + \log\Gamma_p\left(\eta_2+\frac{p+1}{2}\right). ; Expectation of X (associated with η1) To differentiate with respect to η1, we need the following
matrix calculus In mathematics, matrix calculus is a specialized notation for doing multivariable calculus, especially over spaces of matrices. It collects the various partial derivatives of a single function with respect to many variables, and/or of a ...
identity: : \frac{\partial \log , a\mathbf{X}{\partial \mathbf{X =(\mathbf{X}^{-1})^{\rm T} Then: : \begin{align} \operatorname{E} mathbf{X}&= \frac{ \partial A\left(\boldsymbol\eta_1,\cdots \right) }{ \partial \boldsymbol\eta_1 } \\ &= \frac{ \partial }{ \partial \boldsymbol\eta_1 } \left -\boldsymbol\eta_1, + \log\Gamma_p\left(\frac{n}{2}\right) \right\\ &= -\frac{n}{2}(\boldsymbol\eta_1^{-1})^{\rm T} \\ &= \frac{n}{2}(-\boldsymbol\eta_1^{-1})^{\rm T} \\ &= n(\mathbf{V})^{\rm T} \\ &= n\mathbf{V} \end{align} The last line uses the fact that V is symmetric, and therefore it is the same when transposed. ;Expectation of log , X, (associated with ''η''2) Now, for ''η''2, we first need to expand the part of the log-partition function that involves the multivariate gamma function: : \log \Gamma_p(a)= \log \left(\pi^{\frac{p(p-1)}{4\prod_{j=1}^p \Gamma\left(a+\frac{1-j}{2}\right)\right) = \frac{p(p-1)}{4} \log \pi + \sum_{j=1}^p \log \Gamma\left a+\frac{1-j}{2}\right We also need the
digamma function In mathematics, the digamma function is defined as the logarithmic derivative of the gamma function: :\psi(x)=\frac\ln\big(\Gamma(x)\big)=\frac\sim\ln-\frac. It is the first of the polygamma functions. It is strictly increasing and strict ...
: : \psi(x) = \frac{d}{dx} \log \Gamma(x). Then: : \begin{align} \operatorname{E} \mathbf{X}, &= \frac{\partial A\left (\ldots,\eta_2 \right)}{\partial \eta_2} \\ &= \frac{\partial}{\partial \eta_2} \left \mathbf{V}, ) + \log\Gamma_p\left(\eta_2+\frac{p+1}{2}\right) \right\\ &= \frac{\partial}{\partial \eta_2} \left \mathbf{V}, ) + \frac{p(p-1)}{4} \log \pi + \sum_{j=1}^p \log \Gamma\left(\eta_2+\frac{p+1}{2}+\frac{1-j}{2}\right) \right\\ &= p\log 2 + \log, \mathbf{V}, + \sum_{j=1}^p \psi\left(\eta_2+\frac{p+1}{2}+\frac{1-j}{2}\right) \\ &= p\log 2 + \log, \mathbf{V}, + \sum_{j=1}^p \psi\left(\frac{n-p-1}{2}+\frac{p+1}{2}+\frac{1-j}{2}\right) \\ &= p\log 2 + \log, \mathbf{V}, + \sum_{j=1}^p \psi\left(\frac{n+1-j}{2}\right) \end{align} This latter formula is listed in the
Wishart distribution In statistics, the Wishart distribution is a generalization to multiple dimensions of the gamma distribution. It is named in honor of John Wishart, who first formulated the distribution in 1928. It is a family of probability distributions defi ...
article. Both of these expectations are needed when deriving the
variational Bayes Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usuall ...
update equations in a
Bayes network A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bay ...
involving a Wishart distribution (which is the
conjugate prior In Bayesian probability theory, if the posterior distribution p(\theta \mid x) is in the same probability distribution family as the prior probability distribution p(\theta), the prior and posterior are then called conjugate distributions, and th ...
of the
multivariate normal distribution In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional ( univariate) normal distribution to higher dimensions. One ...
). Computing these formulas using integration would be much more difficult. The first one, for example, would require matrix integration.


Entropy


Relative entropy

The
relative entropy Relative may refer to: General use *Kinship and family, the principle binding the most basic social units society. If two people are connected by circumstances of birth, they are said to be ''relatives'' Philosophy *Relativism, the concept that ...
(
Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fr ...
, KL divergence) of two distributions in an exponential family has a simple expression as the
Bregman divergence In mathematics, specifically statistics and information geometry, a Bregman divergence or Bregman distance is a measure of difference between two points, defined in terms of a strictly convex function; they form an important class of divergences. W ...
between the natural parameters with respect to the log-normalizer. The relative entropy is defined in terms of an integral, while the Bregman divergence is defined in terms of a derivative and inner product, and thus is easier to calculate and has a
closed-form expression In mathematics, a closed-form expression is a mathematical expression that uses a finite number of standard operations. It may contain constants, variables, certain well-known operations (e.g., + − × ÷), and functions (e.g., ''n''th roo ...
(assuming the derivative has a closed-form expression). Further, the Bregman divergence in terms of the natural parameters and the log-normalizer equals the Bregman divergence of the dual parameters (expectation parameters), in the opposite order, for the
convex conjugate In mathematics and mathematical optimization, the convex conjugate of a function is a generalization of the Legendre transformation which applies to non-convex functions. It is also known as Legendre–Fenchel transformation, Fenchel transformatio ...
function. Fixing an exponential family with log-normalizer (with convex conjugate ), writing P_{A,\theta} for the distribution in this family corresponding a fixed value of the natural parameter (writing for another value, and with for the corresponding dual expectation/moment parameters), writing for the KL divergence, and for the Bregman divergence, the divergences are related as: :\rm{KL}(P_{A,\theta} \parallel P_{A,\theta'}) = B_A(\theta' \parallel \theta) = B_{A^*}(\eta \parallel \eta'). The KL divergence is conventionally written with respect to the ''first'' parameter, while the Bregman divergence is conventionally written with respect to the ''second'' parameter, and thus this can be read as "the relative entropy is equal to the Bregman divergence defined by the log-normalizer on the swapped natural parameters", or equivalently as "equal to the Bregman divergence defined by the dual to the log-normalizer on the expectation parameters".


Maximum-entropy derivation

Exponential families arise naturally as the answer to the following question: what is the maximum-entropy distribution consistent with given constraints on expected values? The
information entropy In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable X, which takes values in the alphabet \ ...
of a probability distribution ''dF''(''x'') can only be computed with respect to some other probability distribution (or, more generally, a positive measure), and both
measure Measure may refer to: * Measurement, the assignment of a number to a characteristic of an object or event Law * Ballot measure, proposed legislation in the United States * Church of England Measure, legislation of the Church of England * Mea ...
s must be mutually
absolutely continuous In calculus, absolute continuity is a smoothness property of functions that is stronger than continuity and uniform continuity. The notion of absolute continuity allows one to obtain generalizations of the relationship between the two central ope ...
. Accordingly, we need to pick a ''reference measure'' ''dH''(''x'') with the same support as ''dF''(''x''). The entropy of ''dF''(''x'') relative to ''dH''(''x'') is :S F\mid dH-\int \frac{dF}{dH}\log\frac{dF}{dH}\,dH or :S F\mid dH\int\log\frac{dH}{dF}\,dF where ''dF''/''dH'' and ''dH''/''dF'' are Radon–Nikodym derivatives. The ordinary definition of entropy for a discrete distribution supported on a set ''I'', namely :S=-\sum_{i\in I} p_i\log p_i ''assumes'', though this is seldom pointed out, that ''dH'' is chosen to be the
counting measure In mathematics, specifically measure theory, the counting measure is an intuitive way to put a measure on any set – the "size" of a subset is taken to be the number of elements in the subset if the subset has finitely many elements, and infinity ...
on ''I''. Consider now a collection of observable quantities (random variables) ''Ti''. The probability distribution ''dF'' whose entropy with respect to ''dH'' is greatest, subject to the conditions that the expected value of ''Ti'' be equal to ''ti'', is an exponential family with ''dH'' as reference measure and (''T''1, ..., ''Tn'') as sufficient statistic. The derivation is a simple variational calculation using
Lagrange multipliers In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints (i.e., subject to the condition that one or more equations have to be satisfied e ...
. Normalization is imposed by letting ''T''0 = 1 be one of the constraints. The natural parameters of the distribution are the Lagrange multipliers, and the normalization factor is the Lagrange multiplier associated to ''T''0. For examples of such derivations, see
Maximum entropy probability distribution In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entro ...
.


Role in statistics


Classical estimation: sufficiency

According to the
Pitman Pitman may refer to: * A coal miner, particularly in Northern England * Pitman (surname) * Pitman, New Jersey, United States * Pitman, Pennsylvania, United States * Pitman, Saskatchewan, Canada * Pitman Shorthand, a system of shorthand * Pitman ar ...
Koopman Koopman is a Dutch language, Dutch occupational surname that means "merchant". The spelling Coopman is more common in West Flanders.Coopman
at ...
Darmois theorem, among families of probability distributions whose domain does not vary with the parameter being estimated, only in exponential families is there a
sufficient statistic In statistics, a statistic is ''sufficient'' with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the pa ...
whose dimension remains bounded as sample size increases. Less tersely, suppose ''Xk'', (where ''k'' = 1, 2, 3, ... ''n'') are
independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in the New Hope, Pennsylvania, area of the United States during the early 1930s * Independ ...
, identically distributed random variables. Only if their distribution is one of the ''exponential family'' of distributions is there a
sufficient statistic In statistics, a statistic is ''sufficient'' with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the pa ...
''T''(''X''1, ..., ''Xn'') whose
number A number is a mathematical object used to count, measure, and label. The original examples are the natural numbers 1, 2, 3, 4, and so forth. Numbers can be represented in language with number words. More universally, individual numbers c ...
of scalar components does not increase as the sample size ''n'' increases; the statistic ''T'' may be a
vector Vector most often refers to: *Euclidean vector, a quantity with a magnitude and a direction *Vector (epidemiology), an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematic ...
or a single scalar number, but whatever it is, its
size Size in general is the Magnitude (mathematics), magnitude or dimensions of a thing. More specifically, ''geometrical size'' (or ''spatial size'') can refer to linear dimensions (length, width, height, diameter, perimeter), area, or volume ...
will neither grow nor shrink when more data are obtained. As a counterexample if these conditions are relaxed, the family of uniform distributions (either
discrete Discrete may refer to: *Discrete particle or quantum in physics, for example in quantum theory *Discrete device, an electronic component with just one circuit element, either passive or active, other than an integrated circuit *Discrete group, a g ...
or
continuous Continuity or continuous may refer to: Mathematics * Continuity (mathematics), the opposing concept to discreteness; common examples include ** Continuous probability distribution or random variable in probability and statistics ** Continuous ...
, with either or both bounds unknown) has a sufficient statistic, namely the sample maximum, sample minimum, and sample size, but does not form an exponential family, as the domain varies with the parameters.


Bayesian estimation: conjugate distributions

Exponential families are also important in
Bayesian statistics Bayesian statistics is a theory in the field of statistics based on the Bayesian interpretation of probability where probability expresses a ''degree of belief'' in an event. The degree of belief may be based on prior knowledge about the event, ...
. In Bayesian statistics a
prior distribution In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken int ...
is multiplied by a
likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
and then normalised to produce a
posterior distribution The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior p ...
. In the case of a likelihood which belongs to an exponential family there exists a
conjugate prior In Bayesian probability theory, if the posterior distribution p(\theta \mid x) is in the same probability distribution family as the prior probability distribution p(\theta), the prior and posterior are then called conjugate distributions, and th ...
, which is often also in an exponential family. A conjugate prior π for the parameter \boldsymbol\eta of an exponential family : f(x\mid\boldsymbol\eta) = h(x) \exp \left ( {\boldsymbol\eta}^{\rm T}\mathbf{T}(x) -A(\boldsymbol\eta)\right ) is given by : p_\pi(\boldsymbol\eta\mid\boldsymbol\chi,\nu) = f(\boldsymbol\chi,\nu) \exp \left (\boldsymbol\eta^{\rm T} \boldsymbol\chi - \nu A(\boldsymbol\eta) \right ), or equivalently :p_\pi(\boldsymbol\eta\mid\boldsymbol\chi,\nu) = f(\boldsymbol\chi,\nu) g(\boldsymbol\eta)^\nu \exp \left (\boldsymbol\eta^{\rm T} \boldsymbol\chi \right ), \qquad \boldsymbol\chi \in \mathbb{R}^s where ''s'' is the dimension of \boldsymbol\eta and \nu > 0 and \boldsymbol\chi are
hyperparameter In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of the model for the underlying system under analysis. For example, if one is using a beta distribution to mo ...
s (parameters controlling parameters). \nu corresponds to the effective number of observations that the prior distribution contributes, and \boldsymbol\chi corresponds to the total amount that these pseudo-observations contribute to the
sufficient statistic In statistics, a statistic is ''sufficient'' with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the pa ...
over all observations and pseudo-observations. f(\boldsymbol\chi,\nu) is a
normalization constant The concept of a normalizing constant arises in probability theory and a variety of other areas of mathematics. The normalizing constant is used to reduce any probability function to a probability density function with total probability of one. ...
that is automatically determined by the remaining functions and serves to ensure that the given function is a
probability density function In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can ...
(i.e. it is normalized). A(\boldsymbol\eta) and equivalently g(\boldsymbol\eta) are the same functions as in the definition of the distribution over which π is the conjugate prior. A conjugate prior is one which, when combined with the likelihood and normalised, produces a posterior distribution which is of the same type as the prior. For example, if one is estimating the success probability of a binomial distribution, then if one chooses to use a beta distribution as one's prior, the posterior is another beta distribution. This makes the computation of the posterior particularly simple. Similarly, if one is estimating the parameter of a
Poisson distribution In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known co ...
the use of a gamma prior will lead to another gamma posterior. Conjugate priors are often very flexible and can be very convenient. However, if one's belief about the likely value of the theta parameter of a binomial is represented by (say) a bimodal (two-humped) prior distribution, then this cannot be represented by a beta distribution. It can however be represented by using a
mixture density In probability and statistics, a mixture distribution is the probability distribution of a random variable that is derived from a collection of other random variables as follows: first, a random variable is selected by chance from the collectio ...
as the prior, here a combination of two beta distributions; this is a form of
hyperprior In Bayesian statistics, a hyperprior is a prior distribution on a hyperparameter, that is, on a parameter of a prior distribution. As with the term ''hyperparameter,'' the use of ''hyper'' is to distinguish it from a prior distribution of a param ...
. An arbitrary likelihood will not belong to an exponential family, and thus in general no conjugate prior exists. The posterior will then have to be computed by numerical methods. To show that the above prior distribution is a conjugate prior, we can derive the posterior. First, assume that the probability of a single observation follows an exponential family, parameterized using its natural parameter: : p_F(x\mid\boldsymbol \eta) = h(x) g(\boldsymbol\eta) \exp\left(\boldsymbol\eta^{\rm T} \mathbf{T}(x)\right) Then, for data \mathbf{X} = (x_1,\ldots,x_n), the likelihood is computed as follows: :p(\mathbf{X}\mid\boldsymbol\eta) =\left(\prod_{i=1}^n h(x_i) \right) g(\boldsymbol\eta)^n \exp\left(\boldsymbol\eta^{\rm T}\sum_{i=1}^n \mathbf{T}(x_i) \right) Then, for the above conjugate prior: : \begin{align}p_\pi(\boldsymbol\eta\mid\boldsymbol\chi,\nu) &= f(\boldsymbol\chi,\nu) g(\boldsymbol\eta)^\nu \exp(\boldsymbol\eta^{\rm T} \boldsymbol\chi) \propto g(\boldsymbol\eta)^\nu \exp(\boldsymbol\eta^{\rm T} \boldsymbol\chi)\end{align} We can then compute the posterior as follows: :\begin{align} p(\boldsymbol\eta\mid\mathbf{X},\boldsymbol\chi,\nu)& \propto p(\mathbf{X}\mid\boldsymbol\eta) p_\pi(\boldsymbol\eta\mid\boldsymbol\chi,\nu) \\ &= \left(\prod_{i=1}^n h(x_i) \right) g(\boldsymbol\eta)^n \exp\left(\boldsymbol\eta^{\rm T} \sum_{i=1}^n \mathbf{T}(x_i)\right) f(\boldsymbol\chi,\nu) g(\boldsymbol\eta)^\nu \exp(\boldsymbol\eta^{\rm T} \boldsymbol\chi) \\ &\propto g(\boldsymbol\eta)^n \exp\left(\boldsymbol\eta^{\rm T}\sum_{i=1}^n \mathbf{T}(x_i)\right) g(\boldsymbol\eta)^\nu \exp(\boldsymbol\eta^{\rm T} \boldsymbol\chi) \\ &\propto g(\boldsymbol\eta)^{\nu + n} \exp\left(\boldsymbol\eta^{\rm T} \left(\boldsymbol\chi + \sum_{i=1}^n \mathbf{T}(x_i)\right)\right) \end{align} The last line is the
kernel Kernel may refer to: Computing * Kernel (operating system), the central component of most operating systems * Kernel (image processing), a matrix used for image convolution * Compute kernel, in GPGPU programming * Kernel method, in machine learn ...
of the posterior distribution, i.e. : p(\boldsymbol\eta\mid\mathbf{X},\boldsymbol\chi,\nu) = p_\pi\left(\boldsymbol\eta\left, ~\boldsymbol\chi + \sum_{i=1}^n \mathbf{T}(x_i), \nu + n \right.\right) This shows that the posterior has the same form as the prior. The data X enters into this equation ''only'' in the expression : \mathbf{T}(\mathbf{X}) = \sum_{i=1}^n \mathbf{T}(x_i), which is termed the
sufficient statistic In statistics, a statistic is ''sufficient'' with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the pa ...
of the data. That is, the value of the sufficient statistic is sufficient to completely determine the posterior distribution. The actual data points themselves are not needed, and all sets of data points with the same sufficient statistic will have the same distribution. This is important because the dimension of the sufficient statistic does not grow with the data size — it has only as many components as the components of \boldsymbol\eta (equivalently, the number of parameters of the distribution of a single data point). The update equations are as follows: : \begin{align} \boldsymbol\chi' &= \boldsymbol\chi + \mathbf{T}(\mathbf{X}) \\ &= \boldsymbol\chi + \sum_{i=1}^n \mathbf{T}(x_i) \\ \nu' &= \nu + n \end{align} This shows that the update equations can be written simply in terms of the number of data points and the
sufficient statistic In statistics, a statistic is ''sufficient'' with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the pa ...
of the data. This can be seen clearly in the various examples of update equations shown in the
conjugate prior In Bayesian probability theory, if the posterior distribution p(\theta \mid x) is in the same probability distribution family as the prior probability distribution p(\theta), the prior and posterior are then called conjugate distributions, and th ...
page. Because of the way that the sufficient statistic is computed, it necessarily involves sums of components of the data (in some cases disguised as products or other forms — a product can be written in terms of a sum of
logarithm In mathematics, the logarithm is the inverse function to exponentiation. That means the logarithm of a number  to the base  is the exponent to which must be raised, to produce . For example, since , the ''logarithm base'' 10 o ...
s). The cases where the update equations for particular distributions don't exactly match the above forms are cases where the conjugate prior has been expressed using a different
parameterization In mathematics, and more specifically in geometry, parametrization (or parameterization; also parameterisation, parametrisation) is the process of finding parametric equations of a curve, a surface, or, more generally, a manifold or a variety, d ...
than the one that produces a conjugate prior of the above form — often specifically because the above form is defined over the natural parameter \boldsymbol\eta while conjugate priors are usually defined over the actual parameter \boldsymbol\theta .


Hypothesis testing: uniformly most powerful tests

A one-parameter exponential family has a monotone non-decreasing likelihood ratio in the
sufficient statistic In statistics, a statistic is ''sufficient'' with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the pa ...
''T''(''x''), provided that ''η''(''θ'') is non-decreasing. As a consequence, there exists a
uniformly most powerful test In statistical hypothesis testing, a uniformly most powerful (UMP) test is a hypothesis test which has the greatest power 1 - \beta among all possible tests of a given size ''α''. For example, according to the Neyman–Pearson lemma, the likelih ...
for testing the hypothesis ''H''0: ''θ'' ≥ ''θ''0 ''vs''. ''H''1: ''θ'' < ''θ''0.


Generalized linear models

Exponential families form the basis for the distribution functions used in
generalized linear model In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and b ...
s (GLM), a class of model that encompasses many of the commonly used regression models in statistics. Examples include
logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
using the binomial family and
Poisson regression In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable ''Y'' has a Poisson distribution, and assumes the logari ...
.


See also

*
Exponential dispersion model In probability and statistics, the class of exponential dispersion models (EDM) is a set of probability distributions that represents a generalisation of the natural exponential family.Jørgensen, B. (1987). Exponential dispersion models (with dis ...
*
Gibbs measure In mathematics, the Gibbs measure, named after Josiah Willard Gibbs, is a probability measure frequently seen in many problems of probability theory and statistical mechanics. It is a generalization of the canonical ensemble to infinite systems. Th ...
*
Modified half-normal distribution In probability theory and statistics, the half-normal distribution is a special case of the folded normal distribution. Let X follow an ordinary normal distribution, N(0,\sigma^2). Then, Y=, X, follows a half-normal distribution. Thus, the hal ...
*
Natural exponential family In probability and statistics, a natural exponential family (NEF) is a class of probability distributions that is a special case of an exponential family (EF). Definition Univariate case The natural exponential families (NEF) are a subset of ...


Footnotes


References


Citations


Sources

* ** Reprinted as * *


Further reading

* * *


External links


A primer on the exponential family of distributions


on th


jMEF: A Java library for exponential families
{{DEFAULTSORT:Exponential Family Exponentials Continuous distributions Discrete distributions Types of probability distributions