In
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
and
information theory
Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...
, a maximum entropy probability distribution has
entropy
Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...
that is at least as great as that of all other members of a specified class of
probability distribution
In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
s. According to the
principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class (usually defined in terms of specified properties or measures), then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of
prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.
Definition of entropy and differential entropy
If
is a
continuous random variable
In probability theory and statistics, a probability distribution is a function that gives the probabilities of occurrence of possible events for an experiment. It is a mathematical description of a random phenomenon in terms of its sample spa ...
with
probability density , then the
differential entropy of
is defined as
If
is a
discrete random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' in its mathematical definition refers ...
with distribution given by
then the entropy of
is defined as
The seemingly divergent term
is replaced by zero, whenever
This is a special case of more general forms described in the articles
Entropy (information theory)
In information theory, the entropy of a random variable quantifies the average level of uncertainty or information associated with the variable's potential states or possible outcomes. This measures the expected amount of information needed ...
,
Principle of maximum entropy, and differential entropy. In connection with maximum entropy distributions, this is the only one needed, because maximizing
will also maximize the more general forms.
The base of the
logarithm
In mathematics, the logarithm of a number is the exponent by which another fixed value, the base, must be raised to produce that number. For example, the logarithm of to base is , because is to the rd power: . More generally, if , the ...
is not important, as long as the same one is used consistently: Change of base merely results in a rescaling of the entropy. Information theorists may prefer to use base 2 in order to express the entropy in
bits; mathematicians and physicists often prefer the
natural logarithm
The natural logarithm of a number is its logarithm to the base of a logarithm, base of the e (mathematical constant), mathematical constant , which is an Irrational number, irrational and Transcendental number, transcendental number approxima ...
, resulting in a unit of
"nat"s for the entropy.
However, the chosen
measure is crucial, even though the typical use of the
Lebesgue measure
In measure theory, a branch of mathematics, the Lebesgue measure, named after French mathematician Henri Lebesgue, is the standard way of assigning a measure to subsets of higher dimensional Euclidean '-spaces. For lower dimensions or , it c ...
is often defended as a "natural" choice: Which measure is chosen determines the entropy and the consequent maximum entropy distribution.
Distributions with measured constants
Many statistical distributions of applicable interest are those for which the
moments or other measurable quantities are constrained to be constants. The following theorem by
Ludwig Boltzmann
Ludwig Eduard Boltzmann ( ; ; 20 February 1844 – 5 September 1906) was an Austrian mathematician and Theoretical physics, theoretical physicist. His greatest achievements were the development of statistical mechanics and the statistical ex ...
gives the form of the probability density under these constraints.
Continuous case
Suppose
is a continuous,
closed subset of the
real number
In mathematics, a real number is a number that can be used to measure a continuous one- dimensional quantity such as a duration or temperature. Here, ''continuous'' means that pairs of values can have arbitrarily small differences. Every re ...
s
and we choose to specify
measurable function
In mathematics, and in particular measure theory, a measurable function is a function between the underlying sets of two measurable spaces that preserves the structure of the spaces: the preimage of any measurable set is measurable. This is in ...
s
and
numbers
We consider the class
of all real-valued random variables which are supported on
(i.e. whose density function is zero outside of
) and which satisfy the
moment conditions:
If there is a member in
whose
density function is positive everywhere in
and if there exists a maximal entropy distribution for
then its probability density
has the following form:
where we assume that
The constant
and the
Lagrange multipliers solve the constrained optimization problem with
(which ensures that
integrates to unity):
Using the
Karush–Kuhn–Tucker conditions, it can be shown that the optimization problem has a unique solution because the objective function in the optimization is concave in
Note that when the moment constraints are equalities (instead of inequalities), that is,
then the constraint condition
can be dropped, which makes optimization over the Lagrange multipliers unconstrained.
Discrete case
Suppose
is a (finite or infinite) discrete subset of the reals, and that we choose to specify
functions
and
numbers
We consider the class
of all discrete random variables
which are supported on
and which satisfy the
moment conditions
If there exists a member of class
which assigns positive probability to all members of
and if there exists a maximum entropy distribution for
then this distribution has the following shape:
where we assume that
and the constants
solve the constrained optimization problem with
:
Again as above, if the moment conditions are equalities (instead of inequalities), then the constraint condition
is not present in the optimization.
Proof in the case of equality constraints
In the case of equality constraints, this theorem is proved with the
calculus of variations
The calculus of variations (or variational calculus) is a field of mathematical analysis that uses variations, which are small changes in Function (mathematics), functions
and functional (mathematics), functionals, to find maxima and minima of f ...
and
Lagrange multiplier
In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function (mathematics), function subject to constraint (mathematics), equation constraints (i.e., subject to the conditio ...
s. The constraints can be written as
We consider the
functional
where
and
are the Lagrange multipliers. The zeroth constraint ensures the
second axiom of probability. The other constraints are that the measurements of the function are given constants up to order
. The entropy attains an extremum when the
functional derivative is equal to zero:
Therefore, the extremal entropy probability distribution in this case must be of the form (
),
remembering that
. It can be verified that this is the maximal solution by checking that the variation around this solution is always negative.
Uniqueness of the maximum
Suppose
and
are distributions satisfying the expectation-constraints. Letting
and considering the distribution
it is clear that this distribution satisfies the expectation-constraints and furthermore has as support
From basic facts about entropy, it holds that
Taking limits
and
respectively, yields
It follows that a distribution satisfying the expectation-constraints and maximising entropy must necessarily have full support — ''i. e.'' the distribution is almost everywhere strictly positive. It follows that the maximising distribution must be an internal point in the space of distributions satisfying the expectation-constraints, that is, it must be a local extreme. Thus it suffices to show that the local extreme is unique, in order to show both that the entropy-maximising distribution is unique (and this also shows that the local extreme is the global maximum).
Suppose
and
are local extremes. Reformulating the above computations these are characterised by parameters
via
and similarly for
where
We now note a series of identities: Via the satisfaction of the expectation-constraints and utilising gradients / directional derivatives, one has
and similarly for
Letting
one obtains:
where
for some
Computing further, one has
where
is similar to the distribution above, only parameterised by
''Assuming'' that no non-trivial linear combination of the observables is
almost everywhere
In measure theory (a branch of mathematical analysis), a property holds almost everywhere if, in a technical sense, the set for which the property holds takes up nearly all possibilities. The notion of "almost everywhere" is a companion notion to ...
(a.e.) constant, (which ''e.g.'' holds if the observables are independent and not a.e. constant), it holds that
has non-zero variance, unless
By the above equation it is thus clear, that the latter must be the case. Hence
so the parameters characterising the local extrema
are identical, which means that the distributions themselves are identical. Thus, the local extreme is unique and by the above discussion, the maximum is unique – provided a local extreme actually exists.
Caveats
Note that not all classes of distributions contain a maximum entropy distribution. It is possible that a class contain distributions of arbitrarily large entropy (e.g. the class of all continuous distributions on R with mean 0 but arbitrary standard deviation), or that the entropies are bounded above but there is no distribution which attains the maximal entropy.
[For example, the class of all continuous distributions ''X'' on R with and (see Cover, Ch 12).] It is also possible that the expected value restrictions for the class ''C'' force the probability distribution to be zero in certain subsets of ''S''. In that case our theorem doesn't apply, but one can work around this by shrinking the set ''S''.
Examples
Every probability distribution is trivially a maximum entropy probability distribution under the constraint that the distribution has its own entropy. To see this, rewrite the density as
and compare to the expression of the theorem above. By choosing
to be the measurable function and
to be the constant,
is the maximum entropy probability distribution under the constraint
Nontrivial examples are distributions that are subject to multiple constraints that are different from the assignment of the entropy. These are often found by starting with the same procedure
and finding that
can be separated into parts.
A table of examples of maximum entropy distributions is given in Lisman (1972)
and Park & Bera (2009).
Uniform and piecewise uniform distributions
The
uniform distribution on the interval
'a'',''b''is the maximum entropy distribution among all continuous distributions which are supported in the interval
'a'', ''b'' and thus the probability density is 0 outside of the interval. This uniform density can be related to Laplace's
principle of indifference, sometimes called the principle of insufficient reason. More generally, if we are given a subdivision ''a''=''a''
0 < ''a''
1 < ... < ''a''
''k'' = ''b'' of the interval
'a'',''b''and probabilities ''p''
1,...,''p''
''k'' that add up to one, then we can consider the class of all continuous distributions such that
The density of the maximum entropy distribution for this class is constant on each of the intervals [''a''
''j''−1,''a''
''j''). The uniform distribution on the finite set (which assigns a probability of 1/''n'' to each of these values) is the maximum entropy distribution among all discrete distributions supported on this set.
Positive and specified mean: the exponential distribution
The
exponential distribution
In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuousl ...
, for which the density function is
is the maximum entropy distribution among all continuous distributions supported in [0,∞) that have a specified mean of 1/λ.
In the case of distributions supported on [0,∞), the maximum entropy distribution depends on relationships between the first and second moments. In specific cases, it may be the exponential distribution, or may be another distribution, or may be undefinable.
Specified mean and variance: the normal distribution
The
normal distribution
In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is
f(x) = \frac ...
''N''(''μ'',''σ''
2), for which the density function is
has maximum entropy among all real number, real-valued distributions supported on (−∞,∞) with a specified
variance
In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
''σ''
2 (a particular
moment). The same is true when the
mean
A mean is a quantity representing the "center" of a collection of numbers and is intermediate to the extreme values of the set of numbers. There are several kinds of means (or "measures of central tendency") in mathematics, especially in statist ...
''μ'' and the
variance
In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
''σ''
2 is specified (the first two moments), since entropy is translation invariant on (−∞,∞). Therefore, the assumption of normality imposes the minimal prior structural constraint beyond these moments. (See the
differential entropy article for a derivation.)
Discrete distributions with specified mean
Among all the discrete distributions supported on the set with a specified mean μ, the maximum entropy distribution has the following shape:
where the positive constants ''C'' and ''r'' can be determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be μ.
For example, if a large number ''N'' of dice are thrown, and you are told that the sum of all the shown numbers is ''S''. Based on this information alone, what would be a reasonable assumption for the number of dice showing 1, 2, ..., 6? This is an instance of the situation considered above, with = and ''μ'' = ''S''/''N''.
Finally, among all the discrete distributions supported on the infinite set
with mean ''μ'', the maximum entropy distribution has the shape:
where again the constants ''C'' and ''r'' were determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be μ. For example, in the case that ''x
k = k'', this gives
such that respective maximum entropy distribution is the
geometric distribution
In probability theory and statistics, the geometric distribution is either one of two discrete probability distributions:
* The probability distribution of the number X of Bernoulli trials needed to get one success, supported on \mathbb = \;
* T ...
.
Circular random variables
For a continuous random variable
distributed about the unit circle, the
Von Mises distribution maximizes the entropy when the real and imaginary parts of the first
circular moment are specified
or, equivalently, the
circular mean and
circular variance are specified.
When the mean and variance of the angles
modulo
are specified, the
wrapped normal distribution maximizes the entropy.
Maximizer for specified mean, variance and skew
There exists an upper bound on the entropy of continuous random variables on
with a specified mean, variance, and skew. However, there is ''no distribution which achieves this upper bound'', because
is unbounded when
(see Cover & Thomas (2006: chapter 12)).
However, the maximum entropy is -achievable: a distribution's entropy can be arbitrarily close to the upper bound. Start with a normal distribution of the specified mean and variance. To introduce a positive skew, perturb the normal distribution upward by a small amount at a value many larger than the mean. The skewness, being proportional to the third moment, will be affected more than the lower order moments.
This is a special case of the general case in which the exponential of any odd-order polynomial in ''x'' will be unbounded on
. For example,
will likewise be unbounded on
, but when the support is limited to a bounded or semi-bounded interval the upper entropy bound may be achieved (e.g. if ''x'' lies in the interval
,∞and ''λ< 0'', the
exponential distribution
In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuousl ...
will result).
Maximizer for specified mean and deviation risk measure
Every distribution with
log-concave density is a maximal entropy distribution with specified mean and
deviation risk measure In financial mathematics, a deviation risk measure is a function to quantify financial risk (and not necessarily downside risk) in a different method than a general risk measure. Deviation risk measures generalize the concept of standard deviation.
...
.
In particular, the maximal entropy distribution with specified mean
and deviation
is:
* The
normal distribution
In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is
f(x) = \frac ...
if
is the
standard deviation
In statistics, the standard deviation is a measure of the amount of variation of the values of a variable about its Expected value, mean. A low standard Deviation (statistics), deviation indicates that the values tend to be close to the mean ( ...
;
* The
Laplace distribution, if
is the
average absolute deviation;
[
* The distribution with density of the form if is the standard lower semi-deviation, where are constants and the function returns only the negative values of its argument, otherwise zero.][
]
Other examples
In the table below, each listed distribution maximizes the entropy for a particular set of functional constraints listed in the third column, and the constraint that be included in the support of the probability density, which is listed in the fourth column.[
Several listed examples ( Bernoulli, geometric, exponential, Laplace, Pareto) are trivially true, because their associated constraints are equivalent to the assignment of their entropy. They are included anyway because their constraint is related to a common or easily measured quantity.
For reference, is the ]gamma function
In mathematics, the gamma function (represented by Γ, capital Greek alphabet, Greek letter gamma) is the most common extension of the factorial function to complex numbers. Derived by Daniel Bernoulli, the gamma function \Gamma(z) is defined ...
, is the digamma function
In mathematics, the digamma function is defined as the logarithmic derivative of the gamma function:
:\psi(z) = \frac\ln\Gamma(z) = \frac.
It is the first of the polygamma functions. This function is Monotonic function, strictly increasing a ...
, is the beta function, and is the Euler-Mascheroni constant.
The maximum entropy principle can be used to upper bound the entropy of statistical mixtures.[ ]
See also
* Exponential family
* Gibbs measure
* Partition function (mathematics)
The partition function or configuration integral, as used in probability theory, information theory and dynamical systems, is a generalization of the definition of a partition function in statistical mechanics. It is a special case of a normaliz ...
* Maximal entropy random walk - maximizing entropy rate for a graph
Notes
Citations
References
*
* F. Nielsen, R. Nock (2017),
MaxEnt upper bounds for the differential entropy of univariate continuous distributions
', IEEE Signal Processing Letters, 24(4), 402–406
* I. J. Taneja (2001),
Generalized Information Measures and Their Applications
'
* Nader Ebrahimi, Ehsan S. Soofi, Refik Soyer (2008), "Multivariate maximum entropy identification, transformation, and dependence", '' Journal of Multivariate Analysis'' 99: 1217–1231,
{{DEFAULTSORT:Maximum Entropy Probability Distribution
Entropy and information
Continuous distributions
Discrete distributions
Particle statistics
Types of probability distributions