HOME

TheInfoList



OR:

In
information theory Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...
, the asymptotic equipartition property (AEP) is a general property of the output samples of a stochastic source. It is fundamental to the concept of
typical set In information theory, the typical set is a set of sequences whose probability is close to two raised to the negative power of the entropy of their source distribution. That this set has total probability close to one is a consequence of the asympt ...
used in theories of
data compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressi ...
. Roughly speaking, the theorem states that although there are many series of results that may be produced by a random process, the one actually produced is most probably from a loosely defined set of outcomes that all have approximately the same chance of being the one actually realized. (This is a consequence of the
law of large numbers In probability theory, the law of large numbers is a mathematical law that states that the average of the results obtained from a large number of independent random samples converges to the true value, if it exists. More formally, the law o ...
and
ergodic theory Ergodic theory is a branch of mathematics that studies statistical properties of deterministic dynamical systems; it is the study of ergodicity. In this context, "statistical properties" refers to properties which are expressed through the behav ...
.) Although there are individual outcomes which have a higher probability than any outcome in this set, the vast number of outcomes in the set almost guarantees that the outcome will come from the set. One way of intuitively understanding the property is through Cramér's large deviation theorem, which states that the probability of a large deviation from mean decays exponentially with the number of samples. Such results are studied in
large deviations theory In probability theory, the theory of large deviations concerns the asymptotic behaviour of remote tails of sequences of probability distributions. While some basic ideas of the theory can be traced to Laplace, the formalization started with insura ...
; intuitively, it is the large deviations that would violate equipartition, but these are unlikely. In the field of pseudorandom number generation, a candidate generator of undetermined quality whose output sequence lies too far outside the typical set by some statistical criteria is rejected as insufficiently random. Thus, although the typical set is loosely defined, practical notions arise concerning ''sufficient'' typicality.


Definition

Given a discrete-time stationary ergodic stochastic process X on the
probability space In probability theory, a probability space or a probability triple (\Omega, \mathcal, P) is a mathematical construct that provides a formal model of a random process or "experiment". For example, one can define a probability space which models ...
(\Omega, B, p), the asymptotic equipartition property is an assertion that,
almost surely In probability theory, an event is said to happen almost surely (sometimes abbreviated as a.s.) if it happens with probability 1 (with respect to the probability measure). In other words, the set of outcomes on which the event does not occur ha ...
, -\frac \log p(X_1, X_2, \dots, X_n) \to H(X) \quad \text \quad n\to\infty where H(X) or simply H denotes the
entropy rate In the mathematical theory of probability, the entropy rate or source information rate is a function assigning an entropy to a stochastic process. For a strongly stationary process, the conditional entropy for latest random variable eventually ...
of X, which must exist for all discrete-time
stationary process In mathematics and statistics, a stationary process (also called a strict/strictly stationary process or strong/strongly stationary process) is a stochastic process whose statistical properties, such as mean and variance, do not change over time. M ...
es including the ergodic ones. The asymptotic equipartition property is proved for finite-valued (i.e. , \Omega, < \infty) stationary ergodic stochastic processes in the Shannon–McMillan–Breiman theorem using the ergodic theory and for any i.i.d. sources directly using the law of large numbers in both the discrete-valued case (where H is simply the
entropy Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...
of a symbol) and the continuous-valued case (where H is the differential entropy instead). The definition of the asymptotic equipartition property can also be extended for certain classes of continuous-time stochastic processes for which a typical set exists for long enough observation time. The convergence is proven
almost sure In probability theory, an event is said to happen almost surely (sometimes abbreviated as a.s.) if it happens with probability 1 (with respect to the probability measure). In other words, the set of outcomes on which the event does not occur has ...
in all cases.


Discrete-time i.i.d. sources

Given X is an i.i.d. source which may take values in the alphabet \mathcal, its
time series In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. ...
X_1,\ldots,X_n is i.i.d. with
entropy Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...
H(X). The weak
law of large numbers In probability theory, the law of large numbers is a mathematical law that states that the average of the results obtained from a large number of independent random samples converges to the true value, if it exists. More formally, the law o ...
gives the asymptotic equipartition property with
convergence in probability In probability theory, there exist several different notions of convergence of sequences of random variables, including ''convergence in probability'', ''convergence in distribution'', and ''almost sure convergence''. The different notions of conve ...
, \lim_\Pr\left -\frac \log p(X_1, X_2, \ldots, X_n) - H(X)\> \varepsilon\right0 \qquad \forall \varepsilon > 0. since the entropy is equal to the expectation of -\frac \log p(X_1, X_2, \ldots , X_n). The strong law of large numbers asserts the stronger almost sure convergence, \Pr\left lim_ - \frac \log p(X_1, X_2,\ldots, X_n) = H(X)\right1.Convergence in the sense of L1 asserts an even stronger\mathbb \left \lim_ - \frac \log p(X_1, X_2,\ldots, X_n) - H(X)\\right 0


Discrete-time finite-valued stationary ergodic sources

Consider a finite-valued sample space \Omega, i.e. , \Omega, < \infty, for the discrete-time
stationary ergodic process In probability theory, a stationary ergodic process is a stochastic process which exhibits both stationarity and ergodicity. In essence this implies that the random process will not change its statistical properties with time and that its statistic ...
X:=\ defined on the
probability space In probability theory, a probability space or a probability triple (\Omega, \mathcal, P) is a mathematical construct that provides a formal model of a random process or "experiment". For example, one can define a probability space which models ...
(\Omega, B, p). The Shannon–McMillan–Breiman theorem, due to
Claude Shannon Claude Elwood Shannon (April 30, 1916 – February 24, 2001) was an American mathematician, electrical engineer, computer scientist, cryptographer and inventor known as the "father of information theory" and the man who laid the foundations of th ...
,
Brockway McMillan Brockway McMillan (March 30, 1915 – December 3, 2016) was an American government official and scientist, who served as the eighth Under Secretary of the Air Force and the second Director of the National Reconnaissance Office. McMillan was ...
, and
Leo Breiman Leo Breiman (January 27, 1928 – July 5, 2005) was an American statistician at the University of California, Berkeley and a member of the United States National Academy of Sciences. Breiman's work helped to bridge the gap between statistics an ...
, states that we have convergence in the sense of L1.
Chung Kai-lai Kai Lai Chung (traditional Chinese: 鍾開萊; simplified Chinese: 钟开莱; September 19, 1917 – June 2, 2009) was a Chinese-American mathematician known for his significant contributions to modern probability theory. Biography Chung wa ...
generalized this to the case where X may take value in a set of countable infinity, provided that the entropy rate is still finite.


Non-stationary discrete-time source producing independent symbols

The assumptions of stationarity/ergodicity/identical distribution of random variables is not essential for the asymptotic equipartition property to hold. Indeed, as is quite clear intuitively, the asymptotic equipartition property requires only some form of the law of large numbers to hold, which is fairly general. However, the expression needs to be suitably generalized, and the conditions need to be formulated precisely. Consider a source that produces independent symbols, possibly with different output statistics at each instant, for which the statistics of the process are known completely, that is, the marginal distribution of the process seen at each time instant is known. The joint distribution is just the product of marginals. Then, under the condition (which can be relaxed) that \mathrm log p(X_i)M for all ''i'', for some ''M'' > 0, the following holds (AEP): \lim_\Pr\left -\frac \log p(X_1, X_2, \ldots, X_n) - \overline_n(X)\< \varepsilon\right1\qquad \forall \varepsilon > 0 where \overline_n(X)=\fracH(X_1,X_2,\ldots,X_n)


Applications

The asymptotic equipartition property for non-stationary discrete-time independent process leads us to (among other results) the
source coding theorem In information theory, Shannon's source coding theorem (or noiseless coding theorem) establishes the statistical limits to possible data compression for data whose source is an independent identically-distributed random variable, and the opera ...
for non-stationary source (with independent output symbols) and
noisy-channel coding theorem In information theory, the noisy-channel coding theorem (sometimes Shannon's theorem or Shannon's limit), establishes that for any given degree of noise contamination of a communication channel, it is possible (in theory) to communicate discrete ...
for non-stationary memoryless channels.


Measure-theoretic form

T is a measure-preserving map on the probability space \Omega. If P is a finite or countable partition of \Omega, then its entropy is H(P) := -\sum_ \mu(p) \ln\mu(p) with the convention that 0\ln 0 = 0. We only consider partitions with finite entropy: H(P) < \infty. If P is a finite or countable partition of \Omega, then we construct a sequence of partitions by iterating the map: P^ := P \vee T^P \vee \dots \vee T^P where P \vee Q is the least upper bound partition, that is, the least refined partition that refines both P and Q:P\vee Q := \Write P(x) to be the set in P where x falls in. So, for example, P^(x) is the n-letter initial segment of the (P, T) name of x. Write I_(x) to be the information (in units of '' nats'') about x we can recover, if we know which element in the partition P that x falls in:I_ := -\ln \mu(P(x))Similarly, the conditional information of partition P, conditional on partition Q, about x, isI_(x) := -\ln \frach_T(P) is the
Kolmogorov-Sinai entropy In mathematics, a measure-preserving dynamical system is an object of study in the abstract formulation of dynamical systems, and ergodic theory in particular. Measure-preserving systems obey the Poincaré recurrence theorem, and are a special cas ...
h_T(P) := \lim_n \frac 1n H(P^) = \lim_n E_\left frac 1n I_(x)\right In other words, by definition, there is a convergence in expectation. The SMB theorem states that when T is ergodic, there is convergence in L1. If T is not necessarily ergodic, then the underlying probability space would be split up into multiple subsets, each invariant under T. In this case, we still have L1 convergence to some function, but that function is no longer a constant function. When T is ergodic, \mathcal I is trivial, and so the function x \mapsto E\left \; \mathcal I \right simplifies into the constant function x\mapsto E\left \lim_n I_ \right/math>, which by definition, equals \lim_n H(P , \vee_^n T^P), which equals h_T(P) by a proposition.


Continuous-time stationary ergodic sources

Discrete-time functions can be interpolated to continuous-time functions. If such interpolation ''f'' is
measurable In mathematics, the concept of a measure is a generalization and formalization of geometrical measures (length, area, volume) and other common notions, such as magnitude, mass, and probability of events. These seemingly distinct concepts hav ...
, we may define the continuous-time stationary process accordingly as \tilde:=f\circ X. If the asymptotic equipartition property holds for the discrete-time process, as in the i.i.d. or finite-valued stationary ergodic cases shown above, it automatically holds for the continuous-time stationary process derived from it by some measurable interpolation. i.e. -\frac \log p(\tilde_0^\tau) \to H(X) where ''n'' corresponds to the degree of freedom in time . and are the entropy per unit time and per degree of freedom respectively, defined by Shannon. An important class of such continuous-time stationary process is the bandlimited stationary ergodic process with the sample space being a subset of the continuous \mathcal_2 functions. The asymptotic equipartition property holds if the process is white, in which case the time samples are i.i.d., or there exists ''T'' > 1/2''W'', where ''W'' is the nominal bandwidth, such that the ''T''-spaced time samples take values in a finite set, in which case we have the discrete-time finite-valued stationary ergodic process. Any
time-invariant In control theory, a time-invariant (TI) system has a time-dependent system function that is not a direct function of time. Such systems are regarded as a class of systems in the field of system analysis. The time-dependent system function is ...
operations also preserves the asymptotic equipartition property, stationarity and ergodicity and we may easily turn a stationary process to non-stationary without losing the asymptotic equipartition property by nulling out a finite number of time samples in the process.


Category theory

A
category theoretic Category theory is a general theory of mathematical structures and their relations. It was introduced by Samuel Eilenberg and Saunders Mac Lane in the middle of the 20th century in their foundational work on algebraic topology. Category theory ...
definition for the equipartition property is given by
Gromov Gromov () is a Russian male surname, its feminine counterpart is Gromova (Громова). Gromov may refer to: *Alexander Georgiyevich Gromov (born 1947), Russian politician and KGB officer * Alexander Gromov (born 1959), Russian science fiction ...
. Misha Gromov (2013).
In a Search for a Structure, Part 1: On Entropy
. ''(See page 5, where the equipartition property is called the 'Bernoulli approximation theorem'.)''
Given a sequence of Cartesian powers P^N=P\times \cdots \times P of a measure space ''P'', this sequence admits an ''asymptotically equivalent'' sequence ''HN'' of homogeneous measure spaces (''i.e.'' all sets have the same measure; all morphisms are invariant under the group of automorphisms, and thus factor as a morphism to the
terminal object In category theory, a branch of mathematics, an initial object of a category is an object in such that for every object in , there exists precisely one morphism . The dual notion is that of a terminal object (also called terminal element): ...
). The above requires a definition of ''asymptotic equivalence''. This is given in terms of a distance function, giving how much an
injective correspondence In mathematics, an injective function (also known as injection, or one-to-one function ) is a function that maps distinct elements of its domain to distinct elements of its codomain; that is, implies (equivalently by contraposition, implie ...
differs from an
isomorphism In mathematics, an isomorphism is a structure-preserving mapping or morphism between two structures of the same type that can be reversed by an inverse mapping. Two mathematical structures are isomorphic if an isomorphism exists between the ...
. An injective correspondence \pi: P\to Q is a partially defined map that is a
bijection In mathematics, a bijection, bijective function, or one-to-one correspondence is a function between two sets such that each element of the second set (the codomain) is the image of exactly one element of the first set (the domain). Equival ...
; that is, it is a bijection between a subset P'\subset P and Q'\subset Q. Then define , P-Q, _\pi = , P \setminus P', + , Q \setminus Q', , where , ''S'', denotes the measure of a set ''S''. In what follows, the measure of ''P'' and ''Q'' are taken to be 1, so that the measure spaces are probability spaces. This distance , P-Q, _\pi is commonly known as the
earth mover's distance In computer science, the earth mover's distance (EMD) is a measure of dissimilarity between two frequency distributions, densities, or measures, over a metric space ''D''. Informally, if the distributions are interpreted as two different ways of ...
or
Wasserstein metric In mathematics, the Wasserstein distance or Kantorovich– Rubinstein metric is a distance function defined between probability distributions on a given metric space M. It is named after Leonid Vaseršteĭn. Intuitively, if each distribution ...
. Similarly, define , \log P:Q, _\pi = \frac. with , \operatorname(P), taken to be the counting measure on ''P''. Thus, this definition requires that ''P'' be a finite measure space. Finally, let \text_\pi(P,Q) = , P-Q, _\pi +, \log P:Q, _\pi. A sequence of injective correspondences \pi_N:P_N\to Q_N are then asymptotically equivalent when \text_(P_N,Q_N) \to 0 \quad\text\quad N\to\infty. Given a homogenous space sequence ''HN'' that is asymptotically equivalent to ''PN'', the entropy ''H''(''P'') of ''P'' may be taken as H(P) = \lim_\frac , \operatorname(H_N), .


See also

*
Cramér's theorem (large deviations) Cramér's theorem is a fundamental result in the theory of large deviations, a subdiscipline of probability theory. It determines the rate function of a series of iid random variables. A weak version of this result was first shown by Harald Cramé ...
*
Noisy-channel coding theorem In information theory, the noisy-channel coding theorem (sometimes Shannon's theorem or Shannon's limit), establishes that for any given degree of noise contamination of a communication channel, it is possible (in theory) to communicate discrete ...
*
Shannon's source coding theorem In information theory, Shannon's source coding theorem (or noiseless coding theorem) establishes the statistical limits to possible data compression for data whose source is an independent identically-distributed random variable, and the opera ...


Notes


References


Journal articles

*
Claude E. Shannon Claude Elwood Shannon (April 30, 1916 – February 24, 2001) was an American mathematician, electrical engineer, computer scientist, cryptographer and inventor known as the "father of information theory" and the man who laid the foundations of th ...
.
A Mathematical Theory of Communication
. ''
Bell System Technical Journal The ''Bell Labs Technical Journal'' was the in-house scientific journal for scientists of Bell Labs, published yearly by the IEEE society. The journal was originally established as ''The Bell System Technical Journal'' (BSTJ) in New York by the Am ...
'', July/October 1948. *
Sergio Verdú Sergio Verdú (born Barcelona, Spain, August 15, 1958) is a former professor of electrical engineering and specialist in information theory. Until September 22, 2018, he was the Eugene Higgins Professor of Electrical Engineering at Princeton Univer ...
and
Te Sun Han Te Sun Han (born 1941, Kiryū) is a Korean Japanese information theorist and winner of the 2010 Shannon Award. He is a Professor emeritus of The University of Electro-Communications. He has made significant contributions concerning the interfere ...
.
The Role of the Asymptotic Equipartition Property in Noiseless Source Coding
. ''
IEEE Transactions on Information Theory ''IEEE Transactions on Information Theory'' is a monthly peer-reviewed scientific journal published by the IEEE Information Theory Society. It covers information theory and the mathematics of communications. It was established in 1953 as ''IRE Tran ...
'', 43(3): 847–857, 1997. .


Textbooks

* * {{cite book , last=MacKay, first=David J.C. , author-link=David J. C. MacKay, url=http://www.inference.phy.cam.ac.uk/mackay/itila/book.html, title=Information Theory, Inference, and Learning Algorithms, publisher=Cambridge University Press, year=2003, isbn=0-521-64298-1 Information theory Theorems in statistics