Shannon–McMillan–Breiman Theorem
   HOME

TheInfoList



OR:

In
information theory Information theory is the scientific study of the quantification (science), quantification, computer data storage, storage, and telecommunication, communication of information. The field was originally established by the works of Harry Nyquist a ...
, the asymptotic equipartition property (AEP) is a general property of the output samples of a stochastic source. It is fundamental to the concept of
typical set In information theory, the typical set is a set of sequences whose probability is close to two raised to the negative power of the entropy of their source distribution. That this set has total probability close to one is a consequence of the asympt ...
used in theories of
data compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression ...
. Roughly speaking, the theorem states that although there are many series of results that may be produced by a random process, the one actually produced is most probably from a loosely defined set of outcomes that all have approximately the same chance of being the one actually realized. (This is a consequence of the
law of large numbers In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials shou ...
and
ergodic theory Ergodic theory (Greek: ' "work", ' "way") is a branch of mathematics that studies statistical properties of deterministic dynamical systems; it is the study of ergodicity. In this context, statistical properties means properties which are expres ...
.) Although there are individual outcomes which have a higher probability than any outcome in this set, the vast number of outcomes in the set almost guarantees that the outcome will come from the set. One way of intuitively understanding the property is through Cramér's large deviation theorem, which states that the probability of a large deviation from mean decays exponentially with the number of samples. Such results are studied in
large deviations theory In probability theory, the theory of large deviations concerns the asymptotic behaviour of remote tails of sequences of probability distributions. While some basic ideas of the theory can be traced to Laplace, the formalization started with insura ...
; intuitively, it is the large deviations that would violate equipartition, but these are unlikely. In the field of pseudorandom number generation, a candidate generator of undetermined quality whose output sequence lies too far outside the typical set by some statistical criteria is rejected as insufficiently random. Thus, although the typical set is loosely defined, practical notions arise concerning ''sufficient'' typicality.


Definition

Given a discrete-time stationary ergodic stochastic process X on the
probability space In probability theory, a probability space or a probability triple (\Omega, \mathcal, P) is a mathematical construct that provides a formal model of a random process or "experiment". For example, one can define a probability space which models t ...
(\Omega, B, p), the asymptotic equipartition property is an assertion that -\frac \log p(X_1, X_2, \dots, X_n) \to H(X) \quad \text \quad n\to\infty where H(X) or simply H denotes the
entropy rate In the mathematical theory of probability, the entropy rate or source information rate of a stochastic process is, informally, the time density of the average information in a stochastic process. For stochastic processes with a countable index, the ...
of X, which must exist for all discrete-time
stationary process In mathematics and statistics, a stationary process (or a strict/strictly stationary process or strong/strongly stationary process) is a stochastic process whose unconditional joint probability distribution does not change when shifted in time. Con ...
es including the ergodic ones. The asymptotic equipartition property is proved for finite-valued (i.e. , \Omega, < \infty) stationary ergodic stochastic processes in the
Shannon–McMillan–Breiman theorem In information theory, the asymptotic equipartition property (AEP) is a general property of the output samples of a stochastic source. It is fundamental to the concept of typical set used in theories of data compression. Roughly speaking, the th ...
using the ergodic theory and for any
i.i.d. In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is us ...
sources directly using the law of large numbers in both the discrete-valued case (where H is simply the
entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
of a symbol) and the continuous-valued case (where ''H'' is the differential entropy instead). The definition of the asymptotic equipartition property can also be extended for certain classes of continuous-time stochastic processes for which a typical set exists for long enough observation time. The convergence is proven
almost sure In probability theory, an event is said to happen almost surely (sometimes abbreviated as a.s.) if it happens with probability 1 (or Lebesgue measure 1). In other words, the set of possible exceptions may be non-empty, but it has probability 0. ...
in all cases.


Discrete-time i.i.d. sources

Given X is an
i.i.d. In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is us ...
source which may take values in the alphabet \mathcal, its
time series In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Exa ...
X_1,\ldots,X_n is i.i.d. with
entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
H(X). The weak
law of large numbers In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials shou ...
gives the asymptotic equipartition property with
convergence in probability In probability theory, there exist several different notions of convergence of random variables. The convergence of sequences of random variables to some limit random variable is an important concept in probability theory, and its applications to ...
, \lim_\Pr\left -\frac \log p(X_1, X_2, \ldots, X_n) - H(X)\> \varepsilon\right0 \qquad \forall \varepsilon > 0. since the entropy is equal to the expectation of -\frac \log p(X_1, X_2, \ldots , X_n). The strong law of large numbers asserts the stronger almost sure convergence, \Pr\left lim_ - \frac \log p(X_1, X_2,\ldots, X_n) = H(X)\right1.


Discrete-time finite-valued stationary ergodic sources

Consider a finite-valued sample space \Omega, i.e. , \Omega, < \infty, for the discrete-time
stationary ergodic process In probability theory, a stationary ergodic process is a stochastic process which exhibits both stationarity and ergodicity. In essence this implies that the random process will not change its statistical properties with time and that its statistic ...
X:=\ defined on the
probability space In probability theory, a probability space or a probability triple (\Omega, \mathcal, P) is a mathematical construct that provides a formal model of a random process or "experiment". For example, one can define a probability space which models t ...
(\Omega, B, p). The asymptotic equipartition property for such stochastic source is known as the Shannon–McMillan–Breiman theorem, due to
Claude Shannon Claude Elwood Shannon (April 30, 1916 – February 24, 2001) was an American people, American mathematician, electrical engineering, electrical engineer, and cryptography, cryptographer known as a "father of information theory". As a 21-year-o ...
,
Brockway McMillan Brockway McMillan (March 30, 1915 – December 3, 2016) was an American government official and scientist, who served as the eighth Under Secretary of the Air Force and the second Director of the National Reconnaissance Office. McMillan was ...
, and
Leo Breiman Leo Breiman (January 27, 1928 – July 5, 2005) was a distinguished statistician at the University of California, Berkeley. He was the recipient of numerous honors and awards, and was a member of the United States National Academy of Sciences. ...
.


Non-stationary discrete-time source producing independent symbols

The assumptions of stationarity/ergodicity/identical distribution of random variables is not essential for the asymptotic equipartition property to hold. Indeed, as is quite clear intuitively, the asymptotic equipartition property requires only some form of the law of large numbers to hold, which is fairly general. However, the expression needs to be suitably generalized, and the conditions need to be formulated precisely. We assume that the source is producing independent symbols, with possibly different output statistics at each instant. We assume that the statistics of the process are known completely, that is, the marginal distribution of the process seen at each time instant is known. The joint distribution is just the product of marginals. Then, under the condition (which can be relaxed) that \mathrm log p(X_i)M for all ''i'', for some ''M'' > 0, the following holds (AEP): \lim_\Pr\left -\frac \log p(X_1, X_2, \ldots, X_n) - \overline_n(X)\< \varepsilon\right1\qquad \forall \varepsilon > 0 where \overline_n(X)=\fracH(X_1,X_2,\ldots,X_n)


Applications

The asymptotic equipartition property for non-stationary discrete-time independent process leads us to (among other results) the
source coding theorem In information theory, Shannon's source coding theorem (or noiseless coding theorem) establishes the limits to possible data compression, and the operational meaning of the Shannon entropy. Named after Claude Shannon, the source coding theorem ...
for non-stationary source (with independent output symbols) and
noisy-channel coding theorem In information theory, the noisy-channel coding theorem (sometimes Shannon's theorem or Shannon's limit), establishes that for any given degree of noise contamination of a communication channel, it is possible to communicate discrete data (dig ...
for non-stationary memoryless channels.


Continuous-time stationary ergodic sources

Discrete-time functions can be interpolated to continuous-time functions. If such interpolation ''f'' is
measurable In mathematics, the concept of a measure is a generalization and formalization of Geometry#Length, area, and volume, geometrical measures (length, area, volume) and other common notions, such as mass and probability of events. These seemingly ...
, we may define the continuous-time stationary process accordingly as \tilde:=f\circ X. If the asymptotic equipartition property holds for the discrete-time process, as in the i.i.d. or finite-valued stationary ergodic cases shown above, it automatically holds for the continuous-time stationary process derived from it by some measurable interpolation. i.e. -\frac \log p(\tilde_0^\tau) \to H(X) where ''n'' corresponds to the degree of freedom in time . and are the entropy per unit time and per degree of freedom respectively, defined by Shannon. An important class of such continuous-time stationary process is the bandlimited stationary ergodic process with the sample space being a subset of the continuous \mathcal_2 functions. The asymptotic equipartition property holds if the process is white, in which case the time samples are i.i.d., or there exists ''T'' > 1/2''W'', where ''W'' is the nominal bandwidth, such that the ''T''-spaced time samples take values in a finite set, in which case we have the discrete-time finite-valued stationary ergodic process. Any
time-invariant In control theory, a time-invariant (TIV) system has a time-dependent system function that is not a direct function of time. Such systems are regarded as a class of systems in the field of system analysis. The time-dependent system function is ...
operations also preserves the asymptotic equipartition property, stationarity and ergodicity and we may easily turn a stationary process to non-stationary without losing the asymptotic equipartition property by nulling out a finite number of time samples in the process.


Category theory

A
category theoretic Category theory is a general theory of mathematical structures and their relations that was introduced by Samuel Eilenberg and Saunders Mac Lane in the middle of the 20th century in their foundational work on algebraic topology. Nowadays, categ ...
definition for the equipartition property is given by
Gromov Gromov (russian: Громов) is a Russian male surname, its feminine counterpart is Gromova (Громова). Gromov may refer to: * Alexander Georgiyevich Gromov (born 1947), Russian politician and KGB officer * Alexander Gromov (born 1959), R ...
.Misha Gromov, (2012)
In a Search for a Structure, Part 1: On Entropy
. ''(See page 5, where the equipartition property is called the 'Bernoulli approximation theorem'.)''
Given a sequence of Cartesian powers P^N=P\times \cdots \times P of a measure space ''P'', this sequence admits an ''asymptotically equivalent'' sequence ''HN'' of homogeneous measure spaces (''i.e.'' all sets have the same measure; all morphisms are invariant under the group of automorphisms, and thus factor as a morphism to the
terminal object In category theory, a branch of mathematics, an initial object of a category is an object in such that for every object in , there exists precisely one morphism . The dual notion is that of a terminal object (also called terminal element): ...
) . The above requires a definition of ''asymptotic equivalence''. This is given in terms of a distance function, giving how much an injective correspondence differs from an
isomorphism In mathematics, an isomorphism is a structure-preserving mapping between two structures of the same type that can be reversed by an inverse mapping. Two mathematical structures are isomorphic if an isomorphism exists between them. The word is ...
. An injective correspondence \pi: P\to Q is a partially defined map that is a
bijection In mathematics, a bijection, also known as a bijective function, one-to-one correspondence, or invertible function, is a function between the elements of two sets, where each element of one set is paired with exactly one element of the other s ...
; that is, it is a bijection between a subset P'\subset P and Q'\subset Q. Then define , P-Q, _\pi = , P \setminus P', + , Q \setminus Q', where , ''S'', denotes the measure of a set ''S''. In what follows, the measure of ''P'' and ''Q'' are taken to be 1, so that the measure spaces are probability spaces. This distance , P-Q, _\pi is commonly known as the
earth mover's distance In statistics, the earth mover's distance (EMD) is a measure of the distance between two probability distributions over a region ''D''. In mathematics, this is known as the Wasserstein metric. Informally, if the distributions are interpreted ...
or
Wasserstein metric In mathematics, the Wasserstein distance or Kantorovich– Rubinstein metric is a distance function defined between probability distributions on a given metric space M. It is named after Leonid Vaseršteĭn. Intuitively, if each distribution is ...
. Similarly, define , \log P:Q, _\pi = \frac with , \operatorname(P), taken to be the counting measure on ''P''. Thus, this definition requires that ''P'' be a finite measure space. Finally, let \text_\pi(P,Q) = , P-Q, _\pi +, \log P:Q, _\pi A sequence of injective correspondences \pi_N:P_N\to Q_N are then asymptotically equivalent when \text_(P_N,Q_N) \to 0 \quad\text\quad N\to\infty Given a homogenous space sequence ''HN'' that is asymptotically equivalent to ''PN'', the entropy ''H''(''P'') of ''P'' may be taken as H(P) = \lim_\frac , \operatorname(H_N),


See also

* Cramér's theorem (large deviations) *
Shannon's source coding theorem In information theory, Shannon's source coding theorem (or noiseless coding theorem) establishes the limits to possible data compression, and the operational meaning of the Shannon entropy. Named after Claude Shannon, the source coding theorem ...
*
Noisy-channel coding theorem In information theory, the noisy-channel coding theorem (sometimes Shannon's theorem or Shannon's limit), establishes that for any given degree of noise contamination of a communication channel, it is possible to communicate discrete data (dig ...


Notes


References


Journal articles

* Claude E. Shannon. "
A Mathematical Theory of Communication "A Mathematical Theory of Communication" is an article by mathematician Claude E. Shannon published in '' Bell System Technical Journal'' in 1948. It was renamed ''The Mathematical Theory of Communication'' in the 1949 book of the same name, a sm ...
". ''Bell System Technical Journal'', July/October 1948. * * Sergio Verdu and Te Sun Han. "The Role of the Asymptotic Equipartition Property in Noiseless Source Coding." ''IEEE Transactions on Information Theory'', 43(3): 847–857, 1997.


Textbooks

* * {{cite book , last=MacKay, first=David J.C. , author-link=David J. C. MacKay, url=http://www.inference.phy.cam.ac.uk/mackay/itila/book.html, title=Information Theory, Inference, and Learning Algorithms, publisher=Cambridge University Press, year=2003, isbn=0-521-64298-1 Information theory Theorems in statistics