This article discusses how

information theory Information theory is the scientific study of the quantification (science), quantification, computer data storage, storage, and telecommunication, communication of information. The field was originally established by the works of Harry Nyquist a ...

(a branch of mathematics studying the transmission, processing and storage of

information Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random ...

) is related to

measure theory In mathematics, the concept of a measure is a generalization and formalization of geometrical measures ( length, area, volume) and other common notions, such as mass and probability of events. These seemingly distinct concepts have many simil ...

(a branch of mathematics related to

integration Integration may refer to: Biology *Multisensory integration *Path integration * Pre-integration complex, viral genetic material used to insert a viral genome into a host genome *DNA integration, by means of site-specific recombinase technology, ...

and

probability Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...

Measures in information theory

Many of the concepts in information theory have separate definitions and formulas for

continuous Continuity or continuous may refer to: Mathematics * Continuity (mathematics), the opposing concept to discreteness; common examples include ** Continuous probability distribution or random variable in probability and statistics ** Continuous ...

and

discrete Discrete may refer to: *Discrete particle or quantum in physics, for example in quantum theory * Discrete device, an electronic component with just one circuit element, either passive or active, other than an integrated circuit *Discrete group, a ...

cases. For example,

entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...

\Eta(X)

is usually defined for discrete random variables, whereas for continuous random variables the related concept of

differential entropy Differential entropy (also referred to as continuous entropy) is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average surprisal of a random variable, to continu ...

, written

h(X)

, is used (see Cover and Thomas, 2006, chapter 8). Both these concepts are mathematical expectations, but the expectation is defined with an

integral In mathematics Mathematics is an area of knowledge that includes the topics of numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes. These topics are represented i ...

for the continuous case, and a sum for the discrete case. These separate definitions can be more closely related in terms of

. For discrete random variables, probability mass functions can be considered density functions with respect to the counting measure. Thinking of both the integral and the sum as integration on a measure space allows for a unified treatment. Consider the formula for the differential entropy of a continuous

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...

X

with range

\mathbb

and

probability density function In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can ...

f(x)

: :

h(X) = -\int_\mathbb f(x) \log f(x) \,dx.

This can usually be interpreted as the following

Riemann–Stieltjes integral In mathematics, the Riemann–Stieltjes integral is a generalization of the Riemann integral, named after Bernhard Riemann and Thomas Joannes Stieltjes. The definition of this integral was first published in 1894 by Stieltjes. It serves as an inst ...

: :

h(X) = -\int_\mathbb f(x) \log f(x) \,d\mu(x),

where

\mu

is the

Lebesgue measure In measure theory, a branch of mathematics, the Lebesgue measure, named after French mathematician Henri Lebesgue, is the standard way of assigning a measure to subsets of ''n''-dimensional Euclidean space. For ''n'' = 1, 2, or 3, it coincides wit ...

. If instead,

X

is discrete, with range

\Omega

a finite set,

f

is a probability mass function on

\Omega

, and

\nu

is the

counting measure In mathematics, specifically measure theory, the counting measure is an intuitive way to put a measure on any set – the "size" of a subset is taken to be the number of elements in the subset if the subset has finitely many elements, and infinity ...

\Omega

, we can write: :

\Eta(X) = -\sum_ f(x) \log f(x) = -\int_\Omega f(x) \log f(x) \,d\nu(x).

The integral expression, and the general concept, are identical in the continuous case; the only difference is the measure used. In both cases the probability density function

f

is the Radon–Nikodym derivative of the

probability measure In mathematics, a probability measure is a real-valued function defined on a set of events in a probability space that satisfies measure properties such as ''countable additivity''. The difference between a probability measure and the more gener ...

with respect to the measure against which the integral is taken. If

P

is the probability measure induced by

X

, then the integral can also be taken directly with respect to

P

: :

h(X) = -\int_ \log \frac \,dP,

If instead of the underlying measure μ we take another probability measure

Q

, we are led to the

Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...

: let

P

and

Q

be probability measures over the same space. Then if

P

absolutely continuous In calculus, absolute continuity is a smoothness property of functions that is stronger than continuity and uniform continuity. The notion of absolute continuity allows one to obtain generalizations of the relationship between the two central oper ...

with respect to

Q

, written

P \ll Q,

the Radon–Nikodym derivative

\frac

exists and the Kullback–Leibler divergence can be expressed in its full generality: :

D_\operatorname(P \,  Q)
   = \int_
     \frac
     \log \frac
     \,d Q
   = \int_
     \log \frac
       \,d P,

where the integral runs over the

support Support may refer to: Arts, entertainment, and media * Supporting character Business and finance * Support (technical analysis) * Child support * Customer support * Income Support Construction * Support (structure), or lateral support, a ...

P.

Note that we have dropped the negative sign: the Kullback–Leibler divergence is always non-negative due to

Gibbs' inequality 200px, Josiah Willard Gibbs In information theory, Gibbs' inequality is a statement about the information entropy of a discrete probability distribution. Several other bounds on the entropy of probability distributions are derived from Gibbs' ine ...

Entropy as a "measure"

There is an analogy between Shannon's basic "

measures Measure may refer to: * Measurement, the assignment of a number to a characteristic of an object or event Law * Ballot measure, proposed legislation in the United States * Church of England Measure, legislation of the Church of England * Meas ...

" of the

content of random variables and a

measure Measure may refer to: * Measurement, the assignment of a number to a characteristic of an object or event Law * Ballot measure, proposed legislation in the United States * Church of England Measure, legislation of the Church of England * Mea ...

over sets. Namely the

joint entropy In information theory, joint entropy is a measure of the uncertainty associated with a set of variables. Definition The joint Shannon entropy (in bits) of two discrete random variables X and Y with images \mathcal X and \mathcal Y is define ...

conditional entropy In information theory, the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable Y given that the value of another random variable X is known. Here, information is measured in shannons, na ...

, and

mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...

can be considered as the measure of a

set union In set theory, the union (denoted by ∪) of a collection of sets is the set of all elements in the collection. It is one of the fundamental operations through which sets can be combined and related to each other. A refers to a union of ze ...

set difference In set theory, the complement of a set , often denoted by (or ), is the set of elements not in . When all sets in the universe, i.e. all sets under consideration, are considered to be members of a given set , the absolute complement of is the ...

, and

set intersection In set theory, the intersection of two sets A and B, denoted by A \cap B, is the set containing all elements of A that also belong to B or equivalently, all elements of B that also belong to A. Notation and terminology Intersection is writt ...

, respectively (Reza pp. 106–108). If we associate the existence of abstract sets

\tilde X

and

\tilde Y

to arbitrary

s ''X'' and ''Y'', somehow representing the

borne by ''X'' and ''Y'', respectively, such that: *

\mu(\tilde X \cap \tilde Y) = 0

whenever ''X'' and ''Y'' are unconditionally

independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in the New Hope, Pennsylvania, area of the United States during the early 1930s * Independ ...

, and *

\tilde X = \tilde Y

whenever ''X'' and ''Y'' are such that either one is completely determined by the other (i.e. by a bijection); where

\mu

is a

signed measure In mathematics, signed measure is a generalization of the concept of (positive) measure by allowing the set function to take negative values. Definition There are two slightly different concepts of a signed measure, depending on whether or not o ...

over these sets, and we set: :

\begin
\Eta(X) & = \mu(\tilde X), \\
\Eta(Y) & = \mu(\tilde Y), \\
\Eta(X,Y) & = \mu(\tilde X \cup \tilde Y), \\
\Eta(X\mid Y) & = \mu(\tilde X \setminus \tilde Y), \\
\operatorname(X;Y) & = \mu(\tilde X \cap \tilde Y);
\end

we find that Shannon's "measure" of information content satisfies all the postulates and basic properties of a formal

over sets, as commonly illustrated in an information diagram. This allows the sum of two measures to be written: :

\mu(A)+\mu(B)=\mu(A\cup B)+\mu(A\cap B)

and the analog of

Bayes' theorem In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For examp ...

(

\mu(A)+\mu(B\setminus A)=\mu(B)+\mu(A\setminus B)

) allows the difference of two measures to be written: :

\mu(A)-\mu(B)=\mu(A\setminus B)-\mu(B\setminus A)

This can be a handy

mnemonic device A mnemonic ( ) device, or memory device, is any learning technique that aids information retention or retrieval (remembering) in the human memory for better understanding. Mnemonics make use of elaborative encoding, retrieval cues, and imagery ...

in some situations, e.g. :

\begin
\Eta(X,Y) & =\Eta(X)+\Eta(Y\mid X) & \mu(\tilde X \cup \tilde Y) & =\mu(\tilde X)+\mu(\tilde Y \setminus \tilde X) \\
\operatorname(X;Y) & =\Eta(X)-\Eta(X\mid Y) & \mu(\tilde X \cap \tilde Y) & =\mu(\tilde X)-\mu(\tilde X \setminus \tilde Y)
\end

Note that measures (expectation values of the logarithm) of true probabilities are called "entropy" and generally represented by the letter ''H'', while other measures are often referred to as "information" or "correlation" and generally represented by the letter ''I''. For notational simplicity, the letter ''I'' is sometimes used for all measures.

Multivariate mutual information

Certain extensions to the definitions of Shannon's basic measures of information are necessary to deal with the

σ-algebra In mathematical analysis and in probability theory, a σ-algebra (also σ-field) on a set ''X'' is a collection Σ of subsets of ''X'' that includes the empty subset, is closed under complement, and is closed under countable unions and countabl ...

generated by the sets that would be associated to three or more arbitrary random variables. (See Reza pp. 106–108 for an informal but rather complete discussion.) Namely

\Eta(X,Y,Z,\cdots)

needs to be defined in the obvious way as the entropy of a joint distribution, and a multivariate

\operatorname(X;Y;Z;\cdots)

defined in a suitable manner so that we can set: :

\begin
\Eta(X,Y,Z,\cdots) & = \mu(\tilde X \cup \tilde Y \cup \tilde Z \cup \cdots), \\
\operatorname(X;Y;Z;\cdots) & = \mu(\tilde X \cap \tilde Y \cap \tilde Z \cap \cdots);
\end

in order to define the (signed) measure over the whole σ-algebra. There is no single universally accepted definition for the multivariate mutual information, but the one that corresponds here to the measure of a set intersection is due to Fano (1966: p. 57-59). The definition is recursive. As a base case the mutual information of a single random variable is defined to be its entropy:

\operatorname(X)=\Eta(X)

. Then for

n\geq 2

we set :

\operatorname(X_1; \cdots; X_n) = \operatorname(X_1; \cdots; X_) - \operatorname(X_1; \cdots; X_\mid X_n),

where the

conditional mutual information In probability theory, particularly information theory, the conditional mutual information is, in its most basic form, the expected value of the mutual information of two random variables given the value of a third. Definition For random var ...

is defined as :

\operatorname(X_1; \cdots; X_\mid X_n) = \mathbb E_\big(\operatorname(X_1; \cdots; X_)\mid X_n\big).

The first step in the recursion yields Shannon's definition

\operatorname(X_1;X_2)=\Eta(X_1)-\Eta(X_1\mid X_2).

The multivariate mutual information (same as

interaction information The interaction information is a generalization of the mutual information for more than two variables. There are many names for interaction information, including ''amount of information'', ''information correlation'', ''co-information'', and sim ...

but for a change in sign) of three or more random variables can be negative as well as positive: Let ''X'' and ''Y'' be two independent fair coin flips, and let ''Z'' be their

exclusive or Exclusive or or exclusive disjunction is a logical operation that is true if and only if its arguments differ (one is true, the other is false). It is symbolized by the prefix operator J and by the infix operators XOR ( or ), EOR, EXOR, , ...

. Then

\operatorname(X;Y;Z) = - 1

bit. Many other variations are possible for three or more random variables: for example,

\operatorname(X,Y;Z)

is the mutual information of the joint distribution of ''X'' and ''Y'' relative to ''Z'', and can be interpreted as

\mu((\tilde X \cup \tilde Y) \cap \tilde Z).

Many more complicated expressions can be built this way, and still have meaning, e.g.

\operatorname(X,Y;Z\mid W),

\Eta(X,Z\mid W,Y).

References

* Thomas M. Cover and Joy A. Thomas. ''Elements of Information Theory'', second edition, 2006. New Jersey: Wiley and Sons. . * Fazlollah M. Reza. ''An Introduction to Information Theory''. New York: McGraw–Hill 1961. New York: Dover 1994. * {{citation , last=Fano , first=R. M. , author-link= Robert Fano , title=Transmission of Information: a statistical theory of communications , publisher=

MIT Press The MIT Press is a university press affiliated with the Massachusetts Institute of Technology (MIT) in Cambridge, Massachusetts (United States). It was established in 1962. History The MIT Press traces its origins back to 1926 when MIT publish ...

, year=1966 , url=https://archive.org/details/TransmissionOfInformationAStatisticalTheoryOfCommunicationRobertFano, isbn=978-0-262-56169-3 , oclc=804123877 * R. W. Yeung, "On entropy, information inequalities, and Groups.
PS

Measures in information theory

Entropy as a "measure"

Multivariate mutual information

References

See also