In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the

probability Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speakin ...

of a particular

event Event may refer to: Gatherings of people * Ceremony, an event of ritual significance, performed on a special occasion * Convention (meeting), a gathering of individuals engaged in some common interest * Event management, the organization of e ...

occurring from a random variable. It can be thought of as an alternative way of expressing probability, much like

odds Odds provide a measure of the likelihood of a particular outcome. They are calculated as the ratio of the number of events that produce that outcome to the number that do not. Odds are commonly used in gambling and statistics. Odds also have ...

log-odds In statistics, the logit ( ) function is the quantile function associated with the standard logistic distribution. It has many uses in data analysis and machine learning, especially in data transformations. Mathematically, the logit is the ...

, but which has particular mathematical advantages in the setting of information theory. The Shannon information can be interpreted as quantifying the level of "surprise" of a particular outcome. As it is such a basic quantity, it also appears in several other settings, such as the length of a message needed to transmit the event given an optimal source coding of the random variable. The Shannon information is closely related to ''

entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...

'', which is the expected value of the self-information of a random variable, quantifying how surprising the random variable is "on average". This is the average amount of self-information an observer would expect to gain about a random variable when measuring it. The information content can be expressed in various

units of information In computing and telecommunications, a unit of information is the capacity of some standard data storage system or communication channel, used to measure the capacities of other systems and channels. In information theory, units of information a ...

, of which the most common is the "bit" (more correctly called the ''shannon''), as explained below.

Definition

Claude Shannon Claude Elwood Shannon (April 30, 1916 – February 24, 2001) was an American mathematician, electrical engineer, and cryptographer known as a "father of information theory". As a 21-year-old master's degree student at the Massachusetts Inst ...

's definition of self-information was chosen to meet several axioms: # An event with probability 100% is perfectly unsurprising and yields no information. # The less probable an event is, the more surprising it is and the more information it yields. # If two independent events are measured separately, the total amount of information is the sum of the self-informations of the individual events. The detailed derivation is below, but it can be shown that there is a unique function of probability that meets these three axioms, up to a multiplicative scaling factor. Broadly, given a real number

b>1

and an

x

with

P

, the information content is defined as follows:

\mathrm(x) := - \log_b = -\log_b.

The base ''b'' corresponds to the scaling factor above. Different choices of ''b'' correspond to different units of information: when , the unit is the shannon (symbol Sh), often called a 'bit'; when , the unit is the natural unit of information (symbol nat); and when , the unit is the hartley (symbol Hart). Formally, given a random variable

X

with probability mass function

p_

, the self-information of measuring

X

as outcome

x

is defined as

\operatorname I_X(x) := 
 - \log
 = \log.

The use of the notation

I_X(x)

for self-information above is not universal. Since the notation

I(X;Y)

is also often used for the related quantity of

mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...

, many authors use a lowercase

h_X(x)

for self-entropy instead, mirroring the use of the capital

H(X)

for the entropy.

Properties

Monotonically decreasing function of probability

For a given

probability space In probability theory, a probability space or a probability triple (\Omega, \mathcal, P) is a mathematical construct that provides a formal model of a random process or "experiment". For example, one can define a probability space which models t ...

, the measurement of rarer

s are intuitively more "surprising", and yield more information content, than more common values. Thus, self-information is a strictly decreasing monotonic function of the probability, or sometimes called an "antitonic" function. While standard probabilities are represented by real numbers in the interval

, 1 The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...

/math>, self-informations are represented by extended real numbers in the interval

, \infty /math>. In particular, we have the following, for any choice of logarithmic base:

* If a particular event has a 100% probability of occurring, then its self-information is -\log(1) = 0 : its occurrence is "perfectly non-surprising" and yields no information.
* If a particular event has a 0% probability of occurring, then its self-information is -\log(0) = \infty : its occurrence is "infinitely surprising".

From this, we can get a few general properties:

* Intuitively, more information is gained from observing an unexpected event—it is "surprising". 
** For example, if there is a one-in-a-million chance of Alice winning the lottery, her friend Bob will gain significantly more information from learning that she won than that she lost on a given day. (See also ''

Lottery mathematics Lottery mathematics is used to calculate probabilities of winning or losing a lottery game. It is based primarily on combinatorics, particularly the twelvefold way and combinations without replacement. Choosing 6 from 49 In a typical 6/49 game, ...

''.) * This establishes an implicit relationship between the self-information of a random variable and its

variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...

Relationship to log-odds

The Shannon information is closely related to the

. In particular, given some event

x

, suppose that

p(x)

is the probability of

x

occurring, and that

p(\lnot x) = 1-p(x)

is the probability of

x

not occurring. Then we have the following definition of the log-odds:

\text(x) = \log\left(\frac\right)

This can be expressed as a difference of two Shannon informations:

\text(x) = \mathrm(\lnot x) - \mathrm(x)

In other words, the log-odds can be interpreted as the level of surprise when the event ''doesn't'' happen, minus the level of surprise when the event ''does'' happen.

Additivity of independent events

The information content of two

independent events Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of o ...

is the sum of each event's information content. This property is known as

additivity Additive may refer to: Mathematics * Additive function, a function in number theory * Additive map, a function that preserves the addition operation * Additive set-functionn see Sigma additivity * Additive category, a preadditive category with f ...

in mathematics, and

sigma additivity In mathematics, an additive set function is a function mapping sets to numbers, with the property that its value on a union of two disjoint sets equals the sum of its values on these sets, namely, \mu(A \cup B) = \mu(A) + \mu(B). If this additivit ...

in particular in measure and probability theory. Consider two

independent random variables Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in the New Hope, Pennsylvania, area of the United States during the early 1930s * Independe ...

X,\, Y

with probability mass functions

p_X(x)

and

p_Y(y)

respectively. The

joint probability mass function Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...

p_\!\left(x, y\right) = \Pr(X = x,\, Y = y) 
 = p_X\!(x)\,p_Y\!(y)

because

X

and

Y

are

independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in the New Hope, Pennsylvania, area of the United States during the early 1930s * Independ ...

. The information content of the outcome

(X, Y) = (x, y)

&= \operatorname_X(x) + \operatorname_Y(y) \end

See ' below for an example. The corresponding property for

likelihood The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood functi ...

s is that the

log-likelihood The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood functi ...

of independent events is the sum of the log-likelihoods of each event. Interpreting log-likelihood as "support" or negative surprisal (the degree to which an event supports a given model: a model is supported by an event to the extent that the event is unsurprising, given the model), this states that independent events add support: the information that the two events together provide for statistical inference is the sum of their independent information.

Relationship to entropy

The

Shannon entropy Shannon may refer to: People * Shannon (given name) * Shannon (surname) * Shannon (American singer), stage name of singer Shannon Brenda Greene (born 1958) * Shannon (South Korean singer), British-South Korean singer and actress Shannon Arrum W ...

of the random variable

X

above is defined as

\begin 
 \Eta(X) &= \sum_  \\
 &= \sum_  \\
 & \ 
  \operatorname,
\end

by definition equal to the expected information content of measurement of

X

. The expectation is taken over the discrete values over its support. Sometimes, the entropy itself is called the "self-information" of the random variable, possibly because the entropy satisfies

\Eta(X) = \operatorname(X; X)

, where

\operatorname(X;X)

is the

X

with itself. For continuous random variables the corresponding concept is

differential entropy Differential entropy (also referred to as continuous entropy) is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average surprisal of a random variable, to continuo ...

Notes

This measure has also been called surprisal, as it represents the " surprise" of seeing the outcome (a highly improbable outcome is very surprising). This term (as a log-probability measure) was coined by

Myron Tribus Myron T. Tribus (October 30, 1921 – August 31, 2016) was an American organizational theorist, who was the director of the Center for Advanced Engineering Study at MIT from 1974 to 1986. He was known as leading supporter and interpreter of W. E ...

in his 1961 book ''Thermostatics and Thermodynamics''.R. B. Bernstein and R. D. Levine (1972) "Entropy and Chemical Change. I. Characterization of Product (and Reactant) Energy Distributions in Reactive Molecular Collisions: Information and Entropy Deficiency", ''The Journal of Chemical Physics'' 57, 434–44
link
Myron Tribus
(1961) Thermodynamics and Thermostatics: ''An Introduction to Energy, Information and States of Matter, with Engineering Applications'' (D. Van Nostrand, 24 West 40 Street, New York 18, New York, U.S.A) Tribus, Myron (1961), pp. 64–6
borrow
When the event is a random realization (of a variable) the self-information of the variable is defined as the expected value of the self-information of the realization. Self-information is an example of a proper scoring rule.

Examples

Fair coin toss

Consider the

Bernoulli trial In the theory of probability and statistics, a Bernoulli trial (or binomial trial) is a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is ...

of tossing a fair coin

X

. The probabilities of the

events Event may refer to: Gatherings of people * Ceremony, an event of ritual significance, performed on a special occasion * Convention (meeting), a gathering of individuals engaged in some common interest * Event management, the organization of ev ...

of the coin landing as heads

\text

and tails

\text

(see

fair coin In probability theory and statistics, a sequence of independent Bernoulli trials with probability 1/2 of success on each trial is metaphorically called a fair coin. One for which the probability is not 1/2 is called a biased or unfair coin. In the ...

and obverse and reverse) are one half each,

p_X = p_X = \tfrac = 0.5

. Upon

measuring Measurement is the quantification of attributes of an object or event, which can be used to compare with other objects or events. In other words, measurement is a process of determining how large or small a physical quantity is as compared ...

the variable as heads, the associated information gain is

\operatorname_X(\text)
 = -\log_2 
 = -\log_2\! = 1,

so the information gain of a fair coin landing as heads is 1 shannon. Likewise, the information gain of measuring tails

T

\operatorname_X(T)
 = -\log_2 
 = -\log_2  = 1 \text.

Fair die roll

Suppose we have a fair six-sided die. The value of a dice roll is a discrete uniform random variable

X \sim \mathrm, 6 /math> with probability mass function p_X(k) = \begin
\frac, & k \in \ \\
0, & \text
\end The probability of rolling a 4 is p_X(4) = \frac, as for any other valid roll. The information content of rolling a 4 is thus \operatorname_(4) = -\log_2 
= -\log_2 
\approx 2.585\; \text of information.

Two independent, identically distributed dice

Suppose we have two

independent, identically distributed random variables In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is usual ...

X,\, Y \sim \mathrm, 6 /math> each corresponding to an

fair 6-sided dice roll. The

joint distribution Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...

X

and

Y

\cap \mathbb \\ 0 & \text \end \end

The information content of the

random variate In probability and statistics, a random variate or simply variate is a particular outcome of a ''random variable'': the random variates which are other outcomes of the same random variable might have different values ( random numbers). A random ...

(X, Y) = (2,\, 4)

\begin
\operatorname_ 
 &= -\log_2\!
 = \log_2\! = 2 \log_2\! \\
 & \approx 5.169925 \text,
\end

and can also be calculated by additivity of events

\begin
\operatorname_ 
 &= -\log_2\!
 = -\log_2\! -\log_2\! \\
 & = 2\log_2\! \\
 & \approx 5.169925 \text.
\end

Information from frequency of rolls

If we receive information about the value of the dice without knowledge of which die had which value, we can formalize the approach with so-called counting variables

C_k := \delta_k(X) + \delta_k(Y) = \begin 
 0, & \neg\, (X = k \vee Y = k) \\
 1, & \quad X = k\, \veebar \, Y = k \\
 2, & \quad X = k\, \wedge \, Y = k
\end

for

k \in \

, then

\sum_^ = 2

and the counts have the

multinomial distribution In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a ''k''-sided dice rolled ''n'' times. For ''n'' independent trials each of wh ...

\begin
 f(c_1,\ldots,c_6) &  = \Pr(C_1 = c_1 \text \dots \text C_6 = c_6) \\
 &  = \begin , 
    \ & \text \sum_^6 c_i=2 \\
  0 & \text \end \\
 &  = \begin , 
  \ & \text c_k \text 1 \\
  , \ & \text c_k = 2 \\
  0, \ & \text
 \end
\end

To verify this, the 6 outcomes

(X, Y) \in \left\_^ = \left\

correspond to the event

C_k = 2

and a total probability of . These are the only events that are faithfully preserved with identity of which dice rolled which outcome because the outcomes are the same. Without knowledge to distinguish the dice rolling the other numbers, the other

\binom = 15

combination In mathematics, a combination is a selection of items from a set that has distinct members, such that the order of selection does not matter (unlike permutations). For example, given three fruits, say an apple, an orange and a pear, there are th ...

s correspond to one die rolling one number and the other die rolling a different number, each having probability . Indeed,

6 \cdot \tfrac + 15 \cdot \tfrac = 1

, as required. Unsurprisingly, the information content of learning that both dice were rolled as the same particular number is more than the information content of learning that one dice was one number and the other was a different number. Take for examples the events

A_k = \

and

B_ = \ \cap \

for

j \ne k, 1 \leq j, k \leq 6

. For example,

A_2 = \

and

B_ = \

. The information contents are

\operatorname(A_2) = -\log_2\! = 5.169925 \text

\operatorname\left(B_\right) = - \log_2 \! \tfrac = 4.169925 \text

Let

\text = \bigcup_^

be the event that both dice rolled the same value and

\text = \overline

be the event that the dice differed. Then

\Pr(\text) = \tfrac

and

\Pr(\text) = \tfrac

. The information contents of the events are

\operatorname(\text) = -\log_2\! = 2.5849625 \text

\operatorname(\text) = -\log_2\! = 0.2630344 \text.

Information from sum of die

The probability mass or density function (collectively probability measure) of the sum of two independent random variables is the convolution of each probability measure. In the case of independent fair 6-sided dice rolls, the random variable

Z = X + Y

has probability mass function

p_Z(z) = p_X(x) * p_Y(y) =

, where

*

represents the discrete convolution. The outcome

Z = 5

has probability

p_Z(5) = \frac =

. Therefore, the information asserted is

\operatorname_Z(5) = -\log_2 = \log_2
 \approx 3.169925 \text.

General discrete uniform distribution

Generalizing the example above, consider a general discrete uniform random variable (DURV)

\quad a, b \in \mathbb, \ b \ge a.

For convenience, define

N := b - a + 1

. The probability mass function is

p_X(k) = \begin
 \frac, & k \in

, b The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...

\cap \mathbb \\ 0, & \text. \endIn general, the values of the DURV need not be

integer An integer is the number zero (), a positive natural number (, , , etc.) or a negative integer with a minus sign ( −1, −2, −3, etc.). The negative numbers are the additive inverses of the corresponding positive numbers. In the languag ...

s, or for the purposes of information theory even uniformly spaced; they need only be

equiprobable Equiprobability is a property for a collection of events that each have the same probability of occurring. In statistics and probability theory it is applied in the discrete uniform distribution and the equidistribution theorem for rational numb ...

. The information gain of any observation

X = k

\operatorname_X(k) = -\log_2  = \log_2 \text.

Special case: constant random variable

b = a

above,

X

degenerates to a

constant random variable In mathematics, a degenerate distribution is, according to some, a probability distribution in a space with support only on a manifold of lower dimension, and according to others a distribution with support only at a single point. By the latter d ...

with probability distribution deterministically given by

X = b

and probability measure the

Dirac measure In mathematics, a Dirac measure assigns a size to a set based solely on whether it contains a fixed element ''x'' or not. It is one way of formalizing the idea of the Dirac delta function, an important tool in physics and other technical fields. ...

p_X(k) = \delta_(k)

. The only value

X

can take is deterministically

b

, so the information content of any measurement of

X

\operatorname_X(b) = - \log_2 = 0.

In general, there is no information gained from measuring a known value.

Categorical distribution

Generalizing all of the above cases, consider a categorical

discrete random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...

with support

\mathcal = \bigl\_^

and probability mass function given by

p_X(k) = \begin
 p_i, & k = s_i \in \mathcal
 \\ 0,  & \text .
\end

For the purposes of information theory, the values

s \in \mathcal

do not have to be

number A number is a mathematical object used to count, measure, and label. The original examples are the natural numbers 1, 2, 3, 4, and so forth. Numbers can be represented in language with number words. More universally, individual numbers c ...

s; they can be any

mutually exclusive In logic and probability theory, two events (or propositions) are mutually exclusive or disjoint if they cannot both occur at the same time. A clear example is the set of outcomes of a single coin toss, which can result in either heads or tails ...

on a measure space of

finite measure In measure theory, a branch of mathematics, a finite measure or totally finite measure is a special measure that always takes on finite values. Among finite measures are probability measures. The finite measures are often easier to handle than mo ...

that has been normalized to a probability measure

p

Without loss of generality ''Without loss of generality'' (often abbreviated to WOLOG, WLOG or w.l.o.g.; less commonly stated as ''without any loss of generality'' or ''with no loss of generality'') is a frequently used expression in mathematics. The term is used to indicat ...

, we can assume the categorical distribution is supported on the set

= \left\

; the mathematical structure is isomorphic in terms of

probability theory Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set ...

and therefore information theory as well. The information of the outcome

X = x

is given

\operatorname_X(x) = -\log_2.

From these examples, it is possible to calculate the information of any set of

DRVs with known distributions by

Derivation

By definition, information is transferred from an originating entity possessing the information to a receiving entity only when the receiver had not known the information

a priori ("from the earlier") and ("from the later") are Latin phrases used in philosophy to distinguish types of knowledge, justification, or argument by their reliance on empirical evidence or experience. knowledge is independent from current ...

. If the receiving entity had previously known the content of a message with certainty before receiving the message, the amount of information of the message received is zero. Only when the advance knowledge of the content of the message by the receiver is less than 100% certain does the message actually convey information. For example, quoting a character (the Hippy Dippy Weatherman) of comedian

George Carlin George Denis Patrick Carlin (May 12, 1937 – June 22, 2008) was an American comedian, actor, author, and social critic. Regarded as one of the most important and influential stand-up comedians of all time, he was dubbed "the dean of countercu ...

, "''Weather forecast for tonight: dark. Continued dark overnight, with widely scattered light by morning.''" Assuming that one does not reside near the

polar regions The polar regions, also called the frigid zones or polar zones, of Earth are the regions of the planet that surround its geographical poles (the North and South Poles), lying within the polar circles. These high latitudes are dominated by floa ...

, the amount of information conveyed in that forecast is zero because it is known, in advance of receiving the forecast, that darkness always comes with the night. Accordingly, the amount of self-information contained in a message conveying content informing an occurrence of

\omega_n

, depends only on the probability of that event.

\operatorname I(\omega_n) = f(\operatorname P(\omega_n))

for some function

f(\cdot)

to be determined below. If

\operatorname P(\omega_n) = 1

, then

\operatorname I(\omega_n) = 0

. If

\operatorname P(\omega_n) < 1

, then

\operatorname I(\omega_n) > 0

. Further, by definition, the measure of self-information is nonnegative and additive. If a message informing of event

C

is the intersection of two

events

A

and

B

, then the information of event

C

occurring is that of the compound message of both independent events

A

and

B

occurring. The quantity of information of compound message

C

would be expected to equal the sum of the amounts of information of the individual component messages

A

and

B

respectively:

\operatorname I(C) = \operatorname I(A \cap B) = \operatorname I(A) + \operatorname I(B).

Because of the independence of events

A

and

B

, the probability of event

C

\operatorname P(C) = \operatorname P(A \cap B) = \operatorname P(A) \cdot \operatorname P(B).

However, applying function

f(\cdot)

results in

\begin
   \operatorname I(C) & = \operatorname I(A) + \operatorname I(B) \\
f(\operatorname P(C)) & = f(\operatorname P(A)) + f(\operatorname P(B)) \\
                      & = f\big(\operatorname P(A) \cdot \operatorname P(B)\big) \\
\end

Thanks to work on

Cauchy's functional equation Cauchy's functional equation is the functional equation: f(x+y) = f(x) + f(y).\ A function f that solves this equation is called an additive function. Over the rational numbers, it can be shown using elementary algebra that there is a single fam ...

, the only monotone functions

f(\cdot)

having the property such that

f(x \cdot y) = f(x) + f(y)

are the

logarithm In mathematics, the logarithm is the inverse function to exponentiation. That means the logarithm of a number to the base is the exponent to which must be raised, to produce . For example, since , the ''logarithm base'' 10 of ...

functions

\log_b(x)

. The only operational difference between logarithms of different bases is that of different scaling constants, so we may assume

f(x) = K \log(x)

where

\log

is the natural logarithm. Since the probabilities of events are always between 0 and 1 and the information associated with these events must be nonnegative, that requires that

K<0

. Taking into account these properties, the self-information

\operatorname I(\omega_n)

associated with outcome

\omega_n

with probability

\operatorname P(\omega_n)

is defined as:

\operatorname I(\omega_n) = -\log(\operatorname P(\omega_n)) = \log \left(\frac \right)

The smaller the probability of event

\omega_n

, the larger the quantity of self-information associated with the message that the event indeed occurred. If the above logarithm is base 2, the unit of

I(\omega_n)

bit The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...

s. This is the most common practice. When using the natural logarithm of base

e

, the unit will be the nat. For the base 10 logarithm, the unit of information is the hartley. As a quick illustration, the information content associated with an outcome of 4 heads (or any specific outcome) in 4 consecutive tosses of a coin would be 4 bits (probability 1/16), and the information content associated with getting a result other than the one specified would be ~0.09 bits (probability 15/16). See above for detailed examples.

References

External links

Examples of surprisal measures

* ttp://ilab.usc.edu/surprise/ Bayesian Theory of Surprise {{Authority control Information theory Entropy and information

Definition

Properties

Monotonically decreasing function of probability

Relationship to log-odds

Additivity of independent events

Relationship to entropy

Notes

Examples

Fair coin toss

Fair die roll

Two independent, identically distributed dice

Information from frequency of rolls

Information from sum of die

General discrete uniform distribution

Special case: constant random variable

Categorical distribution

Derivation

See also

References

Further reading

External links