Directed information,

I(X^n\to Y^n)

, is an

information theory Information theory is the scientific study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. ...

measure that quantifies the

information Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random, ...

flow from the random process

X^n = \

to the random process

Y^n = \

. The term ''directed information'' was coined by James Massey and is defined as :

I(X^n\to Y^n) \triangleq \sum_^n I(X^i;Y_i, Y^)

, where

I(X^;Y_i, Y^)

is the

conditional mutual information In probability theory, particularly information theory, the conditional mutual information is, in its most basic form, the expected value of the mutual information of two random variables given the value of a third. Definition For random var ...

I(X_1,X_2,...,X_;Y_i, Y_1,Y_2,...,Y_)

. The Directed information has many applications in problems where

causality Causality (also referred to as causation, or cause and effect) is influence by which one event, process, state, or object (''a'' ''cause'') contributes to the production of another event, process, state, or object (an ''effect'') where the ca ...

plays an important role such as capacity of channel with

feedback Feedback occurs when outputs of a system are routed back as inputs as part of a chain of cause-and-effect that forms a circuit or loop. The system can then be said to ''feed back'' into itself. The notion of cause-and-effect has to be handled ...

, capacity of discrete

memoryless In probability and statistics, memorylessness is a property of certain probability distributions. It usually refers to the cases when the distribution of a "waiting time" until a certain event does not depend on how much time has elapsed alread ...

networks with feedback,

gambling Gambling (also known as betting or gaming) is the wagering of something of Value (economics), value ("the stakes") on a Event (probability theory), random event with the intent of winning something else of value, where instances of strategy (ga ...

with causal side information,

compression Compression may refer to: Physical science *Compression (physics), size reduction due to forces *Compression member, a structural element such as a column *Compressibility, susceptibility to compression *Gas compression *Compression ratio, of a c ...

with causal side information, and in

real-time control Real-time computing (RTC) is the computer science term for hardware and software systems subject to a "real-time constraint", for example from event to system response. Real-time programs must guarantee response within specified time constrai ...

communication settings, statistical physics.

Marko's theory of bidirectional communication

Massey's directed information was motivated by Marko's work on developing a theory of bidirectional communication

Estimation and optimization

Estimating and optimizing the directed information is challenging because it has

n

terms where

n

may be large. In many cases, one is interested in optimizing the limiting average, that is, when

n

grows to infinity termed as a multi-letter expression.

Estimation

The estimation of directed information from given samples is a very hard problem since the directed information expression does not depends on samples but on the joint distribution

\

which is unknown. There exist several algorithms based on context tree weight and on empirical parametric distributions and using

Long short-term memory Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ca ...

Optimization

The maximization of the directed information is a fundamental problem in information theory. For a fixed sequence of channel distributions

\_^n)

, the objective is to optimize

I(X^n\to Y^n)

over the channel input distributions

\_^n)

. There exist algorithms for optimizing the directed information based on Blahut-Arimoto, Markov decision process,

Recurrent neural network A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic ...

Reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...

. and

Graphical methods (the Q-graphs) A chart (sometimes known as a graph) is a graphical representation for data visualization, in which "the data is represented by symbols, such as bars in a bar chart, lines in a line chart, or slices in a pie chart". A chart can represent tab ...

. For the case of Blahut-Arimoto, the main idea is to start with the last element of the directed information and go backward. For the case of Markov decision process, the main ideas is to transform the optimization into an infinite horizon average reward Markov decision process. For

the main idea is to model the input distribution using the

and optimize the parameters using

Gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of ...

. For

the main idea is to solve the Markov decision process formulation of the capacity using

tools which allows us to deal with a large or even continuous alphabet.

Causal conditioning

The essence of directed information is from the causal conditioning operation. The causal conditionng is denoted by

(\cdot, , \cdot)

. Recall the notation

x^n = \

y^n = \

; The probablity of

y^n

causal conditioned on

x^n

is denoted by

P(y^n, , x^n)

and is defined as :

P(y^n, , x^n) \triangleq \prod_^n P(y_i, y^,x^)

. Note, that this is very similar to the chain rule of regular conditioinig, i.e.,

P(y^n, x^n) = \prod_^n P(y_i, y^,x^)

, where the difference is in the conditioning on the sequence

x^

versus

x^

. One can also intruce delay to the causal conditioning, i.e., :

P(y^n, , x^) \triangleq \prod_^n P(y_i, y^,x^)

Properties

The chain rule for causal conditioning is :

P(y^n, x^n) = P(y^n, , x^n) P(x^n, , y^)

The chain rule implies two things: first, any joint distribustion

P(y^n, x^n)

can be decomose into a product of

P(y^n, , x^n), P(x^n, , y^)

; and second, any product of

P(y^n, , x^n), P(x^n, , y^)

results in a joint distribustion

P(y^n, x^n)

. The casual conditioning probability

P(y^n, , x^n) \triangleq \prod_^n P(y_i, y^,x^)

is indeed a probability vector, i.e., :

P(y^n, , x^n)\geq 0,\   \forall (y^n,x^n)

. :

\sum_ P(y^n, , x^n)=1,\  \forall (x^n)

. Directed Information can be written in terms of causal conditioning as follows :

I(X^N \rightarrow Y^N)=\mathbf
E\left \log \frac \right /math>

Conservation law of information

This law, established by James Massey and his son, gives a very nice intuition to the directed information and its relation to mutual information. The law states that for any

X^n, Y^n

, the following equality holds: :

I(X^n;Y^n)= I(X^n \to Y^n)+I(Y^ \to X^n).

Causally conditioned entropy and directed information

The causally conditioed entropy is defined as: :

H(X^N , ,  Y^N)=\mathbf
E\left -\log  \right /math> and the Directed Information can be written as the decrease of uncertainty (entropy) of Y^n due to causal side information X^n, i.e., 
: I(X^n\to Y^n) = H(Y^N , ,  X^N)- H(Y^N , ,  X^N) Similarly, one may causally condition on a string (Y^n,Z^n) = \ of pairs:
: H(X^N , ,  Y^N,Z^n)=\mathbf
E\left -\log  \right /math> and the Directed Information causally conditioned on Y^n and Z^n is
: I(X^n\to Y^n , ,  Z^N) = H(Y^N , ,  Z^N)- H(Y^N , ,  X^N, Z^N)

Relation of directed information to transfer entropy

There exists also a derived definition

transfer entropy Transfer entropy is a non-parametric statistic measuring the amount of directed (time-asymmetric) transfer of information between two random process In probability theory and related fields, a stochastic () or random process is a mathematical o ...

which is a truncated or mismatched version of directed information.

References

{{Reflist Information theory