HOME

TheInfoList



OR:

Directed information, I(X^n\to Y^n) , is an
information theory Information theory is the scientific study of the quantification (science), quantification, computer data storage, storage, and telecommunication, communication of information. The field was originally established by the works of Harry Nyquist a ...
measure that quantifies the
information Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random ...
flow from the random process X^n = \ to the random process Y^n = \. The term ''directed information'' was coined by
James Massey James Lee Massey (February 11, 1934 – June 16, 2013) was an American information theorist and cryptographer, Professor Emeritus of Digital Technology at ETH Zurich. His notable work includes the application of the Berlekamp–Massey algorithm to ...
and is defined as :I(X^n\to Y^n) \triangleq \sum_^n I(X^i;Y_i, Y^), where I(X^;Y_i, Y^) is the conditional mutual information I(X_1,X_2,...,X_;Y_i, Y_1,Y_2,...,Y_). The Directed information has many applications in problems where
causality Causality (also referred to as causation, or cause and effect) is influence by which one event, process, state, or object (''a'' ''cause'') contributes to the production of another event, process, state, or object (an ''effect'') where the cau ...
plays an important role such as capacity of channel with
feedback Feedback occurs when outputs of a system are routed back as inputs as part of a chain of cause-and-effect that forms a circuit or loop. The system can then be said to ''feed back'' into itself. The notion of cause-and-effect has to be handled ...
, capacity of discrete
memoryless In probability and statistics, memorylessness is a property of certain probability distributions. It usually refers to the cases when the distribution of a "waiting time" until a certain event does not depend on how much time has elapsed already ...
networks with feedback,
gambling Gambling (also known as betting or gaming) is the wagering of something of value ("the stakes") on a random event with the intent of winning something else of value, where instances of strategy are discounted. Gambling thus requires three el ...
with causal side information,
compression Compression may refer to: Physical science *Compression (physics), size reduction due to forces *Compression member, a structural element such as a column *Compressibility, susceptibility to compression * Gas compression *Compression ratio, of a ...
with causal side information, and in
real-time control Real-time computing (RTC) is the computer science term for hardware and software systems subject to a "real-time constraint", for example from event to system response. Real-time programs must guarantee response within specified time constrai ...
communication settings, statistical physics.


Marko's theory of bidirectional communication

Massey's directed information was motivated by Marko's work on developing a theory of bidirectional communication


Estimation and optimization

Estimating and optimizing the directed information is challenging because it has n terms where n may be large. In many cases, one is interested in optimizing the limiting average, that is, when n grows to infinity termed as a multi-letter expression.


Estimation

The estimation of directed information from given samples is a very hard problem since the directed information expression does not depends on samples but on the joint distribution \ which is unknown. There exist several algorithms based on context tree weight and on empirical parametric distributions and using
Long short-term memory Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ca ...
.


Optimization

The maximization of the directed information is a fundamental problem in information theory. For a fixed sequence of channel distributions \_^n), the objective is to optimize I(X^n\to Y^n) over the channel input distributions \_^n). There exist algorithms for optimizing the directed information based on Blahut-Arimoto, Markov decision process,
Recurrent neural network A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic ...
,
Reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...
. and
Graphical methods (the Q-graphs) A chart (sometimes known as a graph) is a graphical representation for data visualization, in which "the data is represented by symbols, such as bars in a bar chart, lines in a line chart, or slices in a pie chart". A chart can represent tabul ...
. For the case of Blahut-Arimoto, the main idea is to start with the last element of the directed information and go backward. For the case of Markov decision process, the main ideas is to transform the optimization into an infinite horizon average reward Markov decision process. For
Recurrent neural network A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic ...
the main idea is to model the input distribution using the
Recurrent neural network A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic ...
and optimize the parameters using
Gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...
. For
Reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...
the main idea is to solve the Markov decision process formulation of the capacity using
Reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...
tools which allows us to deal with a large or even continuous alphabet.


Causal conditioning

The essence of directed information is from the causal conditioning operation. The causal conditionng is denoted by (\cdot, , \cdot). Recall the notation x^n = \, y^n = \; The probablity of y^n causal conditioned on x^n is denoted by P(y^n, , x^n) and is defined as :P(y^n, , x^n) \triangleq \prod_^n P(y_i, y^,x^). Note, that this is very similar to the chain rule of regular conditioinig, i.e., P(y^n, x^n) = \prod_^n P(y_i, y^,x^), where the difference is in the conditioning on the sequence x^ versus x^. One can also intruce delay to the causal conditioning, i.e., :P(y^n, , x^) \triangleq \prod_^n P(y_i, y^,x^).


Properties

The chain rule for causal conditioning is :P(y^n, x^n) = P(y^n, , x^n) P(x^n, , y^) The chain rule implies two things: first, any joint distribustion P(y^n, x^n) can be decomose into a product of P(y^n, , x^n), P(x^n, , y^); and second, any product of P(y^n, , x^n), P(x^n, , y^) results in a joint distribustion P(y^n, x^n). The casual conditioning probabilityP(y^n, , x^n) \triangleq \prod_^n P(y_i, y^,x^) is indeed a probability vector, i.e., :P(y^n, , x^n)\geq 0,\ \forall (y^n,x^n) . :\sum_ P(y^n, , x^n)=1,\ \forall (x^n) . Directed Information can be written in terms of causal conditioning as follows :I(X^N \rightarrow Y^N)=\mathbf E\left \log \frac \right/math>


Conservation law of information

This law, established by
James Massey James Lee Massey (February 11, 1934 – June 16, 2013) was an American information theorist and cryptographer, Professor Emeritus of Digital Technology at ETH Zurich. His notable work includes the application of the Berlekamp–Massey algorithm to ...
and his son, gives a very nice intuition to the directed information and its relation to mutual information. The law states that for any X^n, Y^n , the following equality holds: :I(X^n;Y^n)= I(X^n \to Y^n)+I(Y^ \to X^n).


Causally conditioned entropy and directed information

The causally conditioed entropy is defined as: :H(X^N , , Y^N)=\mathbf E\left -\log \right/math> and the Directed Information can be written as the decrease of uncertainty (entropy) of Y^n due to causal side information X^n , i.e., :I(X^n\to Y^n) = H(Y^N , , X^N)- H(Y^N , , X^N) Similarly, one may causally condition on a string (Y^n,Z^n) = \ of pairs: :H(X^N , , Y^N,Z^n)=\mathbf E\left -\log \right/math> and the Directed Information causally conditioned on Y^n and Z^n is :I(X^n\to Y^n , , Z^N) = H(Y^N , , Z^N)- H(Y^N , , X^N, Z^N)


Relation of directed information to transfer entropy

There exists also a derived definition transfer entropy which is a truncated or mismatched version of directed information.


References

{{Reflist Information theory