Jensen–Shannon Divergence
   HOME

TheInfoList



OR:

In
probability theory Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expre ...
and
statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, the Jensen–Shannon divergence, named after Johan Jensen and
Claude Shannon Claude Elwood Shannon (April 30, 1916 – February 24, 2001) was an American mathematician, electrical engineer, computer scientist, cryptographer and inventor known as the "father of information theory" and the man who laid the foundations of th ...
, is a method of measuring the similarity between two
probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
s. It is also known as information radius (IRad) or total divergence to the average. It is based on the
Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
, with some notable (and useful) differences, including that it is symmetric and it always has a finite value. The square root of the Jensen–Shannon divergence is a metric often referred to as Jensen–Shannon distance. The similarity between the distributions is greater when the Jensen-Shannon distance is closer to zero.


Definition

Consider the set M_+^1(A) of probability distributions where A is a set provided with some σ-algebra of measurable subsets. In particular we can take A to be a finite or countable set with all subsets being measurable. The Jensen–Shannon divergence (JSD) is a symmetrized and smoothed version of the
Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
D(P \parallel Q). It is defined by :(P \parallel Q)= \fracD(P \parallel M)+\fracD(Q \parallel M), where M=\frac(P+Q) is a mixture distribution of P and Q. The geometric Jensen–Shannon divergence (or G-Jensen–Shannon divergence) yields a closed-form formula for divergence between two Gaussian distributions by taking the geometric mean. A more general definition, allowing for the comparison of more than two probability distributions, is: : \begin _(P_1, P_2, \ldots, P_n) &= \sum_i \pi_i D( P_i \parallel M ) \\ &= H\left(M\right) - \sum_^n \pi_i H(P_i) \end where \begin M &:= \sum_^n \pi_i P_i \end and \pi_1, \ldots, \pi_n are weights that are selected for the probability distributions P_1, P_2, \ldots, P_n, and H(P) is the Shannon entropy for distribution P. For the two-distribution case described above, P_1=P, P_2=Q, \pi_1 = \pi_2 = \frac.\ Hence, for those distributions P, Q JSD = H(M) - \frac\bigg(H(P) + H(Q)\bigg)


Bounds

The Jensen–Shannon divergence is bounded by 1 for two discrete probability distributions, given that one uses the base 2 logarithm: :0 \leq ( P \parallel Q ) \leq 1. With this normalization, it is a lower bound on the
total variation distance In probability theory, the total variation distance is a statistical distance between probability distributions, and is sometimes called the statistical distance, statistical difference or variational distance. Definition Consider a measurable ...
between P and Q: : (P\parallel Q) \le \frac12\, P-Q\, _1=\frac12\sum_, P(\omega)-Q(\omega), . With base-e logarithm, which is commonly used in statistical thermodynamics, the upper bound is \ln(2). In general, the bound in base b is \log_(2): :0 \leq ( P \parallel Q ) \leq \log_b(2). A more general bound, the Jensen–Shannon divergence is bounded by \log_(n) for more than two probability distributions: :0 \leq _(P_1, P_2, \ldots, P_n) \leq \log_(n).


Relation to mutual information

The Jensen–Shannon divergence is the
mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual Statistical dependence, dependence between the two variables. More specifically, it quantifies the "Information conten ...
between a random variable X associated to a mixture distribution between P and Q and the binary indicator variable Z that is used to switch between P and Q to produce the mixture. Let X be some abstract function on the underlying set of events that discriminates well between events, and choose the value of X according to P if Z = 0 and according to Q if Z = 1, where Z is equiprobable. That is, we are choosing X according to the probability measure M=(P+Q)/2, and its distribution is the mixture distribution. We compute :\begin I(X; Z) &= H(X) - H(X, Z)\\ &= -\sum M \log M + \frac \left \sum P \log P + \sum Q \log Q \right\\ &= -\sum \frac \log M - \sum \frac \log M + \frac \left \sum P \log P + \sum Q \log Q \right\\ &= \frac \sum P \left( \log P - \log M\right ) + \frac \sum Q \left( \log Q - \log M \right) \\ &= (P \parallel Q) \end It follows from the above result that the Jensen–Shannon divergence is bounded by 0 and 1 because mutual information is non-negative and bounded by H(Z) = 1 in base 2 logarithm. One can apply the same principle to a joint distribution and the product of its two marginal distribution (in analogy to Kullback–Leibler divergence and mutual information) and to measure how reliably one can decide if a given response comes from the joint distribution or the product distribution—subject to the assumption that these are the only two possibilities.


Quantum Jensen–Shannon divergence

The generalization of probability distributions on
density matrices In quantum mechanics, a density matrix (or density operator) is a matrix used in calculating the probabilities of the outcomes of measurements performed on physical systems. It is a generalization of the state vectors or wavefunctions: while th ...
allows to define quantum Jensen–Shannon divergence (QJSD). It is defined for a set of
density matrices In quantum mechanics, a density matrix (or density operator) is a matrix used in calculating the probabilities of the outcomes of measurements performed on physical systems. It is a generalization of the state vectors or wavefunctions: while th ...
(\rho_1,\ldots,\rho_n) and a probability distribution \pi=(\pi_1,\ldots,\pi_n) as :(\rho_1,\ldots,\rho_n)= S\left(\sum_^n \pi_i \rho_i\right)-\sum_^n \pi_i S(\rho_i) where S(\rho) is the von Neumann entropy of \rho. This quantity was introduced in quantum information theory, where it is called the Holevo information: it gives the upper bound for amount of classical information encoded by the quantum states (\rho_1,\ldots,\rho_n) under the prior distribution \pi (see Holevo's theorem). Quantum Jensen–Shannon divergence for \pi=\left(\frac,\frac\right) and two density matrices is a symmetric function, everywhere defined, bounded and equal to zero only if two
density matrices In quantum mechanics, a density matrix (or density operator) is a matrix used in calculating the probabilities of the outcomes of measurements performed on physical systems. It is a generalization of the state vectors or wavefunctions: while th ...
are the same. It is a square of a metric for pure states, and it was recently shown that this metric property holds for mixed states as well. The Bures metric is closely related to the quantum JS divergence; it is the quantum analog of the
Fisher information metric In information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, ''i.e.'', a smooth manifold whose points are probability distributions. It can be used to calculate the ...
.


Jensen–Shannon centroid

The centroid C* of a finite set of probability distributions can be defined as the minimizer of the average sum of the Jensen-Shannon divergences between a probability distribution and the prescribed set of distributions: C^*=\arg\min_ \sum_^n (P_i \parallel Q) An efficient algorithm (CCCP) based on difference of convex functions is reported to calculate the Jensen-Shannon centroid of a set of discrete distributions (histograms).


Applications

The Jensen–Shannon divergence has been applied in
bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
and genome comparison, in protein surface comparison, in the social sciences, in the quantitative study of history, in fire experiments, and in machine learning.


Notes


External links


Ruby gem for calculating JS divergence
( SciPy) * ttps://sites.santafe.edu/~simon/page7/page7.html THOTH: a python package for the efficient estimation of information-theoretic quantities from empirical data
statcomp R library for calculating complexity measures including Jensen-Shannon Divergence
{{DEFAULTSORT:Jensen-Shannon Divergence Statistical distance