Chow–Liu Tree
   HOME

TheInfoList



OR:

In probability theory and statistics Chow–Liu tree is an efficient method for constructing a second- order product approximation of a joint probability distribution, first described in a paper by . The goals of such a decomposition, as with such
Bayesian networks A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bay ...
in general, may be either
data compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressio ...
or inference.


The Chow–Liu representation

The Chow–Liu method describes a joint probability distribution P(X_,X_,\ldots,X_) as a product of second-order conditional and marginal distributions. For example, the six-dimensional distribution P(X_,X_,X_,X_,X_,X_) might be approximated as : P^(X_,X_,X_,X_,X_,X_)=P(X_, X_)P(X_, X_)P(X_, X_)P(X_, X_)P(X_, X_)P(X_) where each new term in the product introduces just one new variable, and the product can be represented as a first-order dependency tree, as shown in the figure. The Chow–Liu algorithm (below) determines which conditional probabilities are to be used in the product approximation. In general, unless there are no third-order or higher-order interactions, the Chow–Liu approximation is indeed an ''approximation'', and cannot capture the complete structure of the original distribution. provides a modern analysis of the Chow–Liu tree as a
Bayesian network A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bay ...
.


The Chow–Liu algorithm

Chow and Liu show how to select second-order terms for the product approximation so that, among all such second-order approximations (first-order dependency trees), the constructed approximation P^ has the minimum
Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fr ...
to the actual distribution P, and is thus the ''closest'' approximation in the classical
information-theoretic Information theory is the scientific study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. T ...
sense. The Kullback–Leibler divergence between a second-order product approximation and the actual distribution is shown to be : D(P\parallel P^)=-\sum I(X_;X_)+\sum H(X_)-H(X_,X_,\ldots ,X_) where I(X_;X_) is the
mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
between variable X_ and its parent X_ and H(X_,X_,\ldots ,X_) is the
joint entropy In information theory, joint entropy is a measure of the uncertainty associated with a set of variables. Definition The joint Shannon entropy (in bits) of two discrete random variables X and Y with images \mathcal X and \mathcal Y is defined ...
of variable set \. Since the terms \sum H(X_) and H(X_,X_,\ldots ,X_) are independent of the dependency ordering in the tree, only the sum of the pairwise
mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
s, \sum I(X_;X_), determines the quality of the approximation. Thus, if every branch (edge) on the tree is given a weight corresponding to the mutual information between the variables at its vertices, then the tree which provides the optimal second-order approximation to the target distribution is just the ''maximum-weight tree''. The equation above also highlights the role of the dependencies in the approximation: When no dependencies exist, and the first term in the equation is absent, we have only an approximation based on first-order marginals, and the distance between the approximation and the true distribution is due to the redundancies that are not accounted for when the variables are treated as independent. As we specify second-order dependencies, we begin to capture some of that structure and reduce the distance between the two distributions. Chow and Liu provide a simple algorithm for constructing the optimal tree; at each stage of the procedure the algorithm simply adds the maximum
mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
pair to the tree. See the original paper, , for full details. A more efficient tree construction algorithm for the common case of sparse data was outlined in . Chow and Wagner proved in a later paper that the learning of the Chow–Liu tree is consistent given samples (or observations) drawn i.i.d. from a tree-structured distribution. In other words, the probability of learning an incorrect tree decays to zero as the number of samples tends to infinity. The main idea in the proof is the continuity of the mutual information in the pairwise marginal distribution. More recently, the exponential rate of convergence of the error probability was provided.A Large-Deviation Analysis for the Maximum-Likelihood Learning of Tree Structures. V. Y. F. Tan, A. Anandkumar, L. Tong and A. Willsky. In the International symposium on information theory (ISIT), July 2009.


Variations on Chow–Liu trees

The obvious problem which occurs when the actual distribution is not in fact a second-order dependency tree can still in some cases be addressed by fusing or aggregating together densely connected subsets of variables to obtain a "large-node" Chow–Liu tree , or by extending the idea of greedy maximum branch weight selection to non-tree (multiple parent) structures . (Similar techniques of variable substitution and construction are common in the Bayes network literature, e.g., for dealing with loops. See .) Generalizations of the Chow–Liu tree are the so-called t-cherry junction trees. It is proved that the t-cherry junction trees provide a better or at least as good approximation for a discrete multivariate probability distribution as the Chow–Liu tree gives. For the third order t-cherry junction tree see , for the ''k''th-order t-cherry junction tree see . The second order t-cherry junction tree is in fact the Chow–Liu tree.


See also

*
Bayesian network A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bay ...
* Knowledge representation


Notes


References

*. *. * *. *. *. *. *. {{DEFAULTSORT:Chow-Liu tree Knowledge representation