HOME

TheInfoList



OR:

In statistics and
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
, the hierarchical Dirichlet process (HDP) is a
nonparametric Nonparametric statistics is the branch of statistics that is not based solely on Statistical parameter, parametrized families of probability distributions (common examples of parameters are the mean and variance). Nonparametric statistics is based ...
Bayesian Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister. Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a followe ...
approach to clustering
grouped data Grouped data are data formed by aggregating individual observations of a variable into groups, so that a frequency distribution of these groups serves as a convenient means of summarizing or analyzing the data. There are two major types of groupin ...
. It uses a
Dirichlet process In probability theory, Dirichlet processes (after the distribution associated with Peter Gustav Lejeune Dirichlet) are a family of stochastic processes whose realizations are probability distributions. In other words, a Dirichlet process is a p ...
for each group of data, with the Dirichlet processes for all groups sharing a base distribution which is itself drawn from a Dirichlet process. This method allows groups to share statistical strength via sharing of clusters across groups. The base distribution being drawn from a Dirichlet process is important, because draws from a Dirichlet process are atomic probability measures, and the atoms will appear in all group-level Dirichlet processes. Since each atom corresponds to a cluster, clusters are shared across all groups. It was developed by Yee Whye Teh, Michael I. Jordan, Matthew J. Beal and
David Blei David Meir Blei is a professor in the Statistics and Computer Science departments at Columbia University. Prior to fall 2014 he was an associate professor in the Department of Computer Science at Princeton University. His work is primarily in mach ...
and published in the ''
Journal of the American Statistical Association The ''Journal of the American Statistical Association (JASA)'' is the primary journal published by the American Statistical Association, the main professional body for statisticians in the United States. It is published four times a year in Mar ...
'' in 2006, as a formalization and generalization of the infinite hidden Markov model published in 2002.


Model

This model description is sourced from. The HDP is a model for grouped data. What this means is that the data items come in multiple distinct groups. For example, in a
topic model In statistics and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden ...
words are organized into documents, with each document formed by a bag (group) of words (data items). Indexing groups by j=1,...J, suppose each group consist of data items x_,...x_. The HDP is parameterized by a base distribution H that governs the a priori distribution over data items, and a number of concentration parameters that govern the a priori number of clusters and amount of sharing across groups. The jth group is associated with a random probability measure G_j which has distribution given by a Dirichlet process: : \begin G_j, G_0 &\sim \operatorname(\alpha_j,G_0) \end where \alpha_j is the concentration parameter associated with the group, and G_0 is the base distribution shared across all groups. In turn, the common base distribution is Dirichlet process distributed: : \begin G_0 &\sim \operatorname(\alpha_0,H) \end with concentration parameter \alpha_0 and base distribution H. Finally, to relate the Dirichlet processes back with the observed data, each data item x_ is associated with a latent parameter \theta_: : \begin \theta_, G_j &\sim G_j \\ x_, \theta_ &\sim F(\theta_) \end The first line states that each parameter has a prior distribution given by G_j, while the second line states that each data item has a distribution F(\theta_) parameterized by its associated parameter. The resulting model above is called a HDP mixture model, with the HDP referring to the hierarchically linked set of Dirichlet processes, and the mixture model referring to the way the Dirichlet processes are related to the data items. To understand how the HDP implements a clustering model, and how clusters become shared across groups, recall that draws from a
Dirichlet process In probability theory, Dirichlet processes (after the distribution associated with Peter Gustav Lejeune Dirichlet) are a family of stochastic processes whose realizations are probability distributions. In other words, a Dirichlet process is a p ...
are atomic probability measures with probability one. This means that the common base distribution G_0 has a form which can be written as: : \begin G_0 &= \sum_^\infty \pi_\delta_ \end where there are an infinite number of atoms, \theta^*_k, k=1,2,..., assuming that the overall base distribution H has infinite support. Each atom is associated with a mass \pi_. The masses have to sum to one since G_0 is a probability measure. Since G_0 is itself the base distribution for the group specific Dirichlet processes, each G_j will have atoms given by the atoms of G_0, and can itself be written in the form: : \begin G_j &= \sum_^\infty \pi_\delta_ \end Thus the set of atoms is shared across all groups, with each group having its own group-specific atom masses. Relating this representation back to the observed data, we see that each data item is described by a mixture model: : \begin x_, G_j &\sim \sum_^\infty \pi_ F(\theta^*_k) \end where the atoms \theta^*_k play the role of the mixture component parameters, while the masses \pi_ play the role of the mixing proportions. In conclusion, each group of data is modeled using a mixture model, with mixture components shared across all groups but mixing proportions being group-specific. In clustering terms, we can interpret each mixture component as modeling a cluster of data items, with clusters shared across all groups, and each group, having its own mixing proportions, composed of different combinations of clusters.


Applications

The HDP mixture model is a natural nonparametric generalization of
Latent Dirichlet allocation In natural language processing, Latent Dirichlet Allocation (LDA) is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar. The LDA is an ex ...
, where the number of topics can be unbounded and learnt from data. Here each group is a document consisting of a bag of words, each cluster is a topic, and each document is a mixture of topics. The HDP is also a core component of the infinite hidden Markov model,Beal, M.J., Ghahramani, Z. and Rasmussen, C.E. (2002).

(PDF). Advances in Neural Information Processing Systems 14:577–585. Cambridge, MA: MIT Press.
which is a nonparametric generalization of the
hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an o ...
allowing the number of states to be unbounded and learnt from data.Fox, Emily B., et al. "A sticky HDP-HMM with application to speaker diarization." The Annals of Applied Statistics (2011): 1020-1056.


Generalizations

The HDP can be generalized in a number of directions. The Dirichlet processes can be replaced by Pitman-Yor processes and
Gamma process In mathematics and probability theory, a gamma process, also known as (Moran-)Gamma subordinator, is a random process with independent gamma distributed increments. Often written as \Gamma(t;\gamma,\lambda), it is a pure-jump increasing Lévy ...
es, resulting in the Hierarchical Pitman-Yor process and Hierarchical Gamma process. The hierarchy can be deeper, with multiple levels of groups arranged in a hierarchy. Such an arrangement has been exploited in the sequence memoizer, a Bayesian nonparametric model for sequences which has a multi-level hierarchy of Pitman-Yor processes. In addition, Bayesian Multi-Domain Learning (BMDL) model derives domain-dependent latent representations of overdispersed count data based on hierarchical negative binomial factorization for accurate cancer subtyping even if the number of samples for a specific cancer type is small.Hajiramezanali, E. & Dadaneh, S. Z. & Karbalayghareh, A. & Zhou, Z. & Qian, X
"Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data"
(PDF). 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada.


See also

*
Chinese Restaurant Process In probability theory, the Chinese restaurant process is a discrete-time stochastic process, analogous to seating customers at tables in a restaurant. Imagine a restaurant with an infinite number of circular tables, each with infinite capacity. C ...


References

{{Scholia, topic Stochastic processes Nonparametric Bayesian statistics