HOME

TheInfoList



OR:

In
probability theory Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expre ...
, Dirichlet processes (after the distribution associated with
Peter Gustav Lejeune Dirichlet Johann Peter Gustav Lejeune Dirichlet (; ; 13 February 1805 – 5 May 1859) was a German mathematician. In number theory, he proved special cases of Fermat's last theorem and created analytic number theory. In analysis, he advanced the theory o ...
) are a family of
stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables in a probability space, where the index of the family often has the interpretation of time. Sto ...
es whose realizations are
probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
s. In other words, a Dirichlet process is a probability distribution whose range is itself a set of probability distributions. It is often used in
Bayesian inference Bayesian inference ( or ) is a method of statistical inference in which Bayes' theorem is used to calculate a probability of a hypothesis, given prior evidence, and update it as more information becomes available. Fundamentally, Bayesian infer ...
to describe the
prior The term prior may refer to: * Prior (ecclesiastical), the head of a priory (monastery) * Prior convictions, the life history and previous convictions of a suspect or defendant in a criminal case * Prior probability, in Bayesian statistics * Prio ...
knowledge about the distribution of
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
s—how likely it is that the random variables are distributed according to one or another particular distribution. As an example, a bag of 100 real-world dice is a ''random
probability mass function In probability and statistics, a probability mass function (sometimes called ''probability function'' or ''frequency function'') is a function that gives the probability that a discrete random variable is exactly equal to some value. Sometimes i ...
(random pmf)''—to sample this random pmf you put your hand in the bag and draw out a die, that is, you draw a pmf. A bag of dice manufactured using a crude process 100 years ago will likely have probabilities that deviate wildly from the uniform pmf, whereas a bag of state-of-the-art dice used by Las Vegas casinos may have barely perceptible imperfections. We can model the randomness of pmfs with the Dirichlet distribution. The Dirichlet process is specified by a base distribution H and a positive
real number In mathematics, a real number is a number that can be used to measure a continuous one- dimensional quantity such as a duration or temperature. Here, ''continuous'' means that pairs of values can have arbitrarily small differences. Every re ...
\alpha called the concentration parameter (also known as scaling parameter). The base distribution is the
expected value In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...
of the process, i.e., the Dirichlet process draws distributions "around" the base distribution the way a
normal distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is f(x) = \frac ...
draws real numbers around its mean. However, even if the base distribution is continuous, the distributions drawn from the Dirichlet process are
almost surely In probability theory, an event is said to happen almost surely (sometimes abbreviated as a.s.) if it happens with probability 1 (with respect to the probability measure). In other words, the set of outcomes on which the event does not occur ha ...
discrete Discrete may refer to: *Discrete particle or quantum in physics, for example in quantum theory * Discrete device, an electronic component with just one circuit element, either passive or active, other than an integrated circuit * Discrete group, ...
. The scaling parameter specifies how strong this discretization is: in the limit of \alpha\rightarrow 0, the realizations are all concentrated at a single value, while in the limit of \alpha\rightarrow\infty the realizations become continuous. Between the two extremes the realizations are discrete distributions with less and less concentration as \alpha increases. The Dirichlet process can also be seen as the infinite-dimensional generalization of the Dirichlet distribution. In the same way as the Dirichlet distribution is the conjugate prior for the
categorical distribution In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution) is a discrete probability distribution that describes the possible results of a random variable that can ...
, the Dirichlet process is the conjugate prior for infinite, nonparametric discrete distributions. A particularly important application of Dirichlet processes is as a
prior probability A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...
distribution in infinite mixture models. The Dirichlet process was formally introduced by Thomas S. Ferguson in 1973. It has since been applied in
data mining Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...
and
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
, among others for
natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
,
computer vision Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...
and
bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
.


Introduction

Dirichlet processes are usually used when modelling data that tends to repeat previous values in a so-called "rich get richer" fashion. Specifically, suppose that the generation of values X_1,X_2,\dots can be simulated by the following algorithm. :Input: H (a probability distribution called base distribution), \alpha (a positive real number called scaling parameter) :For n\ge 1: :::: a) With probability \frac draw X_n from H. :::: b) With probability \frac set X_n=x, where n_x is the number of previous observations of x. ::::: (Formally, n_x := , \ , where , \cdot , denotes the number of elements in the set.) At the same time, another common model for data is that the observations X_1,X_2,\dots are assumed to be independent and identically distributed (i.i.d.) according to some (random) distribution P. The goal of introducing Dirichlet processes is to be able to describe the procedure outlined above in this i.i.d. model. The X_1,X_2,\dots observations in the algorithm are not
independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in Pennsylvania, United States * Independentes (English: Independents), a Portuguese artist ...
, since we have to consider the previous results when generating the next value. They are, however, exchangeable. This fact can be shown by calculating the
joint probability distribution A joint or articulation (or articular surface) is the connection made between bones, ossicles, or other hard structures in the body which link an animal's skeletal system into a functional whole.Saladin, Ken. Anatomy & Physiology. 7th ed. McGraw- ...
of the observations and noticing that the resulting formula only depends on which x values occur among the observations and how many repetitions they each have. Because of this exchangeability, de Finetti's representation theorem applies and it implies that the observations X_1,X_2,\dots are conditionally independent given a (latent) distribution P. This P is a random variable itself and has a distribution. This distribution (over distributions) is called a Dirichlet process (\operatorname). In summary, this means that we get an equivalent procedure to the above algorithm: # Draw a distribution P from \operatorname\left(H,\alpha\right) # Draw observations X_1,X_2,\dots independently from P. In practice, however, drawing a concrete distribution P is impossible, since its specification requires an infinite amount of information. This is a common phenomenon in the context of Bayesian
non-parametric statistics Nonparametric statistics is a type of statistical analysis that makes minimal assumptions about the underlying distribution of the data being studied. Often these models are infinite-dimensional, rather than finite dimensional, as in parametric s ...
where a typical task is to learn distributions on function spaces, which involve effectively infinitely many parameters. The key insight is that in many applications the infinite-dimensional distributions appear only as an intermediary computational device and are not required for either the initial specification of prior beliefs or for the statement of the final inference.


Formal definition

Given a measurable set ''S'', a base probability distribution ''H'' and a positive
real number In mathematics, a real number is a number that can be used to measure a continuous one- dimensional quantity such as a duration or temperature. Here, ''continuous'' means that pairs of values can have arbitrarily small differences. Every re ...
\alpha, the Dirichlet process \operatorname(H, \alpha) is a
stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables in a probability space, where the index of the family often has the interpretation of time. Sto ...
whose sample path (or realization, i.e. an infinite sequence of random variates drawn from the process) is a probability distribution over ''S'', such that the following holds. For any measurable finite partition of ''S'', denoted \_^n, :\text X \sim \operatorname(H,\alpha) :\text(X(B_1),\dots,X(B_n)) \sim \operatorname(\alpha H(B_1),\dots, \alpha H(B_n)), where \operatorname denotes the Dirichlet distribution and the notation X \sim D means that the random variable X has the distribution D.


Alternative views

There are several equivalent views of the Dirichlet process. Besides the formal definition above, the Dirichlet process can be defined implicitly through de Finetti's theorem as described in the first section; this is often called the Chinese restaurant process. A third alternative is the stick-breaking process, which defines the Dirichlet process constructively by writing a distribution sampled from the process as f(x)=\sum_^\infty\beta_k\delta_(x), where \_^\infty are samples from the base distribution H, \delta_ is an
indicator function In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , then the indicator functio ...
centered on x_k (zero everywhere except for \delta_(x_k)=1) and the \beta_k are defined by a recursive scheme that repeatedly samples from the
beta distribution In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval
, 1 The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
or (0, 1) in terms of two positive Statistical parameter, parameters, denoted by ''alpha'' (''α'') an ...
\operatorname(1,\alpha).


The Chinese restaurant process

A widely employed metaphor for the Dirichlet process is based on the so-called Chinese restaurant process. The metaphor is as follows: Imagine a Chinese restaurant in which customers enter. A new customer sits down at a table with a probability proportional to the number of customers already sitting there. Additionally, a customer opens a new table with a probability proportional to the scaling parameter \alpha. After infinitely many customers entered, one obtains a probability distribution over infinitely many tables to be chosen. This probability distribution over the tables is a random sample of the probabilities of observations drawn from a Dirichlet process with scaling parameter \alpha. If one associates draws from the base measure H with every table, the resulting distribution over the sample space S is a random sample of a Dirichlet process. The Chinese restaurant process is related to the Pólya urn sampling scheme which yields samples from finite Dirichlet distributions. Because customers sit at a table with a probability proportional to the number of customers already sitting at the table, two properties of the DP can be deduced: # The Dirichlet process exhibits a self-reinforcing property: The more often a given value has been sampled in the past, the more likely it is to be sampled again. # Even if H is a distribution over an
uncountable set In mathematics, an uncountable set, informally, is an infinite set that contains too many elements to be countable. The uncountability of a set is closely related to its cardinal number: a set is uncountable if its cardinal number is larger t ...
, there is a nonzero probability that two samples will have exactly the same value because the probability mass will concentrate on a small number of tables.


The stick-breaking process

A third approach to the Dirichlet process is the so-called stick-breaking process view. Conceptually, this involves repeatedly breaking off and discarding a random fraction (sampled from a Beta distribution) of a "stick" that is initially of length 1. Remember that draws from a Dirichlet process are distributions over a set S. As noted previously, the distribution drawn is discrete with probability 1. In the stick-breaking process view, we explicitly use the discreteness and give the
probability mass function In probability and statistics, a probability mass function (sometimes called ''probability function'' or ''frequency function'') is a function that gives the probability that a discrete random variable is exactly equal to some value. Sometimes i ...
of this (random) discrete distribution as: : f(\theta) = \sum_^\infty \beta_k \cdot \delta_(\theta) where \delta_ is the
indicator function In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , then the indicator functio ...
which evaluates to zero everywhere, except for \delta_(\theta_k)=1. Since this distribution is random itself, its mass function is parameterized by two sets of random variables: the locations \left\_^\infty and the corresponding probabilities \left\_^\infty . In the following, we present without proof what these random variables are. The locations \theta_k are independent and identically distributed according to H, the base distribution of the Dirichlet process. The probabilities \beta_k are given by a procedure resembling the breaking of a unit-length stick (hence the name): :\beta_k = \beta'_k\cdot\prod_^\left(1-\beta'_i\right) where \beta'_k are independent random variables with the
beta distribution In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval
, 1 The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
or (0, 1) in terms of two positive Statistical parameter, parameters, denoted by ''alpha'' (''α'') an ...
\operatorname(1,\alpha). The resemblance to 'stick-breaking' can be seen by considering \beta_k as the length of a piece of a stick. We start with a unit-length stick and in each step we break off a portion of the remaining stick according to \beta'_k and assign this broken-off piece to \beta_k. The formula can be understood by noting that after the first ''k'' − 1 values have their portions assigned, the length of the remainder of the stick is \prod_^\left(1-\beta'_i\right) and this piece is broken according to \beta'_k and gets assigned to \beta_k. The smaller \alpha is, the less of the stick will be left for subsequent values (on average), yielding more concentrated distributions. The stick-breaking process is similar to the construction where one samples sequentially from marginal beta distributions in order to generate a sample from a Dirichlet distribution.


The Pólya urn scheme

Yet another way to visualize the Dirichlet process and Chinese restaurant process is as a modified Pólya urn scheme sometimes called the Blackwell–MacQueen sampling scheme. Imagine that we start with an urn filled with \alpha black balls. Then we proceed as follows: # Each time we need an observation, we draw a ball from the urn. # If the ball is black, we generate a new (non-black) colour uniformly, label a new ball this colour, drop the new ball into the urn along with the ball we drew, and return the colour we generated. # Otherwise, label a new ball with the colour of the ball we drew, drop the new ball into the urn along with the ball we drew, and return the colour we observed. The resulting distribution over colours is the same as the distribution over tables in the Chinese restaurant process. Furthermore, when we draw a black ball, if rather than generating a new colour, we instead pick a random value from a base distribution H and use that value to label the new ball, the resulting distribution over labels will be the same as the distribution over the values in a Dirichlet process.


Use as a prior distribution

The Dirichlet Process can be used as a prior distribution to estimate the probability distribution that generates the data. In this section, we consider the model : \begin P & \sim \textrm(H, \alpha )\\ X_1, \ldots, X_n \mid P & \, \overset \, P. \end The Dirichlet Process distribution satisfies prior conjugacy, posterior consistency, and the Bernstein–von Mises theorem.


Prior conjugacy

In this model, the posterior distribution is again a Dirichlet process. This means that the Dirichlet process is a conjugate prior for this model. The posterior distribution is given by : \begin P \mid X_1, \ldots, X_n &\sim \textrm\left(\frac H + \frac \sum_^n \delta_, \; \alpha + n \right) \\ &= \textrm\left(\frac H + \frac \mathbb_n, \; \alpha + n \right) \end where \mathbb_n is defined below.


Posterior consistency

If we take the frequentist view of probability, we believe there is a true probability distribution P_0 that generated the data. Then it turns out that the Dirichlet process is consistent in the weak topology, which means that for every weak neighbourhood U of P_0, the posterior probability of U converges to 1.


Bernstein–Von Mises theorem

In order to interpret the credible sets as confidence sets, a Bernstein–von Mises theorem is needed. In case of the Dirichlet process we compare the posterior distribution with the empirical process \mathbb_n = \frac \sum_^n \delta_. Suppose \mathcal is a P_0 -Donsker class, i.e. : \sqrt n \left( \mathbb_n - P_0 \right) \rightsquigarrow G_ for some Brownian Bridge G_ . Suppose also that there exists a function F such that F(x) \geq \sup_ f(x) such that \int F^2 \, \mathrm H < \infty , then, P_0
almost surely In probability theory, an event is said to happen almost surely (sometimes abbreviated as a.s.) if it happens with probability 1 (with respect to the probability measure). In other words, the set of outcomes on which the event does not occur ha ...
: \sqrt \left( P - \mathbb_n \right) \mid X_1, \cdots, X_n \rightsquigarrow G_. This implies that credible sets you construct are asymptotic confidence sets, and the Bayesian inference based on the Dirichlet process is asymptotically also valid frequentist inference.


Use in Dirichlet mixture models

To understand what Dirichlet processes are and the problem they solve we consider the example of
data clustering Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some specific sense defined by the analyst) to each o ...
. It is a common situation that data points are assumed to be distributed in a hierarchical fashion where each data point belongs to a (randomly chosen) cluster and the members of a cluster are further distributed randomly within that cluster.


Example 1

For example, we might be interested in how people will vote on a number of questions in an upcoming election. A reasonable model for this situation might be to classify each voter as a liberal, a conservative or a moderate and then model the event that a voter says "Yes" to any particular question as a Bernoulli random variable with the probability dependent on which political cluster they belong to. By looking at how votes were cast in previous years on similar pieces of legislation one could fit a predictive model using a simple clustering algorithm such as ''k''-means. That algorithm, however, requires knowing in advance the number of clusters that generated the data. In many situations, it is not possible to determine this ahead of time, and even when we can reasonably assume a number of clusters we would still like to be able to check this assumption. For example, in the voting example above the division into liberal, conservative and moderate might not be finely tuned enough; attributes such as a religion, class or race could also be critical for modelling voter behaviour, resulting in more clusters in the model.


Example 2

As another example, we might be interested in modelling the velocities of galaxies using a simple model assuming that the velocities are clustered, for instance by assuming each velocity is distributed according to the
normal distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is f(x) = \frac ...
v_i\sim N(\mu_k,\sigma^2), where the ith observation belongs to the kth cluster of galaxies with common expected velocity. In this case it is far from obvious how to determine a priori how many clusters (of common velocities) there should be and any model for this would be highly suspect and should be checked against the data. By using a Dirichlet process prior for the distribution of cluster means we circumvent the need to explicitly specify ahead of time how many clusters there are, although the concentration parameter still controls it implicitly. We consider this example in more detail. A first naive model is to presuppose that there are K clusters of normally distributed velocities with common known fixed
variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
\sigma^2. Denoting the event that the ith observation is in the kth cluster as z_i=k we can write this model as: : \begin (v_i \mid z_i=k, \mu_k) & \sim N(\mu_k,\sigma^2) \\ \operatorname (z_i=k) & = \pi_k \\ (\boldsymbol\mid \alpha) & \sim \operatorname\left(\frac \cdot \mathbf_K\right) \\ \mu_k & \sim H(\lambda) \end That is, we assume that the data belongs to K distinct clusters with means \mu_k and that \pi_k is the (unknown) prior probability of a data point belonging to the kth cluster. We assume that we have no initial information distinguishing the clusters, which is captured by the symmetric prior \operatorname\left(\alpha/K\cdot\mathbf_K\right). Here \operatorname denotes the Dirichlet distribution and \mathbf_K denotes a vector of length K where each element is 1. We further assign independent and identical prior distributions H(\lambda) to each of the cluster means, where H may be any parametric distribution with parameters denoted as \lambda. The hyper-parameters \alpha and \lambda are taken to be known fixed constants, chosen to reflect our prior beliefs about the system. To understand the connection to Dirichlet process priors we rewrite this model in an equivalent but more suggestive form: : \begin (v_i \mid \tilde_i) & \sim N(\tilde_i,\sigma^2) \\ \tilde_i & \sim G=\sum_^K \pi_k \delta_ (\tilde_i) \\ (\boldsymbol\mid \alpha) & \sim \operatorname\left(\frac \cdot \mathbf_K \right) \\ \mu_k & \sim H(\lambda) \end Instead of imagining that each data point is first assigned a cluster and then drawn from the distribution associated to that cluster we now think of each observation being associated with parameter \tilde_i drawn from some discrete distribution G with support on the K means. That is, we are now treating the \tilde_i as being drawn from the random distribution G and our prior information is incorporated into the model by the distribution over distributions G. We would now like to extend this model to work without pre-specifying a fixed number of clusters K. Mathematically, this means we would like to select a random prior distribution G(\tilde_i)=\sum_^\infty\pi_k \delta_(\tilde_i) where the values of the clusters means \mu_k are again independently distributed according to H\left(\lambda\right) and the distribution over \pi_k is symmetric over the infinite set of clusters. This is exactly what is accomplished by the model: : \begin (v_i \mid \tilde_i) & \sim N(\tilde_i,\sigma^2)\\ \tilde_i & \sim G \\ G & \sim \operatorname(H(\lambda),\alpha) \end With this in hand we can better understand the computational merits of the Dirichlet process. Suppose that we wanted to draw n observations from the naive model with exactly K clusters. A simple algorithm for doing this would be to draw K values of \mu_k from H(\lambda), a distribution \pi from \operatorname\left(\alpha/K\cdot\mathbf_K\right) and then for each observation independently sample the cluster k with probability \pi_k and the value of the observation according to N\left(\mu_k,\sigma^2\right). It is easy to see that this algorithm does not work in case where we allow infinite clusters because this would require sampling an infinite dimensional parameter \boldsymbol. However, it is still possible to sample observations v_i. One can e.g. use the Chinese restaurant representation described below and calculate the probability for used clusters and a new cluster to be created. This avoids having to explicitly specify \boldsymbol. Other solutions are based on a truncation of clusters: A (high) upper bound to the true number of clusters is introduced and cluster numbers higher than the lower bound are treated as one cluster. Fitting the model described above based on observed data D means finding the posterior distribution p\left(\boldsymbol,\boldsymbol\mid D\right) over cluster probabilities and their associated means. In the infinite dimensional case it is obviously impossible to write down the posterior explicitly. It is, however, possible to draw samples from this posterior using a modified Gibbs sampler. This is the critical fact that makes the Dirichlet process prior useful for
inference Inferences are steps in logical reasoning, moving from premises to logical consequences; etymologically, the word '' infer'' means to "carry forward". Inference is theoretically traditionally divided into deduction and induction, a distinct ...
.


Applications of the Dirichlet process

Dirichlet processes are frequently used in ''Bayesian
nonparametric statistics Nonparametric statistics is a type of statistical analysis that makes minimal assumptions about the underlying distribution of the data being studied. Often these models are infinite-dimensional, rather than finite dimensional, as in parametric s ...
''. "Nonparametric" here does not mean a parameter-less model, rather a model in which representations grow as more data are observed. Bayesian nonparametric models have gained considerable popularity in the field of
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
because of the above-mentioned flexibility, especially in
unsupervised learning Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, wh ...
. In a Bayesian nonparametric model, the prior and posterior distributions are not parametric distributions, but stochastic processes. The fact that the Dirichlet distribution is a probability distribution on the simplex of sets of non-negative numbers that sum to one makes it a good candidate to model distributions over distributions or distributions over functions. Additionally, the nonparametric nature of this model makes it an ideal candidate for clustering problems where the distinct number of clusters is unknown beforehand. In addition, the Dirichlet process has also been used for developing a mixture of expert models, in the context of supervised learning algorithms (regression or classification settings). For instance, mixtures of Gaussian process experts, where the number of required experts must be inferred from the data. As draws from a Dirichlet process are discrete, an important use is as a
prior probability A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...
in infinite mixture models. In this case, S is the parametric set of component distributions. The generative process is therefore that a sample is drawn from a Dirichlet process, and for each data point, in turn, a value is drawn from this sample distribution and used as the component distribution for that data point. The fact that there is no limit to the number of distinct components which may be generated makes this kind of model appropriate for the case when the number of mixture components is not well-defined in advance. For example, the infinite mixture of Gaussians model, as well as associated mixture regression models, e.g.Sotirios P. Chatzis, Dimitrios Korkinof, and Yiannis Demiris, "A nonparametric Bayesian approach toward robot learning by demonstration," Robotics and Autonomous Systems, vol. 60, no. 6, pp. 789–802, June 2012. The infinite nature of these models also lends them to
natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
applications, where it is often desirable to treat the vocabulary as an infinite, discrete set. The Dirichlet Process can also be used for nonparametric hypothesis testing, i.e. to develop Bayesian nonparametric versions of the classical nonparametric hypothesis tests, e.g. sign test, Wilcoxon rank-sum test, Wilcoxon signed-rank test, etc. For instance, Bayesian nonparametric versions of the Wilcoxon rank-sum test and the Wilcoxon signed-rank test have been developed by using the imprecise Dirichlet process, a prior ignorance Dirichlet process.


Related distributions

* The Pitman–Yor process is a generalization of the Dirichlet process to accommodate power-law tails * The hierarchical Dirichlet process extends the ordinary Dirichlet process for modelling grouped data.


References


External links


Introduction to the Dirichlet Distribution and Related Processes by Frigyik, Kapila and Gupta

Yee Whye Teh's overview of Dirichlet processes

Webpage for the NIPS 2003 workshop on non-parametric Bayesian methods

Michael Jordan's NIPS 2005 tutorial: ''Nonparametric Bayesian Methods: Dirichlet Processes, Chinese Restaurant Processes and All That''

Peter Green's summary of construction of Dirichlet Processes

Peter Green's paper on probabilistic models of Dirichlet Processes with implications for statistical modelling and analysis

Zoubin Ghahramani's UAI 2005 tutorial on Nonparametric Bayesian methods

GIMM software for performing cluster analysis using Infinite Mixture Models


by Zhiyuan Weng {{DEFAULTSORT:Dirichlet Process Stochastic processes Nonparametric Bayesian statistics