Conditional Random Fields
   HOME

TheInfoList



OR:

Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
and used for structured prediction. Whereas a classifier predicts a label for a single sample without considering "neighbouring" samples, a CRF can take context into account. To do so, the predictions are modelled as a graphical model, which represents the presence of dependencies between the predictions. What kind of graph is used depends on the application. For example, in
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
, "linear chain" CRFs are popular, for which each prediction is dependent only on its immediate neighbours. In image processing, the graph typically connects locations to nearby and/or similar locations to enforce that they receive similar predictions. Other examples where CRFs are used are: labeling or
parsing Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lati ...
of sequential data for
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
or biological sequences, part-of-speech tagging, shallow parsing, named entity recognition,
gene finding In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functio ...
, peptide critical functional region finding, and object recognition and image segmentation in
computer vision Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the hum ...
.


Description

CRFs are a type of discriminative
undirected In discrete mathematics, and more specifically in graph theory, a graph is a structure amounting to a set of objects in which some pairs of the objects are in some sense "related". The objects correspond to mathematical abstractions called '' v ...
probabilistic Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...
graphical model. Lafferty, McCallum and Pereira define a CRF on observations \boldsymbol and
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
s \boldsymbol as follows:
Let G = (V , E) be a graph such that \boldsymbol = (\boldsymbol_v)_, so that \boldsymbol is indexed by the vertices of G. Then (\boldsymbol, \boldsymbol) is a conditional random field when each random variable \boldsymbol_v, conditioned on \boldsymbol, obeys the Markov property with respect to the graph; that is, its probability is dependent only on its neighbours in G: P(\boldsymbol_v , \boldsymbol, \) = P(\boldsymbol_v , \boldsymbol, \), where \mathit \sim v means that w and v are neighbors in G.
What this means is that a CRF is an undirected graphical model whose nodes can be divided into exactly two disjoint sets \boldsymbol and \boldsymbol, the observed and output variables, respectively; the conditional distribution p(\boldsymbol, \boldsymbol) is then modeled.


Inference

For general graphs, the problem of exact inference in CRFs is intractable. The inference problem for a CRF is basically the same as for an MRF and the same arguments hold. However, there exist special cases for which exact inference is feasible: * If the graph is a chain or a tree, message passing algorithms yield exact solutions. The algorithms used in these cases are analogous to the forward-backward and Viterbi algorithm for the case of HMMs. * If the CRF only contains pair-wise potentials and the energy is
submodular In mathematics, a submodular set function (also known as a submodular function) is a set function whose value, informally, has the property that the difference in the incremental value of the function that a single element makes when added to an ...
, combinatorial min cut/max flow algorithms yield exact solutions. If exact inference is impossible, several algorithms can be used to obtain approximate solutions. These include: *
Loopy belief propagation Loopy may refer to: * Casio Loopy, a video game console * ''Loopy'' (film), a 2004 film * Loopy, a fictional character in the animated TV show ''KaBlam!'' * Loopy de Loop, a cartoon about a character of the same name * Loopy games, in combinatori ...
* Alpha expansion * Mean field inference *
Linear programming relaxation In mathematics, the relaxation of a (mixed) integer linear program is the problem that arises by removing the integrality constraint of each variable. For example, in a 0–1 integer program, all constraints are of the form :x_i\in\. The relax ...
s


Parameter Learning

Learning the parameters \theta is usually done by
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
learning for p(Y_i, X_i; \theta). If all nodes have exponential family distributions and all nodes are observed during training, this optimization is convex. It can be solved for example using gradient descent algorithms, or
Quasi-Newton method Quasi-Newton methods are methods used to either find zeroes or local maxima and minima of functions, as an alternative to Newton's method. They can be used if the Jacobian or Hessian is unavailable or is too expensive to compute at every iteration. ...
s such as the
L-BFGS Limited-memory BFGS (L-BFGS or LM-BFGS) is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory. It is a popular algo ...
algorithm. On the other hand, if some variables are unobserved, the inference problem has to be solved for these variables. Exact inference is intractable in general graphs, so approximations have to be used.


Examples

In sequence modeling, the graph of interest is usually a chain graph. An input sequence of observed variables X represents a sequence of observations and Y represents a hidden (or unknown) state variable that needs to be inferred given the observations. The Y_ are structured to form a chain, with an edge between each Y_ and Y_. As well as having a simple interpretation of the Y_ as "labels" for each element in the input sequence, this layout admits efficient algorithms for: * model ''training'', learning the conditional distributions between the Y_ and feature functions from some corpus of training data. * ''decoding'', determining the probability of a given label sequence Y given X. * ''inference'', determining the ''most likely'' label sequence Y given X. The conditional dependency of each Y_ on X is defined through a fixed set of ''feature functions'' of the form f(i, Y_, Y_, X), which can be thought of as measurements on the input sequence that partially determine the likelihood of each possible value for Y_. The model assigns each feature a numerical weight and combines them to determine the probability of a certain value for Y_. Linear-chain CRFs have many of the same applications as conceptually simpler hidden Markov models (HMMs), but relax certain assumptions about the input and output sequence distributions. An HMM can loosely be understood as a CRF with very specific feature functions that use constant probabilities to model state transitions and emissions. Conversely, a CRF can loosely be understood as a generalization of an HMM that makes the constant transition probabilities into arbitrary functions that vary across the positions in the sequence of hidden states, depending on the input sequence. Notably, in contrast to HMMs, CRFs can contain any number of feature functions, the feature functions can inspect the entire input sequence X at any point during inference, and the range of the feature functions need not have a probabilistic interpretation.


Variants


Higher-order CRFs and semi-Markov CRFs

CRFs can be extended into higher order models by making each Y_ dependent on a fixed number k of previous variables Y_, ..., Y_. In conventional formulations of higher order CRFs, training and inference are only practical for small values of k (such as ''k'' ≤ 5), since their computational cost increases exponentially with k. However, another recent advance has managed to ameliorate these issues by leveraging concepts and tools from the field of Bayesian nonparametrics. Specifically, the CRF-infinity approach constitutes a CRF-type model that is capable of learning infinitely-long temporal dynamics in a scalable fashion. This is effected by introducing a novel potential function for CRFs that is based on the Sequence Memoizer (SM), a nonparametric Bayesian model for learning infinitely-long dynamics in sequential observations. To render such a model computationally tractable, CRF-infinity employs a mean-field approximation of the postulated novel potential functions (which are driven by an SM). This allows for devising efficient approximate training and inference algorithms for the model, without undermining its capability to capture and model temporal dependencies of arbitrary length. There exists another generalization of CRFs, the semi-Markov conditional random field (semi-CRF), which models variable-length ''segmentations'' of the label sequence Y. This provides much of the power of higher-order CRFs to model long-range dependencies of the Y_, at a reasonable computational cost. Finally, large-margin models for structured prediction, such as the structured Support Vector Machine can be seen as an alternative training procedure to CRFs.


Latent-dynamic conditional random field

Latent-dynamic conditional random fields (LDCRF) or discriminative probabilistic latent variable models (DPLVM) are a type of CRFs for sequence tagging tasks. They are latent variable models that are trained discriminatively. In an LDCRF, like in any sequence tagging task, given a sequence of observations x = x_1,\dots,x_n, the main problem the model must solve is how to assign a sequence of labels y = y_1,\dots,y_n from one finite set of labels . Instead of directly modeling (y, x) as an ordinary linear-chain CRF would do, a set of latent variables h is "inserted" between x and y using the
chain rule of probability In probability theory, the chain rule (also called the general product rule) permits the calculation of any member of the joint distribution of a set of random variables using only conditional probabilities. The rule is useful in the study of Bayes ...
: :P(\mathbf , \mathbf) = \sum_\mathbf P(\mathbf, \mathbf, \mathbf) P(\mathbf , \mathbf) This allows capturing latent structure between the observations and labels. While LDCRFs can be trained using quasi-Newton methods, a specialized version of the perceptron algorithm called the latent-variable perceptron has been developed for them as well, based on Collins'
structured perceptron Structured prediction or structured (output) learning is an umbrella term for supervised machine learning techniques that involves predicting structured objects, rather than scalar discrete or real values. Similar to commonly used supervised le ...
algorithm. These models find applications in
computer vision Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the hum ...
, specifically
gesture recognition Gesture recognition is a topic in computer science and language technology with the goal of interpreting human gestures via mathematical algorithms. It is a subdiscipline of computer vision. Gestures can originate from any bodily motion or sta ...
from video streams and shallow parsing.


See also

*
Hammersley–Clifford theorem The Hammersley–Clifford theorem is a result in probability theory, mathematical statistics and statistical mechanics that gives necessary and sufficient conditions under which a strictly positive probability distribution (of events in a probabilit ...
* Maximum entropy Markov model (MEMM)


References

{{reflist, 30em


Further reading

* McCallum, A.
Efficiently inducing features of conditional random fields
In: ''Proc. 19th Conference on Uncertainty in Artificial Intelligence''. (2003) * Wallach, H.M.
Conditional random fields: An introduction
Technical report MS-CIS-04-21, University of Pennsylvania (2004) * Sutton, C., McCallum, A.: An Introduction to Conditional Random Fields for Relational Learning. In "Introduction to Statistical Relational Learning". Edited by
Lise Getoor Lise Getoor is a professor in the computer science department, at the University of California, Santa Cruz, and an adjunct professor in the Computer Science Department at the University of Maryland, College Park. Her primary research interests a ...
and Ben Taskar. MIT Press. (2006
Online PDF
* Klinger, R., Tomanek, K.: Classical Probabilistic Models and Conditional Random Fields. Algorithm Engineering Report TR07-2-013, Department of Computer Science, Dortmund University of Technology, December 2007. ISSN 1864-4503
Online PDF
Graphical models Machine learning