Conditional random fields (CRFs) are a class of
statistical modeling methods often applied in
pattern recognition
Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess PR capabilities but their p ...
and
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
and used for
structured prediction. Whereas a
classifier predicts a label for a single sample without considering "neighbouring" samples, a CRF can take context into account. To do so, the predictions are modelled as a
graphical model, which represents the presence of dependencies between the predictions. The kind of graph used depends on the application. For example, in
natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
, "linear chain" CRFs are popular, for which each prediction is dependent only on its immediate neighbours. In image processing, the graph typically connects locations to nearby and/or similar locations to enforce that they receive similar predictions.
Other examples where CRFs are used are:
labeling or
parsing
Parsing, syntax analysis, or syntactic analysis is a process of analyzing a String (computer science), string of Symbol (formal), symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal gramm ...
of sequential data for
natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
or
biological sequences,
[
] part-of-speech tagging
In corpus linguistics, part-of-speech tagging (POS tagging, PoS tagging, or POST), also called grammatical tagging, is the process of marking up a word in a text ( corpus) as corresponding to a particular part of speech, based on both its defini ...
,
shallow parsing,
named entity recognition,
gene finding, peptide critical functional region finding, and
object recognition
Object recognition – technology in the field of computer vision for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the ...
and
image segmentation in
computer vision
Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...
.
Description
CRFs are a type of
discriminative undirected probabilistic graphical model.
Lafferty,
McCallum and Pereira
define a CRF on observations
and
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
s
as follows:
Let be a graph such that , so that is indexed by the vertices of .
Then is a conditional random field when each random variable , conditioned on , obeys the Markov property
In probability theory and statistics, the term Markov property refers to the memoryless property of a stochastic process, which means that its future evolution is independent of its history. It is named after the Russian mathematician Andrey Ma ...
with respect to the graph; that is, its probability is dependent only on its neighbours in G:
, where means
that and are neighbors in .
What this means is that a CRF is an
undirected graphical model whose nodes can be divided into exactly two disjoint sets
and
, the observed and output variables, respectively; the conditional distribution
is then modeled.
Inference
For general graphs, the problem of exact inference in CRFs is intractable. The inference problem for a CRF is basically the same as for an
MRF and the same arguments hold.
However, there exist special cases for which exact inference is feasible:
* If the graph is a chain or a tree, message passing algorithms yield exact solutions. The algorithms used in these cases are analogous to the
forward-backward and
Viterbi algorithm for the case of HMMs.
* If the CRF only contains pair-wise potentials and the energy is
submodular
In mathematics, a submodular set function (also known as a submodular function) is a set function that, informally, describes the relationship between a set of inputs and an output, where adding more of one input has a decreasing additional benefi ...
, combinatorial min cut/max flow algorithms yield exact solutions.
If exact inference is impossible, several algorithms can be used to obtain approximate solutions. These include:
*
Loopy belief propagation
* Alpha expansion
* Mean field inference
*
Linear programming relaxations
Parameter Learning
Learning the parameters
is usually done by
maximum likelihood learning for
. If all nodes have exponential family distributions and all nodes are observed during training, this
optimization
Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criteria, from some set of available alternatives. It is generally divided into two subfiel ...
is convex.
It can be solved for example using
gradient descent algorithms, or
Quasi-Newton methods such as the
L-BFGS algorithm. On the other hand, if some variables are unobserved, the inference problem has to be solved for these variables. Exact inference is intractable in general graphs, so approximations have to be used.
Examples
In sequence modeling, the graph of interest is usually a chain graph. An input sequence of observed variables
represents a sequence of observations and
represents a hidden (or unknown) state variable that needs to be inferred given the observations. The
are structured to form a chain, with an edge between each
and
. As well as having a simple interpretation of the
as "labels" for each element in the input sequence, this layout admits efficient algorithms for:
* model ''training'', learning the conditional distributions between the
and feature functions from some corpus of training data.
* ''decoding'', determining the probability of a given label sequence
given
.
* ''inference'', determining the ''most likely'' label sequence
given
.
The conditional dependency of each
on
is defined through a fixed set of ''feature functions'' of the form
, which can be thought of as measurements on the input sequence that partially determine the
likelihood
A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the j ...
of each possible value for
. The model assigns each feature a numerical weight and combines them to determine the probability of a certain value for
.
Linear-chain CRFs have many of the same applications as conceptually simpler hidden Markov models (HMMs), but relax certain assumptions about the input and output sequence distributions. An HMM can loosely be understood as a CRF with very specific feature functions that use constant probabilities to model state transitions and emissions. Conversely, a CRF can loosely be understood as a generalization of an HMM that makes the constant transition probabilities into arbitrary functions that vary across the positions in the sequence of hidden states, depending on the input sequence.
Notably, in contrast to HMMs, CRFs can contain any number of feature functions, the feature functions can inspect the entire input sequence
at any point during inference, and the range of the feature functions need not have a probabilistic interpretation.
Variants
Higher-order CRFs and semi-Markov CRFs
CRFs can be extended into higher order models by making each
dependent on a fixed number
of previous variables
. In conventional formulations of higher order CRFs, training and inference are only practical for small values of
(such as ''k'' ≤ 5), since their computational cost increases exponentially with
.
However, another recent advance has managed to ameliorate these issues by leveraging concepts and tools from the field of Bayesian nonparametrics. Specifically, the CRF-infinity approach constitutes a CRF-type model that is capable of learning infinitely-long temporal dynamics in a scalable fashion. This is effected by introducing a novel potential function for CRFs that is based on the Sequence Memoizer (SM), a nonparametric Bayesian model for learning infinitely-long dynamics in sequential observations. To render such a model computationally tractable, CRF-infinity employs a
mean-field approximation of the postulated novel potential functions (which are driven by an SM). This allows for devising efficient approximate training and inference algorithms for the model, without undermining its capability to capture and model temporal dependencies of arbitrary length.
There exists another generalization of CRFs, the semi-Markov conditional random field (semi-CRF), which models variable-length ''segmentations'' of the label sequence
.
This provides much of the power of higher-order CRFs to model long-range dependencies of the
, at a reasonable computational cost.
Finally, large-margin models for
structured prediction, such as the
structured Support Vector Machine can be seen as an alternative training procedure to CRFs.
Latent-dynamic conditional random field
Latent-dynamic conditional random fields (LDCRF) or discriminative probabilistic latent variable models (DPLVM) are a type of CRFs for sequence tagging tasks. They are
latent variable models that are trained discriminatively.
In an LDCRF, like in any sequence tagging task, given a sequence of observations x =
, the main problem the model must solve is how to assign a sequence of labels y =
from one finite set of labels . Instead of directly modeling (y, x) as an ordinary linear-chain CRF would do, a set of latent variables h is "inserted" between x and y using the
chain rule of probability:
:
This allows capturing latent structure between the observations and labels.
While LDCRFs can be trained using quasi-Newton methods, a specialized version of the
perceptron
In machine learning, the perceptron is an algorithm for supervised classification, supervised learning of binary classification, binary classifiers. A binary classifier is a function that can decide whether or not an input, represented by a vect ...
algorithm called the latent-variable perceptron has been developed for them as well, based on Collins'
structured perceptron algorithm.
These models find applications in
computer vision
Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...
, specifically
gesture recognition
Gesture recognition is an area of research and development in computer science and language technology concerned with the recognition and interpretation of human gestures. A subdiscipline of computer vision, it employs mathematical algorithms to ...
from video streams
and
shallow parsing.
See also
*
Hammersley–Clifford theorem
*
Maximum entropy Markov model (MEMM)
References
{{reflist, 30em
Further reading
* McCallum, A.
Efficiently inducing features of conditional random fields In: ''Proc. 19th Conference on Uncertainty in Artificial Intelligence''. (2003)
*
Wallach, H.M.Conditional random fields: An introduction Technical report MS-CIS-04-21, University of Pennsylvania (2004)
* Sutton, C., McCallum, A.: An Introduction to Conditional Random Fields for Relational Learning. In "Introduction to Statistical Relational Learning". Edited by
Lise Getoor and Ben Taskar. MIT Press. (2006
Online PDF* Klinger, R., Tomanek, K.: Classical Probabilistic Models and Conditional Random Fields. Algorithm Engineering Report TR07-2-013, Department of Computer Science, Dortmund University of Technology, December 2007. ISSN 1864-4503
Online PDF
Graphical models
Machine learning