statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

, a maximum-entropy Markov model (MEMM), or conditional Markov model (CMM), is a graphical model for

sequence labeling In machine learning, sequence labeling is a type of pattern recognition task that involves the algorithmic assignment of a categorical label to each member of a sequence of observed values. A common example of a sequence labeling task is part of ...

that combines features of

hidden Markov model A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent (or ''hidden'') Markov process (referred to as X). An HMM requires that there be an observable process Y whose outcomes depend on the outcomes of X ...

s (HMMs) and maximum entropy (MaxEnt) models. An MEMM is a discriminative model that extends a standard maximum entropy classifier by assuming that the unknown values to be learnt are connected in a

Markov chain In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally ...

rather than being conditionally independent of each other. MEMMs find applications in

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

, specifically in

part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging, PoS tagging, or POST), also called grammatical tagging, is the process of marking up a word in a text ( corpus) as corresponding to a particular part of speech, based on both its defini ...

and information extraction.

Model

Suppose we have a sequence of observations

O_1, \dots, O_n

that we seek to tag with the labels

S_1, \dots, S_n

that maximize the conditional probability

P(S_1, \dots, S_n \mid O_1, \dots, O_n)

. In a MEMM, this probability is factored into Markov transition probabilities, where the probability of transitioning to a particular label depends only on the observation at that position and the previous position's label: :

P(S_1, \dots, S_n \mid O_1, \dots, O_n) = \prod_^nP(S_t \mid S_,O_t).

Each of these transition probabilities comes from the same general distribution

P(s\mid s',o)

. For each possible label value of the previous label

s'

, the probability of a certain label

s

is modeled in the same way as a maximum entropy classifier: :

P(s\mid s',o) = P_(s\mid o) = \frac \exp \left( \sum_a \lambda_a f_a(o,s)\right).

Here, the

f_a(o,s)

are real-valued or categorical feature-functions, and

Z(o,s')

is a normalization term ensuring that the distribution sums to one. This form for the distribution corresponds to the

maximum entropy probability distribution In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, ...

satisfying the constraint that the empirical expectation for the feature is equal to the expectation given the model: :

\quad \text a .

The parameters

\lambda_a

can be estimated using generalized iterative scaling. Furthermore, a variant of the Baum–Welch algorithm, which is used for training HMMs, can be used to estimate parameters when training data has incomplete or missing labels. The optimal state sequence

S_1, \dots, S_n

can be found using a very similar Viterbi algorithm to the one used for HMMs. The dynamic program uses the forward probability: :

\alpha_(s) = \sum_ \alpha_t(s') P_(s\mid o_).

Strengths and weaknesses

An advantage of MEMMs rather than HMMs for sequence tagging is that they offer increased freedom in choosing features to represent observations. In sequence tagging situations, it is useful to use domain knowledge to design special-purpose features. In the original paper introducing MEMMs, the authors write that "when trying to extract previously unseen company names from a newswire article, the identity of a word alone is not very predictive; however, knowing that the word is capitalized, that is a noun, that it is used in an appositive, and that it appears near the top of the article would all be quite predictive (in conjunction with the context provided by the state-transition structure)." Useful sequence tagging features, such as these, are often non-independent. Maximum entropy models do not assume independence between features, but generative observation models used in HMMs do. Therefore, MEMMs allow the user to specify many correlated, but informative features. Another advantage of MEMMs versus HMMs and conditional random fields (CRFs) is that training can be considerably more efficient. In HMMs and CRFs, one needs to use some version of the forward–backward algorithm as an inner loop in training. However, in MEMMs, estimating the parameters of the maximum-entropy distributions used for the transition probabilities can be done for each transition distribution in isolation. A drawback of MEMMs is that they potentially suffer from the "label bias problem," where states with low-entropy transition distributions "effectively ignore their observations." Conditional random fields were designed to overcome this weakness, which had already been recognised in the context of neural network-based Markov models in the early 1990s. Another source of label bias is that training is always done with respect to known previous tags, so the model struggles at test time when there is uncertainty in the previous tag.

References

{{reflist, 2 Markov models Statistical natural language processing