Brill Tagger
   HOME

TheInfoList



OR:

The Brill tagger is an inductive method for
part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definitio ...
. It was described and invented by
Eric Brill Eric Brill is a computer scientist specializing in natural language processing. He created the Brill tagger, a supervised part of speech tagger. Another research paper of Brill introduced a machine learning technique now known as transformation-b ...
in his 1993 PhD thesis. It can be summarized as an "error-driven transformation-based tagger". It is: * a form of
supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
, which aims to minimize error; and, * a transformation-based process, in the sense that a tag is assigned to each word and changed using a set of predefined rules. In the transformation process, if the word is known, it first assigns the most frequent tag, or if the word is unknown, it naively assigns the tag "noun" to it. High accuracy is eventually achieved by applying these rules iteratively and changing the incorrect tags. This approach ensures that valuable information such as the morphosyntactic construction of words is employed in an automatic tagging process.


Algorithm

The algorithm starts with initialization, which is the assignment of tags based on their probability for each word (for example, "dog" is more often a noun than a verb). Then "patches" are determined via rules that correct (probable) tagging errors made in the initialization phase:Eric Brill. 1992. A simple rule-based part of speech tagger. In Proceedings of the third conference on Applied natural language processing (ANLC '92). Association for Computational Linguistics, Stroudsburg, PA, USA, 152-155. {{doi, 10.3115/974499.974526 * Initialization: ** Known words (in vocabulary): assigning the most frequent tag associated to a form of the word ** Unknown word


Rules and processing

The input text is first
tokenize In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of ''lexical tokens'' (strings with an assigned and thus identified m ...
d, or broken into words. Typically in
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
, contractions such as "'s", "n't", and the like are considered separate word tokens, as are punctuation marks. A dictionary and some morphological rules then provide an initial tag for each word token. For example, a simple lookup would reveal that "dog" may be a noun or a verb (the most frequent tag is simply chosen), while an unknown word will be assigned some tag(s) based on capitalization, various prefix or suffix strings, etc. (such
morphological analyses Morphology, from the Greek and meaning "study of shape", may refer to: Disciplines *Morphology (archaeology), study of the shapes or forms of artifacts * Morphology (astronomy), study of the shape of astronomical objects such as nebulae, galaxies, ...
, which Brill calls ''Lexical Rules'', may vary between implementations). After all word tokens have (provisional) tags, contextual rules apply iteratively, to correct the tags by examining small amounts of context. This is where the Brill method differs from other part of speech tagging methods such as those using
Hidden Markov Models A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an obs ...
. Rules are reapplied repeatedly, until a threshold is reached, or no more rules can apply. Brill rules are of the general form: ''tag1'' → ''tag2'' IF ''Condition'' where the ''Condition'' tests the preceding and/or following word tokens, or their tags (the notation for such rules differs between implementations). For example, in Brill's notation: IN NN WDPREVTAG DT while would change the tag of a word from IN (preposition) to NN (common noun), if the preceding word's tag is DT (determiner) and the word itself is "while". This covers cases like "all the while" or "in a while", where "while" should be tagged as a noun rather than its more common use as a preposition (many rules are more general). Rules should only operate if the tag being changed is also known to be permissible, for the word in question or in principle (for example, most adjectives in English can also be used as nouns). Rules of this kind can be implemented by simple
Finite-state machines A finite-state machine (FSM) or finite-state automaton (FSA, plural: ''automata''), finite automaton, or simply a state machine, is a mathematical model of computation. It is an abstract machine that can be in exactly one of a finite number o ...
. See
Part of speech tagging In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definit ...
for more general information including descriptions of the
Penn Treebank In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empiric ...
and other sets of tags. Typical Brill taggers use a few hundred rules, which may be developed by linguistic intuition or by
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
on a pre-tagged
corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...
.


Code

Brill's code pages at Johns Hopkins University are no longer on the web
An archived version of a mirror of the Brill tagger
at its latest version as it was available at Plymouth Tech can be found on Archive.org. The software uses the
MIT License The MIT License is a permissive free software license originating at the Massachusetts Institute of Technology (MIT) in the late 1980s. As a permissive license, it puts only very limited restriction on reuse and has, therefore, high license comp ...
.


References


External links


Brill tagger
trained for Dutch (online and offline version)

trained for New Norwegian
Brill tagger
trained for Danish (online demo)

trained for English (online demo)
taggerXML
Modernized version of Eric Brill's Part Of Speech tagger (source code of the Danish and English versions above) Natural language processing