The Constituent Likelihood Automatic Word-tagging System (CLAWS) is a program that performs
part-of-speech tagging
In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definitio ...
. It was developed in the 1980s at
Lancaster University
Lancaster University (legally The University of Lancaster) is a public university, public research university in Lancaster, Lancashire, Lancaster, Lancashire, England. The university was established in 1964 by royal charter, as one of several pla ...
by the University Centre for Computer Corpus Research on Language.
It has an overall accuracy rate of 96-97% with the latest version (CLAWS4) tagging around 100 million words of the
British National Corpus
The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention ...
.
History
A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'. Developed in the early 1980s,
CLAWS was built to fill the ever-growing gap created by always-changing POS necessities. Originally created to add part-of-speech tags to the
LOB
Lob may refer to:
Sports
* Lob (pickleball)
* Lob (tennis)
* Lob (association football), a lofted pass or shot in association football
* Lob bowling, an archaic bowling style in cricket
People
* Lob Brown, American college football player
* L ...
corpus of British English, the CLAWS tagset has since been adapted to other languages as well, including Urdu and Arabic.
Since its inception, CLAWS has been hailed for its functionality and adaptability. Still, it is not without flaws, and though it boasts an error-rate of only 1.5% when judged in major categories, CLAWS still remains with c.3.3% ambiguities unresolved. Ambiguity arises in cases such as with the word ''flies,'' and whether it should be classified as a noun or a verb. It's these ambiguities that will require the various upgrades and tagsets that CLAWS will endure.
Rules and processing
CLAWS uses a
Hidden Markov model
A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...
to determine the likelihood of sequences of words in anticipating each part-of-speech label.
Sample output
This excerpt from
Bram Stoker
Abraham Stoker (8 November 1847 – 20 April 1912) was an Irish author who is celebrated for his 1897 Gothic horror novel '' Dracula''. During his lifetime, he was better known as the personal assistant of actor Sir Henry Irving and busine ...
's Dracula (1897) has been tagged using both the CLAWS C5 and C7 tagsets. This is what a CLAWS output will generally look like, with the most likely part-of-speech tag following each word.
Tagsets
CLAWS1 tagset
The first tagset developed in CLAWS, CLAWS1 tagset, has 132 word tags. In terms of form and application, C1 tagset is similar to
Brown Corpus
The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) is an electronic collection of text samples of American English, the first major structured corpus of varied genres. This corpus first set the bar for the ...
tags.
See Table of tags in C1 tagse
here
CLAWS2 tagset
From 1983 to 1986, updated versions leading to CLAWS2 were part of a larger attempt to deal with aspects such as recognizing sentence breaks, in order to avoid the need for manual pre-processing of a text before the tags were applied, moving instead to optional manual post-editing to adjust the output of the automatic annotation, if needed. The CLAWS2 tagset has 166 word tags.
See Table of tags in C2 tagse
CLAWS4 tagset
The CLAWS4 was used for the 100-million-word
British National Corpus (BNC). A general-purpose grammatical tagger, it is a successor of the CLAWS1 tagger. In tagging the BNC, the many rounds of work that went into CLAWS4 focused on making the CLAWS program independent from the tagsets. For example, the BNC project used two tagset versions: "a main tagset (C5) with 62 tags with which the whole of the corpus has been tagged, and a larger (C7) tagset with 152 tags, which has been used to make a selected 'core' sample corpus of two million words." The latest version of CLAWS4 is offered by UCREL, a research center of
Lancaster University
Lancaster University (legally The University of Lancaster) is a public university, public research university in Lancaster, Lancashire, Lancaster, Lancashire, England. The university was established in 1964 by royal charter, as one of several pla ...
.
CLAWS5 tagset
The CLAWS5 tagset, which was used for
BNC, has over 60 tags.
See Table of tags in C5 tagse
here
CLAWS6 tagset
The CLAWS6 tagset was used for the
BNC sampler corpus and the
COLT
Colt(s) or COLT may refer to:
*Colt (horse), an intact (uncastrated) male horse under four years of age
People
* Colt (given name)
*Colt (surname)
Places
*Colt, Arkansas, United States
*Colt, Louisiana, an unincorporated community, United States ...
corpus. It has over 160 tags, including 13 determiner subtypes.
See Table of tags in C6 tagse
here
CLAWS7 tagset
The standard CLAWS7 tagset is used currently. It is only different in the punctuation tags when compared to the CLAWS6 tagset.
See Table of tags in C7 tagse
CLAWS8 tagset
CLAWS8 tagset was extended from C7 tagset with further distinctions in the determiner and pronoun categories, as well as 37 new auxiliary tags for forms of ''be, do'', and ''have''.
See Table of tags in C8 tagse
here
External links
CLAWS part-of-speech tagger for English*
Brill tagger
*
Part-of-speech tagging
In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definitio ...
*
Sliding window based part-of-speech tagging Sliding window based part-of-speech tagging is used to part-of-speech tag a text.
A high percentage of words in a natural language are words which out of context can be assigned more than one part of speech. The percentage of these ambiguous wor ...
*
British National Corpus (BNC)
*
Brown Corpus
The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) is an electronic collection of text samples of American English, the first major structured corpus of varied genres. This corpus first set the bar for the ...
*
Lancaster University
Lancaster University (legally The University of Lancaster) is a public university, public research university in Lancaster, Lancashire, Lancaster, Lancashire, England. The university was established in 1964 by royal charter, as one of several pla ...
*
Hidden Markov model
A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...
References
{{Reflist
Natural language processing
Corpus linguistics
Tasks of natural language processing
Markov models
Word-sense disambiguation