lemmatisation
   HOME

TheInfoList



OR:

Lemmatisation ( or lemmatization) in
linguistics Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Linguis ...
is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning. Unlike
stemming In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morpholog ...
, lemmatisation depends on correctly identifying the intended
part of speech In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are assi ...
and meaning of a word in a sentence, as well as within the larger
context Context may refer to: * Context (language use), the relevant constraints of the communicative situation that influence language use, language variation, and discourse summary Computing * Context (computing), the virtual environment required to su ...
surrounding that sentence, such as neighboring sentences or even an entire document. As a result, developing efficient lemmatisation algorithms is an open area of research.


Description

In many languages, words appear in several ''
inflected In linguistic morphology, inflection (or inflexion) is a process of word formation in which a word is modified to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, mood, animacy, and defi ...
'' forms. For example, in English, the verb 'to walk' may appear as 'walk', 'walked', 'walks' or 'walking'. The base form, 'walk', that one might look up in a dictionary, is called the ''lemma'' for the word. The association of the base form with a part of speech is often called a '' lexeme'' of the word. Lemmatisation is closely related to
stemming In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morpholog ...
. The difference is that a stemmer operates on a single word ''without'' knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster. The reduced "accuracy" may not matter for some applications. In fact, when used within information retrieval systems, stemming improves query recall accuracy, or true positive rate, when compared to lemmatisation. Nonetheless, stemming reduces
precision Precision, precise or precisely may refer to: Science, and technology, and mathematics Mathematics and computing (general) * Accuracy and precision, measurement deviation from true value and its scatter * Significant figures, the number of digit ...
, or the proportion of positively-labeled instances that are actually positive, for such systems. For instance: #The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up. #The word "walk" is the base form for the word "walking", and hence this is matched in both stemming and lemmatisation. #The word "meeting" can be either the base form of a noun or a form of a verb ("to meet") depending on the context; e.g., "in our last meeting" or "We are meeting again tomorrow". Unlike stemming, lemmatisation attempts to select the correct lemma depending on the context. Document indexing software like
Lucene Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as ...
can store the base stemmed format of the word without the knowledge of meaning, but only considering word formation grammar rules. The stemmed word itself might not be a valid word: 'lazy', as seen in the example below, is stemmed by many stemmers to 'lazi'. This is because the purpose of stemming is not to produce the appropriate lemma – that is a more challenging task that requires knowledge of context. The main purpose of stemming is to map different forms of a word to a single form. As a rule-based algorithm, dependent only upon the spelling of a word, it sacrifices accuracy to ensure that, for example, when 'laziness' is stemmed to 'lazi', it has the same stem as 'lazy'.


Algorithms

A trivial way to do lemmatization is by simple dictionary lookup. This works well for straightforward inflected forms, but a
rule-based system In computer science, a rule-based system is used to store and manipulate knowledge to interpret information in a useful way. It is often used in artificial intelligence applications and research. Normally, the term ''rule-based system'' is appli ...
will be needed for other cases, such as in languages with long
compound words In linguistics, a compound is a lexeme (less precisely, a word or sign) that consists of more than one stem. Compounding, composition or nominal composition is the process of word formation that creates compound lexemes. Compounding occurs when ...
. Such rules can be either hand-crafted or learned automatically from an
annotated An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For ann ...
corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...
.


Use in biomedicine

Morphological analysis of published biomedical literature can yield useful results. Morphological processing of biomedical text can be more effective by a specialised lemmatisation program for biomedicine, and may improve the accuracy of practical
information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
tasks.


See also

*
Canonicalization In computer science, canonicalization (sometimes standardization or normalization) is a process for converting data that has more than one possible representation into a "standard", "normal", or canonical form. This can be done to compare diff ...


References


External links

{{Natural Language Processing Computational linguistics Tasks of natural language processing de:Lemma (Lexikografie)#Lemmatisierung