Linguistic categories include
*
Lexical category
In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are ass ...
, a part of speech such as ''noun'', ''preposition'', etc.
*
Syntactic category, a similar concept which can also include phrasal categories
*
Grammatical category, a grammatical feature such as ''tense'', ''gender'', etc.
The definition of linguistic categories is a major concern of
linguistic theory
Theoretical linguistics is a term in linguistics which, like the related term general linguistics, can be understood in different ways. Both can be taken as a reference to theory of language, or the branch of linguistics which inquires into the n ...
, and thus, the definition and naming of categories varies across different theoretical frameworks and grammatical traditions for different languages. The
operationalization of linguistic categories in
lexicography,
computational linguistics
Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
,
natural language processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
,
corpus linguistics
Corpus linguistics is the study of language, study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feas ...
, and
terminology management
Computer-aided translation (CAT), also referred to as computer-assisted translation or computer-aided human translation (CAHT), is the use of software to assist a human translator in the translation process. The translation is created by a huma ...
typically requires resource-, problem- or application-specific definitions of linguistic categories. In
Cognitive linguistics it has been argued that linguistic categories have a
prototype structure like that of the categories of common words in a language.
John R Taylor
John is a common English name and surname:
* John (given name)
* John (surname)
John may also refer to:
New Testament
Works
* Gospel of John, a title often shortened to John
* First Epistle of John, often shortened to 1 John
* Second ...
(1995) ''Linguistic Categorization: Prototypes in Linguistic Theory'', 2nd ed., ch.2 p.21
Linguistic category inventories
To facilitate the
interoperability between
lexical resources,
linguistic annotations and annotation tools and for the systematic handling of linguistic categories across different theoretical frameworks, a number of inventories of linguistic categories have been developed and are being used, with examples as given below. The practical objective of such inventories is to perform
quantitative evaluation (for language-specific inventories), to train NLP tools, or to facilitate cross-linguistic evaluation, querying or annotation of language data. At a theoretical level, the existence of universal categories in human language has been postulated, e.g., in
Universal grammar, but also
heavily criticized.
Part-of-Speech tagsets
Schools commonly teach that there are 9
parts of speech in English:
noun,
verb,
article
Article often refers to:
* Article (grammar), a grammatical element used to indicate definiteness or indefiniteness
* Article (publishing), a piece of nonfictional prose that is an independent part of a publication
Article may also refer to:
G ...
,
adjective,
preposition
Prepositions and postpositions, together called adpositions (or broadly, in traditional grammar, simply prepositions), are a class of words used to express spatial or temporal relations (''in'', ''under'', ''towards'', ''before'') or mark various ...
,
pronoun,
adverb,
conjunction, and
interjection. However, there are clearly many more categories and sub-categories. For nouns, the plural, possessive, and singular forms can be distinguished. In many languages words are also marked for their
case
Case or CASE may refer to:
Containers
* Case (goods), a package of related merchandise
* Cartridge case or casing, a firearm cartridge component
* Bookcase, a piece of furniture used to store books
* Briefcase or attaché case, a narrow box to c ...
(role as subject, object, etc.),
grammatical gender, and so on; while verbs are marked for
tense,
aspect, and other things. In some tagging systems, different
inflections of the same root word will get different parts of speech, resulting in a large number of tags. For example, NN for singular common nouns, NNS for plural common nouns, NP for singular proper nouns (see the
POS tags used in the Brown Corpus). Other tagging systems use a smaller number of tags and ignore fine differences or model them as
features
Feature may refer to:
Computing
* Feature (CAD), could be a hole, pocket, or notch
* Feature (computer vision), could be an edge, corner or blob
* Feature (software design) is an intentional distinguishing characteristic of a software item ...
somewhat independent from part-of-speech.
[Universal POS tags](_blank)
/ref>
In part-of-speech tagging by computer, it is typical to distinguish from 50 to 150 separate parts of speech for English. POS tagging work has been done in a variety of languages, and the set of POS tags used varies greatly with language. Tags usually are designed to include overt morphological distinctions, although this leads to inconsistencies such as case-marking for pronouns but not nouns in English, and much larger cross-language differences. The tag sets for heavily inflected languages such as Greek and Latin can be very large; tagging ''words'' in agglutinative languages such as Inuit languages may be virtually impossible. Work on stochastic
Stochastic (, ) refers to the property of being well described by a random probability distribution. Although stochasticity and randomness are distinct in that the former refers to a modeling approach and the latter refers to phenomena themselv ...
methods for tagging Koine Greek (DeRose 1990) has used over 1,000 parts of speech and found that about as many words were ambiguous
Ambiguity is the type of meaning (linguistics), meaning in which a phrase, statement or resolution is not explicitly defined, making several interpretations wikt:plausible#Adjective, plausible. A common aspect of ambiguity is uncertainty. It ...
in that language as in English. A morphosyntactic descriptor in the case of morphologically rich languages is commonly expressed using very short mnemonics, such as ''Ncmsan'' for Category=Noun, Type = common, Gender = masculine, Number = singular, Case = accusative, Animate = no.
The most popular "tag set" for POS tagging for American English is probably the Penn tag set, developed in the Penn Treebank project.
Multilingual annotation schemes
For Western European languages, cross-linguistically applicable annotation schemes for parts-of-speech, morphosyntax and syntax have been developed with the EAGLES Guidelines. The "Expert Advisory Group on Language Engineering Standards" (EAGLES) was an initiative of the European Commission that ran within the DG XIII Linguistic Research and Engineering programme from 1994 to 1998, coordinated by Consorzio Pisa Ricerche, Pisa, Italy. The EAGLES guidelines provide guidance for markup to be used with text corpora
In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical ...
, particularly for identifying features relevant in computational linguistics
Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
and lexicography.
Numerous companies, research centres, universities and professional bodies across the European Union collaborated to produce the EAGLES Guidelines, which set out recommendations for ''de facto'' standards and rules of best practice for:
* Large-scale language resources (such as text corpora, computational lexicon
A lexicon is the vocabulary of a language or branch of knowledge (such as nautical or medical). In linguistics, a lexicon is a language's inventory of lexemes. The word ''lexicon'' derives from Koine Greek language, Greek word (), neuter of () ...
s and speech corpora
Speech is a human vocal communication using language. Each language uses phonetic combinations of vowel and consonant sounds that form the sound of its words (that is, all English words sound different from all French words, even if they are th ...
);
* Means of manipulating such knowledge, via computational linguistic formalisms, mark up languages and various software tools;
* Means of assessing and evaluating resources, tools and products.
The Eagles guidelines have inspired subsequent work on other regions, as well, e.g., Eastern Europe.
A generation later, a similar effort was initiated by the research community under the umbrella of Universal Dependencies. Petrov et al. have proposed a "universal", but highly reductionist, tag set, with 12 categories (for example, no subtypes of nouns, verbs, punctuation, etc.; no distinction of "to" as an infinitive marker vs. preposition (hardly a "universal" coincidence), etc.). Subsequently, this was complemented with cross-lingual specifications for dependency syntax (Stanford Dependencies), and morphosyntax (Interset interlingua, partially building on the Multext-East/Eagles tradition) in the context of the Universal Dependencies (UD), an international cooperative project to create treebanks of the world's languages with cross-linguistically applicable ("universal") annotations for parts of speech, dependency syntax, and (optionally) morphosyntactic (morphological) features. Core applications are automated text processing in the field of natural language processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
(NLP) and research into natural language syntax and grammar, especially within linguistic typology
Linguistic typology (or language typology) is a field of linguistics that studies and classifies languages according to their structural features to allow their comparison. Its aim is to describe and explain the structural diversity and the co ...
. The annotation scheme has it roots in three related projects: The UD annotation scheme uses a representation in the form of dependency trees as opposed to a phrase structure trees. At as of February 2019, there are just over 100 treebanks of more than 70 languages available in the UD inventory. The project's primary aim is to achieve cross-linguistic consistency of annotation. However, language-specific extensions are permitted for morphological features (individual languages or resources can introduce additional features). In a more restricted form, dependency relations can be extended with a secondary label that accompanies the UD label, e.g., ''aux:pass'' for an auxiliary (UD ''aux'') used to mark passive voice.
The Universal Dependencies have inspired similar efforts for the areas of inflectional morphology, frame semantics and coreference. For phrase structure syntax, a comparable effort does not seem to exist, but the specifications of the Penn Treebank have been applied to (and extended for) a broad range of languages, e.g., Icelandic, Old English, Middle English, Middle Low German, Early Modern High German, Yiddish, Portuguese, Japanese, Arabic and Chinese.
Conventions for interlinear glosses
In linguistics, an interlinear gloss is a gloss (series of brief explanations, such as definitions or pronunciations) placed between lines (''inter-'' + ''linear''), such as between a line of original text and its translation into another language. When glossed, each line of the original text acquires one or more lines of transcription known as an interlinear text or interlinear glossed text (IGT)—interlinear for short. Such glosses help the reader follow the relationship between the source text and its translation, and the structure of the original language. There is no standard inventory for glosses, but common labels are collected in the Leipzig Glossing Rules.[Comrie, B., Haspelmath, M., & Bickel, B. (2008)]
The Leipzig Glossing Rules: Conventions for interlinear morpheme-by-morpheme glosses
''Department of Linguistics of the Max Planck Institute for Evolutionary Anthropology & the Department of Linguistics of the University of Leipzig. Retrieved January'', ''28'', 2010. Wikipedia also provides a List of glossing abbreviations
This article lists common abbreviations for grammatical terms that are used in linguistic interlinear glossing of oral languages in English.
The list provides conventional glosses as established by standard inventories of glossing abbreviations su ...
that draws on this and other sources.
General Ontology for Linguistic Description (GOLD)
GOLD ("General Ontology for Linguistic Description") is an ontology for descriptive linguistics. It gives a formalized account of the most basic categories and relations used in the scientific description of human language, e.g., as a formalization of interlinear glosses. GOLD was first introduced by Farrar and Langendoen (2003). Originally, it was envisioned as a solution to the problem of resolving disparate markup schemes for linguistic data, in particular data from endangered languages. However, GOLD is much more general and can be applied to all languages. In this function, GOLD overlaps with the ISO 12620 Data Category Registry (ISOcat); it is, however, more stringently structured.
GOLD was maintained by the LINGUIST List and others from 2007 to 2010. Th
RELISH
project created a mirror of the 2010 edition of GOLD as a Data Category Selection within ISOcat. As of 2018, GOLD data remains an important terminology hub in the context of the Linguistic Linked Open Data cloud, but as it is not actively maintained anymore, its function is increasingly replaced by OLiA
Linguistic categories include
* Lexical category, a part of speech such as ''noun'', ''preposition'', etc.
* Syntactic category, a similar concept which can also include phrasal categories
* Grammatical category, a grammatical feature such as ''te ...
(for linguistic annotation, building on GOLD and ISOcat) an
lexinfo.net
(for dictionary metadata, building on ISOcat).
ISO 12620 (ISO TC37 Data Category Registry, ISOcat)
ISO 12620 is a standard from ISO/TC 37
ISO/TC 37 is a technical committee within the International Organization for Standardization (ISO) that prepares standards and other documents concerning methodology and principles for terminology and language resources.
Title: Terminology an ...
that defines a ''Data Category Registry'', a registry for registering linguistic terms used in various fields of translation, computational linguistics
Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
and natural language processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
and defining mappings both between different terms and the same terms used in different systems.
An earlier implementation of this standard, ISOcat, provides persistent identifiers and URIs for linguistic categories, including the inventory of the GOLD ontology (see below).
The goal of the registry is that new systems can reuse existing terminology, or at least be easily mapped to existing terminology, to aid interoperability. The standard is used by other standards such as Lexical Markup Framework (ISO 24613:2008), and a number of terminologies have been added to the registry, including the Eagles guidelines, the National Corpus of Polish The National Corpus of Polish (Polish : Narodowy Korpus Języka Polskiego NKJP) is the biggest and the most important corpus of the Polish language. A linguistic corpus is a collection of texts where one can find the typical use of a single word or ...
, and the TermBase eXchange format from the Localization Industry Standards Association.
However, the current edition ISO 12620:2019 does no longer provide a registry of terms for language technology and terminology, but it is now restricted to terminology resources, hence the revised title "Management of terminology resources — Data category specifications". Accordingly, ISOcat is no longer actively developed. As of May 2020, successor systems, CLARIN Concept Registry and DatCatInfo are only emerging.
For linguistic categories relevant to lexical resources, the ''lexinfo'' vocabulary represents an established community standard, in particular in connection with the OntoLex
OntoLex is the short name of a vocabulary for lexical resources in the web of data (OntoLex-Lemon) and the short name of the W3C community group that created it (W3C Ontology-Lexica Community Group).
OntoLex-Lemon vocabulary
The OntoLex-Lemon v ...
vocabulary and machine-readable dictionaries in the context of Linguistic Linked Open Data
In natural language processing, linguistics, and neighboring fields, Linguistic Linked Open Data (LLOD) describes a method and an interdisciplinary community concerned with creating, sharing, and (re-)using language resources in accordance with L ...
technologies. Like the OntoLex vocabulary builds on the Lexical Markup Framework (LMF), lexinfo builds on (the LMF section of) ISOcat.[Cimiano, P., Chiarcos, C., McCrae, J. P., & Gracia, J. (2020). ''Linguistic Linked Data'' (pp. 137-160). Springer, Cham.] Unlike ISOcat, however, lexinfo is actively maintained and currently (May 2020) extended in a community effort.
Ontologies of Linguistic Annotation (OLiA)
Similar in spirit to GOLD, the Ontologies of Linguistic Annotation (OLiA) provide a reference inventory of linguistic categories for syntactic, morphological and semantic phenomena relevant for linguistic annotation
Text annotation is the practice and the result of adding a note or gloss (annotation), gloss to a text, which may include highlights or underlining, comments, footnotes, tags, and links. Text annotations can include notes written for a reader's p ...
and linguistic corpora in the form of an ontology. In addition, they also provide machine-readable annotation schemes for more than 100 languages, linked with the OLiA reference model. The OLiA ontologies represent a major hub of annotation terminology in the (Linguistic) Linked Open Data cloud, with applications for search, retrieval and machine learning over heterogeneously annotated language resources.
In addition to annotation schemes, the OLiA Reference Model is also linked with the Eagles Guidelines,[Chiarcos, C. (2008)]
An ontology of linguistic annotations
In ''LDV Forum'' (Vol. 23, No. 1, pp. 1-16). GOLD, ISOcat, CLARIN Concept Registry, Universal Dependencies,[Christian Chiarcos, Maxim Ionov and Christian Fäth (2020), Annotation interoperability in the post-ISOcat era, LREC 2020]
lexinfo, etc., they thus enable interoperability between these vocabularies. OLiA is being developed as a community project on GitHub
References
External links
Leipzig Glossing Rules
ISOcat
DatCatInfo Data Category Repository (DCR)
Linguistics
Information science
Semantic Web
Ontology (information science)
#12620
Terminology
Translation
Computational linguistics
{{ISO standards