Language resource
   HOME

TheInfoList



OR:

In
linguistics Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Linguis ...
and language technology, a language resource is a " ompositionof linguistic material used in the construction, improvement and/or evaluation of language processing applications, (...) in language and language-mediated research studies and applications."LD4LT (2020),
The Metashare Ontology as Created by the LD4LT Community Group
', W3C Community Group Linked Data for Language Technology (LD4LT), Development branch, version of Mar 10, 2020
According to Bird & Simons (2003), this includes # data, i.e. "any information that documents or describes a language, such as a published monograph, a computer data file, or even a shoebox full of handwritten index cards. The information could range in content from unanalyzed sound recordings to fully transcribed and annotated texts to a complete descriptive grammar", # tools, i.e., "computational resources that facilitate creating, viewing, querying, or otherwise using language data", and # advice, i.e., "any information about what data sources are reliable, what tools are appropriate in a given situation, what practices to follow when creating new data". The latter aspect is usually referred to as "best practices" or "(community) standards". In a narrower sense, language resource is specifically applied to resources that are available in ''digital form,'' and then, "encompassing (a) data sets (textual, multimodal/multimedia and lexical data, grammars, language models, etc.) in machine readable form, and (b) tools/technologies/services used for their processing and management".


Typology

As of May 2020, no widely used standard typology of language resources has been established (current proposals include the LREMap, METASHARE, and, for data, the LLOD classification). Important classes of language resources include # data ##
lexical resource In digital lexicography, natural language processing, and digital humanities, a lexical resource is a language resource consisting of data regarding the lexemes of the lexicon of one or more languages e.g., in the form of a database. Characterist ...
s, e.g., machine-readable dictionaries, ## linguistic corpora, i.e., digital collections of natural language data, ## linguistic data bases such as the
Cross-Linguistic Linked Data The Cross-Linguistic Linked Data (CLLD) project coordinates over a dozen linguistics databases covering the languages of the world. It is hosted by the Department of Linguistic and Cultural Evolution at the Max Planck Institute for Evolutionary An ...
collection, # tools ## linguistic annotations and tools for creating such annotations in a manual or semiautomated fashion (e.g., tools for annotating
interlinear glossed text In linguistics and pedagogy, an interlinear gloss is a gloss (series of brief explanations, such as definitions or pronunciations) placed between lines, such as between a line of original text and its translation into another language. When gloss ...
such as
Toolbox A toolbox (also called toolkit, tool chest or workbox) is a box to organize, carry, and protect the owner's tools. They could be used for trade, a hobby or DIY, and their contents vary with the craft. Types A toolbox could refer to several typ ...
and
FLEx Flex or FLEX may refer to: Computing * Flex (language), developed by Alan Kay * FLEX (operating system), a single-tasking operating system for the Motorola 6800 * FlexOS, an operating system developed by Digital Research * FLEX (protocol), a comm ...
, or other language documentation tools), ## applications for search and retrieval over such data ( corpus management systems), for automated annotation ( part-of-speech tagging, syntactic
parsing Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lati ...
, semantic parsing, etc.), # metadata and vocabularies ## vocabularies, repositories of linguistic terminology and language metadata, e.g., MetaShare (for language resource metadata), the ISO 12620 data category registry (for linguistic features, data structures and annotations within a language resource), or the
Glottolog ''Glottolog'' is a bibliographic database of the world's lesser-known languages, developed and maintained first at the Max Planck Institute for Evolutionary Anthropology in Leipzig, Germany (between 2015 and 2020 at the Max Planck Institute for ...
database (identifiers for language varieties and bibliographical database).


Language resource publication, dissemination and creation

A major concern of the language resource community has been to develop infrastructures and platforms to present, discuss and disseminate language resources. Selected contributions in this regard include: * a series of International Conferences on Language Resources and Evaluation (LREC), * the European Language Resources Association (ELRA, EU-based), and the Linguistic Data Consortium (LDC, US-based), which represent commercial hosting and dissemination platforms for language resources, * the Open Languages Archives Community (OLAC), which provides and aggregates language resource metadata, * the ''Language Resources and Evaluation Journal'' (LREJ), * th
European Language Grid
is a European platform for language technologies (eg services), data and resources. As for the development of standards and best practices for language resources, these are subject of several community groups and standardization efforts, including *
ISO ISO is the most common abbreviation for the International Organization for Standardization. ISO or Iso may also refer to: Business and finance * Iso (supermarket), a chain of Danish supermarkets incorporated into the SuperBest chain in 2007 * Iso ...
Technical Committee 37: Terminology and other language and content resources (
ISO/TC 37 ISO/TC 37 is a technical committee within the International Organization for Standardization (ISO) that prepares standards and other documents concerning methodology and principles for terminology and language resources. Title: Terminology an ...
), developing standards for all aspects of language resources, *
W3C The World Wide Web Consortium (W3C) is the main international standards organization for the World Wide Web. Founded in 1994 and led by Tim Berners-Lee, the consortium is made up of member organizations that maintain full-time staff working to ...
Community Group ''Best Practices for Multilingual Linked Open Data'' (BPMLOD), working on best practice recommendations for publishing language resources as
Linked Data In computing, linked data (often capitalized as Linked Data) is structured data which is interlinked with other data so it becomes more useful through semantic queries. It builds upon standard Web technologies such as HTTP, RDF and URIs, but r ...
or in RDF, * W3C Community Group ''Linked Data for Language Technology'' (LD4LT), working on linguistic annotations on the web and language resource metadata, * W3C Community Group ''Ontology-Lexica'' (
OntoLex OntoLex is the short name of a vocabulary for lexical resources in the web of data (OntoLex-Lemon) and the short name of the W3C community group that created it (W3C Ontology-Lexica Community Group). OntoLex-Lemon vocabulary The OntoLex-Lemon v ...
), working on lexical resources, * the Open Linguistics working group of the Open Knowledge Foundation, working on conventions for publishing and linking
open Open or OPEN may refer to: Music * Open (band), Australian pop/rock band * The Open (band), English indie rock band * ''Open'' (Blues Image album), 1969 * ''Open'' (Gotthard album), 1999 * ''Open'' (Cowboy Junkies album), 2001 * ''Open'' (YF ...
language resources, developing the
Linguistic Linked Open Data In natural language processing, linguistics, and neighboring fields, Linguistic Linked Open Data (LLOD) describes a method and an interdisciplinary community concerned with creating, sharing, and (re-)using language resources in accordance with L ...
cloud, * the Text Encoding Initiative (TEI), working on
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
-based specifications for language resources and digitally edited text.


References

{{DEFAULTSORT:Language resource Natural language processing Computational linguistics