HOME

TheInfoList



OR:

Terminology extraction (also known as term extraction, glossary extraction, term recognition, or terminology mining) is a subtask of
information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
. The goal of terminology extraction is to automatically extract relevant terms from a given
corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...
. In the semantic web era, a growing number of communities and networked enterprises started to access and interoperate through the
internet The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a '' network of networks'' that consists of private, p ...
. Modeling these communities and their information needs is important for several
web application A web application (or web app) is application software that is accessed using a web browser. Web applications are delivered on the World Wide Web to users with an active network connection. History In earlier computing models like client-serv ...
s, like topic-driven
web crawler A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web s ...
s, web services,
recommender systems A recommender system, or a recommendation system (sometimes replacing 'system' with a synonym such as platform or engine), is a subclass of information filtering system that provide suggestions for items that are most pertinent to a particular ...
, etc. The development of terminology extraction is also essential to the
language industry The language industry is the sector of activity dedicated to facilitating multilingual communication, both oral and written. According to the European Commission's Directorate-General of Translation, the language industry comprises the activitie ...
. One of the first steps to model a
knowledge domain Knowledge can be defined as awareness of facts or as practical skills, and may also refer to familiarity with objects or situations. Knowledge of facts, also called propositional knowledge, is often defined as true belief that is distinc ...
is to collect a vocabulary of domain-relevant terms, constituting the linguistic surface manifestation of domain
concepts Concepts are defined as abstract ideas. They are understood to be the fundamental building blocks of the concept behind principles, thoughts and beliefs. They play an important role in all aspects of cognition. As such, concepts are studied by s ...
. Several methods to automatically extract technical terms from domain-specific document warehouses have been described in the literature. Typically, approaches to automatic term extraction make use of linguistic processors (
part of speech tagging In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definiti ...
,
phrase chunking Phrase chunking is a phase of natural language processing that separates and segments a sentence into its subconstituents, such as noun, verb, and prepositional phrase An adpositional phrase, in linguistics, is a syntactic category that inclu ...
) to extract terminological candidates, i.e. syntactically plausible terminological
noun phrase In linguistics, a noun phrase, or nominal (phrase), is a phrase that has a noun or pronoun as its head or performs the same grammatical function as a noun. Noun phrases are very common cross-linguistically, and they may be the most frequently oc ...
s. Noun phrases include compounds (e.g. "credit card"), adjective noun phrases (e.g. "local tourist information office"), and prepositional noun phrases (e.g. "board of directors"). In English, the first two (compounds and adjective noun phrases) are the most frequent. Terminological entries are then filtered from the candidate list using statistical and
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
methods. Once filtered, because of their low ambiguity and high specificity, these terms are particularly useful for conceptualizing a knowledge domain or for supporting the creation of a domain ontology or a terminology base. Furthermore, terminology extraction is a very useful starting point for
semantic similarity Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tool ...
,
knowledge management Knowledge management (KM) is the collection of methods relating to creating, sharing, using and managing the knowledge and information of an organization. It refers to a multidisciplinary approach to achieve organisational objectives by making ...
, human translation and
machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates ...
, etc.


Bilingual terminology extraction

The methods for terminology extraction can be applied to
parallel corpora A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Libr ...
. Combined with e.g.
co-occurrence In linguistics, co-occurrence or cooccurrence is an above-chance frequency of occurrence of two terms (also known as coincidence or concurrence) from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense ca ...
statistics, candidates for term translations can be obtained. Bilingual terminology can be extracted also from comparable corpora (corpora containing texts within the same text type, domain but not translations of documents between each other).


See also

*
Computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...
*
Glossary A glossary (from grc, γλῶσσα, ''glossa''; language, speech, wording) also known as a vocabulary or clavis, is an alphabetical list of terms in a particular domain of knowledge with the definitions for those terms. Traditionally, a gl ...
*
Natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
* Domain ontology *
Subject indexing Subject indexing is the act of describing or classifying a document by index terms, keywords, or other symbols in order to indicate what different documents are ''about'', to summarize their contents or to increase findability. In other words, i ...
*
Taxonomy (general) Taxonomy is the practice and science of categorization or classification. A taxonomy (or taxonomical classification) is a scheme of classification, especially a hierarchical classification, in which things are organized into groups or types. ...
*
Terminology Terminology is a group of specialized words and respective meanings in a particular field, and also the study of such terms and their use; the latter meaning is also known as terminology science. A ''term'' is a word, compound word, or multi-wo ...
*
Text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...
*
Text simplification Text simplification is an operation used in natural language processing to change, enhance, classify, or otherwise process an existing body of human-readable text so its grammar and structure is greatly simplified while the underlying meaning and ...


References

{{Natural Language Processing Tasks of natural language processing
Extraction Extraction may refer to: Science and technology Biology and medicine * Comedo extraction, a method of acne treatment * Dental extraction, the surgical removal of a tooth from the mouth Computing and information science * Data extraction, the pr ...
Computing terminology