Ontology learning (ontology extraction, ontology generation, or ontology acquisition) is the automatic or semi-automatic creation of
ontologies
In computer science and information science, an ontology encompasses a representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains ...
, including extracting the corresponding
domain's terms and the relationships between the
concepts
Concepts are defined as abstract ideas. They are understood to be the fundamental building blocks of the concept behind principles, thoughts and beliefs.
They play an important role in all aspects of cognition. As such, concepts are studied by sev ...
that these terms represent from a
corpus
Corpus is Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of linguistics
Music
* ...
of natural language text, and encoding them with an
ontology language
In computer science and artificial intelligence, ontology languages are formal languages used to construct ontologies. They allow the encoding of knowledge about specific domains and often include reasoning rules that support the processing of th ...
for easy retrieval. As
building ontologies manually is extremely labor-intensive and time-consuming, there is great motivation to automate the process.
Typically, the process starts by
extracting terms and concepts or
noun phrase
In linguistics, a noun phrase, or nominal (phrase), is a phrase that has a noun or pronoun as its head or performs the same grammatical function as a noun. Noun phrases are very common cross-linguistically, and they may be the most frequently oc ...
s from plain text using linguistic processors such as
part-of-speech tagging
In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definitio ...
and
phrase chunking
Phrase chunking is a phase of natural language processing that separates and segments a sentence into its subconstituents, such as noun, verb, and prepositional phrases, abbreviated as NP, VP, and PP, respectively. Typically, each subconstituent or ...
. Then statistical
or symbolic
techniques are used to extract
relation signatures, often based on pattern-based or definition-based hypernym extraction techniques.
Procedure
Ontology learning (OL) is used to (semi-)automatically extract whole ontologies from natural language text.
[Cimiano, Philipp; Völker, Johanna; Studer, Rudi (2006). "Ontologies on Demand? - A Description of the State-of-the-Art, Applications, Challenges and Trends for Ontology Learning from Text", ''Information, Wissenschaft und Praxis'', 57, p. 315 - 320, http://people.aifb.kit.edu/pci/Publications/iwp06.pdf (retrieved: 18.06.2012).][Wong, W., Liu, W. & Bennamoun, M. (2012),]
Ontology Learning from Text: A Look back and into the Future
. ACM Computing Surveys, Volume 44, Issue 4, Pages 20:1-20:36. The process is usually split into the following eight tasks, which are not all necessarily applied in every ontology learning system.
Domain terminology extraction
During the domain
terminology extraction
Terminology extraction (also known as term extraction, glossary extraction, term recognition, or terminology mining) is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a give ...
step, domain-specific terms are extracted, which are used in the following step (concept discovery) to derive concepts. Relevant terms can be determined, e.g., by calculation of the
TF/IDF values or by application of the C-value / NC-value method. The resulting list of terms has to be filtered by a domain expert. In the subsequent step, similarly to coreference resolution in
information extraction
Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
, the OL system determines synonyms, because they share the same meaning and therefore correspond to the same concept. The most common methods therefore are clustering and the application of statistical similarity measures.
Concept discovery
In the concept discovery step, terms are grouped to meaning bearing units, which correspond to an abstraction of the world and therefore to
concept
Concepts are defined as abstract ideas. They are understood to be the fundamental building blocks of the concept behind principles, thoughts and beliefs.
They play an important role in all aspects of cognition. As such, concepts are studied by s ...
s. The grouped terms are these domain-specific terms and their synonyms, which were identified in the domain terminology extraction step.
Concept hierarchy derivation
In the concept hierarchy derivation step, the OL system tries to arrange the extracted concepts in a taxonomic structure. This is mostly achieved with unsupervised
hierarchical clustering
In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into ...
methods. Because the result of such methods is often noisy, a supervision step, e.g., user evaluation, is added. A further method for the derivation of a concept hierarchy exists in the usage of several patterns that should indicate a
sub- or supersumption relationship. Patterns like “X, that is a Y” or “X is a Y” indicate that X is a subclass of Y. Such pattern can be analyzed efficiently, but they often occur too infrequently to extract enough sub- or supersumption relationships. Instead, bootstrapping methods are developed, which learn these patterns automatically and therefore ensure broader coverage.
Learning of non-taxonomic relations
In the learning of non-taxonomic relations step, relationships are extracted that do not express any sub- or supersumption. Such relationships are, e.g., works-for or located-in. There are two common approaches to solve this subtask. The first is based upon the extraction of anonymous associations, which are named appropriately in a second step. The second approach extracts verbs, which indicate a relationship between entities, represented by the surrounding words. The result of both approaches need to be evaluated by an ontologist to ensure accuracy.
Rule discovery
During
rule discovery,
[Johanna Völker; ]Pascal Hitzler
Pascal Hitzler is a German American computer scientist specializing in Semantic Web and Artificial Intelligence. He is endowed Lloyd T. Smith Creativity in Engineering Chair and Director of the Center for Artificial Intelligence and Data Science ...
; Cimiano, Philipp (2007). "Acquisition of OWL DL Axioms from Lexical Resources", ''Proceedings of the 4th European conference on The Semantic Web'', p. 670 - 685, http://smartweb.dfki.de/Vortraege/lexo_2007.pdf (retrieved: 18.06.2012). axioms (formal description of concepts) are generated for the extracted concepts. This can be achieved, e.g., by analyzing the syntactic structure of a natural language definition and the application of transformation rules on the resulting dependency tree. The result of this process is a list of axioms, which, afterwards, is comprehended to a concept description. This output is then evaluated by an ontologist.
Ontology population
At this step, the ontology is augmented with instances of concepts and properties. For the augmentation with instances of concepts, methods based on the matching of lexico-syntactic patterns are used. Instances of properties are added through the application of
bootstrapping methods, which collect relation tuples.
Concept hierarchy extension
In this step, the OL system tries to extend the taxonomic structure of an existing ontology with further concepts. This can be performed in a supervised manner with a trained classifier or in an unsupervised manner via the application of
similarity measure
In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such meas ...
s.
Frame and Event detection
During frame/event detection, the OL system tries to extract complex relationships from text, e.g., who departed from where to what place and when. Approaches range from applying SVM with
kernel method
In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example ...
s to
semantic role labeling In natural language processing, semantic role labeling (also called shallow semantic parsing or slot-filling) is the process that assigns labels to words or phrases in a sentence that indicates their semantic role in the sentence, such as that of a ...
(SRL)
[Coppola B.; Gangemi A.; Gliozzo A.; Picca D.; Presutti V. (2009).]
Frame Detection over the Semantic Web
, ''Proceedings of the European Semantic Web Conference (ESWC2009), Springer, 2009.'' to deep
semantic parsing Semantic parsing is the task of converting a natural language utterance to a logical form: a machine-understandable representation of its meaning. Semantic parsing can thus be understood as extracting the precise meaning of an utterance. Application ...
techniques.
[Presutti V.; Draicchio F.; Gangemi A. (2009).]
Knowledge extraction based on Discourse Representation Theory and Linguistic Frames
, ''Proceedings of the Conference on Knowledge Engineering and Knowledge Management (EKAW2012), LNCS, Springer, 2012.''
Tools
Dog4Dag (Dresden Ontology Generator for Directed Acyclic Graphs) is an ontology generation plugin for Protégé 4.1 and OBOEdit 2.1. It allows for term generation, sibling generation, definition generation, and relationship induction. Integrated into Protégé 4.1 and OBO-Edit 2.1, DOG4DAG allows ontology extension for all common ontology formats (e.g., OWL and OBO). Limited largely to EBI and Bio Portal lookup service extensions.
[Thomas Wächter, Götz Fabian, Michael Schroeder: DOG4DAG: semi-automated ontology generation in OBO-Edit and Protégé. SWAT4LS London, 2011. http://www.biotec.tu-dresden.de/research/schroeder/dog4dag/]
See also
*
Automatic taxonomy construction
Automatic taxonomy construction (ATC) is the use of software programs to generate taxonomical classifications from a body of texts called a corpus. ATC is a branch of natural language processing, which in turn is a branch of artificial intelligence ...
*
Computational linguistics
Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
*
Domain ontology
In computer science and information science, an ontology encompasses a representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains ...
*
Information extraction
Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
*
Natural language understanding
Natural-language understanding (NLU) or natural-language interpretation (NLI) is a subtopic of natural-language processing in artificial intelligence that deals with machine reading comprehension. Natural-language understanding is considered an ...
*
Semantic Web
*
Text mining
Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...
Bibliography
* P. Buitelaar, P. Cimiano (Eds.)
Ontology Learning and Population: Bridging the Gap between Text and Knowledge ''Series information for Frontiers in Artificial Intelligence and Applications'', IOS Press, 2008.
* P. Buitelaar, P. Cimiano, and B. Magnini (Eds.)
Ontology Learning from Text: Methods, Evaluation and Applications ''Series information for Frontiers in Artificial Intelligence and Applications'', IOS Press, 2005.
* Wong, W. (2009),
Learning Lightweight Ontologies from Text across Different Domains using the Web as Background Knowledge. Doctor of Philosophy thesis, University of Western Australia.
* Wong, W., Liu, W. & Bennamoun, M. (2012),
Ontology Learning from Text: A Look back and into the Future. ACM Computing Surveys, Volume 44, Issue 4, Pages 20:1-20:36.
* Thomas Wächter, Götz Fabian, Michael Schroeder: DOG4DAG: semi-automated ontology generation in OBO-Edit and Protégé. SWAT4LS London, 2011.
References
{{Natural Language Processing
Natural language processing
Ontology learning (computer science)
eu:Terminologia ateratze