A relationship extraction task requires the detection and classification of semantic relationship mentions within a set of artifacts, typically from

text Text may refer to: Written word * Text (literary theory), any object that can be read, including: **Religious text, a writing that a religious tradition considers to be sacred **Text, a verse or passage from scripture used in expository preachin ...

XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable ...

documents. The task is very similar to that of

information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...

(IE), but IE additionally requires the removal of repeated relations (

disambiguation Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to consc ...

) and generally refers to the extraction of many different relationships.

Concept and applications

The concept of relationship extraction was first introduced during the 7th Message Understanding Conference in 1998. Relationship extraction involves the identification of relations between entities and it usually focuses on the extraction of binary relations. Application domains where relationship extraction is useful include gene-disease relationships, protein-protein interaction etc. Current relationship extraction studies use machine learning technologies, which approach relationship extraction as a classification problem.

Never-Ending Language Learning Never-Ending Language Learning system (NELL) is a semantic machine learning system developed by a research team at Carnegie Mellon University, and supported by grants from DARPA, Google, NSF, and CNPq with portions of the system running on a superc ...

is a semantic

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

system developed by a research team at Carnegie Mellon University that extracts relationships from the open web.

Approaches

There are several methods used to extract relationships and these include text-based relationship extraction. These methods rely on the use of pretrained relationship structure information or it could entail the learning of the structure in order to reveal relationships. Another approach to this problem involves the use of domain

ontologies In computer science and information science, an ontology encompasses a representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains ...

. There is also the approach that involves visual detection of meaningful relationships in parametric values of objects listed on a data table that shift positions as the table is permuted automatically as controlled by the software user. The poor coverage, rarity and development cost related to structured resources such as

semantic lexicon A semantic lexicon is a digital dictionary of words labeled with semantic classes so associations can be drawn between words that have not previously been encountered. Semantic lexicons are built upon semantic networks, which represent the semanti ...

s (e.g.

WordNet WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into '' synsets'' with short defin ...

UMLS The Unified Medical Language System (UMLS) is a compendium of many controlled vocabularies in the biomedical sciences (created 1986). It provides a mapping structure among these vocabularies and thus allows one to translate among the various termi ...

) and domain ontologies (e.g. the

Gene Ontology The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and ge ...

) has given rise to new approaches based on broad, dynamic background knowledge on the Web. For instance, the ARCHILES technique uses only Wikipedia and search engine page count for acquiring coarse-grained relations to construct lightweight ontologies. The relationships can be represented using a variety of formalisms/languages. One such representation language for data on the Web is RDF. More recently, end-to-end systems which jointly learn to extract entity mentions and their semantic relations have been proposed with strong potential to obtain high performance. Most of the reported systems have demonstrated their approach on English dataset. However, data and systems have been described for other language, e.g.,

Russian Russian(s) refers to anything related to Russia, including: *Russians (, ''russkiye''), an ethnic group of the East Slavic peoples, primarily living in Russia and neighboring countries *Rossiyane (), Russian language term for all citizens and peo ...

and

Vietnamese Vietnamese may refer to: * Something of, from, or related to Vietnam, a country in Southeast Asia ** A citizen of Vietnam. See Demographics of Vietnam. * Vietnamese people, or Kinh people, a Southeast Asian ethnic group native to Vietnam ** Overse ...

Datasets

Researchers have constructed multiple datasets for benchmarking relationship extraction methods. One such dataset was the document-level relationship extraction dataset called DocRED released in 2019. It uses relations from

Wikidata Wikidata is a collaboratively edited multilingual knowledge graph hosted by the Wikimedia Foundation. It is a common source of open data that Wikimedia projects such as Wikipedia, and anyone else, can use under the CC0 public domain license ...

and text from the

English Wikipedia The English Wikipedia is, along with the Simple English Wikipedia, one of two English-language editions of Wikipedia, an online encyclopedia. It was founded on January 15, 2001, as Wikipedia's first edition, and, as of , has the most arti ...

. The dataset has been used by other researchers and a prediction competition has been setup at CodaLab.

References

Tasks of natural language processing Semantic Web {{comp-sci-stub

Concept and applications

Approaches

Datasets

See also

References