UBY is a large-scale lexical-semantic resource for

natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...

(NLP) developed at the

Ubiquitous Knowledge Processing Lab The Ubiquitous Knowledge Processing Lab (also UKP Lab) is a research lab at the Department of Computer Science at the Technische Universität Darmstadt. It was founded in 2006 by Iryna Gurevych. Research Activities UKP Lab develops natural langua ...

(UKP) in the department of Computer Science of the

Technische Universität Darmstadt The Technische Universität Darmstadt (official English name Technical University of Darmstadt, sometimes also referred to as Darmstadt University of Technology), commonly known as TU Darmstadt, is a research university in the city of Darmstadt ...

. UBY is based on the ISO standard Lexical Markup Framework (LMF) and combines information from several expert-constructed and collaboratively constructed resources for English and German. UBY applies a word sense alignment approach (subfield of

word sense disambiguation Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to consci ...

) for combining information about nouns and verbs. Currently, UBY contains 12 integrated resources in English and German.

Included resources

* English resources:

WordNet WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into '' synsets'' with short definition ...

Wiktionary Wiktionary ( , , rhyming with "dictionary") is a multilingual, web-based project to create a free content dictionary of terms (including words, phrases, proverbs, linguistic reconstructions, etc.) in all natural languages and in a number ...

Wikipedia Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system. Wikipedia is the largest and most-read refer ...

FrameNet FrameNet is a research and resource development project based at the International Computer Science Institute (ICSI) in Berkeley, California, which has produced an electronic resource based on a theory of meaning called frame semantics. The data ...

VerbNet The VerbNet project maps PropBank PropBank is a corpus that is annotated with verbal propositions and their arguments—a "proposition bank". Although "PropBank" refers to a specific corpus produced by Martha Palmer ''et al.'', the term ''pro ...

, OmegaWiki * German resources:

German Wikipedia The German Wikipedia (german: Deutschsprachige Wikipedia) is the German-language edition of Wikipedia, a free and publicly editable online encyclopedia. Founded on March 16, 2001, it is the second-oldest Wikipedia (after the English Wikipedia), ...

, German Wiktionary, OntoWiktionary,

GermaNet GermaNet is a semantic network for the German language. It relates nouns, verbs, and adjectives semantically by grouping lexical units that express the same concept into ''synsets'' and by defining semantic relations between these synsets. GermaNe ...

and IMSLex-Subcat * Multilingual resources: OmegaWiki.

Format

UBY-LMF is a format for standardizing lexical resources for Natural Language Processing (NLP). UBY-LMF conforms to the ISO standard for lexicons: LMF, designed within the ISO-TC37, and constitutes a so-called serialization of this abstract standard. In accordance with the LMF, all attributes and other linguistic terms introduced in UBY-LMF refer to standardized descriptions of their meaning in ISOCat.

Availability and versions

UBY is available as part of the open resource repository DKPro. DKPro UBY is a Java framework for creating and accessing sense-linked lexical resources in accordance with the

UBY-LMF UBY-LMF is a format for standardizing lexical resources for Natural Language Processing (NLP). UBY-LMF conforms to the ISO standard for lexicons: LMF, designed within the ISO-TC37, and constitutes a so-called serialization of this abstract standa ...

lexicon model. While the code of UBY is licensed under a mix of free licenses such as

GPL The GNU General Public License (GNU GPL or simply GPL) is a series of widely used free software licenses that guarantee end users the four freedoms to run, study, share, and modify the software. The license was the first copyleft for general us ...

and

CC by SA A Creative Commons (CC) license is one of several public copyright licenses that enable the free distribution of an otherwise copyrighted "work".A "work" is any creative material made by a person. A painting, a graphic, a book, a song/lyrics ...

, some of the included resources are under different licenses such as academic use only. There is also a Semantic Web version of UBY called lemonUby. lemonUby is based on the lemon model as proposed in the Monnet project. lemon is a model for modeling lexicon and machine-readable dictionaries and linked to the Semantic Web and the Linked Data cloud.

UBY vs. BabelNet

BabelNet BabelNet is a multilingual lexicalized semantic network and ontology developed at the NLP group of the Sapienza University of Rome.R. Navigli and S. P Ponzetto. 2012BabelNet: The Automatic Construction, Evaluation and Application of a Wide-Cover ...

is an automatically lexical semantic resource that links

to the most popular computational lexicons such as

. At first glance, UBY and BabelNet seem to be identical and competitive projects; however, the two resources follow different philosophies. In its early stage, BabelNet was primarily based on the alignment of WordNet and Wikipedia, which by the very nature of Wikipedia implied a strong focus on nouns, and especially named entities. Later on, the focus of BabelNet was shifted more towards other parts of speech. UBY, however, was focused from the very beginning on verb information, especially, syntactic information, which is contained in resources, such as

. Another main difference is that UBY models other resources completely and independently from each other, so that UBY can be used as wholesale replacement of each of the contained resources. A collective access to multiple resources is provided through the available resource alignments. Moreover, the LMF model in UBY allows unified way of access for all as well as individual resources. Meanwhile, BabelNet follow an approach similar to WordNet and bakes selected information types into so called Babel Synsets. This makes access and processing of the knowledge more convenient, however, it blurs the lines between the linked knowledge bases. Additionally, BabelNet enriches the original resources, e.g., by providing automatically created translations for concepts which are not lexicalized in a particular language. Although this provides a great boost of coverage for multilingual applications, the automatic inference of information is always prone to a certain degree of error. In summary, due to the listed differences between the two resources, the usage of one or the other might be preferred depending on the particular application scenario. In fact, the two resources can be used to provide extensive lexicographic knowledge, especially, if they are linked together. The open and well-documented structure of the two resource provide a crucial milestone to achieve this goal.

Applications

UBY has been successfully used in different NLP tasks such as

Word Sense Disambiguation Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to consci ...

, Word Sense Clustering, Verb Sense Labeling and

Text Classification Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") ...

. UBY also inspired other projects on automatic construction of lexical semantic resources. Furthermore, lemonUby was used to improve

machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...

results, especially, finding translations for unknown words.J. P. McCrae, P. Cimiano: Mining translations from the web of open linked data, in: Proceedings of the Joint Workshop on NLP&LOD and SWAIE: Semantic Web, Linked Open Data and Information Extraction, pp 9-13 (2013).

External links

UBY website

UBY Browser

DKPro UBY project on Github

lemonUBY

References

{{Reflist Natural language processing software Free software