HOME

TheInfoList



OR:

In
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
and
information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...
, explicit semantic analysis (ESA) is a vectoral representation of text (individual words or entire documents) that uses a document corpus as a
knowledge base A knowledge base (KB) is a technology used to store complex structured and unstructured information used by a computer system. The initial use of the term was in connection with expert systems, which were the first knowledge-based systems. Ori ...
. Specifically, in ESA, a word is represented as a column vector in the
tf–idf In information retrieval, tf–idf (also TF*IDF, TFIDF, TF–IDF, or Tf–idf), short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or co ...
matrix of the text corpus and a document (string of words) is represented as the
centroid In mathematics and physics, the centroid, also known as geometric center or center of figure, of a plane figure or solid figure is the arithmetic mean position of all the points in the surface of the figure. The same definition extends to any ob ...
of the vectors representing its words. Typically, the text corpus is
English Wikipedia The English Wikipedia is, along with the Simple English Wikipedia, one of two English-language editions of Wikipedia, an online encyclopedia. It was founded on January 15, 2001, as Wikipedia's first edition, and, as of , has the most arti ...
, though other corpora including the
Open Directory Project DMOZ (from ''directory.mozilla.org'', an earlier domain name, stylized in lowercase in its logo) was a multilingual open-content directory of World Wide Web links. The site and community who maintained it were also known as the Open Directory ...
have been used. ESA was designed by Evgeniy Gabrilovich and Shaul Markovitch as a means of improving
text categorization Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") ...
and has been used by this pair of researchers to compute what they refer to as "
semantic Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and comput ...
relatedness" by means of
cosine similarity In data analysis, cosine similarity is a measure of similarity between two sequences of numbers. For defining it, the sequences are viewed as vectors in an inner product space, and the cosine similarity is defined as the cosine of the angle betw ...
between the aforementioned vectors, collectively interpreted as a space of "concepts explicitly defined and described by humans", where Wikipedia articles (or ODP entries, or otherwise titles of documents in the knowledge base corpus) are equated with concepts. The name "explicit semantic analysis" contrasts with
latent semantic analysis Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the do ...
(LSA), because the use of a knowledge base makes it possible to assign human-readable labels to the concepts that make up the vector space.


Model

To perform the basic variant of ESA, one starts with a collection of texts, say, all Wikipedia articles; let the number of documents in the collection be . These are all turned into " bags of words", i.e., term frequency histograms, stored in an
inverted index In computer science, an inverted index (also referred to as a postings list, postings file, or inverted file) is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of d ...
. Using this inverted index, one can find for any word the set of Wikipedia articles containing this word; in the vocabulary of Egozi, Markovitch and Gabrilovitch, "each word appearing in the Wikipedia corpus can be seen as triggering each of the concepts it points to in the inverted index." The output of the inverted index for a single word query is a list of indexed documents (Wikipedia articles), each given a score depending on how often the word in question occurred in them (weighted by the total number of words in the document). Mathematically, this list is an -dimensional vector of word-document scores, where a document not containing the query word has score zero. To compute the relatedness of two words, one compares the vectors (say and ) by computing the cosine similarity, :\mathsf(\mathbf, \mathbf) = \frac = \frac and this gives a numeric estimate of the semantic relatedness of the words. The scheme is extended from single words to multi-word texts by simply summing the vectors of all words in the text.


Analysis

ESA, as originally posited by Gabrilovich and Markovitch, operates under the assumption that the knowledge base contains topically
orthogonal In mathematics, orthogonality is the generalization of the geometric notion of ''perpendicularity''. By extension, orthogonality is also used to refer to the separation of specific features of a system. The term also has specialized meanings in ...
concepts. However, it was later shown by Anderka and Stein that ESA also improves the performance of
information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...
systems when it is based not on Wikipedia, but on the Reuters corpus of newswire articles, which does not satisfy the orthogonality property; in their experiments, Anderka and Stein used newswire stories as "concepts". To explain this observation, links have been shown between ESA and the
generalized vector space model The Generalized vector space model is a generalization of the vector space model used in information retrieval. Wong ''et al.'' presented an analysis of the problems that the pairwise orthogonality assumption of the vector space model (VSM) creat ...
. Gabrilovich and Markovitch replied to Anderka and Stein by pointing out that their experimental result was achieved using "a single application of ESA (text similarity)" and "just a single, extremely small and homogenous test collection of 50 news documents".


Applications


Word relatedness

ESA is considered by its authors a measure of semantic relatedness (as opposed to
semantic similarity Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools ...
). On datasets used to benchmark relatedness of words, ESA outperforms other algorithms, including
WordNet WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into '' synsets'' with short definition ...
semantic similarity measures and skip-gram Neural Network Language Model (
Word2vec Word2vec is a technique for natural language processing (NLP) published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or ...
).


Document relatedness

ESA is used in commercial software packages for computing relatedness of documents. Domain-specific restrictions on the ESA model are sometimes used to provide more robust document matching.


Extensions

Cross-language explicit semantic analysis (CL-ESA) is a multilingual generalization of ESA.Martin Potthast, Benno Stein, and Maik Anderka
A Wikipedia-based multilingual retrieval model
Proceedings of the 30th European Conference on IR Research (ECIR), pp. 522-530, 2008.
CL-ESA exploits a document-aligned multilingual reference collection (e.g., again, Wikipedia) to represent a document as a language-independent concept vector. The relatedness of two documents in different languages is assessed by the cosine similarity between the corresponding vector representations.


See also

*
Topic model In statistics and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden ...


References


External links


Explicit semantic analysis
on Evgeniy Gabrilovich's homepage; has links to implementations {{Natural language processing Natural language processing Vector space model