Semantic spaces
[also referred to as distributed semantic spaces or distributed semantic memory] in the natural language domain aim to create representations of natural language that are capable of capturing meaning. The original motivation for semantic spaces stems from two core challenges of natural language:
Vocabulary mismatch (the fact that the same meaning can be expressed in many ways) and
ambiguity
Ambiguity is the type of meaning in which a phrase, statement or resolution is not explicitly defined, making several interpretations plausible. A common aspect of ambiguity is uncertainty. It is thus an attribute of any idea or statement ...
of natural language (the fact that the same term can have several meanings).
The application of semantic spaces in
natural language processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
(NLP) aims at overcoming limitations of
rule-based or model-based approaches operating on the
keyword
Keyword may refer to:
Computing
* Keyword (Internet search), a word or phrase typically used by bloggers or online content creator to rank a web page on a particular topic
* Index term, a term used as a keyword to documents in an information syst ...
level. The main drawback with these approaches is their brittleness, and the large manual effort required to create either rule-based NLP systems or training corpora for model learning. Rule-based and
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
based models are fixed on the keyword level and break down if the vocabulary differs from that defined in the rules or from the training material used for the statistical models.
Research in semantic spaces dates back more than 20 years. In 1996, two papers were published that raised a lot of attention around the general idea of creating semantic spaces:
latent semantic analysis
Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the do ...
and
Hyperspace Analogue to Language
Semantic memory refers to general world knowledge that humans have accumulated throughout their lives. This general knowledge (word meanings, concepts, facts, and ideas) is intertwined in experience and dependent on culture. We can learn about n ...
. However, their adoption was limited by the large computational effort required to construct and use those semantic spaces. A breakthrough with regard to the
accuracy
Accuracy and precision are two measures of ''observational error''.
''Accuracy'' is how close a given set of measurements (observations or readings) are to their ''true value'', while ''precision'' is how close the measurements are to each other ...
of modelling associative relations between words (e.g. "spider-web", "lighter-cigarette", as opposed to synonymous relations such as "whale-dolphin", "astronaut-driver") was achieved by
explicit semantic analysis In natural language processing and information retrieval, explicit semantic analysis (ESA) is a vectoral representation of text (individual words or entire documents) that uses a document corpus as a knowledge base. Specifically, in ESA, a word i ...
(ESA) in 2007. ESA was a novel (non-machine learning) based approach that represented words in the form of vectors with 100,000
dimension
In physics and mathematics, the dimension of a Space (mathematics), mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any Point (geometry), point within it. Thus, a Line (geometry), lin ...
s (where each dimension represents an Article in
Wikipedia
Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system. Wikipedia is the largest and most-read refer ...
). However practical applications of the approach are limited due to the large number of required dimensions in the vectors.
More recently, advances in
neural network
A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
techniques in combination with other new approaches (
tensor
In mathematics, a tensor is an algebraic object that describes a multilinear relationship between sets of algebraic objects related to a vector space. Tensors may map between different objects such as vectors, scalars, and even other tenso ...
s) led to a host of new recent developments:
Word2vec
Word2vec is a technique for natural language processing (NLP) published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or ...
from
Google
Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...
,
GloVe
A glove is a garment covering the hand. Gloves usually have separate sheaths or openings for each finger and the thumb.
If there is an opening but no (or a short) covering sheath for each finger they are called fingerless gloves. Fingerless glov ...
from
Stanford University
Stanford University, officially Leland Stanford Junior University, is a private research university in Stanford, California. The campus occupies , among the largest in the United States, and enrolls over 17,000 students. Stanford is consider ...
, and
fastText
fastText is a library for learning of word embeddings and text classification created by Facebook's AI Research (FAIR) lab. The model allows one to create an unsupervised learning or supervised learning algorithm for obtaining vector representati ...
from
Facebook
Facebook is an online social media and social networking service owned by American company Meta Platforms. Founded in 2004 by Mark Zuckerberg with fellow Harvard College students and roommates Eduardo Saverin, Andrew McCollum, Dustin M ...
AI Research (FAIR) labs.
See also
*
Word embedding
In natural language processing (NLP), word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the v ...
*
Semantic folding
*
Distributional–relational database
A distributional–relational database, or word-vector database, is a Database Management System, database management system (DBMS) that uses distributional word embedding, word-vector representations to enrich the semantics of data model, structur ...
References
{{reflist
Semantics
Semantic relations
Natural language processing