HOME

TheInfoList



OR:

In
linguistics Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Linguis ...
, statistical semantics applies the methods of
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
to the problem of determining the meaning of words or phrases, ideally through
unsupervised learning Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...
, to a degree of precision at least sufficient for the purpose of
information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...
.


History

The term ''statistical semantics'' was first used by
Warren Weaver Warren Weaver (July 17, 1894 – November 24, 1978) was an American scientist, mathematician, and science administrator. He is widely recognized as one of the pioneers of machine translation and as an important figure in creating support for scien ...
in his well-known paper on
machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
. He argued that
word sense disambiguation Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to consci ...
for machine translation should be based on the
co-occurrence In linguistics, co-occurrence or cooccurrence is an above-chance frequency of occurrence of two terms (also known as coincidence or concurrence) from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense can ...
frequency of the context words near a given target word. The underlying assumption that "a word is characterized by the company it keeps" was advocated by J.R. Firth. This assumption is known in
linguistics Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Linguis ...
as the
distributional hypothesis Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. T ...
. Emile Delavenay defined ''statistical semantics'' as the "statistical study of meanings of words and their frequency and order of recurrence". "
Furnas Furnas is a civil parish in the municipality of Povoação on the island of São Miguel in the Portuguese Azores. The population in 2011 was 1,439, in an area of 34.43 km2. The parish is one of the largest in the island and in the Azores. I ...
et al. 1983" is frequently cited as a foundational contribution to statistical semantics. An early success in the field was
latent semantic analysis Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the do ...
.


Applications

Research in statistical semantics has resulted in a wide variety of algorithms that use the distributional hypothesis to discover many aspects of
semantics Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy Philosophy (f ...
, by applying statistical techniques to large corpora: * Measuring the similarity in word meanings * Measuring the similarity in word relations * Modeling similarity-based generalization * Discovering words with a given relation * Classifying relations between words * Extracting keywords from documents * Measuring the cohesiveness of text * Discovering the different senses of words * Distinguishing the different senses of words * Subcognitive aspects of words * Distinguishing praise from criticism


Related fields

Statistical semantics focuses on the meanings of common words and the relations between common words, unlike
text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...
, which tends to focus on whole documents, document collections, or named entities (names of people, places, and organizations). Statistical semantics is a subfield of
computational semantics Computational semantics is the study of how to automate the process of constructing and reasoning with meaning representations of natural language expressions. It consequently plays an important role in natural-language processing and computatio ...
, which is in turn a subfield of
computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
and
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
. Many of the applications of statistical semantics (listed above) can also be addressed by
lexicon A lexicon is the vocabulary of a language or branch of knowledge (such as nautical or medical). In linguistics, a lexicon is a language's inventory of lexemes. The word ''lexicon'' derives from Koine Greek language, Greek word (), neuter of () ...
-based algorithms, instead of the
corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...
-based algorithms of statistical semantics. One advantage of corpus-based algorithms is that they are typically not as labour-intensive as lexicon-based algorithms. Another advantage is that they are usually easier to adapt to new languages or to noisier new text types from e.g. social media than lexicon-based algorithms are. However, the best performance on an application is often achieved by combining the two approaches.


See also

*
Co-occurrence In linguistics, co-occurrence or cooccurrence is an above-chance frequency of occurrence of two terms (also known as coincidence or concurrence) from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense can ...
*
Computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
*
Information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...
*
Latent semantic analysis Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the do ...
*
Latent semantic indexing Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the do ...
*
Semantic analytics Semantic analytics, also termed ''semantic relatedness'', is the use of ontologies to analyze content in web resources. This field of research combines text analytics and Semantic Web technologies like RDF. Semantic analytics measures the relate ...
*
Semantic similarity Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools ...
*
Statistical natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
*
Text corpus In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical a ...
*
Text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...
* Web mining


References


Sources

* * *: Reprinted in * * * * * * * * * * * * * * * * * * * * {{DEFAULTSORT:Statistical Semantics Applications of artificial intelligence Computational linguistics Information retrieval techniques Semantics Statistical natural language processing Applied statistics Computational fields of study