linguistics Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Lingu ...

, statistical semantics applies the methods of statistics to the problem of determining the meaning of words or phrases, ideally through

unsupervised learning Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...

, to a degree of precision at least sufficient for the purpose of information retrieval.

History

The term ''statistical semantics'' was first used by

Warren Weaver Warren Weaver (July 17, 1894 – November 24, 1978) was an American scientist, mathematician, and science administrator. He is widely recognized as one of the pioneers of machine translation and as an important figure in creating support for scien ...

in his well-known paper on

machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...

. He argued that

word sense disambiguation Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to cons ...

for machine translation should be based on the

co-occurrence In linguistics, co-occurrence or cooccurrence is an above-chance frequency of occurrence of two terms (also known as coincidence or concurrence) from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense ...

frequency of the context words near a given target word. The underlying assumption that "a word is characterized by the company it keeps" was advocated by J.R. Firth. This assumption is known in

as the distributional hypothesis. Emile Delavenay defined ''statistical semantics'' as the "statistical study of meanings of words and their frequency and order of recurrence". " Furnas et al. 1983" is frequently cited as a foundational contribution to statistical semantics. An early success in the field was

latent semantic analysis Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the do ...

Applications

Research in statistical semantics has resulted in a wide variety of algorithms that use the distributional hypothesis to discover many aspects of

semantics Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and compu ...

, by applying statistical techniques to large corpora: * Measuring the similarity in word meanings * Measuring the similarity in word relations * Modeling similarity-based generalization * Discovering words with a given relation * Classifying relations between words * Extracting keywords from documents * Measuring the cohesiveness of text * Discovering the different senses of words * Distinguishing the different senses of words * Subcognitive aspects of words * Distinguishing praise from criticism

Related fields

Statistical semantics focuses on the meanings of common words and the relations between common words, unlike

text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...

, which tends to focus on whole documents, document collections, or named entities (names of people, places, and organizations). Statistical semantics is a subfield of

computational semantics Computational semantics is the study of how to automate the process of constructing and reasoning with meaning representations of natural language expressions. It consequently plays an important role in natural-language processing and computati ...

, which is in turn a subfield of

computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...

and

natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...

. Many of the applications of statistical semantics (listed above) can also be addressed by lexicon-based algorithms, instead of the

corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...

-based algorithms of statistical semantics. One advantage of corpus-based algorithms is that they are typically not as labour-intensive as lexicon-based algorithms. Another advantage is that they are usually easier to adapt to new languages or to noisier new text types from e.g. social media than lexicon-based algorithms are. However, the best performance on an application is often achieved by combining the two approaches.

References

Sources

* * *: Reprinted in * * * * * * * * * * * * * * * * * * * * {{DEFAULTSORT:Statistical Semantics Applications of artificial intelligence Computational linguistics Information retrieval techniques Semantics Statistical natural language processing Applied statistics Computational fields of study

History

Applications

Related fields

See also

References

Sources