DisCoCat
   HOME

TheInfoList



OR:

DisCoCat (Categorical Compositional Distributional) is a mathematical framework for
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
which uses
category theory Category theory is a general theory of mathematical structures and their relations that was introduced by Samuel Eilenberg and Saunders Mac Lane in the middle of the 20th century in their foundational work on algebraic topology. Nowadays, cate ...
to unify
distributional semantics Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. T ...
with the
principle of compositionality In semantics, mathematical logic and related disciplines, the principle of compositionality is the principle that the meaning of a complex expression is determined by the meanings of its constituent expressions and the rules used to combine them. ...
. The grammatical derivations in a
categorial grammar Categorial grammar is a family of formalisms in natural language syntax that share the central assumption that syntactic constituents combine as functions and arguments. Categorial grammar posits a close relationship between the syntax and seman ...
(usually a
pregroup grammar Pregroup grammar (PG) is a grammar formalism intimately related to categorial grammars. Much like categorial grammar (CG), PG is a kind of type logical grammar. Unlike CG, however, PG does not have a distinguished function type. Rather, PG uses in ...
) are interpreted as
linear maps In mathematics, and more specifically in linear algebra, a linear map (also called a linear mapping, linear transformation, vector space homomorphism, or in some contexts linear function) is a mapping V \to W between two vector spaces that pre ...
acting on the
tensor product In mathematics, the tensor product V \otimes W of two vector spaces and (over the same field) is a vector space to which is associated a bilinear map V\times W \to V\otimes W that maps a pair (v,w),\ v\in V, w\in W to an element of V \otimes W ...
of word vectors to produce the meaning of a sentence or a piece of text.
String diagrams String diagrams are a formal graphical language for representing morphisms in monoidal categories, or more generally 2-cells in 2-categories. They are a prominent tool in applied category theory. When interpreted in the monoidal category of vector ...
are used to visualise
information flow In discourse-based grammatical theory, information flow is any tracking of referential information by speakers. Information may be ''new,'' just introduced into the conversation; ''given,'' already active in the speakers' consciousness; or ''old, ...
and reason about natural language
semantics Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy Philosophy (f ...
.


History

The framework was first introduced by
Bob Coecke Bob Coecke (born 23 July 1968) is a Belgian theoretical physicist and logician who was professor of Quantum Foundations, Logics and Structures at Oxford University until 2020, when he became Chief Scientist of Cambridge Quantum Computing, and ...
,
Mehrnoosh Sadrzadeh Mehrnoosh Sadrzadeh is an Iranian British academic who is a professor at University College London. She was awarded a senior research fellowship at the Royal Academy of Engineering in 2022. Early life and education Sadrzadeh is from Iran. She r ...
, and Stephen Clark as an application of
categorical quantum mechanics Categorical quantum mechanics is the study of quantum foundations and quantum information using paradigms from mathematics and computer science, notably monoidal category theory. The primitive objects of study are physical processes, and the diffe ...
to
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
. It started with the observation that
pregroup grammar Pregroup grammar (PG) is a grammar formalism intimately related to categorial grammars. Much like categorial grammar (CG), PG is a kind of type logical grammar. Unlike CG, however, PG does not have a distinguished function type. Rather, PG uses in ...
s and quantum processes shared a common mathematical structure: they both form a
rigid category In category theory, a branch of mathematics, a rigid category is a monoidal category where every object is rigid, that is, has a dual ''X''* (the internal Hom 'X'', 1 and a morphism 1 → ''X'' ⊗ ''X''* satisfying natural conditions. The ...
(also known as a non-symmetric
compact closed category In category theory, a branch of mathematics, compact closed categories are a general context for treating dual objects. The idea of a dual object generalizes the more familiar concept of the dual of a finite-dimensional vector space. So, the ...
). As such, they both benefit from a graphical calculus, which allows a purely diagrammatic reasoning. Although the analogy with quantum mechanics was kept informal at first, it eventually led to the development of quantum natural language processing.


Definition

There are multiple definitions of DisCoCat in the literature, depending on the choice made for the compositional aspect of the model. The common denominator between all the existent versions, however, always involves a categorical definition of DisCoCat as a structure-preserving functor from a category of grammar to a category of semantics, which usually encodes the distributional hypothesis. The original paper used the
categorical product In category theory, the product of two (or more) objects in a category is a notion designed to capture the essence behind constructions in other areas of mathematics such as the Cartesian product of sets, the direct product of groups or ring ...
of FinVect with a pregroup seen as a
posetal category In mathematics, specifically category theory, a posetal category, or thin category, is a category whose homsets each contain at most one morphism. As such, a posetal category amounts to a preordered class (or a preordered set, if its objects for ...
. Unfortunately this approach does not work, indeed all parallel arrows of a posetal category are equal, which means that pregroups cannot distinguish between different grammatical derivations for the same
syntactically ambiguous Syntactic ambiguity, also called structural ambiguity, amphiboly or amphibology, is a situation where a sentence may be interpreted in more than one way due to ambiguous sentence structure. Syntactic ambiguity arises not from the range of mean ...
sentence. Instead, one needs to consider the free
rigid category In category theory, a branch of mathematics, a rigid category is a monoidal category where every object is rigid, that is, has a dual ''X''* (the internal Hom 'X'', 1 and a morphism 1 → ''X'' ⊗ ''X''* satisfying natural conditions. The ...
\mathbf generated by the pregroup grammar. That is, \mathbf has generating objects for the words and the basic types of the grammar, and generating arrows w \to t for the dictionary entries which assign a pregroup type t to a word w. The arrows f: w_1 \dots w_n \to s are grammatical derivations for the sentence w_1 \dots w_n which can be represented as
string diagrams String diagrams are a formal graphical language for representing morphisms in monoidal categories, or more generally 2-cells in 2-categories. They are a prominent tool in applied category theory. When interpreted in the monoidal category of vector ...
with cups and caps, i.e. adjunction units and counits. With this definition of pregroup grammars as free rigid categories, DisCoCat models can be defined as strong monoidal functors F : \mathbf \to \mathbf. Spelling things out in detail, they assign a finite dimensional
vector space In mathematics and physics, a vector space (also called a linear space) is a set whose elements, often called ''vectors'', may be added together and multiplied ("scaled") by numbers called '' scalars''. Scalars are often real numbers, but can ...
F(x) to each basic type x and a vector F(w) \in F(t) = F(t_1) \otimes \dots \otimes F(t_n) in the appropriate
tensor product In mathematics, the tensor product V \otimes W of two vector spaces and (over the same field) is a vector space to which is associated a bilinear map V\times W \to V\otimes W that maps a pair (v,w),\ v\in V, w\in W to an element of V \otimes W ...
space to each dictionary entry w \to t where t = t_1 \dots t_n (objects for words are sent to the monoidal unit, i.e. F(w) = 1). The meaning of a sentence f: w_1 \dots w_n \to s is then given by a vector F(f) \in F(s) which can be computed as the contraction of a
tensor network Tensor networks or tensor network states are a class of variational wave functions used in the study of many-body quantum systems. Tensor networks extend one-dimensional matrix product states to higher dimensions while preserving some of their use ...
. The reason behind the choice of \mathbf as the category of semantics is that vector spaces are the usual setting of distributional reading in computational linguistics and
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
. The underlying idea of distributional hypothesis "A word is characterized by the company it keeps" is particularly relevant when assigning meaning to words like adjectives or verbs, whose semantic connotation is strongly dependent on context.


Variations

Variations of DisCoCat have been proposed with a different choice for the grammar category. The main motivation behind this lies in the fact that pregroup grammars have been proved to be weakly equivalent to context-free grammars. One example of variation chooses
Combinatory categorial grammar Combinatory categorial grammar (CCG) is an efficiently parsable, yet linguistically expressive grammar formalism. It has a transparent interface between surface syntax and underlying semantic representation, including predicate–argument structur ...
as the grammar category.


List of linguistic phenomena

The DisCoCat framework has been used to study the following phenomena from
linguistics Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Linguis ...
. *
Entailment Logical consequence (also entailment) is a fundamental concept in logic, which describes the relationship between statements that hold true when one statement logically ''follows from'' one or more statements. A valid logical argument is one ...
* Coordination *
Hyponymy and hypernymy In linguistics, semantics, general semantics, and ontologies, hyponymy () is a semantic relation between a hyponym denoting a subtype and a hypernym or hyperonym (sometimes called umbrella term or blanket term) denoting a supertype. In other wor ...
*
Ambiguity Ambiguity is the type of meaning in which a phrase, statement or resolution is not explicitly defined, making several interpretations plausible. A common aspect of ambiguity is uncertainty. It is thus an attribute of any idea or statement ...
with density matrices *
Discourse analysis Discourse analysis (DA), or discourse studies, is an approach to the analysis of written, vocal, or sign language use, or any significant semiotic event. The objects of discourse Analysis ( discourse, writing, conversation, communicative event ...
* Anaphora and
ellipsis The ellipsis (, also known informally as dot dot dot) is a series of dots that indicates an intentional omission of a word, sentence, or whole section from a text without altering its original meaning. The plural is ellipses. The term origin ...
*
Language evolution Evolutionary linguistics or Darwinian linguistics is a sociobiological approach to the study of language. Evolutionary linguists consider linguistics as a subfield of sociobiology and evolutionary psychology. The approach is also closely linked ...


Applications in NLP

The DisCoCat framework has been applied to solve the following tasks in
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
. *
Word-sense disambiguation Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to consc ...
*
Semantic similarity Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools ...
*
Question answering Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural l ...
*
Machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
*
Anaphora resolution In linguistics, anaphora () is the use of an expression whose interpretation depends upon another expression in context (its antecedent or postcedent). In a narrower sense, anaphora is the use of an expression that depends specifically upon an a ...


See also

*
Lambek calculus Categorial grammar is a family of formalisms in natural language syntax that share the central assumption that syntactic constituents combine as functions and arguments. Categorial grammar posits a close relationship between the syntax and sema ...
*
Pregroup grammar Pregroup grammar (PG) is a grammar formalism intimately related to categorial grammars. Much like categorial grammar (CG), PG is a kind of type logical grammar. Unlike CG, however, PG does not have a distinguished function type. Rather, PG uses in ...
*
Distributional semantics Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. T ...
*
Principle of compositionality In semantics, mathematical logic and related disciplines, the principle of compositionality is the principle that the meaning of a complex expression is determined by the meanings of its constituent expressions and the rules used to combine them. ...
*
String diagram String diagrams are a formal graphical language for representing morphisms in monoidal categories, or more generally 2-cells in 2-categories. They are a prominent tool in applied category theory. When interpreted in the monoidal category of vector ...
*
Categorical quantum mechanics Categorical quantum mechanics is the study of quantum foundations and quantum information using paradigms from mathematics and computer science, notably monoidal category theory. The primitive objects of study are physical processes, and the diffe ...
* Quantum natural language processing


External links


DisCoPy
a Python toolkit for computing with string diagrams
lambeq
a Python library for quantum natural language processing


References

{{reflist Computational linguistics Category theory