Chomsky–Schützenberger Enumeration Theorem
   HOME

TheInfoList



OR:

In
formal language theory In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules. The alphabet of a formal language consists of symb ...
, the Chomsky–Schützenberger enumeration theorem is a theorem derived by
Noam Chomsky Avram Noam Chomsky (born December 7, 1928) is an American public intellectual: a linguist, philosopher, cognitive scientist, historian, social critic, and political activist. Sometimes called "the father of modern linguistics", Chomsky is ...
and
Marcel-Paul Schützenberger Marcel-Paul "Marco" Schützenberger (24 October 1920 – 29 July 1996) was a French mathematician and Doctor of Medicine. He worked in the fields of formal language, combinatorics, and information theory.Herbert Wilf, Dominique Foata, ''et al.' ...
about the number of words of a given length generated by an unambiguous context-free grammar. The theorem provides an unexpected link between the theory of
formal languages In logic, mathematics, computer science, and linguistics, a formal language consists of string (computer science), words whose symbol (formal), letters are taken from an alphabet (formal languages), alphabet and are well-formedness, well-formed ...
and
abstract algebra In mathematics, more specifically algebra, abstract algebra or modern algebra is the study of algebraic structures. Algebraic structures include groups, rings, fields, modules, vector spaces, lattices, and algebras over a field. The term ''a ...
.


Statement

In order to state the theorem, a few notions from algebra and formal language theory are needed. Let \mathbb denote the set of nonnegative integers. A ''power series'' over \mathbb is an
infinite series In mathematics, a series is, roughly speaking, a description of the operation of adding infinitely many quantities, one after the other, to a given starting quantity. The study of series is a major part of calculus and its generalization, math ...
of the form :f = f(x) = \sum_^\infty a_k x^k = a_0 + a_1 x^1 + a_2 x^2 + a_3 x^3 + \cdots with coefficients a_k in \mathbb. The ''multiplication'' of two formal
power series In mathematics, a power series (in one variable) is an infinite series of the form \sum_^\infty a_n \left(x - c\right)^n = a_0 + a_1 (x - c) + a_2 (x - c)^2 + \dots where ''an'' represents the coefficient of the ''n''th term and ''c'' is a const ...
f and g is defined in the expected way as the
convolution In mathematics (in particular, functional analysis), convolution is a operation (mathematics), mathematical operation on two function (mathematics), functions ( and ) that produces a third function (f*g) that expresses how the shape of one is ...
of the sequences a_n and b_n: :f(x)\cdot g(x) = \sum_^\infty \left(\sum_^k a_i b_\right) x^k. In particular, we write f^2 = f(x)\cdot f(x), f^3 = f(x)\cdot f(x)\cdot f(x), and so on. In analogy to
algebraic numbers An algebraic number is a number that is a root of a non-zero polynomial in one variable with integer (or, equivalently, rational) coefficients. For example, the golden ratio, (1 + \sqrt)/2, is an algebraic number, because it is a root of the po ...
, a power series f(x) is called ''algebraic'' over \mathbb(x), if there exists a finite set of polynomials p_0(x), p_1(x), p_2(x), \ldots , p_n(x) each with
rational Rationality is the quality of being guided by or based on reasons. In this regard, a person acts rationally if they have a good reason for what they do or a belief is rational if it is based on strong evidence. This quality can apply to an abi ...
coefficients such that :p_0(x) + p_1(x) \cdot f + p_2(x)\cdot f^2 + \cdots + p_n(x)\cdot f^n = 0. A context-free grammar is said to be ''unambiguous'' if every string generated by the grammar admits a unique parse tree or, equivalently, only one
leftmost derivation In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules are of the form :A\ \to\ \alpha with A a ''single'' nonterminal symbol, and \alpha a string of terminals and/or nonterminals (\alpha can be empt ...
. Having established the necessary notions, the theorem is stated as follows. :Chomsky–Schützenberger theorem. If L is a
context-free language In formal language theory, a context-free language (CFL) is a language generated by a context-free grammar (CFG). Context-free languages have many applications in programming languages, in particular, most arithmetic expressions are generated by ...
admitting an unambiguous context-free grammar, and a_k := , L \ \cap \Sigma^k , is the number of words of length k in L, then G(x)=\sum_^\infty a_k x^k is a power series over \mathbb that is algebraic over \mathbb(x). Proofs of this theorem are given by , and by .


Usage


Asymptotic estimates

The theorem can be used in
analytic combinatorics In combinatorics, the symbolic method is a technique for counting combinatorial objects. It uses the internal structure of the objects to derive formulas for their generating functions. The method is mostly associated with Philippe Flajolet an ...
to estimate the number of words of length ''n'' generated by a given unambiguous context-free grammar, as ''n'' grows large. The following example is given by : the unambiguous context-free grammar ''G'' over the alphabet has start symbol ''S'' and the following rules :''S'' → ''M'' , ''U'' :''M'' → 0''M''1''M'' , ''ε'' :''U'' → 0''S'' , 0''M''1''U''. To obtain an algebraic representation of the power series associated with a given context-free grammar ''G'', one transforms the grammar into a system of equations. This is achieved by replacing each occurrence of a terminal symbol by ''x'', each occurrence of ''ε'' by the integer '1', each occurrence of '→' by '=', and each occurrence of ', ' by '+', respectively. The operation of concatenation at the right-hand-side of each rule corresponds to the multiplication operation in the equations thus obtained. This yields the following system of equations: :''S'' = ''M'' + ''U'' :''M'' = ''M''²''x''² + 1 :''U'' = ''Sx'' + ''MUx''² In this system of equations, ''S'', ''M'', and ''U'' are functions of ''x'', so one could also write , , and . The equation system can be resolved after ''S'', resulting in a single algebraic equation: :. This quadratic equation has two solutions for ''S'', one of which is the algebraic power series . By applying methods from complex analysis to this equation, the number a_n of words of length ''n'' generated by ''G'' can be estimated, as ''n'' grows large. In this case, one obtains a_n \in O(2+\epsilon)^n but a_n \notin O(2-\epsilon)^n for each \epsilon>0. The following example is from : \left\{\begin{array} { l } { S \rightarrow X Y } \\ { T \rightarrow a T , T b T , Y c Y } \\ { Y \rightarrow Y a Y , c Y , a b T a Y Y a , X } \\ { X \rightarrow a , b , c } \end{array} \Rightarrow \left\{\begin{array}{l} s(z)=x(z) y(z) \\ t(z)=z t(z)+z t(z)^2+z y(z)^2 \\ y(z)=z y(z)^2+z y(z)+z^4 t(z) y(z)^2+x(z) \\ x(z)=3 z \end{array}\right.\right. which simplifies to s(z)^8-27\left(z^3-z^2\right) s(z)^5+\ldots+59049 z^{10}=0


Inherent ambiguity

In classical formal language theory, the theorem can be used to prove that certain context-free languages are inherently ambiguous. For example, the ''Goldstine language'' L_G over the alphabet \{a,b\} consists of the words a^{n_1}ba^{n_2}b\cdots a^{n_p}b with p\ge 1, n_i>0 for i \in \{1,2,\ldots,p\}, and n_j \neq j for some j \in \{1,2,\ldots,p\}. It is comparably easy to show that the language L_G is context-free. The harder part is to show that there does not exist an unambiguous grammar that generates L_G. This can be proved as follows: If g_k denotes the number of words of length k in L_G, then for the associated power series holds G(x) = \sum_{k=0}^\infty g_k x^k = \frac{1-x}{1-2x}- \frac1x \sum_{k \ge 1} x^{k(k+1)/2-1} . Using methods from
complex analysis Complex analysis, traditionally known as the theory of functions of a complex variable, is the branch of mathematical analysis that investigates Function (mathematics), functions of complex numbers. It is helpful in many branches of mathemati ...
, one can prove that this function is not algebraic over \mathbb{Q}(x). By the Chomsky-Schützenberger theorem, one can conclude that L_G does not admit an unambiguous context-free grammar.See for detailed account.


Notes


References

* * * * * * * {{DEFAULTSORT:Chomsky-Schutzenberger enumeration theorem Noam Chomsky Formal languages Theorems in discrete mathematics