Parikh's theorem in
theoretical computer science
Theoretical computer science is a subfield of computer science and mathematics that focuses on the Abstraction, abstract and mathematical foundations of computation.
It is difficult to circumscribe the theoretical areas precisely. The Associati ...
says that if one looks only at the number of occurrences of each
terminal symbol
In formal languages, terminal and nonterminal symbols are parts of the ''vocabulary'' under a formal grammar. ''Vocabulary'' is a finite, nonempty set of symbols. ''Terminal symbols'' are symbols that cannot be replaced by other symbols of the v ...
in a
context-free language
In formal language theory, a context-free language (CFL), also called a Chomsky type-2 language, is a language generated by a context-free grammar (CFG).
Context-free languages have many applications in programming languages, in particular, mos ...
, without regard to their order, then the language is indistinguishable from a
regular language
In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...
.
It is useful for deciding that strings with a given number of terminals are not accepted by a context-free grammar. It was first proved by
Rohit Parikh in 1961 and republished in 1966.
Definitions and formal statement
Let
be an
alphabet
An alphabet is a standard set of letter (alphabet), letters written to represent particular sounds in a spoken language. Specifically, letters largely correspond to phonemes as the smallest sound segments that can distinguish one word from a ...
. The ''Parikh vector'' of a word is defined as the function
, given by
where
denotes the number of occurrences of the symbol
in the word
.
A subset of
is said to be ''linear'' if it is of the form
for some vectors
.
A subset of
is said to be ''semi-linear'' if it is a union of finitely many linear subsets.
In short, the image under
of context-free languages and of regular languages is the same, and it is equal to the set of semilinear sets.
Two languages are said to be ''commutatively equivalent'' if they have the same set of Parikh vectors. Thus, every context-free language is commutatively equivalent to some regular language.
Proof
The second part is easy to prove.
The first part is less easy. The following proof is credited to Goldstine.
First we need a small strengthening of the
pumping lemma for context-free languages
In computer science, in particular in formal language theory, the pumping lemma for context-free languages, also known as the Bar-Hillel lemma, is a lemma that gives a property shared by all context-free languages and generalizes the pumping le ...
:
The proof is essentially the same as the standard pumping lemma: use the pigeonhole principle to find
copies of some nonterminal symbol
in the longest path in the shortest derivation tree.
Now we prove the first part of Parikh's theorem, making use of the above lemma.
Strengthening for bounded languages
A language
is ''bounded'' if
for some fixed words
.
Ginsburg and Spanier
gave a necessary and sufficient condition, similar to Parikh's theorem, for bounded languages.
Call a linear set ''stratified'', if in its definition for each
the vector
has the property that it has at most two non-zero coordinates, and for each
if each of the vectors
has two non-zero coordinates,