Parikh's theorem in
theoretical computer science
computer science (TCS) is a subset of general computer science and mathematics that focuses on mathematical aspects of computer science such as the theory of computation, lambda calculus, and type theory.
It is difficult to circumscribe the ...
says that if one looks only at the number of occurrences of each
terminal symbol
In computer science, terminal and nonterminal symbols are the lexical elements used in specifying the production rules constituting a formal grammar. ''Terminal symbols'' are the elementary symbols of the language defined by a formal grammar. ...
in a
context-free language
In formal language theory, a context-free language (CFL) is a language generated by a context-free grammar (CFG).
Context-free languages have many applications in programming languages, in particular, most arithmetic expressions are generated by ...
, without regard to their order, then the language is indistinguishable from a
regular language
In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...
.
It is useful for deciding that strings with a given number of terminals are not accepted by a context-free grammar. It was first proved by
Rohit Parikh in 1961 and republished in 1966.
Definitions and formal statement
Let
be an
alphabet
An alphabet is a standardized set of basic written graphemes (called letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character represents a syllab ...
. The ''Parikh vector'' of a word is defined as the function
, given by
where
denotes the number of occurrences of the letter
in the word
.
A subset of
is said to be ''linear'' if it is of the form
for some vectors
.
A subset of
is said to be ''semi-linear'' if it is a union of finitely many linear subsets.
In short, the image under
of context-free languages and of regular languages is the same, and it is equal to the set of semilinear sets.
Two languages are said to be ''commutatively equivalent'' if they have the same set of Parikh vectors. Thus, every context-free language is commutatively equivalent to some regular language.
Proof
The second part is easy to prove.
The first part is less easy. The following proof is credited to.
First we need a small strengthening of the
pumping lemma for context-free languages
Pumping may refer to:
* The operation of a pump, for moving a liquid from one location to another
**The use of a breast pump for extraction of milk
* Pumping (audio), a creative misuse of dynamic range compression
* Pumping (computer systems), the ...
:
The proof is essentially the same as the standard pumping lemma: use the pigeonhole principle to find
copies of some nonterminal symbol
in the longest path in the shortest derivation tree.
Strengthening for bounded languages
A language
is ''bounded'' if
for some fixed words
.
Ginsburg and Spanier
gave a necessary and sufficient condition, similar to Parikh's theorem, for bounded languages.
Call a linear set ''stratified'', if in its definition for each
the vector
has the property that it has at most two non-zero coordinates, and for each
if each of the vectors
has two non-zero coordinates,