Parikh's theorem in

theoretical computer science Theoretical computer science is a subfield of computer science and mathematics that focuses on the Abstraction, abstract and mathematical foundations of computation. It is difficult to circumscribe the theoretical areas precisely. The Associati ...

says that if one looks only at the number of occurrences of each

terminal symbol In formal languages, terminal and nonterminal symbols are parts of the ''vocabulary'' under a formal grammar. ''Vocabulary'' is a finite, nonempty set of symbols. ''Terminal symbols'' are symbols that cannot be replaced by other symbols of the v ...

in a

context-free language In formal language theory, a context-free language (CFL), also called a Chomsky type-2 language, is a language generated by a context-free grammar (CFG). Context-free languages have many applications in programming languages, in particular, mos ...

, without regard to their order, then the language is indistinguishable from a

regular language In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...

. It is useful for deciding that strings with a given number of terminals are not accepted by a context-free grammar. It was first proved by Rohit Parikh in 1961 and republished in 1966.

Definitions and formal statement

Let

\Sigma=\

be an

alphabet An alphabet is a standard set of letter (alphabet), letters written to represent particular sounds in a spoken language. Specifically, letters largely correspond to phonemes as the smallest sound segments that can distinguish one word from a ...

. The ''Parikh vector'' of a word is defined as the function

p:\Sigma^*\to\mathbb^k

, given by

p(w)=(, w, _, , w, _, \ldots, , w, _)

where

, w, _

denotes the number of occurrences of the symbol

a_i

in the word

w

. A subset of

\mathbb^k

is said to be ''linear'' if it is of the form

u_0 + \mathbbu_1 + \dots + \mathbbu_m = \

for some vectors

u_0,\ldots,u_m

. A subset of

\mathbb^k

is said to be ''semi-linear'' if it is a union of finitely many linear subsets. In short, the image under

p

of context-free languages and of regular languages is the same, and it is equal to the set of semilinear sets. Two languages are said to be ''commutatively equivalent'' if they have the same set of Parikh vectors. Thus, every context-free language is commutatively equivalent to some regular language.

Proof

The second part is easy to prove. The first part is less easy. The following proof is credited to Goldstine. First we need a small strengthening of the

pumping lemma for context-free languages In computer science, in particular in formal language theory, the pumping lemma for context-free languages, also known as the Bar-Hillel lemma, is a lemma that gives a property shared by all context-free languages and generalizes the pumping le ...

: The proof is essentially the same as the standard pumping lemma: use the pigeonhole principle to find

k

copies of some nonterminal symbol

A

in the longest path in the shortest derivation tree. Now we prove the first part of Parikh's theorem, making use of the above lemma.

Strengthening for bounded languages

A language

L

is ''bounded'' if

L\subset w_1^*\ldots w_k^*

for some fixed words

w_1,\ldots, w_k

. Ginsburg and Spanier gave a necessary and sufficient condition, similar to Parikh's theorem, for bounded languages. Call a linear set ''stratified'', if in its definition for each

i\ge 1

the vector

u_i

has the property that it has at most two non-zero coordinates, and for each

i,j\ge 1

if each of the vectors

u_i,u_j

has two non-zero coordinates,

i_1 and j_1, respectively, then their order is ''not'' i_1 .
A semi-linear set is stratified if it is a union of finitely many stratified linear subsets.

Significance

The theorem has multiple interpretations. It shows that a context-free language over a singleton alphabet must be a

and that some context-free languages can only have

ambiguous grammar In computer science, an ambiguous grammar is a context-free grammar for which there exists a string (computer science), string that can have more than one leftmost derivation or parse tree. Every non-empty context-free language admits an ambiguous ...

s{{explain, reason=why exactly?, date=April 2017. Such languages are called '' inherently ambiguous languages''. From a

formal grammar A formal grammar is a set of Terminal and nonterminal symbols, symbols and the Production (computer science), production rules for rewriting some of them into every possible string of a formal language over an Alphabet (formal languages), alphabe ...

perspective, this means that some ambiguous

context-free grammar In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the fo ...

s cannot be converted to equivalent unambiguous context-free grammars.

References

Formal languages