Parikh's theorem in

theoretical computer science computer science (TCS) is a subset of general computer science and mathematics that focuses on mathematical aspects of computer science such as the theory of computation, lambda calculus, and type theory. It is difficult to circumscribe the ...

says that if one looks only at the number of occurrences of each

terminal symbol In computer science, terminal and nonterminal symbols are the lexical elements used in specifying the production rules constituting a formal grammar. ''Terminal symbols'' are the elementary symbols of the language defined by a formal grammar. ...

in a

context-free language In formal language theory, a context-free language (CFL) is a language generated by a context-free grammar (CFG). Context-free languages have many applications in programming languages, in particular, most arithmetic expressions are generated by ...

, without regard to their order, then the language is indistinguishable from a

regular language In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...

. It is useful for deciding that strings with a given number of terminals are not accepted by a context-free grammar. It was first proved by Rohit Parikh in 1961 and republished in 1966.

Definitions and formal statement

Let

\Sigma=\

be an

alphabet An alphabet is a standardized set of basic written graphemes (called letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character represents a syllab ...

. The ''Parikh vector'' of a word is defined as the function

p:\Sigma^*\to\mathbb^k

, given by

p(w)=(, w, _, , w, _, \ldots, , w, _)

where

, w, _

denotes the number of occurrences of the letter

a_i

in the word

w

. A subset of

\mathbb^k

is said to be ''linear'' if it is of the form

u_0 + \mathbbu_1 + \dots + \mathbbu_m = \

for some vectors

u_0,\ldots,u_m

. A subset of

\mathbb^k

is said to be ''semi-linear'' if it is a union of finitely many linear subsets. In short, the image under

p

of context-free languages and of regular languages is the same, and it is equal to the set of semilinear sets. Two languages are said to be ''commutatively equivalent'' if they have the same set of Parikh vectors. Thus, every context-free language is commutatively equivalent to some regular language.

Proof

The second part is easy to prove. The first part is less easy. The following proof is credited to. First we need a small strengthening of the

pumping lemma for context-free languages Pumping may refer to: * The operation of a pump, for moving a liquid from one location to another **The use of a breast pump for extraction of milk * Pumping (audio), a creative misuse of dynamic range compression * Pumping (computer systems), the ...

: The proof is essentially the same as the standard pumping lemma: use the pigeonhole principle to find

k

copies of some nonterminal symbol

A

in the longest path in the shortest derivation tree.

Strengthening for bounded languages

A language

L

is ''bounded'' if

L\subset w_1^*\ldots w_k^*

for some fixed words

w_1,\ldots, w_k

. Ginsburg and Spanier gave a necessary and sufficient condition, similar to Parikh's theorem, for bounded languages. Call a linear set ''stratified'', if in its definition for each

i\ge 1

the vector

u_i

has the property that it has at most two non-zero coordinates, and for each

i,j\ge 1

if each of the vectors

u_i,u_j

has two non-zero coordinates,

i_1 and j_1, respectively, then their order is ''not'' i_1 .
A semi-linear set is stratified if it is a union of finitely many stratified linear subsets.

Significance

The theorem has multiple interpretations. It shows that a context-free language over a singleton alphabet must be a

and that some context-free languages can only have

ambiguous grammar In computer science, an ambiguous grammar is a context-free grammar for which there exists a string that can have more than one leftmost derivation or parse tree, while an unambiguous grammar is a context-free grammar for which every valid string ...

s{{explain, reason=why exactly?, date=April 2017. Such languages are called ''

inherently ambiguous language In computer science, an ambiguous grammar is a context-free grammar for which there exists a string that can have more than one leftmost derivation or parse tree, while an unambiguous grammar is a context-free grammar for which every valid string ...

s''. From a

formal grammar In formal language theory, a grammar (when the context is not given, often called a formal grammar for clarity) describes how to form strings from a language's alphabet that are valid according to the language's syntax. A grammar does not describe ...

perspective, this means that some ambiguous context-free grammars cannot be converted to equivalent unambiguous context-free grammars.

References

Formal languages