Indexed grammars are a generalization of
context-free grammar
In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules are of the form
:A\ \to\ \alpha
with A a ''single'' nonterminal symbol, and \alpha a string of terminals and/or nonterminals (\alpha can be empt ...
s in that
nonterminal
In computer science, terminal and nonterminal symbols are the lexical elements used in specifying the production rules constituting a formal grammar. ''Terminal symbols'' are the elementary symbols of the language defined by a formal grammar. ...
s are equipped with lists of ''flags'', or ''index symbols''.
The language produced by an indexed grammar is called an
indexed language Indexed languages are a class of formal languages discovered by Alfred Aho; they are described by indexed grammars and can be recognized by nested stack automata.
Indexed languages are a proper subset of context-sensitive languages. They qualify as ...
.
Definition
Modern definition by Hopcroft and Ullman
In contemporary publications following Hopcroft and Ullman (1979),
an indexed grammar is formally defined a 5-tuple ''G'' = ⟨''N'',''T'',''F'',''P'',''S''⟩ where
* ''N'' is a set of variables or
nonterminal symbols,
* ''T'' is a set ("
alphabet
An alphabet is a standardized set of basic written graphemes (called letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character represents a syll ...
") of terminal symbols,
* ''F'' is a set of so-called ''index symbols'', or ''indices'',
* ''S'' ∈ ''N'' is the ''
start symbol'', and
* ''P'' is a finite set of ''
productions''.
In productions as well as in derivations of indexed grammars, a string ("stack") ''σ'' ∈ ''F''
* of index symbols is attached to every nonterminal symbol ''A'' ∈ ''N'', denoted by ''A''
'σ''[" and " are meta symbols to indicate the stack.]
Terminal symbols may not be followed by index stacks.
For an index stack ''σ'' ∈ ''F''
* and a string ''α'' ∈ (''N'' ∪ ''T'')
* of nonterminal and terminal symbols, ''α''
'σ''denotes the result of attaching
'σ''to every nonterminal in ''α''; for example if ''α'' equals with ''a'',''d'' ∈ ''T'' terminal, and nonterminal symbols, then ''α''
'σ''denotes
Using this notation, each production in ''P'' has to be of the form
# ''A''
ƒâ†’ α
ƒ
# ''A''
ƒâ†’ ''B''
'f''σ or
# ''A''
'f''σ→ α
ƒ
where ''A'', ''B'' ∈ ''N'' are nonterminal symbols, ''f'' ∈ ''F'' is an index, ''σ'' ∈ ''F''
* is a string of index symbols, and ''α'' ∈ (''N'' ∪ ''T'')
* is a string of nonterminal and terminal symbols. Some authors write ".." instead of "''σ''" for the index stack in production rules; the rule of type 1, 2, and 3 then reads , and , respectively.
Derivations are similar to those in a
context-free grammar
In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules are of the form
:A\ \to\ \alpha
with A a ''single'' nonterminal symbol, and \alpha a string of terminals and/or nonterminals (\alpha can be empt ...
except for the index stack attached to each nonterminal symbol.
When a production like e.g. ''A''
'σ''→ ''B''
'σ'''C''
'σ''is applied, the index stack of ''A'' is copied to both ''B'' and ''C''.
Moreover, a rule can push an index symbol onto the stack, or pop its "topmost" (i.e., leftmost) index symbol.
Formally, the relation ⇒ ("direct derivation") is defined on the set (''N''
*">'F''*ˆª''T'')
* of "sentential forms" as follows:
#If ''A''
'σ''→ ''α''
'σ''is a production of type 1, then β ''A''
'φ''''γ'' ⇒ ''β'' ''α''
'φ''''γ'', using the above definition. That is, the rule's left hand side's index stack ''φ'' is copied to each nonterminal of the right hand side.
#If ''A''
'σ''→ ''B''
'fσ''is a production of type 2, then ''β'' ''A''
'φ''''γ'' ⇒ ''β'' ''B''
'fφ''''γ''. That is, the right hand side's index stack is obtained from the left hand side's stack ''φ'' by pushing ''f'' onto it.
#If ''A''
'fσ''→ ''α''
'σ''is a production of type 3, then ''β'' ''A''
'fφ''''γ'' ⇒ ''β'' ''α''
'φ''''γ'', using again the definition of ''α''
'σ'' That is, the first index ''f'' is popped from the left hand side's stack, which is then distributed to each nonterminal of the right hand side.
As usual, the derivation relation is defined as the
reflexive transitive closure
In mathematics, a subset of a given set is closed under an operation of the larger set if performing that operation on members of the subset always produces a member of that subset. For example, the natural numbers are closed under addition, but ...
of direct derivation ⇒.
The language ''L''(''G'') = is the set of all strings of terminal symbols derivable from the start symbol.
Original definition by Aho
Historically, the concept of indexed grammars was first introduced by
Alfred Aho
Alfred Vaino Aho (born August 9, 1941) is a Canadian computer scientist best known for his work on programming languages, compilers, and related algorithms, and his textbooks on the art and science of computer programming.
Aho was elected into ...
(1968)
using a different formalism.
Aho defined an indexed grammar to be a 5-tuple (''N'',''T'',''F'',''P'',''S'') where
# ''N'' is a finite
alphabet
An alphabet is a standardized set of basic written graphemes (called letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character represents a syll ...
of variables or
nonterminal symbols
# ''T'' is a finite alphabet of terminal symbols
# ''F'' ⊆
2''N'' × (''N'' ∪ ''T'') * is the finite set of so-called ''flags'' (each flag is itself a set of so-called ''index productions'')
# ''P'' ⊆ ''N'' × (''NF''
* ∪ ''T'')
* is the finite set of ''
productions''
# ''S'' ∈ ''N'' is the ''start symbol''
Direct derivations were as follows:
* A production ''p'' = (''A'' → ''X''
1''η''
1…''X''
''k''''η''
''k'') from ''P'' matches a nonterminal ''A'' ∈ ''N'' followed by its (possibly empty) string of flags ''ζ'' ∈ ''F''
*. In context, ''γ'' ''Aζ'' ''δ'', via ''p'', derives to ''γ'' ''X''
1''θ''
1…''X''
''k''''θ''
''k'' ''δ'', where ''θ''
''i'' = ''η''
''i''''ζ'' if ''X''
''i'' was a nonterminal and the empty word otherwise. The old flags of ''A'' are therefore ''copied'' to each new nonterminal produced by ''p''. Each such production can be simulated by appropriate productions of type 1 and 2 in the Hopcroft/Ullman formalism.
* An index production ''p'' = (''A'' → ''X''
1…''X''
''k'') ∈ ''f'' matches ''Afζ'' (the flag ''f'' it comes from must match the first symbol following the nonterminal ''A'') and copies the remaining index string ''ζ'' to each new nonterminal: ''γ'' ''Afζ'' ''δ'' derives to ''γ'' ''X''
1''θ''
1…''X''
''k''''θ''
''k'' ''δ'', where ''θ''
''i'' is the empty word when ''X''
''i'' is a terminal and ''ζ'' when it is a nonterminal. Each such production corresponds to a production of type 3 in the Hopcroft/Ullman formalism.
This formalism is e.g. used by Hayashi (1973, p. 65-66).
Examples
In practice, stacks of indices can count and remember what rules were applied and in which order. For example, indexed grammars can describe the context-sensitive language of word triples :
:
A derivation of ' is then
: ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ .
As another example, the grammar ''G'' = ⟨ , , , ''P'', ''S'' ⟩ produces the language , where the production set ''P'' consists of
:
An example derivation is
: ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ .
Both example languages are not context-free by the
pumping lemma In the theory of formal languages, the pumping lemma may refer to:
*Pumping lemma for regular languages, the fact that all sufficiently long strings in such a language have a substring that can be repeated arbitrarily many times, usually used to pro ...
.
Properties
Hopcroft and
Ullman tend to consider indexed languages as a "natural" class, since they are generated by several formalisms other than indexed grammars, viz.
*
Aho's one-way
nested stack automata
In automata theory, a nested stack automaton is a finite automaton that can make use of a stack containing data which can be additional stacks.
Like a stack automaton, a nested stack automaton may step up or down in the stack, and read the cur ...
*
Fischer
Fischer is a German occupational surname, meaning fisherman. The name Fischer is the fourth most common German surname. The English version is Fisher.
People with the surname A
* Abraham Fischer (1850–1913) South African public official
* Ad ...
's macro grammars
*
Greibach's automata with stacks of stacks
*
Maibaum's algebraic characterization
Hayashi
generalized the
pumping lemma In the theory of formal languages, the pumping lemma may refer to:
*Pumping lemma for regular languages, the fact that all sufficiently long strings in such a language have a substring that can be repeated arbitrarily many times, usually used to pro ...
to indexed grammars.
Conversely, Gilman gives a "shrinking lemma" for indexed languages.
Linear indexed grammars
Gerald Gazdar
Gerald James Michael Gazdar, FBA (born 24 February 1950) is a British linguist and computer scientist.
Education
He was educated at Heath Mount School, Bradfield College, the University of East Anglia (BA, 1970) and the University of Reading ...
has defined a second class, the linear indexed grammars (LIG), by requiring that at most one nonterminal in each production be specified as receiving the stack,
[all other nonterminals receive an empty stack]
whereas in an ordinary indexed grammar, all nonterminals receive copies of the stack.
Formally, a linear indexed grammar is defined similar to an ordinary indexed grammar, but the production's form requirements are modified to:
# ''A''
'σ''→ ''α''[] ''B''
'σ''''β''[],
# ''A''
'σ''→ ''α''[] ''B''
'fσ''''β''[],
# ''A''
'fσ''→ ''α''[] ''B''
'σ''''β''[],
where ''A'', ''B'', ''f'', ''σ'', ''α'' are used as
above, and ''β'' ∈ (''N'' ∪ ''T'')
* is a string of nonterminal and terminal symbols like ''α''.
[In order to generate any string at all, some productions must be admitted having no nonterminal symbol on their right hand side. However, Gazdar didn't discuss this issue.] Also, the direct derivation relation ⇒ is defined similar to above. This new class of grammars defines a strictly smaller class of languages,
which belongs to the
mildly context-sensitive classes.
The language is generable by an indexed grammar, but not by a linear indexed grammar, while both and are generable by a linear indexed grammar.
If both the original and the modified production rules are admitted, the language class remains the indexed languages.
Example
Letting σ denote an arbitrary sequence of stack symbols, we can define a grammar for the language ''L'' = as
:
To derive the string ''abc'' we have the steps:
:''S''[] ⇒ ''aS''[''f'']''c'' ⇒ ''aT''[''f'']''c'' ⇒ ''aT''[]''bc'' ⇒ ''abc''
Similarly:
:''S''[] ⇒ ''aS''[''f'']''c'' ⇒ ''aaS''[''ff'']''cc'' ⇒ ''aaT''[''ff'']''cc'' ⇒ ''aaT''[''f'']''bcc'' ⇒ ''aaT''[]''bbcc'' ⇒ ''aabbcc''
Computational power
The linearly indexed languages are a subset of the indexed languages, and thus all LIGs can be recoded as IGs, making the LIGs strictly less powerful than the IGs. A conversion from a LIG to an IG is relatively simple. LIG rules in general look approximately like
, modulo the push/pop part of a rewrite rule. The symbols
and
represent strings of terminal and/or non-terminal symbols, and any non-terminal symbol in either must have an empty stack, by the definition of a LIG. This is, of course, counter to how IGs are defined: in an IG, the non-terminals whose stacks are not being pushed to or popped from must have exactly the same stack as the rewritten non-terminal. Thus, somehow, we need to have non-terminals in
and
which, despite having non-empty stacks, behave as if they had empty stacks.
Consider the rule
as an example case. In converting this to an IG, the replacement for
must be some