Definition
Syntax
Formally, a parsing expression grammar consists of: * A finite set ''N'' of ''Semantics
The fundamental difference betweenOperational interpretation of parsing expressions
Each nonterminal in a parsing expression grammar essentially represents a parsingExamples
This is a PEG that recognizes mathematical formulas that apply the basic five operations to non-negative integers.'('
and ')'
. The range -9/code> is a shortcut for the ten characters from '0'
to '9'
. (This range syntax is the same as the syntax used by regular expression
A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or ...
s.) The nonterminal symbols are the ones that expand to other rules: ''Value'', ''Power'', ''Product'', ''Sum'', and ''Expr''. Note that rules ''Sum'' and ''Product'' don't lead to desired left-associativity of these operations (they don't deal with associativity at all, and it has to be handled in post-processing step after parsing), and the ''Power'' rule (by referring to itself on the right) results in desired right-associativity of exponent. Also note that a rule like (with intention to achieve left-associativity) would cause infinite recursion, so it cannot be used in practice even though it can be expressed in the grammar.
The following recursive rule matches standard C-style if/then/else statements in such a way that the optional "else" clause always binds to the innermost "if", because of the implicit prioritization of the '/' operator. (In a context-free grammar
In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules are of the form
:A\ \to\ \alpha
with A a ''single'' nonterminal symbol, and \alpha a string of terminals and/or nonterminals (\alpha can be empt ...
, this construct yields the classic dangling else ambiguity.)
S ← 'if' C 'then' S 'else' S / 'if' C 'then' S
The following recursive rule matches Pascal-style nested comment syntax, . The comment symbols appear in single quotes to distinguish them from PEG operators.
Begin ← '(*'
End ← '*)'
C ← Begin N* End
N ← C / (!Begin !End Z)
Z ← any single character
The parsing expression matches and consumes the text "foo" but only if it is followed by the text "bar". The parsing expression matches the text "foo" but only if it is ''not'' followed by the text "bar". The expression matches a single "a" but only if it is not part of an arbitrarily long sequence of a's followed by a b.
The parsing expression matches and consumes an arbitrary-length sequence of a's and b's. The production rule describes the simple context-free "matching language" .
The following parsing expression grammar describes the classic non-context-free language :
S ← &(A 'c') 'a'+ B !.
A ← 'a' A? 'b'
B ← 'b' B? 'c'
Implementing parsers from parsing expression grammars
Any parsing expression grammar can be converted directly into a recursive descent parser
In computer science, a recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures (or a non-recursive equivalent) where each such procedure implements one of the nonterminals of the grammar. Thus t ...
.
Due to the unlimited lookahead capability that the grammar formalism provides, however, the resulting parser could exhibit exponential time
In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...
performance in the worst case.
It is possible to obtain better performance for any parsing expression grammar by converting its recursive descent parser into a ''packrat parser'', which always runs in linear time
In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...
, at the cost of substantially greater storage space requirements. A packrat parser
is a form of parser
Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lati ...
similar to a recursive descent parser in construction, except that during the parsing process it memoizes the intermediate results of all invocations of the mutually recursive parsing functions, ensuring that each parsing function is only invoked at most once at a given input position. Because of this memoization, a packrat parser has the ability to parse many context-free grammar
In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules are of the form
:A\ \to\ \alpha
with A a ''single'' nonterminal symbol, and \alpha a string of terminals and/or nonterminals (\alpha can be empt ...
s and ''any'' parsing expression grammar (including some that do not represent context-free languages) in linear time. Examples of memoized recursive descent parsers are known from at least as early as 1993.
This analysis of the performance of a packrat parser assumes that enough memory is available to hold all of the memoized results; in practice, if there is not enough memory, some parsing functions might have to be invoked more than once at the same input position, and consequently the parser could take more than linear time.
It is also possible to build LL parser
In computer science, an LL parser (Left-to-right, leftmost derivation) is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.
An LL parser is called a ...
s and LR parser
In computer science, LR parsers are a type of bottom-up parser that analyse deterministic context-free languages in linear time. There are several variants of LR parsers: SLR parsers, LALR parsers, Canonical LR(1) parsers, Minimal LR(1) parse ...
s from parsing expression grammars, with better worst-case performance than a recursive descent parser, but the unlimited lookahead capability of the grammar formalism is then lost. Therefore, not all languages that can be expressed using parsing expression grammars can be parsed by LL or LR parsers.
Advantages
Compared to pure regular expressions
A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" o ...
(i.e. without back-references), PEGs are strictly more powerful, but require significantly more memory. For example, a regular expression inherently cannot find an arbitrary number of matched pairs of parentheses, because it is not recursive, but a PEG can. However, a PEG will require an amount of memory proportional to the length of the input, while a regular expression matcher will require only a constant amount of memory.
Any PEG can be parsed in linear time by using a packrat parser, as described above.
Many CFGs contain ambiguities, even when they're intended to describe unambiguous languages. The "dangling else
The dangling else is a problem in programming of parser generators in which an optional else clause in an if–then(–else) statement results in nested conditionals being ambiguous. Formally, the reference context-free grammar of the language i ...
" problem in C, C++, and Java is one example. These problems are often resolved by applying a rule outside of the grammar. In a PEG, these ambiguities never arise, because of prioritization.
Disadvantages
Memory consumption
PEG parsing is typically carried out via ''packrat parsing'', which uses memoization
In computing, memoization or memoisation is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls and returning the cached result when the same inputs occur again. Memoization ...
to eliminate redundant parsing steps. Packrat parsing requires storage proportional to the total input size, rather than the depth of the parse tree as with LR parsers. This is a significant difference in many domains: for example, hand-written source code has an effectively constant expression nesting depth independent of the length of the program—expressions nested beyond a certain depth tend to get refactored.
For some grammars and some inputs, the depth of the parse tree can be proportional to the input size,
so both an LR parser and a packrat parser will appear to have the same worst-case asymptotic performance. A more accurate analysis would take the depth of the parse tree into account separately from the input size. This is similar to a situation which arises in graph algorithms
The following is a list of well-known algorithms along with one-line descriptions for each.
Automated planning
Combinatorial algorithms
General combinatorial algorithms
* Brent's algorithm: finds a cycle in function value iterations using on ...
: the Bellman–Ford algorithm
The Bellman–Ford algorithm is an algorithm that computes shortest paths from a single source vertex to all of the other vertices in a weighted digraph.
It is slower than Dijkstra's algorithm for the same problem, but more versatile, as it is ...
and Floyd–Warshall algorithm
In computer science, the Floyd–Warshall algorithm (also known as Floyd's algorithm, the Roy–Warshall algorithm, the Roy–Floyd algorithm, or the WFI algorithm) is an algorithm for finding shortest paths in a directed weighted graph with p ...
appear to have the same running time () if only the number of vertices is considered. However, a more precise analysis which accounts for the number of edges as a separate parameter assigns the Bellman–Ford algorithm
The Bellman–Ford algorithm is an algorithm that computes shortest paths from a single source vertex to all of the other vertices in a weighted digraph.
It is slower than Dijkstra's algorithm for the same problem, but more versatile, as it is ...
a time of , which is quadratic for sparse graphs with .
Indirect left recursion
A PEG is called ''well-formed'' if it contains no '' left-recursive'' rules, i.e., rules that allow a nonterminal to expand to an expression in which the same nonterminal occurs as the leftmost symbol. For a left-to-right top-down parser, such rules cause infinite regress: parsing will continually expand the same nonterminal without moving forward in the string.
Therefore, to allow packrat parsing, left recursion must be eliminated. For example, in the arithmetic grammar above, it would be tempting to move some rules around so that the precedence order of products and sums could be expressed in one line:
Value ← -9. / '(' Expr ')'
Product ← Expr (('*' / '/') Expr)*
Sum ← Expr (('+' / '-') Expr)*
Expr ← Product / Sum / Value
In this new grammar, matching an Expr
requires testing if a Product
matches while matching a Product
requires testing if an Expr
matches. Because the term appears in the leftmost position, these rules make up a circular definition
A circular definition is a description that uses the term(s) being defined as part of the description or assumes that the term(s) being described are already known. There are several kinds of circular definition, and several ways of characteris ...
that cannot be resolved. (Circular definitions that can be resolved exist—such as in the original formulation from the first example—but such definitions are required not to exhibit pathological recursion.) However, left-recursive rules can always be rewritten to eliminate left-recursion. For example, the following left-recursive CFG rule:
string-of-a ← string-of-a 'a' , 'a'
can be rewritten in a PEG using the plus operator:
string-of-a ← 'a'+
The process of rewriting ''indirectly'' left-recursive rules is complex in some packrat parsers, especially when semantic actions are involved.
With some modification, traditional packrat parsing can support direct left recursion,
but doing so results in a loss of the linear-time parsing property which is generally the justification for using PEGs and packrat parsing in the first place. Only the OMeta OMeta is a specialized object-oriented programming language for pattern matching, developed by Alessandro Warth and Ian Piumarta in 2007 under the Viewpoints Research Institute. The language is based on Parsing Expression Grammars (PEGs) rather th ...
parsing algorithm supports full direct and indirect left recursion without additional attendant complexity (but again, at a loss of the linear time complexity), whereas all GLR parser
A GLR parser (GLR standing for "Generalized LR", where L stands for "left-to-right" and R stands for "rightmost (derivation)") is an extension of an LR parser algorithm to handle non-deterministic and ambiguous grammars. The theoretical foundat ...
s support left recursion.
Expressive power
PEG packrat parsers cannot recognize some unambiguous nondeterministic CFG rules, such as the following:
S ← 'x' S 'x' , 'x'
Neither LL(k)
In computer science, an LL parser (Left-to-right, leftmost derivation) is a Top-down parsing, top-down parser for a restricted context-free language. It parses the input from Left to right, performing Context-free grammar#Derivations and syntax tre ...
nor LR(k) parsing algorithms are capable of recognizing this example. However, this grammar can be used by a general CFG parser like the CYK algorithm
In computer science, the Cocke–Younger–Kasami algorithm (alternatively called CYK, or CKY) is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Cocke ...
. However, the ''language'' in question can be recognised by all these types of parser, since it is in fact a regular language (that of strings of an odd number of x's).
It is an open problem to give a concrete example of a context-free language which cannot be recognized by a parsing expression grammar. In particular, it is open whether a parsing expression grammar can recognize the language of palindromes.
Ambiguity detection and influence of rule order on language that is matched
LL(k) and LR(k) parser generators will fail to complete when the input grammar is ambiguous. This is a feature in the common case that the grammar is intended to be unambiguous but is defective. A PEG parser generator will resolve unintended ambiguities earliest-match-first, which may be arbitrary and lead to surprising parses.
The ordering of productions in a PEG grammar affects not only the resolution of ambiguity, but also the ''language matched''. For example, consider the first PEG example in Ford's paper
(example rewritten in pegjs.org/online notation, and labelled and ):
* :
* :
Ford notes that ''The second alternative in the latter PEG rule will never succeed because the first choice is always taken if the input string ... begins with 'a'.''.
Specifically, (i.e., the language matched by ) includes the input "ab", but does not.
Thus, adding a new option to a PEG grammar can ''remove'' strings from the language matched, e.g. is the addition of a rule to the single-production grammar , which contains a string not matched by .
Furthermore, constructing a grammar to match from PEG grammars and is not always a trivial task.
This is in stark contrast to CFG's, in which the addition of a new production cannot remove strings (though, it can introduce problems in the form of ambiguity),
and a (potentially ambiguous) grammar for can be constructed
S → start(G1) , start(G2)
Bottom-up PEG parsing
A pika parser uses dynamic programming to apply PEG rules bottom-up and right to left, which is the inverse of the normal recursive descent order of top-down, left to right. Parsing in reverse order solves the left recursion problem, allowing left-recursive rules to be used directly in the grammar without being rewritten into non-left-recursive form, and also conveys optimal error recovery capabilities upon the parser, which historically proved difficult to achieve for recursive descent parsers.
See also
* Compiler Description Language
Compiler Description Language (CDL) is a programming language based on affix grammars. It is very similar to Backus–Naur form (BNF) notation. It was designed for the development of compilers. It is very limited in its capabilities and control f ...
(CDL)
* Formal grammar
In formal language theory, a grammar (when the context is not given, often called a formal grammar for clarity) describes how to form strings from a language's alphabet that are valid according to the language's syntax. A grammar does not describe ...
* Regular expression
A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or ...
* Top-down parsing language Top-Down Parsing Language (TDPL) is a type of analytic grammar, analytic formal grammar developed by Alexander Birman in the early 1970s in order to study formally the behavior of a common class of practical top-down parsing, top-down parsers that s ...
* Comparison of parser generators
This is a list of notable lexer generators and parser generators for various language classes.
Regular languages
Regular languages are a category of languages (sometimes termed Chomsky Type 3) which can be matched by a state machine (more sp ...
* Parser combinator
In computer programming
Computer programming is the process of performing a particular computation (or more generally, accomplishing a specific computing result), usually by designing and building an executable computer program. Programming i ...
* Python
Python may refer to:
Snakes
* Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia
** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia
* Python (mythology), a mythical serpent
Computing
* Python (pro ...
References
External links
Converting a string expression into a lambda expression using an expression parser
The Packrat Parsing and Parsing Expression Grammars Page
* The constructed language
A constructed language (sometimes called a conlang) is a language whose phonology, grammar, and vocabulary, instead of having developed naturally, are consciously devised for some purpose, which may include being devised for a work of fiction. ...
Lojban
Lojban (pronounced ) is a logical, constructed, human language created by the Logical Language Group which aims to be syntactically unambigious. It succeeds the Loglan project.
The Logical Language Group (LLG) began developing Lojban in 1987. ...
has
fairly large PEG grammar
allowing unambiguous parsing of Lojban text.
{{DEFAULTSORT:Parsing Expression Grammar
Formal languages