Parsing expression grammar
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, a parsing expression grammar (PEG) is a type of
analytic Generally speaking, analytic (from el, ἀναλυτικός, ''analytikos'') refers to the "having the ability to analyze" or "division into elements or principles". Analytic or analytical can also have the following meanings: Chemistry * ...
formal grammar In formal language theory, a grammar (when the context is not given, often called a formal grammar for clarity) describes how to form strings from a language's alphabet that are valid according to the language's syntax. A grammar does not describe ...
, i.e. it describes a
formal language In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules. The alphabet of a formal language consists of sym ...
in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to
context-free grammar In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules are of the form :A\ \to\ \alpha with A a ''single'' nonterminal symbol, and \alpha a string of terminals and/or nonterminals (\alpha can be em ...
s (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser. Unlike CFGs, PEGs cannot be
ambiguous Ambiguity is the type of meaning in which a phrase, statement or resolution is not explicitly defined, making several interpretations plausible. A common aspect of ambiguity is uncertainty. It is thus an attribute of any idea or statement ...
; a string has exactly one valid
parse tree A parse tree or parsing tree or derivation tree or concrete syntax tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term ''parse tree'' itself is used primarily in co ...
or none. It is conjectured that there exist context-free languages that cannot be recognized by a PEG, but this is not yet proven. PEGs are well-suited to parsing computer languages (and artificial human languages such as
Lojban Lojban (pronounced ) is a logical, constructed, human language created by the Logical Language Group which aims to be syntactically unambigious. It succeeds the Loglan project. The Logical Language Group (LLG) began developing Lojban in 198 ...
), but not
natural language In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages ...
s where the performance of PEG algorithms is comparable to general CFG algorithms such as the
Earley algorithm In computer science, the Earley parser is an algorithm for parsing strings that belong to a given context-free language, though (depending on the variant) it may suffer problems with certain nullable grammars. The algorithm, named after its inv ...
.


Definition


Syntax

Formally, a parsing expression grammar consists of: * A finite set ''N'' of '' nonterminal symbols''. * A finite set Σ of ''
terminal symbol In computer science, terminal and nonterminal symbols are the lexical elements used in specifying the production rules constituting a formal grammar. ''Terminal symbols'' are the elementary symbols of the language defined by a formal grammar. ...
s'' that is disjoint from ''N''. * A finite set ''P'' of ''parsing rules''. * An expression ''eS'' termed the ''starting expression''. Each parsing rule in ''P'' has the form ''A'' ← ''e'', where ''A'' is a nonterminal symbol and ''e'' is a ''parsing expression''. A parsing expression is a hierarchical expression similar to a
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
, which is constructed in the following fashion: # An ''atomic parsing expression'' consists of: #* any terminal symbol, #* any nonterminal symbol, or #* the empty string ε. # Given any existing parsing expressions ''e'', ''e''1, and ''e''2, a new parsing expression can be constructed using the following operators: #* ''Sequence'': ''e''1 ''e''2 #* ''Ordered choice'': ''e''1 / ''e''2 #* ''Zero-or-more'': ''e''* #* ''One-or-more'': ''e''+ #* ''Optional'': ''e''? #* ''And-predicate'': &''e'' #* ''Not-predicate'': !''e''


Semantics

The fundamental difference between context-free grammars and parsing expression grammars is that the PEG's choice operator is ''ordered''. If the first alternative succeeds, the second alternative is ignored. Thus ordered choice is not
commutative In mathematics, a binary operation is commutative if changing the order of the operands does not change the result. It is a fundamental property of many binary operations, and many mathematical proofs depend on it. Most familiar as the name of ...
, unlike unordered choice as in context-free grammars. Ordered choice is analogous to soft cut operators available in some
logic programming Logic programming is a programming paradigm which is largely based on formal logic. Any program written in a logic programming language is a set of sentences in logical form, expressing facts and rules about some problem domain. Major logic pro ...
languages. The consequence is that if a CFG is transliterated directly to a PEG, any ambiguity in the former is resolved by deterministically picking one parse tree from the possible parses. By carefully choosing the order in which the grammar alternatives are specified, a programmer has a great deal of control over which parse tree is selected. Like boolean context-free grammars, parsing expression grammars also add the and- and not- syntactic predicates. Because they can use an arbitrarily complex sub-expression to "look ahead" into the input string without actually consuming it, they provide a powerful syntactic lookahead and disambiguation facility, in particular when reordering the alternatives cannot specify the exact parse tree desired.


Operational interpretation of parsing expressions

Each nonterminal in a parsing expression grammar essentially represents a parsing function in a recursive descent parser, and the corresponding parsing expression represents the "code" comprising the function. Each parsing function conceptually takes an input string as its argument, and yields one of the following results: * ''success'', in which the function may optionally move forward or ''consume'' one or more characters of the input string supplied to it, or * ''failure'', in which case no input is consumed. An atomic parsing expression consisting of a single terminal (i.e. literal) succeeds if the first character of the input string matches that terminal, and in that case consumes the input character; otherwise the expression yields a failure result. An atomic parsing expression consisting of the empty string always trivially succeeds without consuming any input. An atomic parsing expression consisting of a nonterminal ''A'' represents a recursive call to the nonterminal-function ''A''. A nonterminal may succeed without actually consuming any input, and this is considered an outcome distinct from failure. The sequence operator ''e''1 ''e''2 first invokes ''e''1, and if ''e''1 succeeds, subsequently invokes ''e''2 on the remainder of the input string left unconsumed by ''e''1, and returns the result. If either ''e''1 or ''e''2 fails, then the sequence expression ''e''1 ''e''2 fails (consuming no input). The choice operator ''e''1 / ''e''2 first invokes ''e''1, and if ''e''1 succeeds, returns its result immediately. Otherwise, if ''e''1 fails, then the choice operator backtracks to the original input position at which it invoked ''e''1, but then calls ''e''2 instead, returning ''e''2's result. The zero-or-more, one-or-more, and optional operators consume zero or more, one or more, or zero or one consecutive repetitions of their sub-expression ''e'', respectively. Unlike in
context-free grammar In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules are of the form :A\ \to\ \alpha with A a ''single'' nonterminal symbol, and \alpha a string of terminals and/or nonterminals (\alpha can be em ...
s and
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
s, however, these operators ''always'' behave greedily, consuming as much input as possible and never backtracking. (Regular expression matchers may start by matching greedily, but will then backtrack and try shorter matches if they fail to match.) For example, the expression a* will always consume as many a's as are consecutively available in the input string, and the expression (a* a) will always fail because the first part (a*) will never leave any a's for the second part to match. The and-predicate expression &''e'' invokes the sub-expression ''e'', and then succeeds if ''e'' succeeds and fails if ''e'' fails, but in either case ''never consumes any input''. The not-predicate expression !''e'' succeeds if ''e'' fails and fails if ''e'' succeeds, again consuming no input in either case.


Examples

This is a PEG that recognizes mathematical formulas that apply the basic five operations to non-negative integers. Expr ← Sum Sum ← Product (('+' / '-') Product)* Product ← Power (('*' / '/') Power)* Power ← Value ('^' Power)? Value ← -9 / '(' Expr ')' In the above example, the terminal symbols are characters of text, represented by characters in single quotes, such as '(' and ')'. The range -9/code> is a shortcut for the ten characters from '0' to '9'. (This range syntax is the same as the syntax used by
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
s.) The nonterminal symbols are the ones that expand to other rules: ''Value'', ''Power'', ''Product'', ''Sum'', and ''Expr''. Note that rules ''Sum'' and ''Product'' don't lead to desired left-associativity of these operations (they don't deal with associativity at all, and it has to be handled in post-processing step after parsing), and the ''Power'' rule (by referring to itself on the right) results in desired right-associativity of exponent. Also note that a rule like (with intention to achieve left-associativity) would cause infinite recursion, so it cannot be used in practice even though it can be expressed in the grammar. The following recursive rule matches standard C-style if/then/else statements in such a way that the optional "else" clause always binds to the innermost "if", because of the implicit prioritization of the '/' operator. (In a
context-free grammar In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules are of the form :A\ \to\ \alpha with A a ''single'' nonterminal symbol, and \alpha a string of terminals and/or nonterminals (\alpha can be em ...
, this construct yields the classic dangling else ambiguity.) S ← 'if' C 'then' S 'else' S / 'if' C 'then' S The following recursive rule matches Pascal-style nested comment syntax, . The comment symbols appear in single quotes to distinguish them from PEG operators. Begin ← '(*' End ← '*)' C ← Begin N* End N ← C / (!Begin !End Z) Z ← any single character The parsing expression matches and consumes the text "foo" but only if it is followed by the text "bar". The parsing expression matches the text "foo" but only if it is ''not'' followed by the text "bar". The expression matches a single "a" but only if it is not part of an arbitrarily long sequence of a's followed by a b. The parsing expression matches and consumes an arbitrary-length sequence of a's and b's. The production rule describes the simple context-free "matching language" \ . The following parsing expression grammar describes the classic non-context-free language \ : S ← &(A 'c') 'a'+ B !. A ← 'a' A? 'b' B ← 'b' B? 'c'


Implementing parsers from parsing expression grammars

Any parsing expression grammar can be converted directly into a recursive descent parser. Due to the unlimited lookahead capability that the grammar formalism provides, however, the resulting parser could exhibit exponential time performance in the worst case. It is possible to obtain better performance for any parsing expression grammar by converting its recursive descent parser into a ''packrat parser'', which always runs in
linear time In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...
, at the cost of substantially greater storage space requirements. A packrat parser is a form of
parser Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lat ...
similar to a recursive descent parser in construction, except that during the parsing process it memoizes the intermediate results of all invocations of the mutually recursive parsing functions, ensuring that each parsing function is only invoked at most once at a given input position. Because of this memoization, a packrat parser has the ability to parse many
context-free grammar In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules are of the form :A\ \to\ \alpha with A a ''single'' nonterminal symbol, and \alpha a string of terminals and/or nonterminals (\alpha can be em ...
s and ''any'' parsing expression grammar (including some that do not represent context-free languages) in linear time. Examples of memoized recursive descent parsers are known from at least as early as 1993. This analysis of the performance of a packrat parser assumes that enough memory is available to hold all of the memoized results; in practice, if there is not enough memory, some parsing functions might have to be invoked more than once at the same input position, and consequently the parser could take more than linear time. It is also possible to build LL parsers and LR parsers from parsing expression grammars, with better worst-case performance than a recursive descent parser, but the unlimited lookahead capability of the grammar formalism is then lost. Therefore, not all languages that can be expressed using parsing expression grammars can be parsed by LL or LR parsers.


Advantages

Compared to pure regular expressions (i.e. without back-references), PEGs are strictly more powerful, but require significantly more memory. For example, a regular expression inherently cannot find an arbitrary number of matched pairs of parentheses, because it is not recursive, but a PEG can. However, a PEG will require an amount of memory proportional to the length of the input, while a regular expression matcher will require only a constant amount of memory. Any PEG can be parsed in linear time by using a packrat parser, as described above. Many CFGs contain ambiguities, even when they're intended to describe unambiguous languages. The " dangling else" problem in C, C++, and Java is one example. These problems are often resolved by applying a rule outside of the grammar. In a PEG, these ambiguities never arise, because of prioritization.


Disadvantages


Memory consumption

PEG parsing is typically carried out via ''packrat parsing'', which uses
memoization In computing, memoization or memoisation is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls and returning the cached result when the same inputs occur again. Memoization ...
to eliminate redundant parsing steps. Packrat parsing requires storage proportional to the total input size, rather than the depth of the parse tree as with LR parsers. This is a significant difference in many domains: for example, hand-written source code has an effectively constant expression nesting depth independent of the length of the program—expressions nested beyond a certain depth tend to get refactored. For some grammars and some inputs, the depth of the parse tree can be proportional to the input size, so both an LR parser and a packrat parser will appear to have the same worst-case asymptotic performance. A more accurate analysis would take the depth of the parse tree into account separately from the input size. This is similar to a situation which arises in graph algorithms: the
Bellman–Ford algorithm The Bellman–Ford algorithm is an algorithm that computes shortest paths from a single source vertex to all of the other vertices in a weighted digraph. It is slower than Dijkstra's algorithm for the same problem, but more versatile, as it i ...
and
Floyd–Warshall algorithm In computer science, the Floyd–Warshall algorithm (also known as Floyd's algorithm, the Roy–Warshall algorithm, the Roy–Floyd algorithm, or the WFI algorithm) is an algorithm for finding shortest paths in a directed weighted graph with ...
appear to have the same running time (O(, V, ^3)) if only the number of vertices is considered. However, a more precise analysis which accounts for the number of edges as a separate parameter assigns the
Bellman–Ford algorithm The Bellman–Ford algorithm is an algorithm that computes shortest paths from a single source vertex to all of the other vertices in a weighted digraph. It is slower than Dijkstra's algorithm for the same problem, but more versatile, as it i ...
a time of O(, V, *, E, ), which is quadratic for sparse graphs with , E, \in O(, V, ).


Indirect left recursion

A PEG is called ''well-formed'' if it contains no '' left-recursive'' rules, i.e., rules that allow a nonterminal to expand to an expression in which the same nonterminal occurs as the leftmost symbol. For a left-to-right top-down parser, such rules cause infinite regress: parsing will continually expand the same nonterminal without moving forward in the string. Therefore, to allow packrat parsing, left recursion must be eliminated. For example, in the arithmetic grammar above, it would be tempting to move some rules around so that the precedence order of products and sums could be expressed in one line: Value ← -9. / '(' Expr ')' Product ← Expr (('*' / '/') Expr)* Sum ← Expr (('+' / '-') Expr)* Expr ← Product / Sum / Value In this new grammar, matching an Expr requires testing if a Product matches while matching a Product requires testing if an Expr matches. Because the term appears in the leftmost position, these rules make up a
circular definition A circular definition is a description that uses the term(s) being defined as part of the description or assumes that the term(s) being described are already known. There are several kinds of circular definition, and several ways of character ...
that cannot be resolved. (Circular definitions that can be resolved exist—such as in the original formulation from the first example—but such definitions are required not to exhibit pathological recursion.) However, left-recursive rules can always be rewritten to eliminate left-recursion. For example, the following left-recursive CFG rule: string-of-a ← string-of-a 'a' , 'a' can be rewritten in a PEG using the plus operator: string-of-a ← 'a'+ The process of rewriting ''indirectly'' left-recursive rules is complex in some packrat parsers, especially when semantic actions are involved. With some modification, traditional packrat parsing can support direct left recursion, but doing so results in a loss of the linear-time parsing property which is generally the justification for using PEGs and packrat parsing in the first place. Only the
OMeta OMeta is a specialized object-oriented programming language for pattern matching, developed by Alessandro Warth and Ian Piumarta in 2007 under the Viewpoints Research Institute. The language is based on Parsing Expression Grammars (PEGs) rather th ...
parsing algorithm supports full direct and indirect left recursion without additional attendant complexity (but again, at a loss of the linear time complexity), whereas all GLR parsers support left recursion.


Expressive power

PEG packrat parsers cannot recognize some unambiguous nondeterministic CFG rules, such as the following: S ← 'x' S 'x' , 'x' Neither LL(k) nor LR(k) parsing algorithms are capable of recognizing this example. However, this grammar can be used by a general CFG parser like the CYK algorithm. However, the ''language'' in question can be recognised by all these types of parser, since it is in fact a regular language (that of strings of an odd number of x's). It is an open problem to give a concrete example of a context-free language which cannot be recognized by a parsing expression grammar. In particular, it is open whether a parsing expression grammar can recognize the language of palindromes.


Ambiguity detection and influence of rule order on language that is matched

LL(k) and LR(k) parser generators will fail to complete when the input grammar is ambiguous. This is a feature in the common case that the grammar is intended to be unambiguous but is defective. A PEG parser generator will resolve unintended ambiguities earliest-match-first, which may be arbitrary and lead to surprising parses. The ordering of productions in a PEG grammar affects not only the resolution of ambiguity, but also the ''language matched''. For example, consider the first PEG example in Ford's paper (example rewritten in pegjs.org/online notation, and labelled and ): * :   * :   Ford notes that ''The second alternative in the latter PEG rule will never succeed because the first choice is always taken if the input string ... begins with 'a'.''. Specifically, (i.e., the language matched by ) includes the input "ab", but does not. Thus, adding a new option to a PEG grammar can ''remove'' strings from the language matched, e.g. is the addition of a rule to the single-production grammar , which contains a string not matched by . Furthermore, constructing a grammar to match from PEG grammars and is not always a trivial task. This is in stark contrast to CFG's, in which the addition of a new production cannot remove strings (though, it can introduce problems in the form of ambiguity), and a (potentially ambiguous) grammar for can be constructed S → start(G1) , start(G2)


Bottom-up PEG parsing

A pika parser uses dynamic programming to apply PEG rules bottom-up and right to left, which is the inverse of the normal recursive descent order of top-down, left to right. Parsing in reverse order solves the left recursion problem, allowing left-recursive rules to be used directly in the grammar without being rewritten into non-left-recursive form, and also conveys optimal error recovery capabilities upon the parser, which historically proved difficult to achieve for recursive descent parsers.


See also

* Compiler Description Language (CDL) *
Formal grammar In formal language theory, a grammar (when the context is not given, often called a formal grammar for clarity) describes how to form strings from a language's alphabet that are valid according to the language's syntax. A grammar does not describe ...
*
Regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
* Top-down parsing language * Comparison of parser generators * Parser combinator * Python


References


External links


Converting a string expression into a lambda expression using an expression parser

The Packrat Parsing and Parsing Expression Grammars Page
* The
constructed language A constructed language (sometimes called a conlang) is a language whose phonology, grammar, and vocabulary, instead of having developed naturally, are consciously devised for some purpose, which may include being devised for a work of fiction. ...
Lojban Lojban (pronounced ) is a logical, constructed, human language created by the Logical Language Group which aims to be syntactically unambigious. It succeeds the Loglan project. The Logical Language Group (LLG) began developing Lojban in 198 ...
has
fairly large PEG grammar
allowing unambiguous parsing of Lojban text.

{{DEFAULTSORT:Parsing Expression Grammar Formal languages