computer programming Computer programming is the process of performing a particular computation (or more generally, accomplishing a specific computing result), usually by designing and building an executable computer program. Programming involves tasks such as ana ...

, a parser combinator is a higher-order function that accepts several parsers as input and returns a new parser as its output. In this context, a

parser Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lati ...

is a function accepting strings as input and returning some structure as output, typically a

parse tree A parse tree or parsing tree or derivation tree or concrete syntax tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term ''parse tree'' itself is used primarily in co ...

or a set of indices representing locations in the string where parsing stopped successfully. Parser combinators enable a recursive descent parsing strategy that facilitates modular piecewise construction and testing. This parsing technique is called combinatory parsing. Parsers using combinators have been used extensively in the prototyping of compilers and processors for

domain-specific language A domain-specific language (DSL) is a computer language specialized to a particular application domain. This is in contrast to a general-purpose language (GPL), which is broadly applicable across domains. There are a wide variety of DSLs, ranging f ...

s such as natural-language interfaces to databases, where complex and varied semantic actions are closely integrated with syntactic processing. In 1989, Richard Frost and John Launchbury demonstrated use of parser combinators to construct natural-language interpreters. Graham Hutton also used higher-order functions for basic parsing in 1992 and monadic parsing in 1996. S. D. Swierstra also exhibited the practical aspects of parser combinators in 2001. In 2008, Frost, Hafiz and Callaghan described a set of parser combinators in

Haskell Haskell () is a general-purpose, statically-typed, purely functional programming language with type inference and lazy evaluation. Designed for teaching, research and industrial applications, Haskell has pioneered a number of programming lan ...

that solve the long-standing problem of accommodating

left recursion In the formal language theory of computer science, left recursion is a special case of recursion where a string is recognized as part of a language by the fact that it decomposes into a string from that same language (on the left) and a suffix (on ...

, and work as a complete

top-down parsing Top-down parsing in computer science is a parsing strategy where one first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar. LL parsers are a type of parser that uses a top-d ...

tool in polynomial time and space.

Basic idea

In any programming language that has

first-class function In computer science, a programming language is said to have first-class functions if it treats functions as first-class citizens. This means the language supports passing functions as arguments to other functions, returning them as the values from ...

s, parser combinators can be used to combine basic parsers to construct parsers for more complex rules. For example, a production rule of a

context-free grammar In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules are of the form :A\ \to\ \alpha with A a ''single'' nonterminal symbol, and \alpha a string of terminals and/or nonterminals (\alpha can be empt ...

(CFG) may have one or more alternatives and each alternative may consist of a sequence of non-terminal(s) and/or terminal(s), or the alternative may consist of a single non-terminal or terminal or the empty string. If a simple parser is available for each of these alternatives, a parser combinator can be used to combine each of these parsers, returning a new parser which can recognise any or all of the alternatives. In languages that support

operator overloading In computer programming, operator overloading, sometimes termed ''operator ad hoc polymorphism'', is a specific case of polymorphism, where different operators have different implementations depending on their arguments. Operator overloading is ...

, a parser combinator can take the form of an

infix An infix is an affix inserted inside a word stem (an existing word or the core of a family of words). It contrasts with ''adfix,'' a rare term for an affix attached to the outside of a stem, such as a prefix or suffix. When marking text for int ...

operator, used to glue different parsers to form a complete rule. Parser combinators thereby enable parsers to be defined in an embedded style, in code which is similar in structure to the rules of the formal grammar. As such, implementations can be thought of as executable specifications with all the associated advantages such as readability.

The combinators

To keep the discussion relatively straightforward, we discuss parser combinators in terms of ''recognizers'' only. If the input string is of length #input and its members are accessed through an index j, a recognizer is a parser which returns, as output, a set of indices representing positions at which the parser successfully finished recognizing a sequence of tokens that began at position j. An empty result set indicates that the recognizer failed to recognize any sequence beginning at index j. A non-empty result set indicates the recognizer ends at different positions successfully. * The empty recognizer recognizes the empty string. This parser always succeeds, returning a singleton set containing the current position: :

empty(j) = \

* A recognizer term ''x'' recognizes the terminal x. If the token at position j in the input string is x, this parser returns a singleton set containing j + 1; otherwise, it returns the empty set. :

term(x,j) = 
\begin 
  \left \,      & j \geq \#input\\
  \left \,  & j^ \mbox input=x\\
  \left \,      & \mbox
\end

Note that there may be multiple distinct ways to parse a string while finishing at the same index: this indicates an

ambiguous grammar In computer science, an ambiguous grammar is a context-free grammar for which there exists a string that can have more than one leftmost derivation or parse tree, while an unambiguous grammar is a context-free grammar for which every valid string ...

. Simple recognizers do not acknowledge these ambiguities; each possible finishing index is listed only once in the result set. For a more complete set of results, a more complicated object such as a

must be returned. Following the definitions of two basic recognizers p and q, we can define two major parser combinators for alternative and sequencing: * The ‘alternative’ parser combinator, ⊕, applies both of the recognizers on the same input position j and sums up the results returned by both of the recognizers, which is eventually returned as the final result. It is used as an infix operator between p and q as follows: :

(p \oplus q)(j) = p(j) \cup q(j)

* The sequencing of recognizers is done with the ⊛ parser combinator. Like ⊕, it is used as an infix operator between p and q. But it applies the first recognizer p to the input position j, and if there is any successful result of this application, then the second recognizer q is applied to every element of the result set returned by the first recognizer. ⊛ ultimately returns the union of these applications of q. :

(p \circledast q)(j) =  \bigcup \

Examples

Consider a highly ambiguous

, s ::= ‘x’ s s , ε. Using the combinators defined earlier, we can modularly define executable notations of this grammar in a modern functional language (e.g.

) as s = term ‘x’ <*> s <*> s <+> empty. When the recognizer s is applied on an input sequence xxxxx at position 1, according to the above definitions it would return a result set .

Shortcomings and solutions

Parser combinators, like all

recursive descent parser In computer science, a recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures (or a non-recursive equivalent) where each such procedure implements one of the nonterminals of the grammar. Thus t ...

s, are not limited to the

s and thus do no global search for ambiguities in the LL(''k'') parsing First_k and Follow_k sets. Thus, ambiguities are not known until run-time if and until the input triggers them. In such cases, the recursive descent parser may default (perhaps unknown to the grammar designer) to one of the possible ambiguous paths, resulting in semantic confusion (aliasing) in the use of the language. This leads to bugs by users of ambiguous programming languages, which are not reported at compile-time, and which are introduced not by human error, but by ambiguous grammar. The only solution that eliminates these bugs is to remove the ambiguities and use a context-free grammar. The simple implementations of parser combinators have some shortcomings, which are common in top-down parsing. Naïve combinatory parsing requires

exponential Exponential may refer to any of several mathematical topics related to exponentiation, including: *Exponential function, also: **Matrix exponential, the matrix analogue to the above * Exponential decay, decrease at a rate proportional to value *Exp ...

time and space when parsing an ambiguous context-free grammar. In 1996, Frost and Szydlowski demonstrated how

memoization In computing, memoization or memoisation is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls and returning the cached result when the same inputs occur again. Memoization ...

can be used with parser combinators to reduce the time complexity to polynomial. Later Frost used monads to construct the combinators for systematic and correct threading of memo-table throughout the computation. Like any top-down recursive descent parsing, the conventional parser combinators (like the combinators described above) will not terminate while processing a left-recursive grammar (e.g. s ::= s <*> term ‘x’, empty). A recognition algorithm that accommodates ambiguous grammars with direct left-recursive rules is described by Frost and Hafiz in 2006. The algorithm curtails the otherwise ever-growing left-recursive parse by imposing depth restrictions. That algorithm was extended to a complete parsing algorithm to accommodate indirect as well as direct left-recursion in

polynomial time In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...

, and to generate compact polynomial-size representations of the potentially exponential number of parse trees for highly ambiguous grammars by Frost, Hafiz and Callaghan in 2007. This extended algorithm accommodates indirect left recursion by comparing its ‘computed context’ with ‘current context’. The same authors also described their implementation of a set of parser combinators written in the Haskell programming language based on the same algorithm.cf
X-SAIGA
— executable specifications of grammars

Notes

References

* * * * * * * * * * *

External links

parser-combinator

Common Lisp Common Lisp (CL) is a dialect of the Lisp programming language, published in ANSI standard document ''ANSI INCITS 226-1994 (S20018)'' (formerly ''X3.226-1994 (R1999)''). The Common Lisp HyperSpec, a hyperlinked HTML version, has been derived fro ...

parser combinator implementation
Parsec
Industrial strength, monadic parser combinator library for

parsec
Go version of Parsec
FParsec
F# adaptation of Parsec
csharp-monad
C# port of Parsec
Jparsec

Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...

port of Parsec
Arcsecond

Javascript JavaScript (), often abbreviated as JS, is a programming language that is one of the core technologies of the World Wide Web, alongside HTML and CSS. As of 2022, 98% of Website, websites use JavaScript on the Client (computing), client side ...

fantasy-land compliant monadic parser combinator library
Ramble
R parser combinator implementation
nom

Rust Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO(OH ...

parser combinator implementation using zero copy.
pyparsing

Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...

parser combinator implementation, though it does not call itself that in its documentation.
ts-parsec

TypeScript TypeScript is a free and open source programming language A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are ...

parser combinator library
Diesel

Swift Swift or SWIFT most commonly refers to: * SWIFT, an international organization facilitating transactions between banks ** SWIFT code * Swift (programming language) * Swift (bird), a family of birds It may also refer to: Organizations * SWIFT, ...

parser combinator library {{Parsers Parsing Formal languages Functional programming