In
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, scannerless parsing (also called lexerless parsing) performs tokenization (breaking a stream of characters into words) and parsing (arranging the words into phrases) in a single step, rather than breaking it up into a
pipeline
Pipeline may refer to:
Electronics, computers and computing
* Pipeline (computing), a chain of data-processing stages or a CPU optimization found on
** Instruction pipelining, a technique for implementing instruction-level parallelism within a s ...
of a
lexer followed by a
parser
Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lati ...
, executing
concurrently. A
language grammar is scannerless if it uses a single formalism to express both the
lexical
Lexical may refer to:
Linguistics
* Lexical corpus or lexis, a complete set of all words in a language
* Lexical item, a basic unit of lexicographical classification
* Lexicon, the vocabulary of a person, language, or branch of knowledge
* Lex ...
(word level) and phrase level structure of the language.
Dividing processing into a lexer followed by a parser is more modular; scannerless parsing is primarily used when a clear lexer–parser distinction is unneeded or unwanted. Examples of when this is appropriate include
TeX
Tex may refer to:
People and fictional characters
* Tex (nickname), a list of people and fictional characters with the nickname
* Joe Tex (1933–1982), stage name of American soul singer Joseph Arrington Jr.
Entertainment
* ''Tex'', the Italian ...
, most
wiki
A wiki ( ) is an online hypertext publication collaboratively edited and managed by its own audience, using a web browser. A typical wiki contains multiple pages for the subjects or scope of the project, and could be either open to the pu ...
grammars,
makefile
In software development, Make is a build automation tool that automatically builds executable programs and libraries from source code by reading files called ''Makefiles'' which specify how to derive the target program. Though integrated develo ...
s, simple application-specific
scripting language
A scripting language or script language is a programming language that is used to manipulate, customize, and automate the facilities of an existing system. Scripting languages are usually interpreted at runtime rather than compiled.
A scripting ...
s, and
Raku.
Advantages
* Only one
metalanguage
In logic and linguistics, a metalanguage is a language used to describe another language, often called the ''object language''. Expressions in a metalanguage are often distinguished from those in the object language by the use of italics, quot ...
is needed
* Non-regular lexical structure is handled easily
* "Token classification" is unneeded which removes the need for design accommodations such as "
the lexer hack In computer programming, the lexer hack is a common solution to the problems in parsing ANSI C, due to the reference grammar being context-sensitive. In C, classifying a sequence of characters as a variable name or a type name requires contextual i ...
" and language
reserved word
In a computer language, a reserved word (also known as a reserved identifier) is a word that cannot be used as an identifier, such as the name of a variable, function, or label – it is "reserved from use". This is a syntactic definition, and a re ...
s (such as "while" in
C)
* Grammars can be compositional (can be merged without human intervention)
Disadvantages
* Since the lexical scanning and syntactic parsing are combined, the resulting parser tends to be more complicated and thus harder to understand and debug. The same will hold for the associated grammar, if a grammar is used to generate the parser.
* The resulting parser tends to be significantly less efficient than a lexer-parser pipeline with regard to both time and memory.
Implementations
SGLRis a parser for the modular Syntax Definition Formalism
SDF, and is part of the
ASF+SDF Meta-Environment and the
Stratego/XT program transformation system.
JSGLR a pure Java implementation of SGLR, also based on
SDF.
*
TXL supports character-level parsing.
dparsergenerates ANSI C code for scannerless
GLR parser
A GLR parser (GLR standing for "Generalized LR", where L stands for "left-to-right" and R stands for "rightmost (derivation)") is an extension of an LR parser algorithm to handle non-deterministic and ambiguous grammars. The theoretical foundatio ...
s.
*
Spirit
Spirit or spirits may refer to:
Liquor and other volatile liquids
* Spirits, a.k.a. liquor, distilled alcoholic drinks
* Spirit or tincture, an extract of plant or animal material dissolved in ethanol
* Volatile (especially flammable) liquids, ...
allows for both scannerless and scanner-based parsing.
*
SBP is a scannerless parser for
boolean grammar Boolean grammars, introduced by , are a class of formal grammars studied in formal language theory. They extend the basic type of grammars, the context-free grammars, with conjunction and negation operations. Besides these explicit operations, Boole ...
s (a superset of context-free grammars), written in Java.
Lajais a two phase scannerless parser generator with support for mapping the grammar rules into objects, written in Java.
* The
Raku Grammars feature of the general purpose programming language
Raku.
PyParsingis a scannerless parser written in pure Python.
*
META II
META II is a domain-specific programming language for writing compilers. It was created in 1963–1964 by Dewey Val Schorre at UCLA. META II uses what Schorre called syntax equations. Its operation is simply explained as:
Each ''syntax equation'' ...
Has built in token parsers functions.
*
TREE-META
The TREE-META (or Tree Meta, TREEMETA) Translator Writing System is a compiler-compiler system for context-free languages originally developed in the 1960s. Parsing statements of the metalanguage resemble augmented Backus–Naur form with embedded ...
Like META II also is scannerless having builtin lexer functions.
*
CWIC Compiler for Writing and Implementing Compilers. Has token rules as part of its language. Rules in CWIC were compiled into boolean functions returning success or failure.
Notes
* This is because parsing at the character level makes the language recognized by the parser a single
context-free language
In formal language theory, a context-free language (CFL) is a language generated by a context-free grammar (CFG).
Context-free languages have many applications in programming languages, in particular, most arithmetic expressions are generated by ...
defined on characters, as opposed to a context-free language of sequences of strings in
regular language
In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...
s. Some lexerless parsers handle the entire class of context-free languages, which is closed under composition.
References
Further reading
*
{{Parsers
Parsing algorithms