HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
, scannerless parsing (also called lexerless parsing) performs tokenization (breaking a stream of characters into words) and parsing (arranging the words into phrases) in a single step, rather than breaking it up into a
pipeline A pipeline is a system of Pipe (fluid conveyance), pipes for long-distance transportation of a liquid or gas, typically to a market area for consumption. The latest data from 2014 gives a total of slightly less than of pipeline in 120 countries ...
of a lexer followed by a
parser Parsing, syntax analysis, or syntactic analysis is a process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar by breaking it into parts. The term '' ...
, executing concurrently. A language grammar is scannerless if it uses a single formalism to express both the lexical (word level) and phrase level structure of the language. Dividing processing into a lexer followed by a parser is more modular; scannerless parsing is primarily used when a clear lexer–parser distinction is unneeded or unwanted. Examples of when this is appropriate include
TeX Tex, TeX, TEX, may refer to: People and fictional characters * Tex (nickname), a list of people and fictional characters with the nickname * Tex Earnhardt (1930–2020), U.S. businessman * Joe Tex (1933–1982), stage name of American soul singer ...
, most
wiki A wiki ( ) is a form of hypertext publication on the internet which is collaboratively edited and managed by its audience directly through a web browser. A typical wiki contains multiple pages that can either be edited by the public or l ...
grammars,
makefile In software development, Make is a command-line interface software tool that performs actions ordered by configured Dependence analysis, dependencies as defined in a configuration file called a ''makefile''. It is commonly used for build automati ...
s, simple application-specific
scripting language In computing, a script is a relatively short and simple set of instructions that typically automation, automate an otherwise manual process. The act of writing a script is called scripting. A scripting language or script language is a programming ...
s, and Raku.


Advantages

* Only one metalanguage is needed * Non-regular lexical structure is handled easily * "Token classification" is unneeded which removes the need for design accommodations such as "
the lexer hack In computer programming, the lexer hack is a solution to parsing context-sensitive grammars such as C, where classifying a sequence of characters as a variable name or a type name requires contextual information, by feeding contextual information ...
" and language
reserved word In a programming language, a reserved word (sometimes known as a reserved identifier) is a word that cannot be used by a programmer as an identifier, such as the name of a variable, function, or label – it is "reserved from use". In brief, an '' ...
s (such as "while" in C) * Grammars can be compositional (can be merged without human intervention)


Disadvantages

* Since the lexical scanning and syntactic parsing are combined, the resulting parser tends to be more complicated and thus harder to understand and debug. The same will hold for the associated grammar, if a grammar is used to generate the parser. * The resulting parser tends to be significantly less efficient than a lexer-parser pipeline with regard to both time and memory.


Implementations

* SGLR is a parser for the modular Syntax Definition Formalism (SDF), and is part of the ASF+SDF Meta-Environment and the Stratego/XT program transformation system. * JSGLR, a pure Java implementation of SGLR, also based on SDF. * TXL supports character-level parsing. * dparser generates ANSI C code for scannerless GLR parsers. * Spirit allows for both scannerless and scanner-based parsing. * SBP is a scannerless parser for Boolean grammars (a superset of context-free grammars), written in Java. * Laja is a two-phase scannerless parser generator with support for mapping the grammar rules into objects, written in Java. * The Raku grammars feature of the general purpose programming language Raku. * PyParsing is a scannerless parser written in pure Python. *
META II META II is a Domain-specific language, domain-specific programming language for writing compilers. It was created in 1963–1964 by Dewey Val Schorre at University of California, Los Angeles (UCLA). META II uses what Schorre called ''Syntax (progr ...
Has built in token parsers functions. *
TREE-META The TREE-META (or Tree Meta, TREEMETA) Translator Writing System is a compiler-compiler system for context-free languages originally developed in the 1960s. Parsing statements of the metalanguage resemble augmented Backus–Naur form with embedd ...
Like META II also is scannerless having builtin lexer functions. * CWIC compiler for writing and implementing compilers. Has token rules as part of its language. Rules in CWIC were compiled into Boolean functions returning success or failure.


Notes

* This is because parsing at the character level makes the language recognized by the parser a single
context-free language In formal language theory, a context-free language (CFL), also called a Chomsky type-2 language, is a language generated by a context-free grammar (CFG). Context-free languages have many applications in programming languages, in particular, mos ...
defined on characters, as opposed to a context-free language of sequences of strings in
regular language In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...
s. Some lexerless parsers handle the entire class of context-free languages, which is closed under composition.


References


Further reading

* {{Parsers Parsing algorithms