HOME

TheInfoList



OR:

re2c is a
free and open-source Free and open-source software (FOSS) is software available under a Software license, license that grants users the right to use, modify, and distribute the software modified or not to everyone free of charge. FOSS is an inclusive umbrella term ...
lexer generator Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful ''lexical tokens'' belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives ...
for C, C++, D, Go,
Haskell Haskell () is a general-purpose, statically typed, purely functional programming language with type inference and lazy evaluation. Designed for teaching, research, and industrial applications, Haskell pioneered several programming language ...
,
Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
,
JavaScript JavaScript (), often abbreviated as JS, is a programming language and core technology of the World Wide Web, alongside HTML and CSS. Ninety-nine percent of websites use JavaScript on the client side for webpage behavior. Web browsers have ...
,
OCaml OCaml ( , formerly Objective Caml) is a General-purpose programming language, general-purpose, High-level programming language, high-level, Comparison of multi-paradigm programming languages, multi-paradigm programming language which extends the ...
,
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (prog ...
,
Rust Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO(OH) ...
, V and Zig. It compiles declarative regular expression specifications to
deterministic finite automata In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state auto ...
. Originally written by Peter Bumbulis and described in his paper, re2c was put in
public domain The public domain (PD) consists of all the creative work to which no Exclusive exclusive intellectual property rights apply. Those rights may have expired, been forfeited, expressly Waiver, waived, or may be inapplicable. Because no one holds ...
and has been since maintained by volunteers. It is the lexer generator adopted by projects such as
PHP PHP is a general-purpose scripting language geared towards web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementation is now produced by the PHP Group. ...
,
SpamAssassin Apache SpamAssassin is a computer program used for e-mail spam filtering. It uses a variety of spam-detection techniques, including DNS and fuzzy checksum techniques, Bayesian filtering, external programs, blacklists and online databases. It ...
, Ninja build system and others. Together with the
Lemon The lemon (''Citrus'' × ''limon'') is a species of small evergreen tree in the ''Citrus'' genus of the flowering plant family Rutaceae. A true lemon is a hybrid of the citron and the bitter orange. Its origins are uncertain, but some ...
parser generator, re2c is used in
BRL-CAD BRL-CAD is a constructive solid geometry (CSG) solid modeling computer-aided design (CAD) system. It includes an interactive geometry editor, Ray tracing (graphics), ray tracing support for rendering (computer graphics), graphics rendering and g ...
. This combination is also used with STEPcode, an implementation of
ISO 10303 ISO 10303 (Automation systems and integration — Product data representation and exchange)ISO 10303-1:1994 Industrial automation systems and integration -- Product data representation and exchange -- Part 1: Overview and fundamental principle ...
standard.


Philosophy

The main goal of re2c is generating ''fast'' lexers: at least as fast as reasonably optimized C lexers coded by hand. Instead of using traditional table-driven approach, re2c encodes the generated
finite-state machine A finite-state machine (FSM) or finite-state automaton (FSA, plural: ''automata''), finite automaton, or simply a state machine, is a mathematical model of computation. It is an abstract machine that can be in exactly one of a finite number o ...
directly in the form of conditional jumps and comparisons. The resulting program is faster than its table-driven counterpart and much easier to debug and understand. Moreover, this approach often results in smaller lexers, as re2c applies a number of optimizations such as
DFA minimization In automata theory (a branch of theoretical computer science), DFA minimization is the task of transforming a given deterministic finite automaton (DFA) into an equivalent DFA that has a minimum number of states. Here, two DFAs are called equiva ...
and the construction of ''tunnel automaton.'' Another distinctive feature of re2c is its flexible interface: instead of assuming a fixed program template, re2c lets the programmer write most of the interface code and adapt the generated lexer to any particular environment. The main idea is that re2c should be a ''zero-cost'' abstraction for the programmer: using it should never result in a slower program than the corresponding hand-coded implementation.


Features

* Submatch extraction: re2c supports both POSIX-compliant capturing groups and standalone ''tags'' (with leftmost greedy disambiguation and optional handling of repeated submatch). The implementation is based on the lookahead-TDFA algorithm. * Encoding support: re2c supports ASCII, UTF-8, UTF-16, UTF-32, UCS-2 and EBCDIC. * Flexible user interface: the generated code uses a few primitive operations in order to interface with the environment (read input characters, advance to the next input position, etc.); users can redefine these primitives to whatever they need. * Storable state: re2c supports both ''pull-model'' lexers (when lexer runs without interrupts and pulls more input as necessary) and ''push-model'' lexers (when lexer is periodically stopped and resumed to parse new chunks of input). * Start conditions: re2c can generate multiple interrelated lexers, where each lexer is triggered by a certain ''condition'' in program. * Self-validation: re2c has a special mode in which it ignores all used-defined interface code and generates a self-contained ''skeleton'' program. Additionally, re2c generates two files: one with the input strings derived from the regular grammar, and one with compressed match results that are used to verify lexer behavior on all inputs. Input strings are generated so that they extensively cover DFA transitions and paths. Data generation happens right after DFA construction and prior to any optimizations, but the lexer itself is fully optimized, so skeleton programs are capable of revealing any errors in optimizations and code generation. * Warnings: re2c performs static analysis of the program and warns its users about possible deficiencies or bugs, such as undefined control flow, unreachable code, ill-formed escape symbols and potential misuse of the interface primitives. * Debugging. Besides generating human-readable lexers, re2c has a number of options that dump various intermediate representations of the generated lexer, such as NFA, multiple stages of DFA and the resulting program graph in DOT format.


Syntax

re2c program can contain any number of /*!re2c ... */ blocks. Each block consists of a sequence of ''rules'', ''definitions'' and ''configurations'' (they can be intermixed, but it is generally better to put configurations first, then definitions and then rules). Rules have the form REGEXP or REGEXP := CODE; where REGEXP is a ''regular expression'' and CODE is a block of C code. When REGEXP matches the input string, control flow is transferred to the associated CODE. There is one special rule: the ''default rule'' with * instead of REGEXP; it is triggered if no other rules matches. re2c has ''greedy'' matching semantics: if multiple rules match, the rule that matches longer prefix is preferred; if the conflicting rules match the same prefix, the earlier rule has priority. Definitions have the form NAME = REGEXP; (and also NAME in Flex compatibility mode). Configurations have the form re2c:CONFIG = VALUE; where CONFIG is the name of the particular configuration and VALUE is a number or a string. For more advanced usage see the official re2c manual.


Regular expressions

re2c uses the following syntax for regular expressions: *"foo" case-sensitive string literal *'foo' case-insensitive string literal * -xyz/code>, a-xyz/code> character class (possibly negated) *. any character except newline *R \ S difference of character classes *R* zero or more occurrences of R *R+ one or more occurrences of R *R? zero or one occurrence of R *R repetition of R exactly n times *R repetition of R at least n times *R repetition of R from n to m times *(R) just R; parentheses are used to override precedence or for POSIX-style submatch *R S concatenation: R followed by S *R , S alternative: R or S *R / S lookahead: R followed by S, but S is not consumed *name the regular expression defined as name (except in Flex compatibility mode) *@stag an ''s-tag'': saves the last input position at which @stag matches in a variable named stag *#mtag an ''m-tag'': saves all input positions at which #mtag matches in a variable named mtag Character classes and string literals may contain the following escape sequences: \a, \b, \f, \n, \r, \t, \v, \\, octal escapes \ooo and hexadecimal escapes \xhh, \uhhhh and \Uhhhhhhhh.


Example

Here is a very simple program in re2c (example.re). It checks that all input arguments are hexadecimal numbers. The code for re2c is enclosed in comments /*!re2c ... */, all the rest is plain C code. See the official re2c website for more complex examples. #include static int lex(const char *YYCURSOR) int main(int argc, char **argv) Given that, re2c -is -o example.c example.re generates the code below (example.c). The contents of the comment /*!re2c ... */ are substituted with a
deterministic finite automaton In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state auto ...
encoded in the form of conditional jumps and comparisons; the rest of the program is copied verbatim into the output file. There are several code generation options; normally re2c uses switch statements, but it can use nested if statements (as in this example with -s option), or generate bitmaps and jump tables. Which option is better depends on the C compiler; re2c users are encouraged to experiment. /* Generated by re2c 1.2.1 on Fri Aug 23 21:59:00 2019 */ #include static int lex(const char *YYCURSOR) int main(int argc, char **argv)


See also

*
Ragel Ragel ( IPA: ) is a finite-state machine compiler and a parser generator. Initially Ragel supported output for C, C++ and Assembly source code, later expanded to support several other languages including Objective-C, D, Go, Ruby, and Java. Ad ...
*
Lex Lex or LEX may refer to: Computing * Amazon Lex, a service for building conversational interfaces into any application using voice and text * LEX (cipher), a stream cipher based on the round transformation of AES * Lex (software), a computer pro ...
*
Comparison of parser generators This is a list of notable lexer generators and parser generators for various language classes. Regular languages Regular languages are a category of languages (sometimes termed Chomsky Type 3) which can be matched by a state machine (more spe ...


References


External links

* {{Official website, re2c.org Compiling tools Public-domain software Lexical analysis