Flex Lexical Analyser
   HOME

TheInfoList



OR:

Flex (fast
lexical analyzer In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of ''lexical tokens'' ( strings with an assigned and thus identified ...
generator) is a
free and open-source software Free and open-source software (FOSS) is a term used to refer to groups of software consisting of both free software and open-source software where anyone is freely licensed to use, copy, study, and change the software in any way, and the source ...
alternative to
lex Lex or LEX may refer to: Arts and entertainment * ''Lex'', a daily featured column in the ''Financial Times'' Games * Lex, the mascot of the word-forming puzzle video game ''Bookworm'' * Lex, the protagonist of the word-forming puzzle video ga ...
. It is a
computer program A computer program is a sequence or set of instructions in a programming language for a computer to execute. Computer programs are one component of software, which also includes documentation and other intangible components. A computer program ...
that generates
lexical analyzer In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of ''lexical tokens'' ( strings with an assigned and thus identified ...
s (also known as "scanners" or "lexers"). It is frequently used as the lex implementation together with
Berkeley Yacc Berkeley Yacc (byacc) is a Unix parser generator designed to be compatible with Yacc. It was originally written by Robert Corbett and released in 1989. Due to its liberal license and because it was faster than the AT&T Yacc, it quickly became the ...
parser generator In computer science, a compiler-compiler or compiler generator is a programming tool that creates a parser, interpreter, or compiler from some form of formal description of a programming language and machine. The most common type of compiler- ...
on
BSD The Berkeley Software Distribution or Berkeley Standard Distribution (BSD) is a discontinued operating system based on Research Unix, developed and distributed by the Computer Systems Research Group (CSRG) at the University of California, Berk ...
-derived operating systems (as both lex and yacc are part of
POSIX The Portable Operating System Interface (POSIX) is a family of standards specified by the IEEE Computer Society for maintaining compatibility between operating systems. POSIX defines both the system- and user-level application programming interf ...
), or together with
GNU bison GNU Bison, commonly known as Bison, is a parser generator that is part of the GNU Project. Bison reads a specification in the BNF notation (a context-free language), warns about any parsing ambiguities, and generates a parser that reads sequence ...
(a version of
yacc Yacc (Yet Another Compiler-Compiler) is a computer program for the Unix operating system developed by Stephen C. Johnson. It is a Look Ahead Left-to-Right Rightmost Derivation (LALR) parser generator, generating a LALR parser (the part of a com ...
) in *BSD ports and in Linux distributions. Unlike Bison, flex is not part of the
GNU Project The GNU Project () is a free software, mass collaboration project announced by Richard Stallman on September 27, 1983. Its goal is to give computer users freedom and control in their use of their computers and computing devices by collaborati ...
and is not released under the
GNU General Public License The GNU General Public License (GNU GPL or simply GPL) is a series of widely used free software licenses that guarantee end users the Four Freedoms (Free software), four freedoms to run, study, share, and modify the software. The license was th ...
, although a manual for Flex was produced and published by the Free Software Foundation.


History

Flex was written in C around 1987 by
Vern Paxson Vern Edward Paxson is a Professor of Computer Science at the University of California, Berkeley. He also leads the Networking and Security Group at the International Computer Science Institute in Berkeley, California. His interests range from tr ...
, with the help of many ideas and much inspiration from
Van Jacobson Van Jacobson (born 1950) is an American computer scientist, renowned for his work on TCP/IP network performance and scaling.
. Original version by
Jef Poskanzer Jeffrey A. Poskanzer is a computer programmer. He was the first person to post a weekly FAQ to Usenet. He developed the portable pixmap file format and pbmplus (the precursor to the Netpbm package) to manipulate it. He has also worked on the team ...
. The fast table representation is a partial implementation of a design done by Van Jacobson. The implementation was done by Kevin Gong and Vern Paxson.


Example lexical analyzer

This is an example of a Flex scanner for the instructional programming language
PL/0 PL/0 is a programming language, intended as an educational programming language, that is similar to but much simpler than Pascal, a general-purpose programming language. It serves as an example of how to construct a compiler. It was originally in ...
. The tokens recognized are: '+', '-', '*', '/', '=', '(', ')', ',', ';', '.', ':=', '<', '<=', '<>', '>', '>='; numbers: 0-9 ; identifiers: a-zA-Z and keywords: begin, call, const, do, end, if, odd, procedure, then, var, while. % digit -9letter -zA-Z %% "+" "-" "*" "/" "(" ")" ";" "," "." ":=" "=" "<>" "<" ">" "<=" ">=" "begin" "call" "const" "do" "end" "if" "odd" "procedure" "then" "var" "while" (, )* + \t\n\r /* skip whitespace */ . %% int yywrap(void)


Internals

These programs perform character parsing and tokenizing via the use of a
deterministic finite automaton In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state autom ...
(DFA). A DFA is a theoretical machine accepting
regular language In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...
s. These machines are a subset of the collection of
Turing machine A Turing machine is a mathematical model of computation describing an abstract machine that manipulates symbols on a strip of tape according to a table of rules. Despite the model's simplicity, it is capable of implementing any computer algori ...
s. DFAs are equivalent to
read-only right moving Turing machines In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state automa ...
. The syntax is based on the use of
regular expressions A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" o ...
. See also
nondeterministic finite automaton In automata theory, a finite-state machine is called a deterministic finite automaton (DFA), if * each of its transitions is ''uniquely'' determined by its source state and input symbol, and * reading an input symbol is required for each state ...
.


Issues


Time complexity

A Flex lexical analyzer usually has time complexity O(n) in the length of the input. That is, it performs a constant number of operations for each input symbol. This constant is quite low: GCC generates 12 instructions for the DFA match loop. Note that the constant is independent of the length of the token, the length of the regular expression and the size of the DFA. However, using the REJECT macro in a scanner with the potential to match extremely long tokens can cause Flex to generate a scanner with non-linear performance. This feature is optional. In this case, the programmer has explicitly told Flex to "go back and try again" after it has already matched some input. This will cause the DFA to backtrack to find other accept states. The REJECT feature is not enabled by default, and because of its performance implications its use is discouraged in the Flex manual.


Reentrancy

By default the scanner generated by Flex is not reentrant. This can cause serious problems for programs that use the generated scanner from different threads. To overcome this issue there are options that Flex provides in order to achieve reentrancy. A detailed description of these options can be found in the Flex manual.


Usage under non-Unix environments

Normally the generated scanner contains references to the ''unistd.h'' header file, which is
Unix Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, and ot ...
specific. To avoid generating code that includes ''
unistd.h In the C and C++ programming languages, unistd.h is the name of the header file that provides access to the POSIX operating system API. It is defined by the POSIX.1 standard, the base of the Single Unix Specification, and should therefore be a ...
'', ''%option nounistd'' should be used. Another issue is the call to '' isatty'' (a Unix library function), which can be found in the generated code. The ''%option never-interactive'' forces flex to generate code that does not use ''isatty''.


Using flex from other languages

Flex can only generate code for C and
C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
. To use the scanner code generated by flex from other languages a
language binding In programming and software design, binding is an application programming interface (API) that provides glue code specifically made to allow a programming language to use a foreign library or operating system service (one that is not native to th ...
tool such as SWIG can be used.


Unicode support

Flex is limited to matching 1-byte (8-bit) binary values and therefore does not support
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
. RE/flex and other alternatives do support Unicode matching.


Flex++

flex++ is a similar lexical scanner for
C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
which is included as part of the flex package. The generated code does not depend on any runtime or external
library A library is a collection of materials, books or media that are accessible for use and not just for display purposes. A library provides physical (hard copies) or digital access (soft copies) materials, and may be a physical location or a vir ...
except for a memory allocator (
malloc C dynamic memory allocation refers to performing manual memory management for dynamic memory allocation in the C programming language via a group of functions in the C standard library, namely , , , and . The C++ programming language includes ...
or a user-supplied alternative) unless the input also depends on it. This can be useful in embedded and similar situations where traditional
operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common services for computer programs. Time-sharing operating systems schedule tasks for efficient use of the system and may also in ...
or C runtime facilities may not be available. The flex++ generated C++ scanner includes the header file FlexLexer.h, which defines the interfaces of the two C++ generated classes.


See also

*
Comparison of parser generators This is a list of notable lexer generators and parser generators for various language classes. Regular languages Regular languages are a category of languages (sometimes termed Chomsky Type 3) which can be matched by a state machine (more sp ...
*
Lex Lex or LEX may refer to: Arts and entertainment * ''Lex'', a daily featured column in the ''Financial Times'' Games * Lex, the mascot of the word-forming puzzle video game ''Bookworm'' * Lex, the protagonist of the word-forming puzzle video ga ...
*
yacc Yacc (Yet Another Compiler-Compiler) is a computer program for the Unix operating system developed by Stephen C. Johnson. It is a Look Ahead Left-to-Right Rightmost Derivation (LALR) parser generator, generating a LALR parser (the part of a com ...
*
GNU Bison GNU Bison, commonly known as Bison, is a parser generator that is part of the GNU Project. Bison reads a specification in the BNF notation (a context-free language), warns about any parsing ambiguities, and generates a parser that reads sequence ...
*
Berkeley Yacc Berkeley Yacc (byacc) is a Unix parser generator designed to be compatible with Yacc. It was originally written by Robert Corbett and released in 1989. Due to its liberal license and because it was faster than the AT&T Yacc, it quickly became the ...


References


Further reading

* *M. E. Lesk and E. Schmidt, ''LEX - Lexical Analyzer Generator'' *Alfred Aho, Ravi Sethi and Jeffrey Ullman, ''Compilers: Principles, Techniques and Tools'', Addison-Wesley (1986). Describes the pattern-matching techniques used by flex (deterministic finite automata)


External links

*
ANSI-C Lex Specification

JFlex: Fast Scanner Generator for Java

Brief description of Lex, Flex, YACC, and Bison
{{DEFAULTSORT:Flex Lexical Analyser Free compilers and interpreters Compiling tools Free software programmed in C Finite automata Software using the BSD license Lexical analysis