HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, lexical analysis, lexing or tokenization is the process of converting a sequence of
character Character or Characters may refer to: Arts, entertainment, and media Literature * ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk * ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to The ...
s (such as in a
computer program A computer program is a sequence or set of instructions in a programming language for a computer to Execution (computing), execute. Computer programs are one component of software, which also includes software documentation, documentation and oth ...
or web page) into a sequence of ''lexical tokens'' ( strings with an assigned and thus identified meaning). A program that performs lexical analysis may be termed a ''lexer'', ''tokenizer'', or ''scanner'', although ''scanner'' is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the
syntax In linguistics, syntax () is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure ( constituenc ...
of
programming language A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are a kind of computer language. The description of a programming ...
s, web pages, and so forth.


Applications

A lexer forms the first phase of a compiler frontend in modern processing. Analysis generally occurs in one pass. In older languages such as
ALGOL ALGOL (; short for "Algorithmic Language") is a family of imperative computer programming languages originally developed in 1958. ALGOL heavily influenced many other languages and was the standard method for algorithm description used by the ...
, the initial stage was instead line reconstruction, which performed unstropping and removed whitespace and comments (and had scannerless parsers, with no separate lexer). These steps are now done as part of the lexer. Lexers and parsers are most often used for compilers, but can be used for other computer language tools, such as prettyprinters or linters. Lexing can be divided into two stages: the ''scanning'', which segments the input string into syntactic units called ''lexemes'' and categorizes these into token classes; and the ''evaluating'', which converts lexemes into processed values. Lexers are generally quite simple, with most of the complexity deferred to the parser or semantic analysis phases, and can often be generated by a lexer generator, notably lex or derivatives. However, lexers can sometimes include some complexity, such as
phrase structure Phrase structure rules are a type of rewrite rule used to describe a given language's syntax and are closely associated with the early stages of transformational grammar, proposed by Noam Chomsky in 1957. They are used to break down a natural langu ...
processing to make input easier and simplify the parser, and may be written partly or fully by hand, either to support more features or for performance. Lexical analysis is also an important early stage in
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
, where text or sound waves are segmented into words and other units. This requires a variety of decisions which are not fully standardized, and the number of tokens systems produce varies for strings like "1/2", "chair's", "can't", "and/or", "1/1/2010", "2x4", "...,", and many others. This is in contrast to lexical analysis for programming and similar languages where exact rules are commonly defined and known.


Lexeme

A ''lexeme'' is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token.page 111, "Compilers Principles, Techniques, & Tools, 2nd Ed." (WorldCat) by Aho, Lam, Sethi and Ullman, as quoted in https://stackoverflow.com/questions/14954721/what-is-the-difference-between-token-and-lexeme Some authors term this a "token", using "token" interchangeably to represent the string being tokenized, and the token data structure resulting from putting this string through the tokenization process. The word lexeme in computer science is defined differently than
lexeme A lexeme () is a unit of lexical meaning that underlies a set of words that are related through inflection. It is a basic abstract unit of meaning, a unit of morphological analysis in linguistics that roughly corresponds to a set of forms take ...
in linguistics. A lexeme in computer science roughly corresponds to a
word A word is a basic element of language that carries an objective or practical meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no conse ...
in linguistics (not to be confused with a word in computer architecture), although in some cases it may be more similar to a
morpheme A morpheme is the smallest meaningful Constituent (linguistics), constituent of a linguistic expression. The field of linguistics, linguistic study dedicated to morphemes is called morphology (linguistics), morphology. In English, morphemes are ...
. In some natural languages (for example, in English), the linguistic lexeme is similar to the lexeme in computer science, but this is generally not true (for example, in Chinese, it is highly non-trivial to find word boundaries due to the lack of word separators).


Token

A ''lexical token'' or simply ''token'' is a string with an assigned and thus identified meaning. It is structured as a pair consisting of a ''token name'' and an optional ''token value''. The token name is a category of lexical unit. Common token names are *
identifier An identifier is a name that identifies (that is, labels the identity of) either a unique object or a unique ''class'' of objects, where the "object" or class may be an idea, physical countable object (or class thereof), or physical noncountable ...
: names the programmer chooses; * keyword: names already in the programming language; * separator (also known as punctuators): punctuation characters and paired-delimiters; * operator: symbols that operate on arguments and produce results; *
literal Literal may refer to: * Interpretation of legal concepts: ** Strict constructionism ** The plain meaning rule The plain meaning rule, also known as the literal rule, is one of three rules of statutory construction traditionally applied by ...
: numeric, logical, textual, reference literals; *
comment Comment may refer to: * Comment (linguistics) or rheme, that which is said about the topic (theme) of a sentence * Bernard Comment (born 1960), Swiss writer and publisher Computing * Comment (computer programming), explanatory text or informat ...
: line, block (Depends on the compiler if compiler implements comments as tokens otherwise it will be stripped). , (, ; , - , operator , , , , , - , literal , , , , , - , comment , , , Consider this expression in the C programming language: : The lexical analysis of this expression yields the following sequence of tokens: : identifier, x), (operator, =), (identifier, a), (operator, +), (identifier, b), (operator, *), (literal, 2), (separator, ;)/code> A token name is what might be termed a
part of speech In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are as ...
in linguistics.


Lexical grammar

The specification of a
programming language A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are a kind of computer language. The description of a programming ...
often includes a set of rules, the lexical grammar, which defines the lexical syntax. The lexical syntax is usually a regular language, with the grammar rules consisting of
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
s; they define the set of possible character sequences (lexemes) of a token. A lexer recognizes strings, and for each kind of string found the lexical program takes an action, most simply producing a token. Two important common lexical categories are white space and comments. These are also defined in the grammar and processed by the lexer, but may be discarded (not producing any tokens) and considered ''non-significant'', at most separating two tokens (as in if x instead of ifx). There are two important exceptions to this. First, in off-side rule languages that delimit blocks with indenting, initial whitespace is significant, as it determines block structure, and is generally handled at the lexer level; see
phrase structure Phrase structure rules are a type of rewrite rule used to describe a given language's syntax and are closely associated with the early stages of transformational grammar, proposed by Noam Chomsky in 1957. They are used to break down a natural langu ...
, below. Secondly, in some uses of lexers, comments and whitespace must be preserved – for examples, a prettyprinter also needs to output the comments and some debugging tools may provide messages to the programmer showing the original source code. In the 1960s, notably for
ALGOL ALGOL (; short for "Algorithmic Language") is a family of imperative computer programming languages originally developed in 1958. ALGOL heavily influenced many other languages and was the standard method for algorithm description used by the ...
, whitespace and comments were eliminated as part of the line reconstruction phase (the initial phase of the compiler frontend), but this separate phase has been eliminated and these are now handled by the lexer.


Tokenization

''Tokenization'' is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of
parsing Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from ...
input. For example, in the text string: : The quick brown fox jumps over the lazy dog the string isn't implicitly segmented on spaces, as a
natural language In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages ...
speaker would do. The raw input, the 43 characters, must be explicitly split into the 9 tokens with a given space delimiter (i.e., matching the string " " or
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
/\s/). When a token class represents more than one possible lexeme, the lexer often saves enough information to reproduce the original lexeme, so that it can be used in semantic analysis. The parser typically retrieves this information from the lexer and stores it in the
abstract syntax tree In computer science, an abstract syntax tree (AST), or just syntax tree, is a tree representation of the abstract syntactic structure of text (often source code) written in a formal language. Each node of the tree denotes a construct occurr ...
. This is necessary in order to avoid information loss in the case where numbers may also be valid identifiers. Tokens are identified based on the specific rules of the lexer. Some methods used to identify tokens include:
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
s, specific sequences of characters termed a
flag A flag is a piece of fabric (most often rectangular or quadrilateral) with a distinctive design and colours. It is used as a symbol, a signalling device, or for decoration. The term ''flag'' is also used to refer to the graphic design empl ...
, specific separating characters called
delimiter A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text, mathematical expressions or other data streams. An example of a delimiter is the comma character, which acts a ...
s, and explicit definition by a dictionary. Special characters, including punctuation characters, are commonly used by lexers to identify tokens because of their natural use in written and programming languages. Tokens are often categorized by character content or by context within the data stream. Categories are defined by the rules of the lexer. Categories often involve grammar elements of the language used in the data stream. Programming languages often categorize tokens as identifiers, operators, grouping symbols, or by
data type In computer science and computer programming, a data type (or simply type) is a set of possible values and a set of allowed operations on it. A data type tells the compiler or interpreter how the programmer intends to use the data. Most progra ...
. Written languages commonly categorize tokens as nouns, verbs, adjectives, or punctuation. Categories are used for post-processing of the tokens either by the parser or by other functions in the program. A lexical analyzer generally does nothing with combinations of tokens, a task left for a parser. For example, a typical lexical analyzer recognizes parentheses as tokens, but does nothing to ensure that each "(" is matched with a ")". When a lexer feeds tokens to the parser, the representation used is typically an enumerated list of number representations. For example, "Identifier" is represented with 0, "Assignment operator" with 1, "Addition operator" with 2, etc. Tokens are defined often by
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
s, which are understood by a lexical analyzer generator such as lex. The lexical analyzer (generated automatically by a tool like lex, or hand-crafted) reads in a stream of characters, identifies the
lexemes A lexeme () is a unit of lexical meaning that underlies a set of words that are related through inflection. It is a basic abstract unit of meaning, a unit of morphological analysis in linguistics that roughly corresponds to a set of forms taken ...
in the stream, and categorizes them into tokens. This is termed ''tokenizing''. If the lexer finds an invalid token, it will report an error. Following tokenizing is
parsing Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from ...
. From there, the interpreted data may be loaded into data structures for general use, interpretation, or
compiling In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs tha ...
.


Scanner

The first stage, the ''scanner'', is usually based on a
finite-state machine A finite-state machine (FSM) or finite-state automaton (FSA, plural: ''automata''), finite automaton, or simply a state machine, is a mathematical model of computation. It is an abstract machine that can be in exactly one of a finite number o ...
(FSM). It has encoded within it information on the possible sequences of characters that can be contained within any of the tokens it handles (individual instances of these character sequences are termed
lexemes A lexeme () is a unit of lexical meaning that underlies a set of words that are related through inflection. It is a basic abstract unit of meaning, a unit of morphological analysis in linguistics that roughly corresponds to a set of forms taken ...
). For example, an ''integer'' lexeme may contain any sequence of
numerical digit A numerical digit (often shortened to just digit) is a single symbol used alone (such as "2") or in combinations (such as "25"), to represent numbers in a positional numeral system. The name "digit" comes from the fact that the ten digits (Latin ...
characters. In many cases, the first non-whitespace character can be used to deduce the kind of token that follows and subsequent input characters are then processed one at a time until reaching a character that is not in the set of characters acceptable for that token (this is termed the '' maximal munch'', or ''longest match'', rule). In some languages, the lexeme creation rules are more complex and may involve backtracking over previously read characters. For example, in C, one 'L' character is not enough to distinguish between an identifier that begins with 'L' and a wide-character string literal.


Evaluator

A
lexeme A lexeme () is a unit of lexical meaning that underlies a set of words that are related through inflection. It is a basic abstract unit of meaning, a unit of morphological analysis in linguistics that roughly corresponds to a set of forms take ...
, however, is only a string of characters known to be of a certain kind (e.g., a string literal, a sequence of letters). In order to construct a token, the lexical analyzer needs a second stage, the ''evaluator'', which goes over the characters of the lexeme to produce a ''value''. The lexeme's type combined with its value is what properly constitutes a token, which can be given to a parser. Some tokens such as parentheses do not really have values, and so the evaluator function for these can return nothing: only the type is needed. Similarly, sometimes evaluators can suppress a lexeme entirely, concealing it from the parser, which is useful for whitespace and comments. The evaluators for identifiers are usually simple (literally representing the identifier), but may include some unstropping. The evaluators for
integer literal In computer science, an integer literal is a kind of literal for an integer whose value is directly represented in source code. For example, in the assignment statement x = 1, the string 1 is an integer literal indicating the value 1, while in t ...
s may pass the string on (deferring evaluation to the semantic analysis phase), or may perform evaluation themselves, which can be involved for different bases or floating point numbers. For a simple quoted string literal, the evaluator needs to remove only the quotes, but the evaluator for an escaped string literal incorporates a lexer, which unescapes the escape sequences. For example, in the source code of a computer program, the string : might be converted into the following lexical token stream; whitespace is suppressed and special characters have no value: IDENTIFIER net_worth_future EQUALS OPEN_PARENTHESIS IDENTIFIER assets MINUS IDENTIFIER liabilities CLOSE_PARENTHESIS SEMICOLON Due to licensing restrictions of existing parsers, it may be necessary to write a lexer by hand. This is practical if the list of tokens is small, but in general, lexers are generated by automated tools. These tools generally accept regular expressions that describe the tokens allowed in the input stream. Each regular expression is associated with a production rule in the lexical grammar of the programming language that evaluates the lexemes matching the regular expression. These tools may generate source code that can be compiled and executed or construct a state transition table for a
finite-state machine A finite-state machine (FSM) or finite-state automaton (FSA, plural: ''automata''), finite automaton, or simply a state machine, is a mathematical model of computation. It is an abstract machine that can be in exactly one of a finite number o ...
(which is plugged into template code for compiling and executing). Regular expressions compactly represent patterns that the characters in lexemes might follow. For example, for an
English English usually refers to: * English language * English people English may also refer to: Peoples, culture, and language * ''English'', an adjective for something of, from, or related to England ** English national ...
-based language, an IDENTIFIER token might be any English alphabetic character or an underscore, followed by any number of instances of ASCII alphanumeric characters and/or underscores. This could be represented compactly by the string . This means "any character a-z, A-Z or _, followed by 0 or more of a-z, A-Z, _ or 0-9". Regular expressions and the finite-state machines they generate are not powerful enough to handle recursive patterns, such as "''n'' opening parentheses, followed by a statement, followed by ''n'' closing parentheses." They are unable to keep count, and verify that ''n'' is the same on both sides, unless a finite set of permissible values exists for ''n''. It takes a full parser to recognize such patterns in their full generality. A parser can push parentheses on a stack and then try to pop them off and see if the stack is empty at the end (see example in the '' Structure and Interpretation of Computer Programs'' book).


Obstacles

Typically, tokenization occurs at the word level. However, it is sometimes difficult to define what is meant by a "word". Often a tokenizer relies on simple heuristics, for example: * Punctuation and whitespace may or may not be included in the resulting list of tokens. * All contiguous strings of alphabetic characters are part of one token; likewise with numbers. * Tokens are separated by whitespace characters, such as a space or line break, or by punctuation characters. In languages that use inter-word spaces (such as most that use the Latin alphabet, and most programming languages), this approach is fairly straightforward. However, even here there are many edge cases such as contractions,
hyphen The hyphen is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation. ''Son-in-law'' is an example of a hyphenated word. The hyphen is sometimes confused with dashes ( figure ...
ated words, emoticons, and larger constructs such as URIs (which for some purposes may count as single tokens). A classic example is "New York-based", which a naive tokenizer may break at the space even though the better break is (arguably) at the hyphen. Tokenization is particularly difficult for languages written in
scriptio continua ''Scriptio continua'' (Latin for "continuous script"), also known as ''scriptura continua'' or ''scripta continua'', is a style of writing without spaces or other marks between the words or sentences. The form also lacks punctuation, diacritic ...
which exhibit no word boundaries such as
Ancient Greek Ancient Greek includes the forms of the Greek language used in ancient Greece and the ancient world from around 1500 BC to 300 BC. It is often roughly divided into the following periods: Mycenaean Greek (), Dark Ages (), the Archaic pe ...
, Chinese, or Thai.
Agglutinative language An agglutinative language is a type of synthetic language with morphology that primarily uses agglutination. Words may contain different morphemes to determine their meanings, but all of these morphemes (including stems and affixes) tend to rem ...
s, such as Korean, also make tokenization tasks complicated. Some ways to address the more difficult problems include developing more complex heuristics, querying a table of common special-cases, or fitting the tokens to a
language model A language model is a probability distribution over sequences of words. Given any sequence of words of length , a language model assigns a probability P(w_1,\ldots,w_m) to the whole sequence. Language models generate probabilities by training on ...
that identifies collocations in a later processing step.


Lexer generator

Lexers are often generated by a ''lexer generator'', analogous to parser generators, and such tools often come together. The most established is lex, paired with the
yacc Yacc (Yet Another Compiler-Compiler) is a computer program for the Unix operating system developed by Stephen C. Johnson. It is a Look Ahead Left-to-Right Rightmost Derivation (LALR) parser generator, generating a LALR parser (the part of a co ...
parser generator, or rather some of their many reimplementations, like
flex Flex or FLEX may refer to: Computing * Flex (language), developed by Alan Kay * FLEX (operating system), a single-tasking operating system for the Motorola 6800 * FlexOS, an operating system developed by Digital Research * FLEX (protocol), a com ...
(often paired with
GNU Bison GNU Bison, commonly known as Bison, is a parser generator that is part of the GNU Project. Bison reads a specification in the BNF notation (a context-free language), warns about any parsing ambiguities, and generates a parser that reads sequen ...
). These generators are a form of
domain-specific language A domain-specific language (DSL) is a computer language specialized to a particular application domain. This is in contrast to a general-purpose language (GPL), which is broadly applicable across domains. There are a wide variety of DSLs, ranging ...
, taking in a lexical specification – generally regular expressions with some markup – and emitting a lexer. These tools yield very fast development, which is very important in early development, both to get a working lexer and because a language specification may change often. Further, they often provide advanced features, such as pre- and post-conditions which are hard to program by hand. However, an automatically generated lexer may lack flexibility, and thus may require some manual modification, or an all-manually written lexer. Lexer performance is a concern, and optimizing is worthwhile, more so in stable languages where the lexer is run very often (such as C or HTML). lex/flex-generated lexers are reasonably fast, but improvements of two to three times are possible using more tuned generators. Hand-written lexers are sometimes used, but modern lexer generators produce faster lexers than most hand-coded ones. The lex/flex family of generators uses a table-driven approach which is much less efficient than the directly coded approach. With the latter approach the generator produces an engine that directly jumps to follow-up states via goto statements. Tools like re2c have proven to produce engines that are between two and three times faster than flex produced engines. It is in general difficult to hand-write analyzers that perform better than engines generated by these latter tools.


Phrase structure

Lexical analysis mainly segments the input stream of characters into tokens, simply grouping the characters into pieces and categorizing them. However, the lexing may be significantly more complex; most simply, lexers may omit tokens or insert added tokens. Omitting tokens, notably whitespace and comments, is very common, when these are not needed by the compiler. Less commonly, added tokens may be inserted. This is done mainly to group tokens into statements, or statements into blocks, to simplify the parser.


Line continuation

Line continuation is a feature of some languages where a newline is normally a statement terminator. Most often, ending a line with a backslash (immediately followed by a
newline Newline (frequently called line ending, end of line (EOL), next line (NEL) or line break) is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or ...
) results in the line being ''continued'' – the following line is ''joined'' to the prior line. This is generally done in the lexer: the backslash and newline are discarded, rather than the newline being tokenized. Examples include
bash Bash or BASH may refer to: Arts and entertainment * ''Bash!'' (Rockapella album), 1992 * ''Bash!'' (Dave Bailey album), 1961 * '' Bash: Latter-Day Plays'', a dramatic triptych * ''BASH!'' (role-playing game), a 2005 superhero game * "Bash" ('' ...
, other shell scripts and Python.


Semicolon insertion

Many languages use the semicolon as a statement terminator. Most often this is mandatory, but in some languages the semicolon is optional in many contexts. This is mainly done at the lexer level, where the lexer outputs a semicolon into the token stream, despite one not being present in the input character stream, and is termed ''semicolon insertion'' or ''automatic semicolon insertion''. In these cases, semicolons are part of the formal phrase grammar of the language, but may not be found in input text, as they can be inserted by the lexer. Optional semicolons or other terminators or separators are also sometimes handled at the parser level, notably in the case of trailing commas or semicolons. Semicolon insertion is a feature of
BCPL BCPL ("Basic Combined Programming Language") is a procedural, imperative, and structured programming language. Originally intended for writing compilers for other languages, BCPL is no longer in common use. However, its influence is still ...
and its distant descendant Go, though it is absent in B or C.Semicolons in Go
, golang-nuts, Rob 'Commander' Pike, 12/10/09
Semicolon insertion is present in
JavaScript JavaScript (), often abbreviated as JS, is a programming language that is one of the core technologies of the World Wide Web, alongside HTML and CSS. As of 2022, 98% of websites use JavaScript on the client side for webpage behavior, of ...
, though the rules are somewhat complex and much-criticized; to avoid bugs, some recommend always using semicolons, while others use initial semicolons, termed defensive semicolons, at the start of potentially ambiguous statements. Semicolon insertion (in languages with semicolon-terminated statements) and line continuation (in languages with newline-terminated statements) can be seen as complementary: semicolon insertion adds a token, even though newlines generally do ''not'' generate tokens, while line continuation prevents a token from being generated, even though newlines generally ''do'' generate tokens.


Off-side rule

The off-side rule (blocks determined by indenting) can be implemented in the lexer, as in Python, where increasing the indenting results in the lexer emitting an INDENT token, and decreasing the indenting results in the lexer emitting a DEDENT token. These tokens correspond to the opening brace in languages that use braces for blocks, and means that the phrase grammar does not depend on whether braces or indenting are used. This requires that the lexer hold state, namely the current indent level, and thus can detect changes in indenting when this changes, and thus the lexical grammar is not context-free: INDENT–DEDENT depend on the contextual information of prior indent level.


Context-sensitive lexing

Generally lexical grammars are context-free, or almost so, and thus require no looking back or ahead, or backtracking, which allows a simple, clean, and efficient implementation. This also allows simple one-way communication from lexer to parser, without needing any information flowing back to the lexer. There are exceptions, however. Simple examples include: semicolon insertion in Go, which requires looking back one token; concatenation of consecutive string literals in Python, which requires holding one token in a buffer before emitting it (to see if the next token is another string literal); and the off-side rule in Python, which requires maintaining a count of indent level (indeed, a stack of each indent level). These examples all only require lexical context, and while they complicate a lexer somewhat, they are invisible to the parser and later phases. A more complex example is the lexer hack in C, where the token class of a sequence of characters cannot be determined until the semantic analysis phase, since typedef names and variable names are lexically identical but constitute different token classes. Thus in the hack, the lexer calls the semantic analyzer (say, symbol table) and checks if the sequence requires a typedef name. In this case, information must flow back not from the parser only, but from the semantic analyzer back to the lexer, which complicates design.


See also

* List of parser generators


References


Sources

* ''Compiling with C# and Java'', Pat Terry, 2005, * ''Algorithms + Data Structures = Programs'', Niklaus Wirth, 1975, * ''Compiler Construction'', Niklaus Wirth, 1996, * Sebesta, R. W. (2006). Concepts of programming languages (Seventh edition) pp. 177. Boston: Pearson/Addison-Wesley.


External links

* *
Word Mention Segmentation Task
an analysis {{Natural Language Processing Compiler construction Programming language implementation Parsing