In
computer science
Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
, a lexical grammar or lexical structure is a
formal grammar
A formal grammar is a set of Terminal and nonterminal symbols, symbols and the Production (computer science), production rules for rewriting some of them into every possible string of a formal language over an Alphabet (formal languages), alphabe ...
defining the
syntax
In linguistics, syntax ( ) is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituenc ...
of
tokens. The program is written using characters that are defined by the lexical structure of the language used. The character set is equivalent to the alphabet used by any written language. The lexical grammar lays down the rules governing how a character sequence is divided up into subsequences of characters, each part of which represents an individual token. This is frequently defined in terms of
regular expression
A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
s.
For instance, the lexical grammar for many
programming language
A programming language is a system of notation for writing computer programs.
Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...
s specifies that a
string literal
string literal or anonymous string is a literal for a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally "bracketed delimiters", as in x = "foo", where , "foo ...
starts with a character and continues until a matching is found (
escaping makes this more complicated), that an
identifier
An identifier is a name that identifies (that is, labels the identity of) either a unique object or a unique ''class'' of objects, where the "object" or class may be an idea, person, physical countable object (or class thereof), or physical mass ...
is an
alphanumeric
Alphanumericals or alphanumeric characters are any collection of number characters and letters in a certain language. Sometimes such characters may be mistaken one for the other.
Merriam-Webster suggests that the term "alphanumeric" may often ...
sequence (letters and digits, usually also allowing underscores, and disallowing initial digits), and that an
integer literal In computer science, an integer literal is a kind of literal (computer programming), literal for an integer (computer science), integer whose Value (computer science), value is directly represented in source code. For example, in the assignment stat ...
is a sequence of digits. So in the following character sequence the tokens are ''string'', ''identifier'' and ''number'' (plus whitespace tokens) because the space character terminates the sequence of characters forming the identifier. Further, certain sequences are categorized as
keywords – these generally have the same form as identifiers (usually alphabetical words), but are categorized separately; formally they have a different token type.
Examples
Regular expressions for common lexical rules follow (for example, C).
Unescaped string literal (quote, followed by non-quotes, ending in a quote):
"
""
Escaped string literal (quote, followed by escaped characters or non-quotes, ending in a quote):
"(\.,
\"*"
Integer literal:
-9
Decimal integer literal (no leading zero):
-90-9]*, 0
Hexadecimal integer literal:
0
x0-9A-Fa-f]+
Octal integer literal:
0
-7
Identifier:
-Za-z_$A-Za-z0-9_$]*
See also
*
Lexical analysis
Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful ''lexical tokens'' belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives ...
References
External links
ANSI C grammar, Lex specification
{{DEFAULTSORT:Lexical Grammar
Formal languages
Parsing