HOME

TheInfoList



OR:

Parsing, syntax analysis, or syntactic analysis is a process of analyzing a
string String or strings may refer to: *String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects Arts, entertainment, and media Films * ''Strings'' (1991 film), a Canadian anim ...
of
symbols A symbol is a mark, sign, or word that indicates, signifies, or is understood as representing an idea, object, or relationship. Symbols allow people to go beyond what is known or seen by creating linkages between otherwise different concep ...
, either in
natural language A natural language or ordinary language is a language that occurs naturally in a human community by a process of use, repetition, and change. It can take different forms, typically either a spoken language or a sign language. Natural languages ...
, computer languages or
data structure In computer science, a data structure is a data organization and storage format that is usually chosen for Efficiency, efficient Data access, access to data. More precisely, a data structure is a collection of data values, the relationships amo ...
s, conforming to the rules of a
formal grammar A formal grammar is a set of Terminal and nonterminal symbols, symbols and the Production (computer science), production rules for rewriting some of them into every possible string of a formal language over an Alphabet (formal languages), alphabe ...
by breaking it into parts. The term ''parsing'' comes from Latin ''pars'' (''orationis''), meaning part (of speech). The term has slightly different meanings in different branches of
linguistics Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...
and
computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
. Traditional sentence parsing is often performed as a method of understanding the exact meaning of a sentence or word, sometimes with the aid of devices such as
sentence diagram A sentence diagram is a pictorial representation of the grammar, grammatical structure of a Sentence (linguistics), sentence. The term "sentence diagram" is used more when pedagogy, teaching written language, where sentences are ''diagrammed''. ...
s. It usually emphasizes the importance of grammatical divisions such as subject and predicate. Within
computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...
the term is used to refer to the formal analysis by a computer of a sentence or other string of words into its constituents, resulting in a
parse tree A parse tree or parsing tree (also known as a derivation tree or concrete syntax tree) is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term ''parse tree'' itself is use ...
showing their syntactic relation to each other, which may also contain
semantic Semantics is the study of linguistic Meaning (philosophy), meaning. It examines what meaning is, how words get their meaning, and how the meaning of a complex expression depends on its parts. Part of this process involves the distinction betwee ...
information. Some parsing algorithms generate a ''parse forest'' or list of parse trees from a string that is syntactically ambiguous. The term is also used in
psycholinguistics Psycholinguistics or psychology of language is the study of the interrelation between linguistic factors and psychological aspects. The discipline is mainly concerned with the mechanisms by which language is processed and represented in the mind ...
when describing language comprehension. In this context, parsing refers to the way that human beings analyze a sentence or phrase (in spoken language or text) "in terms of grammatical constituents, identifying the parts of speech, syntactic relations, etc." This term is especially common when discussing which linguistic cues help speakers interpret garden-path sentences. Within computer science, the term is used in the analysis of computer languages, referring to the syntactic analysis of the input code into its component parts in order to facilitate the writing of
compilers In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs tha ...
and interpreters. The term may also be used to describe a split or separation. In data analysis, the term is often used to refer to a process extracting desired information from data, e.g., creating a
time series In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. ...
signal from a
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
document.


Human languages


Traditional methods

The traditional grammatical exercise of parsing, sometimes known as ''clause analysis'', involves breaking down a text into its component
parts of speech In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are as ...
with an explanation of the form, function, and syntactic relationship of each part. This is determined in large part from study of the language's
conjugation Conjugation or conjugate may refer to: Linguistics *Grammatical conjugation, the modification of a verb from its basic form *Emotive conjugation or Russell's conjugation, the use of loaded language Mathematics *Complex conjugation, the change o ...
s and
declensions In linguistics, declension (verb: ''to wikt:decline#Verb, decline'') is the changing of the form of a word, generally to express its syntactic function in the sentence by way of an inflection. Declension may apply to nouns, pronouns, adjectives, ...
, which can be quite intricate for heavily
inflected In linguistic Morphology (linguistics), morphology, inflection (less commonly, inflexion) is a process of word formation in which a word is modified to express different grammatical category, grammatical categories such as grammatical tense, ...
languages. To parse a phrase such as "man bites dog" involves noting that the
singular Singular may refer to: * Singular, the grammatical number that denotes a unit quantity, as opposed to the plural and other forms * Singular or sounder, a group of boar, see List of animal names * Singular (band), a Thai jazz pop duo *'' Singula ...
noun "man" is the subject of the sentence, the verb "bites" is the third person singular of the present tense of the verb "to bite", and the singular noun "dog" is the
object Object may refer to: General meanings * Object (philosophy), a thing, being, or concept ** Object (abstract), an object which does not exist at any particular time or place ** Physical object, an identifiable collection of matter * Goal, an a ...
of the sentence. Techniques such as
sentence diagram A sentence diagram is a pictorial representation of the grammar, grammatical structure of a Sentence (linguistics), sentence. The term "sentence diagram" is used more when pedagogy, teaching written language, where sentences are ''diagrammed''. ...
s are sometimes used to indicate relation between elements in the sentence. Parsing was formerly central to the teaching of grammar throughout the English-speaking world, and widely regarded as basic to the use and understanding of written language.


Computational methods

In some
machine translation Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages. Early approaches were mostly rule-based or statisti ...
and
natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
systems, written texts in human languages are parsed by computer programs. Human sentences are not easily parsed by programs, as there is substantial
ambiguity Ambiguity is the type of meaning (linguistics), meaning in which a phrase, statement, or resolution is not explicitly defined, making for several interpretations; others describe it as a concept or statement that has no real reference. A com ...
in the structure of human language, whose usage is to convey meaning (or
semantics Semantics is the study of linguistic Meaning (philosophy), meaning. It examines what meaning is, how words get their meaning, and how the meaning of a complex expression depends on its parts. Part of this process involves the distinction betwee ...
) amongst a potentially unlimited range of possibilities, but only some of which are germane to the particular case. So an utterance "Man bites dog" versus "Dog bites man" is definite on one detail but in another language might appear as "Man dog bites" with a reliance on the larger context to distinguish between those two possibilities, if indeed that difference was of concern. It is difficult to prepare formal rules to describe informal behaviour even though it is clear that some rules are being followed. In order to parse natural language data, researchers must first agree on the
grammar In linguistics, grammar is the set of rules for how a natural language is structured, as demonstrated by its speakers or writers. Grammar rules may concern the use of clauses, phrases, and words. The term may also refer to the study of such rul ...
to be used. The choice of syntax is affected by both
linguistic Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...
and computational concerns; for instance some parsing systems use
lexical functional grammar Lexical functional grammar (LFG) is a constraint-based grammar framework in theoretical linguistics. It posits several parallel levels of syntactic structure, including a phrase structure grammar representation of word order and constituency, an ...
, but in general, parsing for grammars of this type is known to be
NP-complete In computational complexity theory, NP-complete problems are the hardest of the problems to which ''solutions'' can be verified ''quickly''. Somewhat more precisely, a problem is NP-complete when: # It is a decision problem, meaning that for any ...
.
Head-driven phrase structure grammar Head-driven phrase structure grammar (HPSG) is a highly lexicalized, constraint-based grammar developed by Carl Pollard and Ivan Sag. It is a type of phrase structure grammar, as opposed to a dependency grammar, and it is the immediate successor t ...
is another linguistic formalism which has been popular in the parsing community, but other research efforts have focused on less complex formalisms such as the one used in the Penn
Treebank In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empi ...
. Shallow parsing aims to find only the boundaries of major constituents such as noun phrases. Another popular strategy for avoiding linguistic controversy is
dependency grammar Dependency grammar (DG) is a class of modern Grammar, grammatical theories that are all based on the dependency relation (as opposed to the ''constituency relation'' of Phrase structure grammar, phrase structure) and that can be traced back prima ...
parsing. Most modern parsers are at least partly statistical; that is, they rely on a
corpus Corpus (plural ''corpora'') is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of ...
of training data which has already been annotated (parsed by hand). This approach allows the system to gather information about the frequency with which various constructions occur in specific contexts. ''(See
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
.)'' Approaches which have been used include straightforward PCFGs (probabilistic context-free grammars), maximum entropy, and
neural net In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks. A neural network consists of connected ...
s. Most of the more successful systems use ''lexical'' statistics (that is, they consider the identities of the words involved, as well as their
part of speech In grammar, a part of speech or part-of-speech ( abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are ...
). However such systems are vulnerable to
overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...
and require some kind of smoothing to be effective. Parsing algorithms for natural language cannot rely on the grammar having 'nice' properties as with manually designed grammars for programming languages. As mentioned earlier some grammar formalisms are very difficult to parse computationally; in general, even if the desired structure is not context-free, some kind of context-free approximation to the grammar is used to perform a first pass. Algorithms which use context-free grammars often rely on some variant of the CYK algorithm, usually with some
heuristic A heuristic or heuristic technique (''problem solving'', '' mental shortcut'', ''rule of thumb'') is any approach to problem solving that employs a pragmatic method that is not fully optimized, perfected, or rationalized, but is nevertheless ...
to prune away unlikely analyses to save time. ''(See chart parsing.)'' However some systems trade speed for accuracy using, e.g., linear-time versions of the shift-reduce algorithm. A somewhat recent development has been parse reranking in which the parser proposes some large number of analyses, and a more complex system selects the best option. In natural language understanding applications, semantic parsers convert the text into a representation of its meaning.


Psycholinguistics

In
psycholinguistics Psycholinguistics or psychology of language is the study of the interrelation between linguistic factors and psychological aspects. The discipline is mainly concerned with the mechanisms by which language is processed and represented in the mind ...
, parsing involves not just the assignment of words to categories (formation of ontological insights), but the evaluation of the meaning of a sentence according to the rules of syntax drawn by inferences made from each word in the sentence (known as
connotation A connotation is a commonly understood cultural or emotional association that any given word or phrase carries, in addition to its explicit or literal meaning, which is its denotation. A connotation is frequently described as either positive or ...
). This normally occurs as words are being heard or read. Neurolinguistics generally understands parsing to be a function of working memory, meaning that parsing is used to keep several parts of one sentence at play in the mind at one time, all readily accessible to be analyzed as needed. Because the human working memory has limitations, so does the function of sentence parsing. This is evidenced by several different types of syntactically complex sentences that demonstrate potential issues for mental parsing of sentences. The first, and perhaps most well-known, type of sentence that challenges parsing ability is the garden-path sentence. These sentences are designed so that the most common interpretation of the sentence appears grammatically faulty, but upon further inspection, these sentences are grammatically sound. Garden-path sentences are difficult to parse because they contain a phrase or a word with more than one meaning, often their most typical meaning being a different part of speech.Pritchett, B. L. (1988). Garden Path Phenomena and the Grammatical Basis of Language Processing. Language, 64(3), 539–576. https://doi.org/10.2307/414532 For example, in the sentence, "the horse raced past the barn fell", raced is initially interpreted as a past tense verb, but in this sentence, it functions as part of an adjective phrase. Since parsing is used to identify parts of speech, these sentences challenge the parsing ability of the reader. Another type of sentence that is difficult to parse is an attachment ambiguity, which includes a phrase that could potentially modify different parts of a sentence, and therefore presents a challenge in identifying syntactic relationship (i.e. "The boy saw the lady with the telescope", in which the ambiguous phrase with the telescope could modify the boy saw or the lady.) A third type of sentence that challenges parsing ability is center embedding, in which phrases are placed in the center of other similarly formed phrases (i.e. "The rat the cat the man hit chased ran into the trap".) Sentences with 2 or in the most extreme cases 3 center embeddings are challenging for mental parsing, again because of ambiguity of syntactic relationship. Within neurolinguistics there are multiple theories that aim to describe how parsing takes place in the brain. One such model is a more traditional generative model of sentence processing, which theorizes that within the brain there is a distinct module designed for sentence parsing, which is preceded by access to lexical recognition and retrieval, and then followed by syntactic processing that considers a single syntactic result of the parsing, only returning to revise that syntactic interpretation if a potential problem is detected. The opposing, more contemporary model theorizes that within the mind, the processing of a sentence is not modular, or happening in strict sequence. Rather, it poses that several different syntactic possibilities can be considered at the same time, because lexical access, syntactic processing, and determination of meaning occur in parallel in the brain. In this way these processes are integrated. Although there is still much to learn about the neurology of parsing, studies have shown evidence that several areas of the brain might play a role in parsing. These include the left anterior temporal pole, the left inferior frontal gyrus, the left superior temporal gyrus, the left superior frontal gyrus, the right posterior cingulate cortex, and the left angular gyrus. Although it has not been absolutely proven, it has been suggested that these different structures might favor either phrase-structure parsing or dependency-structure parsing, meaning different types of parsing could be processed in different ways which have yet to be understood.


Discourse analysis

Discourse analysis Discourse analysis (DA), or discourse studies, is an approach to the analysis of written, spoken, or sign language, including any significant semiotic event. The objects of discourse analysis (discourse, writing, conversation, communicative sy ...
examines ways to analyze language use and semiotic events. Persuasive language may be called
rhetoric Rhetoric is the art of persuasion. It is one of the three ancient arts of discourse ( trivium) along with grammar and logic/ dialectic. As an academic discipline within the humanities, rhetoric aims to study the techniques that speakers or w ...
.


Computer languages


Parser

A parser is a software component that takes input data (typically text) and builds a
data structure In computer science, a data structure is a data organization and storage format that is usually chosen for Efficiency, efficient Data access, access to data. More precisely, a data structure is a collection of data values, the relationships amo ...
– often some kind of
parse tree A parse tree or parsing tree (also known as a derivation tree or concrete syntax tree) is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term ''parse tree'' itself is use ...
,
abstract syntax tree An abstract syntax tree (AST) is a data structure used in computer science to represent the structure of a program or code snippet. It is a tree representation of the abstract syntactic structure of text (often source code) written in a formal ...
or other hierarchical structure, giving a structural representation of the input while checking for correct syntax. The parsing may be preceded or followed by other steps, or these may be combined into a single step. The parser is often preceded by a separate lexical analyser, which creates tokens from the sequence of input characters; alternatively, these can be combined in scannerless parsing. Parsers may be programmed by hand or may be automatically or semi-automatically generated by a
parser generator In computer science, a compiler-compiler or compiler generator is a programming tool that creates a parser, interpreter, or compiler from some form of formal description of a programming language and machine. The most common type of compiler- ...
. Parsing is complementary to templating, which produces formatted ''output.'' These may be applied to different domains, but often appear together, such as the scanf/
printf printf is a C standard library function that formats text and writes it to standard output. The function accepts a format c-string argument and a variable number of value arguments that the function serializes per the format string. Mism ...
pair, or the input (front end parsing) and output (back end code generation) stages of a
compiler In computing, a compiler is a computer program that Translator (computing), translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primaril ...
. The input to a parser is typically text in some
computer language A computer language is a formal language used to communicate with a computer. Types of computer languages include: * Software construction#Construction languages, Construction language – all forms of communication by which a human can Comput ...
, but may also be text in a natural language or less structured textual data, in which case generally only certain parts of the text are extracted, rather than a parse tree being constructed. Parsers range from very simple functions such as scanf, to complex programs such as the frontend of a
C++ compiler C, or c, is the third letter of the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is ''cee'' (pronounced ), plural ''cees''. History "C ...
or the
HTML Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets ( ...
parser of a
web browser A web browser, often shortened to browser, is an application for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's scr ...
. An important class of simple parsing is done using
regular expression A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
s, in which a group of regular expressions defines a
regular language In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...
and a regular expression engine automatically generating a parser for that language, allowing
pattern matching In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually must be exact: "either it will or will not be a ...
and extraction of text. In other contexts regular expressions are instead used prior to parsing, as the lexing step whose output is then used by the parser. The use of parsers varies by input. In the case of data languages, a parser is often found as the file reading facility of a program, such as reading in HTML or
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
text; these examples are
markup language A markup language is a Encoding, text-encoding system which specifies the structure and formatting of a document and potentially the relationships among its parts. Markup can control the display of a document or enrich its content to facilitate au ...
s. In the case of
programming language A programming language is a system of notation for writing computer programs. Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...
s, a parser is a component of a
compiler In computing, a compiler is a computer program that Translator (computing), translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primaril ...
or
interpreter Interpreting is translation from a spoken or signed language into another language, usually in real time to facilitate live communication. It is distinguished from the translation of a written text, which can be more deliberative and make use o ...
, which parses the
source code In computing, source code, or simply code or source, is a plain text computer program written in a programming language. A programmer writes the human readable source code to control the behavior of a computer. Since a computer, at base, only ...
of a
computer programming language A programming language is a system of notation for writing computer programs. Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...
to create some form of internal representation; the parser is a key step in the compiler frontend. Programming languages tend to be specified in terms of a
deterministic context-free grammar In formal grammar theory, the deterministic context-free grammars (DCFGs) are a proper subset of the context-free grammars. They are the subset of context-free grammars that can be derived from deterministic pushdown automata, and they generate the ...
because fast and efficient parsers can be written for them. For compilers, the parsing itself can be done in one pass or multiple passes – see
one-pass compiler In computer programming, a one-pass compiler is a compiler that processes each compilation unit only once, sequentially translating each source statement or declaration into something close to its final machine code. This is in contrast to a mul ...
and
multi-pass compiler A multi-pass compiler is a type of compiler that processes the source code or abstract syntax tree of a program several times. This is in contrast to a one-pass compiler, which traverses the program only once. Each pass takes the result of the prev ...
. The implied disadvantages of a one-pass compiler can largely be overcome by adding fix-ups, where provision is made for code relocation during the forward pass, and the fix-ups are applied backwards when the current program segment has been recognized as having been completed. An example where such a fix-up mechanism would be useful would be a forward GOTO statement, where the target of the GOTO is unknown until the program segment is completed. In this case, the application of the fix-up would be delayed until the target of the GOTO was recognized. Conversely, a backward GOTO does not require a fix-up, as the location will already be known. Context-free grammars are limited in the extent to which they can express all of the requirements of a language. Informally, the reason is that the memory of such a language is limited. The grammar cannot remember the presence of a construct over an arbitrarily long input; this is necessary for a language in which, for example, a name must be declared before it may be referenced. More powerful grammars that can express this constraint, however, cannot be parsed efficiently. Thus, it is a common strategy to create a relaxed parser for a context-free grammar which accepts a superset of the desired language constructs (that is, it accepts some invalid constructs); later, the unwanted constructs can be filtered out at the semantic analysis (contextual analysis) step. For example, in Python the following is syntactically valid code: x = 1 print(x) The following code, however, is syntactically valid in terms of the context-free grammar, yielding a syntax tree with the same structure as the previous, but violates the semantic rule requiring variables to be initialized before use: x = 1 print(y)


Overview of process

The following example demonstrates the common case of parsing a computer language with two levels of grammar: lexical and syntactic. The first stage is the token generation, or
lexical analysis Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful ''lexical tokens'' belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives ...
, by which the input character stream is split into meaningful symbols defined by a grammar of
regular expression A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
s. For example, a calculator program would look at an input such as "12 * (3 + 4)^2" and split it into the tokens 12, *, (, 3, +, 4, ), ^, 2, each of which is a meaningful symbol in the context of an arithmetic expression. The lexer would contain rules to tell it that the characters *, +, ^, ( and ) mark the start of a new token, so meaningless tokens like "12*" or "(3" will not be generated. The next stage is parsing or syntactic analysis, which is checking that the tokens form an allowable expression. This is usually done with reference to a
context-free grammar In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the fo ...
which recursively defines components that can make up an expression and the order in which they must appear. However, not all rules defining programming languages can be expressed by context-free grammars alone, for example type validity and proper declaration of identifiers. These rules can be formally expressed with
attribute grammar An attribute grammar is a formal way to supplement a formal grammar with semantic information processing. Semantic information is stored in attributes associated with terminal and nonterminal symbols of the grammar. The values of attributes are t ...
s. The final phase is
semantic parsing Semantic parsing is the task of converting a natural language utterance to a logical form: a machine-understandable representation of its meaning. Semantic parsing can thus be understood as extracting the precise meaning of an utterance. Applicat ...
or analysis, which is working out the implications of the expression just validated and taking the appropriate action. In the case of a calculator or interpreter, the action is to evaluate the expression or program; a compiler, on the other hand, would generate some kind of code. Attribute grammars can also be used to define these actions.


Types of parsers

The ''task'' of the parser is essentially to determine if and how the input can be derived from the start symbol of the grammar. This can be done in essentially two ways: ; Top-down parsing :Top-down parsing can be viewed as an attempt to find left-most derivations of an input-stream by searching for
parse tree A parse tree or parsing tree (also known as a derivation tree or concrete syntax tree) is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term ''parse tree'' itself is use ...
s using a top-down expansion of the given
formal grammar A formal grammar is a set of Terminal and nonterminal symbols, symbols and the Production (computer science), production rules for rewriting some of them into every possible string of a formal language over an Alphabet (formal languages), alphabe ...
rules. Tokens are consumed from left to right. Inclusive choice is used to accommodate
ambiguity Ambiguity is the type of meaning (linguistics), meaning in which a phrase, statement, or resolution is not explicitly defined, making for several interpretations; others describe it as a concept or statement that has no real reference. A com ...
by expanding all alternative right-hand-sides of grammar rules.Aho, A.V., Sethi, R. and Ullman, J.D. (1986) " Compilers: principles, techniques, and tools." '' Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA. '' This is known as the primordial soup approach. Very similar to sentence diagramming, primordial soup breaks down the constituencies of sentences. ;
Bottom-up parsing In computer science, parsing reveals the grammatical structure of linear input text, as a first step in working out its meaning. Bottom-up parsing recognizes the text's lowest-level small details first, before its mid-level structures, and leaves t ...
: A parser can start with the input and attempt to rewrite it to the start symbol. Intuitively, the parser attempts to locate the most basic elements, then the elements containing these, and so on.
LR parser In computer science, LR parsers are a type of bottom-up parsing, bottom-up parser that analyse deterministic context-free languages in linear time. There are several variants of LR parsers: SLR parsers, LALR parsers, canonical LR parser, canonica ...
s are examples of bottom-up parsers. Another term used for this type of parser is Shift-Reduce parsing.
LL parser In computer science, an LL parser (left-to-right, leftmost derivation) is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence. An LL parser is called a ...
s and recursive-descent parser are examples of top-down parsers that cannot accommodate left recursive production rules. Although it has been believed that simple implementations of top-down parsing cannot accommodate direct and indirect left-recursion and may require exponential time and space complexity while parsing ambiguous context-free grammars, more sophisticated algorithms for top-down parsing have been created by Frost, Hafiz, and CallaghanFrost, R., Hafiz, R. and Callaghan, P. (2007)
Modular and Efficient Top-Down Parsing for Ambiguous Left-Recursive Grammars
." ''10th International Workshop on Parsing Technologies (IWPT), ACL-SIGPARSE '', Pages: 109 - 120, June 2007, Prague.
Frost, R., Hafiz, R. and Callaghan, P. (2008)
Parser Combinators for Ambiguous Left-Recursive Grammars
" '' 10th International Symposium on Practical Aspects of Declarative Languages (PADL), ACM-SIGPLAN '', Volume 4902/2008, Pages: 167 - 181, January 2008, San Francisco.
which accommodate
ambiguity Ambiguity is the type of meaning (linguistics), meaning in which a phrase, statement, or resolution is not explicitly defined, making for several interpretations; others describe it as a concept or statement that has no real reference. A com ...
and left recursion in polynomial time and which generate polynomial-size representations of the potentially exponential number of parse trees. Their algorithm is able to produce both left-most and right-most derivations of an input with regard to a given
context-free grammar In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the fo ...
. An important distinction with regard to parsers is whether a parser generates a ''leftmost derivation'' or a ''rightmost derivation'' (see
context-free grammar In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the fo ...
). LL parsers will generate a leftmost derivation and LR parsers will generate a rightmost derivation (although usually in reverse). Some ' algorithms have been designed for visual programming languages. Parsers for visual languages are sometimes based on
graph grammar In computer science, graph transformation, or graph rewriting, concerns the technique of creating a new graph (discrete mathematics), graph out of an original graph algorithmically. It has numerous applications, ranging from software engineering (s ...
s. Adaptive parsing algorithms have been used to construct "self-extending" natural language user interfaces.


Implementation

A simple parser implementation reads the entire input file, performs an intermediate computation or translation, and then writes the entire output file, such as in-memory
multi-pass compiler A multi-pass compiler is a type of compiler that processes the source code or abstract syntax tree of a program several times. This is in contrast to a one-pass compiler, which traverses the program only once. Each pass takes the result of the prev ...
s. Alternative parser implementation approaches: * push parsers call registered handlers ( callbacks) as soon as the parser detects relevant tokens in the input stream. A push parser may skip parts of the input that are irrelevant (an example is
Expat An expatriate (often shortened to expat) is a person who resides outside their native country. The term often refers to a professional, skilled worker, or student from an affluent country. However, it may also refer to retirees, artists and ...
). * pull parsers, such as parsers that are typically used by
compilers In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs tha ...
front-ends by "pulling" input text. * incremental parsers (such as incremental
chart parser In computer science, a chart parser is a type of parser suitable for ambiguous grammars (including grammars of natural languages). It uses the dynamic programming approach—partial hypothesized results are stored in a structure called a chart a ...
s) that, as the text of the file is edited by a user, does not need to completely re-parse the entire file. * Active versus passive parsers Song-Chun Zhu
"Classic Parsing Algorithms"


Parser development software

Some of the well known parser development tools include the following: * ANTLR *
Bison A bison (: bison) is a large bovine in the genus ''Bison'' (from Greek, meaning 'wild ox') within the tribe Bovini. Two extant taxon, extant and numerous extinction, extinct species are recognised. Of the two surviving species, the American ...
* Coco/R * Definite clause grammar *
GOLD Gold is a chemical element; it has chemical symbol Au (from Latin ) and atomic number 79. In its pure form, it is a brightness, bright, slightly orange-yellow, dense, soft, malleable, and ductile metal. Chemically, gold is a transition metal ...
* JavaCC *
Lemon The lemon (''Citrus'' × ''limon'') is a species of small evergreen tree in the ''Citrus'' genus of the flowering plant family Rutaceae. A true lemon is a hybrid of the citron and the bitter orange. Its origins are uncertain, but some ...
* Lex * LuZc *
Parboiled Parboiling (or leaching) is the partial or semi boiling of food as the first step in cooking. The word is from the Old French ''parbouillir'', 'to boil thoroughly' but by mistaken association with "part", it has acquired this definition. The w ...
*
Parsec The parsec (symbol: pc) is a unit of length used to measure the large distances to astronomical objects outside the Solar System, approximately equal to or (AU), i.e. . The parsec unit is obtained by the use of parallax and trigonometry, and ...
* Ragel * Spirit Parser Framework * Syntax Definition Formalism *
SYNTAX In linguistics, syntax ( ) is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituenc ...
* XPL *
Yacc Yacc (Yet Another Compiler-Compiler) is a computer program for the Unix operating system developed by Stephen C. Johnson. It is a lookahead left-to-right rightmost derivation (LALR) parser generator, generating a LALR parser (the part of a co ...


Lookahead

Lookahead establishes the maximum incoming tokens that a parser can use to decide which rule it should use. Lookahead is especially relevant to LL, LR, and LALR parsers, where it is often explicitly indicated by affixing the lookahead to the algorithm name in parentheses, such as LALR(1). Most
programming language A programming language is a system of notation for writing computer programs. Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...
s, the primary target of parsers, are carefully defined in such a way that a parser with limited lookahead, typically one, can parse them, because parsers with limited lookahead are often more efficient. One important change to this trend came in 1990 when Terence Parr created ANTLR for his Ph.D. thesis, a
parser generator In computer science, a compiler-compiler or compiler generator is a programming tool that creates a parser, interpreter, or compiler from some form of formal description of a programming language and machine. The most common type of compiler- ...
for efficient LL(''k'') parsers, where ''k'' is any fixed value. LR parsers typically have only a few actions after seeing each token. They are shift (add this token to the stack for later reduction), reduce (pop tokens from the stack and form a syntactic construct), end, error (no known rule applies) or conflict (does not know whether to shift or reduce). Lookahead has two advantages. * It helps the parser take the correct action in case of conflicts. For example, parsing the if statement in the case of an else clause. * It eliminates many duplicate states and eases the burden of an extra stack. A C language non-lookahead parser will have around 10,000 states. A lookahead parser will have around 300 states. Example: Parsing the Expression {, class="toccolours" , colspan=3 , Set of expression parsing rules (called grammar) is as follows, , - , Rule1: , , E → E + E , , style="padding-left:1em" , Expression is the sum of two expressions. , - , Rule2: , , E → E * E , , style="padding-left:1em" , Expression is the product of two expressions. , - , Rule3: , , E → number , , style="padding-left:1em" , Expression is a simple number , - , Rule4: , , colspan=2 , + has less precedence than * Most programming languages (except for a few such as APL and Smalltalk) and algebraic formulas give higher precedence to multiplication than addition, in which case the correct interpretation of the example above is . Note that Rule4 above is a semantic rule. It is possible to rewrite the grammar to incorporate this into the syntax. However, not all such rules can be translated into syntax. ; Simple non-lookahead parser actions Initially Input = , +, 2, *, 3# Shift "1" onto stack from input (in anticipation of rule3). Input = , 2, *, 3Stack = # Reduces "1" to expression "E" based on rule3. Stack = # Shift "+" onto stack from input (in anticipation of rule1). Input = , *, 3Stack = , +# Shift "2" onto stack from input (in anticipation of rule3). Input = , 3Stack = , +, 2# Reduce stack element "2" to Expression "E" based on rule3. Stack = , +, E# Reduce stack items , +, Eand new input "E" to "E" based on rule1. Stack = # Shift "*" onto stack from input (in anticipation of rule2). Input = Stack = ,*# Shift "3" onto stack from input (in anticipation of rule3). Input = [] (empty) Stack = [E, *, 3] # Reduce stack element "3" to expression "E" based on rule3. Stack = [E, *, E] # Reduce stack items [E, *, E] and new input "E" to "E" based on rule2. Stack = The parse tree and resulting code from it is not correct according to language semantics. To correctly parse without lookahead, there are three solutions: * The user has to enclose expressions within parentheses. This often is not a viable solution. * The parser needs to have more logic to backtrack and retry whenever a rule is violated or not complete. The similar method is followed in LL parsers. * Alternatively, the parser or grammar needs to have extra logic to delay reduction and reduce only when it is absolutely sure which rule to reduce first. This method is used in LR parsers. This correctly parses the expression but with many more states and increased stack depth. ; Lookahead parser actions # Shift 1 onto stack on input 1 in anticipation of rule3. It does not reduce immediately. # Reduce stack item 1 to simple Expression on input + based on rule3. The lookahead is +, so we are on path to E +, so we can reduce the stack to E. # Shift + onto stack on input + in anticipation of rule1. # Shift 2 onto stack on input 2 in anticipation of rule3. # Reduce stack item 2 to Expression on input * based on rule3. The lookahead * expects only E before it. # Now stack has E + E and still the input is *. It has two choices now, either to shift based on rule2 or reduction based on rule1. Since * has higher precedence than + based on rule4, we shift * onto stack in anticipation of rule2. # Shift 3 onto stack on input 3 in anticipation of rule3. # Reduce stack item 3 to Expression after seeing end of input based on rule3. # Reduce stack items E * E to E based on rule2. # Reduce stack items E + E to E based on rule1. The parse tree generated is correct and simply than non-lookahead parsers. This is the strategy followed in LALR parsers.


List of parsing algorithms


See also

*
Backtracking Backtracking is a class of algorithms for finding solutions to some computational problems, notably constraint satisfaction problems, that incrementally builds candidates to the solutions, and abandons a candidate ("backtracks") as soon as it de ...
*
Chart parser In computer science, a chart parser is a type of parser suitable for ambiguous grammars (including grammars of natural languages). It uses the dynamic programming approach—partial hypothesized results are stored in a structure called a chart a ...
*
Compiler-compiler In computer science, a compiler-compiler or compiler generator is a programming tool that creates a Parsing#Computer_languages, parser, interpreter (computer software), interpreter, or compiler from some form of formal description of a programm ...
* Deterministic parsing * DMS Software Reengineering Toolkit *
Grammar checker A grammar checker, in computing terms, is a Computer program, program, or part of a program, that attempts to verify written text for grammatical correctness. Grammar checkers are most often implemented as a feature of a larger program, such as a ...
* Inverse parser * LALR parser * Left corner parser *
Lexical analysis Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful ''lexical tokens'' belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives ...
*
Parsing expression grammar In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 20 ...
*
Pratt parser In computer science, an operator-precedence parser is a bottom-up parser that interprets an operator-precedence grammar. For example, most calculators use operator-precedence parsers to convert from the human-readable infix notation relying on or ...
*
Program transformation A program transformation is any operation that takes a computer program and generates another program. In many cases the transformed program is required to be semantically equivalent to the original, relative to a particular Formal semantics of p ...
* Shallow parsing *
Sentence processing Sentence processing takes place whenever a reader or listener processes a language utterance, either in isolation or in the context of a conversation or a text. Many studies of the human language comprehension process have focused on reading of ...
* Source code generation


References


Further reading

* Chapman, Nigel P.
''LR Parsing: Theory and Practice''
Cambridge University Press Cambridge University Press was the university press of the University of Cambridge. Granted a letters patent by King Henry VIII in 1534, it was the oldest university press in the world. Cambridge University Press merged with Cambridge Assessme ...
, 1987. * Grune, Dick; Jacobs, Ceriel J.H.
''Parsing Techniques - A Practical Guide''
Vrije Universiteit Amsterdam The (abbreviated as ''VU Amsterdam'' or simply ''VU'' when in context) is a public university, public research university in Amsterdam, Netherlands, founded in 1880. The VU Amsterdam is one of two large, publicly funded research universities in ...
, Amsterdam, the Netherlands. Originally published by Ellis Horwood, Chichester, England, 1990;


External links


The Lemon LALR Parser GeneratorStanford Parser
The Stanford Parser
Turin University Parser
Natural language parser for the Italian, open source, developed in Common Lisp by Leonardo Lesmo, University of Torino, Italy.

{{Strings Algorithms on strings Compiler construction