Maximal Munch
   HOME

TheInfoList



OR:

In
computer programming Computer programming is the process of performing a particular computation (or more generally, accomplishing a specific computing result), usually by designing and building an executable computer program. Programming involves tasks such as ana ...
and
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, "maximal munch" or "longest match" is the principle that when creating some construct, as much of the available input as possible should be consumed. The earliest known use of this term is by R.G.G. Cattell in his PhD thesis on automatic derivation of code generators for
compilers In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
.


Application

For instance, the
lexical syntax In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of ''lexical tokens'' ( strings with an assigned and thus identified ...
of many
programming languages A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are a kind of computer language. The description of a programming ...
requires that tokens be built from the maximum possible number of characters from the input stream. This is done to resolve the problem of inherent ambiguity in commonly used
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or ...
s such as -z (one or more lower-case letters). The term is also used in
compilers In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
in the
instruction selection __NOTOC__ In computer science, ''instruction selection'' is the stage of a compiler backend that transforms its middle-level intermediate representation (IR) into a low-level IR. In a typical compiler, instruction selection precedes both instruction ...
stage to describe a method of "tiling" — determining how a structured tree representing a program in an
intermediate language An intermediate representation (IR) is the data structure or code used internally by a compiler or virtual machine to represent source code. An IR is designed to be conducive to further processing, such as optimization and translation. A " ...
should be converted into linear
machine code In computer programming, machine code is any low-level programming language, consisting of machine language instructions, which are used to control a computer's central processing unit (CPU). Each instruction causes the CPU to perform a very ...
. An entire subtree might be converted into just one machine instruction, and the problem is how to split the tree into non-overlapping "tiles", each representing one machine instruction. An effective strategy is simply to make a tile of the largest subtree possible at any given point, which is called "maximal munch".


Drawbacks

In some situations, "maximal munch" leads to undesirable or unintuitive outcomes. For instance, in the C programming language, the statement x=y/*z; (without any whitespace) will probably lead to a syntax error, since the /* character sequence initiates a (unintended) comment that is either unterminated or terminated by the end token */ of some later, unrelated ''actual'' comment (comments in C do not nest). What was actually meant in the statement was to assign to the variable x the result of dividing the value in y by the value obtained by dereferencing pointer z; this would be valid code. It can be stated by making use of whitespace, or using x=y/(*z);. Another example, in
C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
, uses the "angle bracket" characters < and > in the syntax for
template specialization Generic programming is a style of computer programming in which algorithms are written in terms of types ''to-be-specified-later'' that are then ''instantiated'' when needed for specific types provided as parameters. This approach, pioneered by ...
, but two consecutive > characters are interpreted as the right-shift operator >>. Prior to C++11, the following code would produce a parse error, because the right-shift operator token is encountered instead of two right-angle-bracket tokens: std::vector> my_mat_11; //Incorrect in C++03, correct in C++11. std::vector > my_mat_03; //Correct in either C++03 or C++11. The
C++11 C++11 is a version of the ISO/IEC 14882 standard for the C++ programming language. C++11 replaced the prior version of the C++ standard, called C++03, and was later replaced by C++14. The name follows the tradition of naming language versions by ...
standard adopted in August 2011 amended the grammar so that a right-shift token is accepted as synonymous with a pair of right angle brackets (as in
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
), which complicates the grammar but allows the continued use of the maximal munch principle. An exception to the maximal munch rule had to be added anyway to deal with the sequence <:: which can appear in templates. In that case, unless the sequence is followed by : or > the character < is interpreted as its own token instead of part of the token <:.


Alternatives

Programming languages researchers have also responded by replacing or supplementing the principle of maximal munch with other lexical disambiguation tactics. One approach is to utilize "follow restrictions", which instead of directly taking the longest match will put some restrictions on what characters can ''follow'' a valid match. For example, stipulating that strings matching -z cannot be followed by an alphabetic character achieves the same effect as maximal munch with that regular expression. (In the context of regular expressions, the maximal munch principle is referred to as ''greediness'' and contrasted with ''
laziness Laziness (also known as indolence) is disinclination to activity or exertion despite having the ability to act or to exert oneself. It is often used as a pejorative; terms for a person seen to be lazy include "couch potato", "slacker", and "b ...
''.) Another approach is to keep the principle of maximal munch but make it subordinate to some other principle, such as context (''e.g.'', the right-shift token in Java would not be matched in the context of a generics expression, where it is syntactically invalid).Van Wyk ''et al.'', 63.


References


Bibliography

* * * * * {{Cite book , last=Van Wyk , first=Eric , last2=Schwerdfeger , first2=August , title=Context-Aware Scanning for Parsing Extensible Languages , journal=GPCE '07: Proceedings of the 6th International Conference on Generative Programming and Component Engineering , year=2007 , pages=63–72 , publisher=ACM , location=New York , doi=10.1145/1289971.1289983, isbn=9781595938558 Compilers