Maximal munch

Last updated

In computer programming and computer science, "maximal munch" or "longest match" is the principle that when creating some construct, as much of the available input as possible should be consumed.

Contents

The earliest known use of this term is by R.G.G. Cattell in his PhD thesis [1] on automatic derivation of code generators for compilers.

Application

For instance, the lexical syntax of many programming languages requires that tokens be built from the maximum possible number of characters from the input stream. This is done to resolve the problem of inherent ambiguity in commonly used regular expressions such as [a-z]+ (one or more lower-case letters). [2]

The term is also used in compilers in the instruction selection stage to describe a method of "tiling" — determining how a structured tree representing a program in an intermediate language should be converted into linear machine code. An entire subtree might be converted into just one machine instruction, and the problem is how to split the tree into non-overlapping "tiles", each representing one machine instruction. An effective strategy is simply to make a tile of the largest subtree possible at any given point, which is called "maximal munch". [3]

Drawbacks

In some situations, "maximal munch" leads to undesirable or unintuitive outcomes. For instance, in the C programming language, the statement x=y/*z; (without any whitespace) will probably lead to a syntax error since the /* character sequence (unintentionally) initiates a comment that is either unterminated or terminated by the end token */ of some later, unrelated actual comment (comments in C do not nest). What was actually meant in the statement was to assign to the variable x the result of dividing the value in y by the value obtained by dereferencing pointer z; this would be valid code. It can be stated by making use of whitespace or using x=y/(*z);.

Another example, in C++, uses the "angle bracket" characters < and > in the syntax for template specialization, but two consecutive > characters are interpreted as the right-shift operator >>. [4] Prior to C++11, the following code would produce a parse error, because the right-shift operator token is encountered instead of two right-angle-bracket tokens:

std::vector<std::vector<int>>my_mat_11;//Incorrect in C++03, correct in C++11.std::vector<std::vector<int>>my_mat_03;//Correct in either C++03 or C++11.

The C++11 standard adopted in August 2011 amended the grammar so that a right-shift token is accepted as synonymous with a pair of right angle brackets (as in Java), which complicates the grammar but allows the continued use of the maximal munch principle. An exception to the maximal munch rule had to be added anyway to deal with the sequence <:: which can appear in templates. In that case, unless the sequence is followed by : or > the character < is interpreted as its own token instead of part of the token <: .

Alternatives

Programming languages researchers have also responded by replacing or supplementing the principle of maximal munch with other lexical disambiguation tactics. One approach is to utilize "follow restrictions", which instead of directly taking the longest match will put some restrictions on what characters can follow a valid match. For example, stipulating that strings matching [a-z]+ cannot be followed by an alphabetic character achieves the same effect as maximal munch with that regular expression. [5] (In the context of regular expressions, the maximal munch principle is referred to as greediness and contrasted with laziness .) Another approach is to keep the principle of maximal munch but make it subordinate to some other principle, such as context (e.g., the right-shift token in Java would not be matched in the context of a generics expression, where it is syntactically invalid). [6]

Related Research Articles

Atlas Autocode (AA) is a programming language developed around 1963 at the University of Manchester. A variant of the language ALGOL, it was developed by Tony Brooker and Derrick Morris for the Atlas computer. The initial AA and AB compilers were written by Jeff Rohl and Tony Brooker using the Brooker-Morris Compiler-compiler, with a later hand-coded non-CC implementation (ABC) by Jeff Rohl.

In computing, a compiler is a computer program that translates computer code written in one programming language into another language. The name "compiler" is primarily used for programs that translate source code from a high-level programming language to a low-level programming language to create an executable program.

Yacc is a computer program for the Unix operating system developed by Stephen C. Johnson. It is a lookahead left-to-right rightmost derivation (LALR) parser generator, generating a LALR parser based on a formal grammar, written in a notation similar to Backus–Naur form (BNF). Yacc is supplied as a standard utility on BSD and AT&T Unix. GNU-based Linux distributions include Bison, a forward-compatible Yacc replacement.

GNU Bison, commonly known as Bison, is a parser generator that is part of the GNU Project. Bison reads a specification in the BNF notation, warns about any parsing ambiguities, and generates a parser that reads sequences of tokens and decides whether the sequence conforms to the syntax specified by the grammar.

<span class="mw-page-title-main">S-expression</span> Data serialization format

In computer programming, an S-expression is an expression in a like-named notation for nested list (tree-structured) data. S-expressions were invented for and popularized by the programming language Lisp, which uses them for source code as well as data.

In computer science, augmented Backus–Naur form (ABNF) is a metalanguage based on Backus–Naur form (BNF), but consisting of its own syntax and derivation rules. The motive principle for ABNF is to describe a formal system of a language to be used as a bidirectional communications protocol. It is defined by Internet Standard 68, which as of December 2010 is RFC 5234, and it often serves as the definition language for IETF communication protocols.

In computer science, Backus–Naur form or Backus normal form (BNF) is a metasyntax notation for context-free grammars, often used to describe the syntax of languages used in computing, such as computer programming languages, document formats, instruction sets and communication protocols. It is applied wherever exact descriptions of languages are needed: for instance, in official language specifications, in manuals, and in textbooks on programming language theory.

Lexical tokenization is conversion of a text into meaningful lexical tokens belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include identifiers, operators, grouping symbols and data types. Lexical tokenization is not the same process as the probabilistic tokenization, used for large language model's data preprocessing, that encode text into numerical tokens, using byte pair encoding.

A string literal or anonymous string is a literal for a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally "bracketed delimiters", as in x = "foo", where "foo" is a string literal with value foo. Methods such as escape sequences can be used to avoid the problem of delimiter collision and allow the delimiters to be embedded in a string. There are many alternate notations for specifying string literals especially in complicated cases. The exact notation depends on the programming language in question. Nevertheless, there are general guidelines that most modern programming languages follow.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

In logic and computer science, a metasyntax describes the allowable structure and composition of phrases and sentences of a metalanguage, which is used to describe either a natural language or a computer programming language. Some of the widely used formal metalanguages for computer languages are Backus–Naur form (BNF), extended Backus–Naur form (EBNF), Wirth syntax notation (WSN), and augmented Backus–Naur form (ABNF).

A computer programming language is said to adhere to the off-side rule of syntax if blocks in that language are expressed by their indentation. The term was coined by Peter Landin, possibly as a pun on the offside rule in association football. This is contrasted with free-form languages, notably curly-bracket programming languages, where indentation has no computational meaning and indent style is only a matter of coding conventions and formatting. Off-side-rule languages are also described as having significant indentation.

<span class="mw-page-title-main">Syntax (programming languages)</span> Set of rules defining correctly structured programs

In computer science, the syntax of a computer language is the rules that define the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data.

In computer programming, the lexer hack is a common solution to the problems in parsing ANSI C, due to the reference grammar being context-sensitive. In C, classifying a sequence of characters as a variable name or a type name requires contextual information of the phrase structure, which prevents one from having a context-free lexer.

C++11 is a version of the ISO/IEC 14882 standard for the C++ programming language. C++11 replaced the prior version of the C++ standard, called C++03, and was later replaced by C++14. The name follows the tradition of naming language versions by the publication year of the specification, though it was formerly named C++0x because it was expected to be published before 2010.

Raku rules are the regular expression, string matching and general-purpose parsing facility of the Raku programming language, and are a core part of the language. Since Perl's pattern-matching constructs have exceeded the capabilities of formal regular expressions for some time, Raku documentation refers to them exclusively as regexes, distancing the term from the formal definition.

This comparison of programming languages compares the features of language syntax (format) for over 50 computer programming languages.

Wirth syntax notation (WSN) is a metasyntax, that is, a formal way to describe formal languages. Originally proposed by Niklaus Wirth in 1977 as an alternative to Backus–Naur form (BNF). It has several advantages over BNF in that it contains an explicit iteration construct, and it avoids the use of an explicit symbol for the empty string.

RE/flex is a free and open source computer program written in C++ that generates fast lexical analyzers in C++. RE/flex offers full Unicode support, indentation anchors, word boundaries, lazy quantifiers, and performance tuning options. RE/flex accepts Flex lexer specifications and offers options to generate scanners for Bison parsers. RE/flex includes a fast C++ regular expression library.

In computer programming languages, an identifier is a lexical token that names the language's entities. Some of the kinds of entities an identifier might denote include variables, data types, labels, subroutines, and modules.

References

  1. Cattell, R. G. G. “Formalization and Automatic Derivation of Code Generators”. PhD thesis, 1978. Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
  2. Aho et al., 168.
  3. Page, 470.
  4. Vandevoorde.
  5. Van den Brand et al., 26.
  6. Van Wyk et al., 63.

Bibliography