Operator-precedence parser

Last updated

In computer science, an operator precedence parser is a bottom-up parser that interprets an operator-precedence grammar. For example, most calculators use operator precedence parsers to convert from the human-readable infix notation relying on order of operations to a format that is optimized for evaluation such as Reverse Polish notation (RPN).

Contents

Edsger Dijkstra's shunting yard algorithm is commonly used to implement operator precedence parsers.

Relationship to other parsers

An operator-precedence parser is a simple shift-reduce parser that is capable of parsing a subset of LR(1) grammars. More precisely, the operator-precedence parser can parse all LR(1) grammars where two consecutive nonterminals and epsilon never appear in the right-hand side of any rule.

Operator-precedence parsers are not used often in practice; however they do have some properties that make them useful within a larger design. First, they are simple enough to write by hand, which is not generally the case with more sophisticated right shift-reduce parsers. Second, they can be written to consult an operator table at run time, which makes them suitable for languages that can add to or change their operators while parsing. (An example is Haskell, which allows user-defined infix operators with custom associativity and precedence; consequently, an operator-precedence parser must be run on the program after parsing of all referenced modules.)

Raku sandwiches an operator-precedence parser between two recursive descent parsers in order to achieve a balance of speed and dynamism. GCC's C and C++ parsers, which are hand-coded recursive descent parsers, are both sped up by an operator-precedence parser that can quickly examine arithmetic expressions. Operator precedence parsers are also embedded within compiler-compiler-generated parsers to noticeably speed up the recursive descent approach to expression parsing. [1]

Precedence climbing method

The precedence climbing method is a compact, efficient, and flexible algorithm for parsing expressions that was first described by Martin Richards and Colin Whitby-Strevens. [2]

An infix-notation expression grammar in EBNF format will usually look like this:

expression::=equality-expressionequality-expression::=additive-expression(('=='|'!=')additive-expression)*additive-expression::=multiplicative-expression(('+'|'-')multiplicative-expression)*multiplicative-expression::=primary(('*'|'/')primary)*primary::='('expression')'|NUMBER|VARIABLE|'-'primary

With many levels of precedence, implementing this grammar with a predictive recursive-descent parser can become inefficient. Parsing a number, for example, can require five function calls: one for each non-terminal in the grammar until reaching primary.

An operator-precedence parser can do the same more efficiently. [1] The idea is that we can left associate the arithmetic operations as long as we find operators with the same precedence, but we have to save a temporary result to evaluate higher precedence operators. The algorithm that is presented here does not need an explicit stack; instead, it uses recursive calls to implement the stack.

The algorithm is not a pure operator-precedence parser like the Dijkstra shunting yard algorithm. It assumes that the primary nonterminal is parsed in a separate subroutine, like in a recursive descent parser.

Pseudocode

The pseudocode for the algorithm is as follows. The parser starts at function parse_expression. Precedence levels are greater than or equal to 0.

parse_expression()return parse_expression_1(parse_primary(), 0)
parse_expression_1(lhs, min_precedence)lookahead := peek next token     whilelookahead is a binary operator whose precedence is >= min_precedenceop := lookahead         advance to next token         rhs := parse_primary ()         lookahead := peek next token         whilelookahead is a binary operator whose precedence is greater                  than op's, or a right-associative operator                  whose precedence is equal to op's             rhs := parse_expression_1 (rhs, precedence of op + (1 if lookahead precedence is greater, else 0))             lookahead := peek next token         lhs := the result of applying op with operands lhs and rhsreturnlhs

Note that in the case of a production rule like this (where the operator can only appear once):

equality-expression::=additive-expression('=='|'!=')additive-expression

the algorithm must be modified to accept only binary operators whose precedence is > min_precedence.

Example execution of the algorithm

An example execution on the expression 2 + 3 * 4 + 5 == 19 is as follows. We give precedence 0 to equality expressions, 1 to additive expressions, 2 to multiplicative expressions.

parse_expression_1 (lhs = 2, min_precedence = 0)

  • the lookahead token is *, with precedence 2. the outer while loop is entered.
  • op is * (precedence 2) and the input is advanced
  • rhs is 4
  • the next token is +, with precedence 1. the inner while loop is not entered.
  • lhs is assigned 3*4 = 12
  • the next token is +, with precedence 1. the outer while loop is left.
  • 12 is returned.

1 is returned.

Pratt parsing

Another precedence parser known as Pratt parsing was first described by Vaughan Pratt in the 1973 paper "Top down operator precedence", [3] based on recursive descent. Though it predates precedence climbing, it can be viewed as a generalization of precedence climbing. [4]

Pratt designed the parser originally to implement the CGOL programming language, and it was treated in much more depth in a Masters Thesis under his supervision. [5]

Tutorials and implementations:

Alternative methods

There are other ways to apply operator precedence rules. One is to build a tree of the original expression and then apply tree rewrite rules to it.

Such trees do not necessarily need to be implemented using data structures conventionally used for trees. Instead, tokens can be stored in flat structures, such as tables, by simultaneously building a priority list which states what elements to process in which order.

Full parenthesization

Another approach is to first fully parenthesize the expression, inserting a number of parentheses around each operator, such that they lead to the correct precedence even when parsed with a linear, left-to-right parser. This algorithm was used in the early FORTRAN I compiler: [7]

The Fortran I compiler would expand each operator with a sequence of parentheses. In a simplified form of the algorithm, it would

  • replace + and with ))+(( and ))-((, respectively;
  • replace * and / with )*( and )/(, respectively;
  • add (( at the beginning of each expression and after each left parenthesis in the original expression; and
  • add )) at the end of the expression and before each right parenthesis in the original expression.

Although not obvious, the algorithm was correct, and, in the words of Knuth, “The resulting formula is properly parenthesized, believe it or not.” [8]

Example code of a simple C application that handles parenthesisation of basic math operators (+, -, *, /, ^, ( and )):

#include<stdio.h>#include<string.h>// The command-line argument boundary is our lexer.intmain(intargc,char*argv[]){inti;printf("((((");for(i=1;i!=argc;i++){// strlen(argv[i]) == 2if(argv[i]&&!argv[i][1]){switch(*argv[i]){case'(':printf("((((");continue;case')':printf("))))");continue;case'^':printf(")^(");continue;case'*':printf("))*((");continue;case'/':printf("))/((");continue;case'+':// unary check: either first or had an operator expecting secondary argumentif(i==1||strchr("(^*/+-",*argv[i-1]))printf("+");elseprintf(")))+(((");continue;case'-':if(i==1||strchr("(^*/+-",*argv[i-1]))printf("-");elseprintf(")))-(((");continue;}}printf("%s",argv[i]);}printf("))))\n");return0;}

First, you need to compile your program. Assuming your program is written in C and the source code is in a file named program.c, you would use the following command:

gcc program.c -o program

The above command tells gcc to compile program.c and create an executable named program.

Command to run the program with parameters, For example; a * b + c ^ d / e

./program a '*' b + c '^' d / e

it produces

((((a))*((b)))+(((c)^(d))/((e))))

as output on the console.

A limitation to this strategy is that unary operators must all have higher precedence than infix operators. The "negative" operator in the above code has a higher precedence than exponentiation. Running the program with this input

- a ^ 2

produces this output

((((-a)^(2))))

which is probably not what is intended.

Related Research Articles

In computer science, LR parsers are a type of bottom-up parser that analyse deterministic context-free languages in linear time. There are several variants of LR parsers: SLR parsers, LALR parsers, canonical LR(1) parsers, minimal LR(1) parsers, and generalized LR parsers. LR parsers can be generated by a parser generator from a formal grammar defining the syntax of the language to be parsed. They are widely used for the processing of computer languages.

Mary is a programming language designed and implemented by RUNIT at Trondheim, Norway in the 1970s. It borrowed many features from ALGOL 68 but was designed for systems programming.

Rebol is a cross-platform data exchange language and a multi-paradigm dynamic programming language designed by Carl Sassenrath for network communications and distributed computing. It introduces the concept of dialecting: small, optimized, domain-specific languages for code and data, which is also the most notable property of the language according to its designer Carl Sassenrath:

Although it can be used for programming, writing functions, and performing processes, its greatest strength is the ability to easily create domain-specific languages or dialects

In computer science, a recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures where each such procedure implements one of the nonterminals of the grammar. Thus the structure of the resulting program closely mirrors that of the grammar it recognizes.

In computer programming, M-expressions were an early proposed syntax for the Lisp programming language, inspired by contemporary languages such as Fortran and ALGOL. The notation was never implemented into the language and, as such, it was never finalized.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

Top-down parsing in computer science is a parsing strategy where one first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar. LL parsers are a type of parser that uses a top-down parsing strategy.

<span class="mw-page-title-main">C syntax</span> Set of rules defining correctly structured programs

The syntax of the C programming language is the set of rules governing writing of software in C. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

In programming languages, scientific calculators and similar common operator notation or operator grammar is a way to define and analyse mathematical and other formal expressions. In this model a linear sequence of tokens are divided into two classes: operators and operands.

In computer programming, operators are constructs defined within programming languages which behave generally like functions, but which differ syntactically or semantically.

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

The Packrat parser is a type of parser that shares similarities with the recursive descent parser in its construction. However, it differs because it takes parsing expression grammars (PEGs) as input rather than LL grammars.

In computer science, the shunting yard algorithm is a method for parsing arithmetical or logical expressions, or a combination of both, specified in infix notation. It can produce either a postfix notation string, also known as Reverse Polish notation (RPN), or an abstract syntax tree (AST). The algorithm was invented by Edsger Dijkstra and named the "shunting yard" algorithm because its operation resembles that of a railroad shunting yard. Dijkstra first described the shunting yard algorithm in the Mathematisch Centrum report MR 34/61.

In computer science, tail recursive parsers are a derivation from the more common recursive descent parsers. Tail recursive parsers are commonly used to parse left recursive grammars. They use a smaller amount of stack space than regular recursive descent parsers. They are also easy to write. Typical recursive descent parsers make parsing left recursive grammars impossible. Tail recursive parsers use a node reparenting technique that makes this allowable.

META II is a domain-specific programming language for writing compilers. It was created in 1963–1964 by Dewey Val Schorre at UCLA. META II uses what Schorre called syntax equations. Its operation is simply explained as:

Each syntax equation is translated into a recursive subroutine which tests the input string for a particular phrase structure, and deletes it if found.

CGOL is an alternative syntax featuring an extensible algebraic notation for the Lisp programming language. It was designed for MACLISP by Vaughan Pratt and subsequently ported to Common Lisp.

In computer programming, a parser combinator is a higher-order function that accepts several parsers as input and returns a new parser as its output. In this context, a parser is a function accepting strings as input and returning some structure as output, typically a parse tree or a set of indices representing locations in the string where parsing stopped successfully. Parser combinators enable a recursive descent parsing strategy that facilitates modular piecewise construction and testing. This parsing technique is called combinatory parsing.

<span class="mw-page-title-main">History of compiler construction</span>

In computing, a compiler is a computer program that transforms source code written in a programming language or computer language, into another computer language. The most common reason for transforming source code is to create an executable program.

A shift-reduce parser is a class of efficient, table-driven bottom-up parsing methods for computer languages and other notations formally defined by a grammar. The parsing methods most commonly used for parsing programming languages, LR parsing and its variations, are shift-reduce methods. The precedence parsers used before the invention of LR parsing are also shift-reduce methods. All shift-reduce parsers have similar outward effects, in the incremental order in which they build a parse tree or call specific output actions.

re2c is a free and open-source lexer generator for C, C++, Go, and Rust. It compiles declarative regular expression specifications to deterministic finite automata. Originally written by Peter Bumbulis and described in his paper, re2c was put in public domain and has been since maintained by volunteers. It is the lexer generator adopted by projects such as PHP, SpamAssassin, Ninja build system and others. Together with the Lemon parser generator, re2c is used in BRL-CAD. This combination is also used with STEPcode, an implementation of ISO 10303 standard.

References

  1. 1 2 Harwell, Sam (2008-08-29). "Operator precedence parser". ANTLR3 Wiki. Retrieved 2017-10-25.
  2. Richards, Martin; Whitby-Strevens, Colin (1979). BCPL — the language and its compiler. Cambridge University Press. ISBN   9780521219655.
  3. Pratt, Vaughan. "Top down operator precedence." Proceedings of the 1st Annual ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (1973).
  4. Norvell, Theodore. "Parsing Expressions by Recursive Descent". www.engr.mun.ca. The purpose of this post is to [... start] with precedence climbing and refactoring it to use the command pattern until we arrive at a Pratt parser. [This is the author who coined the term "precedence climbing".]
  5. Van De Vanter, Michael L. "A Formalization and Correctness Proof of the CGOL Language System." (Master's Thesis). MIT Laboratory for Computer Science Technical Report MIT-LCS-TR-147 (Cambridge, Massachusetts). 1975.
  6. Crockford, D (2007-02-21). "Top Down Operator Precedence".
  7. Padua, David (2000). "The Fortran I Compiler" (PDF). Computing in Science & Engineering. 2 (1): 70–75. Bibcode:2000CSE.....2a..70P. doi:10.1109/5992.814661.
  8. Knuth, Donald E. (1962). "A HISTORY OF WRITING COMPILERS". Computers and Automation. Edmund C. Berkeley. 11 (12): 8–14.