Recursive ascent parser

Last updated

In computer science, recursive ascent parsing is a technique for implementing an LR parser which uses mutually-recursive functions rather than tables. Thus, the parser is directly encoded in the host language similar to recursive descent. Direct encoding usually yields a parser which is faster than its table-driven equivalent [1] for the same reason that compilation is faster than interpretation. It is also (nominally) possible to hand edit a recursive ascent parser, whereas a tabular implementation is nigh unreadable to the average human.

Contents

Recursive ascent was first described by Thomas Pennello in his article Pennello, Thomas J. (1986). "Very fast LR parsing". Proceedings of the 1986 SIGPLAN symposium on Compiler construction - SIGPLAN '86. pp. 145–151. doi:10.1145/12276.13326. ISBN   0897911970. S2CID   17214407. in 1986. He was not intending to create a hand-editable implementation of an LR parser, but rather a maintainable and efficient parser implemented in assembly language. The technique was later expounded upon by G.H. Roberts [2] in 1988 as well as in an article by Leermakers, Augusteijn, Kruseman Aretz [3] in 1992 in the journal Theoretical Computer Science. An extremely readable description of the technique was written by Morell and Middleton [4] in 2003. A good exposition can also be found in a TOPLAS article by Sperber and Thiemann. [5]

Recursive ascent has also been merged with recursive descent, yielding a technique known as recursive ascent/descent. This implementation technique is arguably easier to hand-edit due to the reduction in states and fact that some of these states are more intuitively top-down rather than bottom up. It can also yield some minimal performance improvements over conventional recursive ascent. [6]

Summary

Intuitively, recursive ascent is a literal implementation of the LR parsing concept. Each function in the parser represents a single LR automaton state. Within each function, a multi-branch statement is used to select the appropriate action based on the current token read from the input stream. Once the token has been identified, action is taken based on the state being encoded. There are two different fundamental actions which may be taken based on the token in question:

There is also a third LR automaton action which may be taken in a given state, but only after a reduce where the shift counter has decremented to zero (indicating that the current state should handle the result). This is the goto action, which is essentially a special case of shift designed to handle non-terminals in a production. This action must be handled after the multi-branch statement, since this is where any reduction results will "resurface" from farther down the call stack.

Example

Consider the following grammar in bison syntax:

expr : expr '+' term   { $$ = $1 + $3; }      | expr '-' term   { $$ = $1 - $3; }      | term            { $$ = $1; }      ;  term : '(' expr ')'    { $$ = $2; }      | num             { $$ = $1; }      ;  num : '0'              { $$ = 0; }     | '1'              { $$ = 1; }     ;

This grammar is LR(0) in that it is left-recursive (in the expr non-terminal) but does not require any lookahead. Recursive ascent is also capable of handling grammars which are LALR(1) in much the same way that table-driven parsers handle such cases (by pre-computing conflict resolutions based on possible lookahead).

The following is a Scala implementation of a recursive ascent parser based on the above grammar:

objectExprParser{privatetypeResult=(NonTerminal,Int)privatesealedtraitNonTerminal{valv:Int}privatecaseclassNTexpr(v:Int,in:Stream[Char])extendsNonTerminalprivatecaseclassNTterm(v:Int,in:Stream[Char])extendsNonTerminalprivatecaseclassNTnum(v:Int,in:Stream[Char])extendsNonTerminalclassParseException(msg:String)extendsRuntimeException(msg){defthis()=this("")defthis(c:Char)=this(c.toString)}defparse(in:Stream[Char])=state0(in)._1.v/*   * 0 $accept: . expr $end   *   * '('  shift, and go to state 1   * '0'  shift, and go to state 2   * '1'  shift, and go to state 3   *   * expr  go to state 4   * term  go to state 5   * num   go to state 6   */privatedefstate0(in:Stream[Char])=inmatch{casecur#::tail=>{defloop(tuple:Result):Result={val(res,goto)=tupleif(goto==0){loop(resmatch{caseNTexpr(v,in)=>state4(in,v)caseNTterm(v,in)=>state5(in,v)caseNTnum(v,in)=>state6(in,v)})}else(res,goto-1)}loop(curmatch{case'('=>state1(tail)case'0'=>state2(tail)case'1'=>state3(tail)casec=>thrownewParseException(c)})}caseStream()=>thrownewParseException}/*   * 4 term: '(' . expr ')'   *   * '('  shift, and go to state 1   * '0'  shift, and go to state 2   * '1'  shift, and go to state 3   *   * expr  go to state 7   * term  go to state 5   * num   go to state 6   */privatedefstate1(in:Stream[Char]):Result=inmatch{casecur#::tail=>{defloop(tuple:Result):Result={val(res,goto)=tupleif(goto==0){loop(resmatch{caseNTexpr(v,in)=>state7(in,v)caseNTterm(v,in)=>state5(in,v)caseNTnum(v,in)=>state6(in,v)})}else(res,goto-1)}loop(curmatch{case'('=>state1(tail)case'0'=>state2(tail)case'1'=>state3(tail)casec=>thrownewParseException(c)})}caseStream()=>thrownewParseException}/*   * 6 num: '0' .   *   * $default  reduce using rule 6 (num)   */privatedefstate2(in:Stream[Char])=(NTnum(0,in),0)/*   * 7 num: '1' .   *   * $default  reduce using rule 7 (num)   */privatedefstate3(in:Stream[Char])=(NTnum(1,in),0)/*   * 0 $accept: expr . $end   * 1 expr: expr . '+' term   * 2     | expr . '-' term   *   * $end  shift, and go to state 8   * '+'   shift, and go to state 9   * '-'   shift, and go to state 10   */privatedefstate4(in:Stream[Char],arg1:Int):Result=inmatch{casecur#::tail=>{decrement(curmatch{case'+'=>state9(tail,arg1)case'-'=>state10(tail,arg1)casec=>thrownewParseException(c)})}caseStream()=>state8(arg1)}/*   * 3 expr: term .   *   * $default  reduce using rule 3 (expr)   */privatedefstate5(in:Stream[Char],arg1:Int)=(NTexpr(arg1,in),0)/*   * 5 term: num .   *   * $default  reduce using rule 5 (term)   */privatedefstate6(in:Stream[Char],arg1:Int)=(NTterm(arg1,in),0)/*   * 1 expr: expr . '+' term   * 2     | expr . '-' term   * 4 term: '(' expr . ')'   *   * '+'  shift, and go to state 9   * '-'  shift, and go to state 10   * ')'  shift, and go to state 11   */privatedefstate7(in:Stream[Char],arg1:Int):Result=inmatch{casecur#::tail=>{decrement(curmatch{case'+'=>state9(tail,arg1)case'-'=>state10(tail,arg1)case')'=>state11(tail,arg1)casec=>thrownewParseException(c)})}caseStream()=>thrownewParseException}/*   * 0 $accept: expr $end .   *   * $default  accept   */privatedefstate8(arg1:Int)=(NTexpr(arg1,Stream()),1)/*   * 1 expr: expr '+' . term   *   * '('  shift, and go to state 1   * '0'  shift, and go to state 2   * '1'  shift, and go to state 3   *   * term  go to state 12   * num   go to state 6   */privatedefstate9(in:Stream[Char],arg1:Int)=inmatch{casecur#::tail=>{defloop(tuple:Result):Result={val(res,goto)=tupleif(goto==0){loop(resmatch{caseNTterm(v,in)=>state12(in,arg1,v)caseNTnum(v,in)=>state6(in,v)case_=>thrownewAssertionError})}else(res,goto-1)}loop(curmatch{case'('=>state1(tail)case'0'=>state2(tail)case'1'=>state3(tail)casec=>thrownewParseException(c)})}caseStream()=>thrownewParseException}/*   * 2 expr: expr '-' . term   *   * '('  shift, and go to state 1   * '0'  shift, and go to state 2   * '1'  shift, and go to state 3   *   * term  go to state 13   * num   go to state 6   */privatedefstate10(in:Stream[Char],arg1:Int)=inmatch{casecur#::tail=>{defloop(tuple:Result):Result={val(res,goto)=tupleif(goto==0){loop(resmatch{caseNTterm(v,in)=>state13(in,arg1,v)caseNTnum(v,in)=>state6(in,v)case_=>thrownewAssertionError})}else(res,goto-1)}loop(curmatch{case'('=>state1(tail)case'0'=>state2(tail)case'1'=>state3(tail)casec=>thrownewParseException(c)})}caseStream()=>thrownewParseException}/*   * 4 term: '(' expr ')' .   *   * $default  reduce using rule 4 (term)   */privatedefstate11(in:Stream[Char],arg1:Int)=(NTterm(arg1,in),2)/*   * 1 expr: expr '+' term .   *   * $default  reduce using rule 1 (expr)   */privatedefstate12(in:Stream[Char],arg1:Int,arg2:Int)=(NTexpr(arg1+arg2,in),2)/*   * 2 expr: expr '-' term .   *   * $default  reduce using rule 2 (expr)   */privatedefstate13(in:Stream[Char],arg1:Int,arg2:Int)=(NTexpr(arg1-arg2,in),2)privatedefdecrement(tuple:Result)={val(res,goto)=tupleassert(goto!=0)(res,goto-1)}}

The following is a Prolog implementation of a recursive ascent parser based on the above grammar:

state(S),[S]-->[S].state(S0,S),[S]-->[S0]./*    0. S --> E$    1. E --> E + T    2. E --> E - T    3. E --> T    4. T --> (E)    5. T --> N    6. N --> 0    7. N --> 1*/accept-->state(s([],[e(_)])).r(1)-->state(s(Ts,[t(A1),'+',e(A0)|Ss]),s(Ts,[e(A0+A1)|Ss])).r(2)-->state(s(Ts,[t(A1),'-',e(A0)|Ss]),s(Ts,[e(A0-A1)|Ss])).r(3)-->state(s(Ts,[t(A)|Ss]),s(Ts,[e(A)|Ss])).r(4)-->state(s(Ts,[')',e(A),'('|Ss]),s(Ts,[t(A)|Ss])).r(5)-->state(s(Ts,[n(A)|Ss]),s(Ts,[t(A)|Ss])).r(6)-->state(s(Ts,['0'|Ss]),s(Ts,[n(0)|Ss])).r(7)-->state(s(Ts,['1'|Ss]),s(Ts,[n(1)|Ss])).t(T)-->state(s([T|Ts],Ss),s(Ts,[T|Ss]))./*    S --> .E$    E --> .E + T    E --> .E - T    E --> .T    T --> .(E)    T --> .N    N --> .0    N --> .1*/s0-->t('('),s3,s2,s1.s0-->t('0'),s11,s10,s2,s1.s0-->t('1'),s12,s10,s2,s1./*    S --> E.$    E --> E. + T    E --> E. - T*/s1-->accept.s1-->t('+'),s7,s1.s1-->t('-'),s8,s1./*    E --> T.*/s2-->r(3)./*    T --> (.E)    E --> .E + T    E --> .E - T    E --> .T    T --> .(E)    T --> .N    N --> .0    N --> .1*/s3-->t('('),s3,s2,s4.s3-->t('0'),s11,s10,s2,s4.s3-->t('1'),s12,s10,s2,s4./*    T --> (E.)    E --> E .+ T    E --> E .- T*/s4-->t(')'),s9.s4-->t('+'),s7,s4.s4-->t('-'),s8,s4./*    E --> E + T.*/s5-->r(1)./*    E --> E - T.*/s6-->r(2)./*    E --> E + .T    T --> .(E)    T --> .N    N --> .0    N --> .1*/s7-->t('('),s3,s5.s7-->t('0'),s11,s10,s5.s7-->t('1'),s12,s10,s5./*    E --> E - .T    T --> .(E)    T --> .N    N --> .0    N --> .1*/s8-->t('('),s3,s6.s8-->t('0'),s11,s10,s6.s8-->t('1'),s12,s10,s6./*    T --> (E).*/s9-->r(4)./*    T --> N.*/s10-->r(5)./*    N --> '0'.*/s11-->r(6)./*    N --> '1'.*/s12-->r(7).parser(Cs,T):-length(Cs,_),phrase(s0,[s(Cs,[])],[s([],[e(T)])]).% state(S0, S), [S] --> [S0, S].%?- S0 = [s("(1+1)", [])|_], phrase(s0, S0, [S]), maplist(portray_clause, S0).

See also

Related Research Articles

In computer science, LR parsers are a type of bottom-up parser that analyse deterministic context-free languages in linear time. There are several variants of LR parsers: SLR parsers, LALR parsers, canonical LR(1) parsers, minimal LR(1) parsers, and generalized LR parsers. LR parsers can be generated by a parser generator from a formal grammar defining the syntax of the language to be parsed. They are widely used for the processing of computer languages.

GNU Bison, commonly known as Bison, is a parser generator that is part of the GNU Project. Bison reads a specification in Bison syntax, warns about any parsing ambiguities, and generates a parser that reads sequences of tokens and decides whether the sequence conforms to the syntax specified by the grammar.

In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.

In computer science, Backus–Naur form is a notation used to describe the syntax of programming languages or other formal languages. It was developed by John Backus and Peter Naur. BNF can be described as a metasyntax notation for context-free grammars. Backus–Naur form is applied wherever exact descriptions of languages are needed, such as in official language specifications, in manuals, and in textbooks on programming language theory. BNF can be used to describe document formats, instruction sets, and communication protocols.

In computer science, a recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures where each such procedure implements one of the nonterminals of the grammar. Thus the structure of the resulting program closely mirrors that of the grammar it recognizes.

In computer programming, undefined behavior (UB) is the result of executing a program whose behavior is prescribed to be unpredictable, in the language specification to which the computer code adheres. This is different from unspecified behavior, for which the language specification does not prescribe a result, and implementation-defined behavior that defers to the documentation of another component of the platform.

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

In computer science, corecursion is a type of operation that is dual to recursion. Whereas recursion works analytically, starting on data further from a base case and breaking it down into smaller data and repeating until one reaches a base case, corecursion works synthetically, starting from a base case and building it up, iteratively producing data further removed from a base case. Put simply, corecursive algorithms use the data that they themselves produce, bit by bit, as they become available, and needed, to produce further bits of data. A similar but distinct concept is generative recursion which may lack a definite "direction" inherent in corecursion and recursion.

In mathematics and in computer programming, a variadic function is a function of indefinite arity, i.e., one which accepts a variable number of arguments. Support for variadic functions differs widely among programming languages.

The computer programming languages C and Pascal have similar times of origin, influences, and purposes. Both were used to design their own compilers early in their lifetimes. The original Pascal definition appeared in 1969 and a first compiler in 1970. The first version of C appeared in 1972.

<span class="mw-page-title-main">Scala (programming language)</span> General-purpose programming language

Scala is a strong statically typed high-level general-purpose programming language that supports both object-oriented programming and functional programming. Designed to be concise, many of Scala's design decisions are intended to address criticisms of Java.

<span class="mw-page-title-main">Recursion (computer science)</span> Use of functions that call themselves

In computer science, recursion is a method of solving a computational problem where the solution depends on solutions to smaller instances of the same problem. Recursion solves such recursive problems by using functions that call themselves from within their own code. The approach can be applied to many types of problems, and recursion is one of the central ideas of computer science.

The power of recursion evidently lies in the possibility of defining an infinite set of objects by a finite statement. In the same manner, an infinite number of computations can be described by a finite recursive program, even if this program contains no explicit repetitions.

In computer science, tail recursive parsers are a derivation from the more common recursive descent parsers. Tail recursive parsers are commonly used to parse left recursive grammars. They use a smaller amount of stack space than regular recursive descent parsers. They are also easy to write. Typical recursive descent parsers make parsing left recursive grammars impossible. Tail recursive parsers use a node reparenting technique that makes this allowable.

C++11 is a version of the ISO/IEC 14882 standard for the C++ programming language. C++11 replaced the prior version of the C++ standard, called C++03, and was later replaced by C++14. The name follows the tradition of naming language versions by the publication year of the specification, though it was formerly named C++0x because it was expected to be published before 2010.

stdarg.h is a header in the C standard library of the C programming language that allows functions to accept an indefinite number of arguments. It provides facilities for stepping through a list of function arguments of unknown number and type. C++ provides this functionality in the header cstdarg.

<span class="mw-page-title-main">ATS (programming language)</span> Programming language

In computing, ATS is a programming language designed to unify programming with formal specification. ATS has support for combining theorem proving with practical programming through the use of advanced type systems. A past version of The Computer Language Benchmarks Game has demonstrated that the performance of ATS is comparable to that of the C and C++ programming languages. By using theorem proving and strict type checking, the compiler can detect and prove that its implemented functions are not susceptible to bugs such as division by zero, memory leaks, buffer overflow, and other forms of memory corruption by verifying pointer arithmetic and reference counting before the program compiles. Additionally, by using the integrated theorem-proving system of ATS (ATS/LF), the programmer may make use of static constructs that are intertwined with the operative code to prove that a function conforms to its specification.

In computer programming, variadic templates are templates that take a variable number of arguments.

SLR grammars are the class of formal grammars accepted by a Simple LR parser. SLR grammars are a superset of all LR(0) grammars and a subset of all LALR(1) and LR(1) grammars.

Nemerle is a general-purpose, high-level, statically typed programming language designed for platforms using the Common Language Infrastructure (.NET/Mono). It offers functional, object-oriented, aspect-oriented, reflective and imperative features. It has a simple C#-like syntax and a powerful metaprogramming system.

jq (programming language) Programming language for JSON

jq is a very high-level lexically scoped functional programming language in which every JSON value is a constant. jq supports backtracking and managing indefinitely long streams of JSON data. It is related to the Icon and Haskell programming languages. The language supports a namespace-based module system and has some support for closures. In particular, functions and functional expressions can be used as parameters of other functions.

References

  1. Thomas J Penello (1986). "Very fast LR parsing". ACM SIGPLAN Notices. 21 (7): 145–151. doi: 10.1145/13310.13326 .
  2. G.H. Roberts (1988). "Recursive ascent: an LR analog to recursive descent". ACM SIGPLAN Notices. 23 (8): 23–29. doi: 10.1145/47907.47909 . S2CID   12740771.
  3. Leermakers, Augusteijn, Kruseman Aretz (1992). "A functional LR parser". Theoretical Computer Science. 104 (2): 313–323. doi: 10.1016/0304-3975(92)90128-3 .{{cite journal}}: CS1 maint: multiple names: authors list (link)
  4. Larry Morell & David Middleton (2003). "Recursive-ascent parsing". Journal of Computing Sciences in Colleges. Vol. 18, no. 6. pp. 186–201.
  5. Sperber and Thiemann (2000). "Generation of LR parsers by partial evaluation". ACM Transactions on Programming Languages and Systems. 22 (2): 224–264. doi: 10.1145/349214.349219 . S2CID   14955687.
  6. John Boyland & Daniel Spiewak (2009). "ScalaBison Recursive Ascent-Descent Parser Generator" (PDF). Archived from the original (PDF) on 2009-07-18.