Recursive descent parser

Last updated

In computer science, a recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures (or a non-recursive equivalent) where each such procedure implements one of the nonterminals of the grammar. Thus the structure of the resulting program closely mirrors that of the grammar it recognizes. [1]

Contents

A predictive parser is a recursive descent parser that does not require backtracking. [2] Predictive parsing is possible only for the class of LL(k) grammars, which are the context-free grammars for which there exists some positive integer k that allows a recursive descent parser to decide which production to use by examining only the next k tokens of input. The LL(k) grammars therefore exclude all ambiguous grammars, as well as all grammars that contain left recursion. Any context-free grammar can be transformed into an equivalent grammar that has no left recursion, but removal of left recursion does not always yield an LL(k) grammar. A predictive parser runs in linear time.

Recursive descent with backtracking is a technique that determines which production to use by trying each production in turn. Recursive descent with backtracking is not limited to LL(k) grammars, but is not guaranteed to terminate unless the grammar is LL(k). Even when they terminate, parsers that use recursive descent with backtracking may require exponential time.

Although predictive parsers are widely used, and are frequently chosen if writing a parser by hand, programmers often prefer to use a table-based parser produced by a parser generator,[ citation needed ] either for an LL(k) language or using an alternative parser, such as LALR or LR. This is particularly the case if a grammar is not in LL(k) form, as transforming the grammar to LL to make it suitable for predictive parsing is involved. Predictive parsers can also be automatically generated, using tools like ANTLR.

Predictive parsers can be depicted using transition diagrams for each non-terminal symbol where the edges between the initial and the final states are labelled by the symbols (terminals and non-terminals) of the right side of the production rule. [3]

Example parser

The following EBNF-like grammar (for Niklaus Wirth's PL/0 programming language, from Algorithms + Data Structures = Programs ) is in LL(1) form:

program =block ".".block =["const"ident "="number {","ident "="number}";"]["var"ident {","ident}";"]{"procedure"ident ";"block ";"}statement .statement =ident ":="expression|"call"ident|"begin"statement {";"statement }"end"|"if"condition "then"statement|"while"condition "do"statement .condition ="odd"expression|expression ("="|"#"|"<"|"<="|">"|">=")expression .expression =["+"|"-"]term {("+"|"-")term}.term =factor {("*"|"/")factor}.factor =ident|number|"("expression ")".

Terminals are expressed in quotes. Each nonterminal is defined by a rule in the grammar, except for ident and number, which are assumed to be implicitly defined.

C implementation

What follows is an implementation of a recursive descent parser for the above language in C. The parser reads in source code, and exits with an error message if the code fails to parse, exiting silently if the code parses correctly.

Notice how closely the predictive parser below mirrors the grammar above. There is a procedure for each nonterminal in the grammar. Parsing descends in a top-down manner until the final nonterminal has been processed. The program fragment depends on a global variable, sym, which contains the current symbol from the input, and the function nextsym, which updates sym when called.

The implementations of the functions nextsym and error are omitted for simplicity.

typedefenum{ident,number,lparen,rparen,times,slash,plus,minus,eql,neq,lss,leq,gtr,geq,callsym,beginsym,semicolon,endsym,ifsym,whilesym,becomes,thensym,dosym,constsym,comma,varsym,procsym,period,oddsym}Symbol;Symbolsym;voidnextsym(void);voiderror(constcharmsg[]);intaccept(Symbols){if(sym==s){nextsym();return1;}return0;}intexpect(Symbols){if(accept(s))return1;error("expect: unexpected symbol");return0;}voidfactor(void){if(accept(ident)){;}elseif(accept(number)){;}elseif(accept(lparen)){expression();expect(rparen);}else{error("factor: syntax error");nextsym();}}voidterm(void){factor();while(sym==times||sym==slash){nextsym();factor();}}voidexpression(void){if(sym==plus||sym==minus)nextsym();term();while(sym==plus||sym==minus){nextsym();term();}}voidcondition(void){if(accept(oddsym)){expression();}else{expression();if(sym==eql||sym==neq||sym==lss||sym==leq||sym==gtr||sym==geq){nextsym();expression();}else{error("condition: invalid operator");nextsym();}}}voidstatement(void){if(accept(ident)){expect(becomes);expression();}elseif(accept(callsym)){expect(ident);}elseif(accept(beginsym)){do{statement();}while(accept(semicolon));expect(endsym);}elseif(accept(ifsym)){condition();expect(thensym);statement();}elseif(accept(whilesym)){condition();expect(dosym);statement();}else{error("statement: syntax error");nextsym();}}voidblock(void){if(accept(constsym)){do{expect(ident);expect(eql);expect(number);}while(accept(comma));expect(semicolon);}if(accept(varsym)){do{expect(ident);}while(accept(comma));expect(semicolon);}while(accept(procsym)){expect(ident);expect(semicolon);block();expect(semicolon);}statement();}voidprogram(void){nextsym();block();expect(period);}

Examples

Some recursive descent parser generators:

See also

Related Research Articles

<span class="mw-page-title-main">Context-free grammar</span> Type of formal grammar

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In computer science, the Earley parser is an algorithm for parsing strings that belong to a given context-free language, though it may suffer problems with certain nullable grammars. The algorithm, named after its inventor, Jay Earley, is a chart parser that uses dynamic programming; it is mainly used for parsing in computational linguistics. It was first introduced in his dissertation in 1968.

In computer science, LR parsers are a type of bottom-up parser that analyse deterministic context-free languages in linear time. There are several variants of LR parsers: SLR parsers, LALR parsers, canonical LR(1) parsers, minimal LR(1) parsers, and generalized LR parsers. LR parsers can be generated by a parser generator from a formal grammar defining the syntax of the language to be parsed. They are widely used for the processing of computer languages.

In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.

Lexical tokenization is conversion of a text into meaningful lexical tokens belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include identifiers, operators, grouping symbols and data types. Lexical tokenization is related to the type of tokenization used in Large language models (LLMs), but with two differences. First, lexical tokenization is usually based on a lexical grammar, whereas LLM tokenizers are usually probability-based. Second, LLM tokenizers perform a second step that converts the tokens into numerical values.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

Top-down parsing in computer science is a parsing strategy where one first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar. LL parsers are a type of parser that uses a top-down parsing strategy.

PL/0 is a programming language, intended as an educational programming language, that is similar to but much simpler than Pascal, a general-purpose programming language. It serves as an example of how to construct a compiler. It was originally introduced in the book, Algorithms + Data Structures = Programs, by Niklaus Wirth in 1976. It features quite limited language constructs: there are no real numbers, very few basic arithmetic operations and no control-flow constructs other than "if" and "while" blocks. While these limitations make writing real applications in this language impractical, it helps the compiler remain compact and simple.

The dangling else is a problem in programming of parser generators in which an optional else clause in an if–then(–else) statement results in nested conditionals being ambiguous. Formally, the reference context-free grammar of the language is ambiguous, meaning there is more than one correct parse tree.

Coco/R is a compiler generator that takes wirth syntax notation grammars of a source language and generates a scanner and a parser for that language.

Top-Down Parsing Language (TDPL) is a type of analytic formal grammar developed by Alexander Birman in the early 1970s in order to study formally the behavior of a common class of practical top-down parsers that support a limited form of backtracking. Birman originally named his formalism the TMG Schema (TS), after TMG, an early parser generator, but it was later given the name TDPL by Aho and Ullman in their classic anthology The Theory of Parsing, Translation and Compiling.

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

The Packrat parser is a type of parser that shares similarities with the recursive descent parser in its construction. However, it differs because it takes parsing expression grammars (PEGs) as input rather than LL grammars.

In computer science, an operator precedence parser is a bottom-up parser that interprets an operator-precedence grammar. For example, most calculators use operator precedence parsers to convert from the human-readable infix notation relying on order of operations to a format that is optimized for evaluation such as Reverse Polish notation (RPN).

In the formal language theory of computer science, left recursion is a special case of recursion where a string is recognized as part of a language by the fact that it decomposes into a string from that same language and a suffix. For instance, can be recognized as a sum because it can be broken into , also a sum, and , a suitable suffix.

<span class="mw-page-title-main">Syntax (programming languages)</span> Set of rules defining correctly structured programs

In computer science, the syntax of a computer language is the rules that define the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data.

This is a list of notable lexer generators and parser generators for various language classes.

<span class="mw-page-title-main">Formal grammar</span> Structure of a formal language

A formal grammar describes which strings from an alphabet of a formal language are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.

<span class="mw-page-title-main">LL grammar</span> Type of a context-free grammar

In formal language theory, an LL grammar is a context-free grammar that can be parsed by an LL parser, which parses the input from Left to right, and constructs a Leftmost derivation of the sentence. A language that has an LL grammar is known as an LL language. These form subsets of deterministic context-free grammars (DCFGs) and deterministic context-free languages (DCFLs), respectively. One says that a given grammar or language "is an LL grammar/language" or simply "is LL" to indicate that it is in this class.

A shift-reduce parser is a class of efficient, table-driven bottom-up parsing methods for computer languages and other notations formally defined by a grammar. The parsing methods most commonly used for parsing programming languages, LR parsing and its variations, are shift-reduce methods. The precedence parsers used before the invention of LR parsing are also shift-reduce methods. All shift-reduce parsers have similar outward effects, in the incremental order in which they build a parse tree or call specific output actions.

References

  1. Burge, W.H. (1975). Recursive Programming Techniques . ISBN   0-201-14450-6.
  2. Watson, Des (22 March 2017). A Practical Approach to Compiler Construction. Springer. ISBN   978-3-319-52789-5.
  3. Aho, Alfred V.; Sethi, Ravi; Ullman, Jeffrey (1986). Compilers: Principles, Techniques and Tools (first ed.). Addison Wesley. p.  183.

General references